I am using spark-sql-2.4.1v with Java 8. I need to calculate percentiles such as 25,75,90 for some given data.
I tried using percentile_approx()
from Spark-sql to do this.
But the results of percentile_approx()
are not matching the fractional percentiles of excel sheet which uses PERCENTILE.INC()
.
Hence, I'm wondering how to fix or adjust the percentile_approx()
function. Is there anyway to overwrite or write a custom function modifying percentile_approx()
which calculates fractional percentiles correctly? How to write/modify percentile_approx()
?
Given dataset:
val df = Seq(
(10, "1/15/2018", 0.010680705, 10,0.619875458, "east"),
(10, "1/15/2018", 0.006628853, 4,0.16039063, "west"),
(10, "1/15/2018", 0.01378215, 20,0.082049528, "east"),
(10, "1/15/2018", 0.810680705, 6,0.819875458, "west"),
(10, "1/15/2018", 0.702228853, 30,0.916039063, "east"))
.toDF("id", "date", "revenue", "con_dist_1", "con_dist_2", "zone")
val percentiles = Seq(0.25, 0.75,0.90) // Which percentiles to calculate
val cols = Seq("con_dist_1", "con_dist_2") // The columns to use
I need to calculate the given percentiles for each zone for the given columns. How can this be achieved?
Expected results:
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| id| date| revenue|zone|perctile_col|qunantile_0.25|qunantile_0.75|qunantile_0.9|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
| 10|1/15/2018|0.006628853|west| con_dist_1| 4.5| 5.5| 5.8|
| 10|1/15/2018|0.010680705|west| con_dist_1| 4.5| 5.5| 5.8|
| 10|1/15/2018|0.010680705|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018| 0.01378215|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018|0.006628853|east| con_dist_1| 15| 25| 28.0|
| 10|1/15/2018|0.006628853|west| con_dist_2| 0.325261837| 0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|west| con_dist_2| 0.325261837| 0.655004251| 0.7539269752|
| 10|1/15/2018|0.010680705|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
| 10|1/15/2018| 0.01378215|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
| 10|1/15/2018|0.006628853|east| con_dist_2| 0.350962493| 0.4990442955| 0.749241156|
+---+---------+-----------+----+------------+--------------+--------------+-------------+
You can verify the results with "definition 2" of this url https://www.translatorscafe.com/unit-converter/en-US/calculator/percentile/