Sample Dataframe :
+----+-------+-----+-------+-----+
|year|country|state|college|marks|
+----+-------+-----+-------+-----+
|2019| India| A| AC| 15|
|2019| India| A| AC| 25|
|2019| India| A| AC| 35|
|2019| India| A| AD| 40|
|2019| India| B| AC| 15|
|2019| India| B| AC| 50|
|2019| India| B| BC| 65|
|2019| USA| A| UC| 15|
|2019| USA| A| UC| 65|
|2019| USA| A| UD| 45|
|2019| USA| B| UC| 44|
|2019| USA| B| MC| 88|
|2019| USA| B| MC| 90|
|2020| India| A| AC| 65|
|2020| India| A| AC| 33|
|2020| India| A| AC| 55|
|2020| India| A| AD| 70|
|2020| India| B| AC| 88|
|2020| India| B| AC| 60|
|2020| India| B| BC| 45|
|2020| USA| A| UC| 85|
|2020| USA| A| UC| 55|
|2020| USA| A| UD| 32|
|2020| USA| B| UC| 64|
|2020| USA| B| MC| 78|
|2020| USA| B| MC| 80|
+----+-------+-----+-------+-----+
In order to do multi dimensional aggregation you can do it in two ways i.e by using grouping sets or by using rollup in Spark.
To read more about these multidimensional aggregation follow this link Multi-Dimensional Aggregation
The solution using rollup is provided as follows:
val ans_df = df.rollup("year","country","state","college").agg(max("marks").as("Marks"))
The result :
+----+-------+-----+-------+-----+
|year|country|state|college|Marks|
+----+-------+-----+-------+-----+
|2020| India| A| AC| 65|
|2019| India| B| BC| 65|
|2020| India| B| null| 88|
|2019| USA| B| UC| 44|
|2020| India| B| AC| 88|
|2020| USA| null| null| 85|
|2019| India| A| AC| 35|
|2019| USA| B| MC| 90|
|2019| India| A| AD| 40|
|2019| USA| A| UD| 45|
|2019| USA| null| null| 90|
|2020| USA| A| UD| 32|
|null| null| null| null| 90|
|2019| USA| B| null| 90|
|2020| India| null| null| 88|
|2019| USA| A| null| 65|
|2019| India| B| null| 65|
|2019| USA| A| UC| 65|
|2020| India| B| BC| 45|
|2020| USA| B| UC| 64|
+----+-------+-----+-------+-----+
Moreover, as asked spark makes sure of doing this operation in an optimal manner and makes use of the already partitioned data on doing a groupBy on an additional column.Example - On doing a groupBy on key (year,country,state,college) the data already grouped on key (year,country,state) will be used, thereby reducing significant computation.