Sort in descending order in PySpark

Question

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.

group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)

But it throws the following error.

sort() got an unexpected keyword argument 'ascending'

unbelievably, this still exists in the spark 3.2.0. – Reza Keshavarz May 22 '23 at 12:53 — Reza Keshavarz, May 22 '23 at 12:53

zero323 · Accepted Answer · 2017-03-31T18:08:53.273

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:

from pyspark.sql.functions import col

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(col("count").desc()))

or desc function:

from pyspark.sql.functions import desc

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(desc("count"))

Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

Henrique Florencio · Answer 2 · 2020-07-16T17:54:42.303

151

Use orderBy:

df.orderBy('column_name', ascending=False)

Complete answer:

group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

edited Jul 16 '20 at 17:54

answered Mar 08 '17 at 17:52

Henrique Florencio

3,440
1
18
19

score 33 · Answer 3 · edited Mar 19 '20 at 12:26

33

By far the most convenient way is using this:

df.orderBy(df.column_name.desc())

Doesn't require special imports.

edited Mar 19 '20 at 12:26

Justin Lange

897
10
25

answered Dec 05 '19 at 10:42

gdoron

147,333
58
291
367

2

Credit to [Daniel Haviv](https://www.linkedin.com/in/danielhaviv/) a Solutions Architect at Databricks who showed me this way. – gdoron Dec 05 '19 at 10:44
2

by far the best answer here. – born_naked Mar 17 '20 at 22:35
2

This should be the accepted answer instead. Much simpeler and doesnt rely on packages (perhaps wasn't available at the time) – Anonymous May 12 '20 at 10:30
1

I really like this answer but didn't work for me with count in spark 3.0.0. I think is because count is a function rather than a number. TypeError: Invalid argument, not a string or column: of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. – Way Too Simple Aug 20 '20 at 20:58
This orderBy (sort) works in Azure Synapse Analytics, when reading from dedicatedp1 using "spark.read.synapsesql" – Doug_Ivison Mar 29 '23 at 21:30

score 7 · Answer 4 · answered Jul 13 '19 at 05:48

7

you can use groupBy and orderBy as follows also

dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))

answered Jul 13 '19 at 05:48

Narendra Maru

787
7
8

1

Why are you first renaming the column and then using the old name for sorting? Renaming is not even a part of the question asked – Sheldore Jan 31 '21 at 14:54
@Sheldore I am renaming the column name for the performance optimization while working with aggregation queries its difficult for Spark to maintain the metadata for the newly added column – Narendra Maru Mar 25 '21 at 12:31

score 6 · Answer 5 · answered Apr 12 '20 at 03:18

6

In pyspark 2.4.4

1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

2) from pyspark.sql.functions import desc
   group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))

No need to import in 1) and 1) is short & easy to read,
So I prefer 1) over 2)

answered Apr 12 '20 at 03:18

Prabhath Kota

93
1
7

1

Why are you using both orderBy and sort in the same answer in 2)? – Sheldore Jan 31 '21 at 14:52

score 1 · Answer 6 · answered May 06 '22 at 04:33

RDD.sortBy(keyfunc, ascending=True, numPartitions=None)

an example:

words =  rdd2.flatMap(lambda line: line.split(" "))
counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b)

print(counter.sortBy(lambda a: a[1],ascending=False).take(10))

score 0 · Answer 7 · edited Dec 20 '22 at 03:40

0

PySpark added Pandas style sort operator with the ascending keyword argument in version 1.4.0. You can now use

df.sort('<col_name>', ascending = False)

Or you can use the orderBy function:

df.orderBy('<col_name>').desc()

edited Dec 20 '22 at 03:40

Edward Ji

745
8
19

answered Dec 16 '22 at 10:36

Mr RK

9
2

`df.orderBy('').desc()` this gave error the first command worked – Blue Clouds Aug 01 '23 at 18:25

score -2 · Answer 8 · edited Feb 09 '23 at 09:59

-2

You can use pyspark.sql.functions.desc instead.

from pyspark.sql.functions import desc

g.groupBy('dst').count().sort(desc('count')).show()

edited Feb 09 '23 at 09:59

Dorian Turba

3,260
3
23
67

answered Feb 08 '23 at 10:38

Wria Mohammed

1,433
18
23

Sort in descending order in PySpark

8 Answers8

Linked