139

I'm using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.

group_by_dataframe.count().filter("`count` >= 10").sort('count', ascending=False)

But it throws the following error.

sort() got an unexpected keyword argument 'ascending'
Mykola Zotko
  • 15,583
  • 3
  • 71
  • 73
rclakmal
  • 1,872
  • 3
  • 17
  • 19

8 Answers8

222

In PySpark 1.3 sort method doesn't take ascending parameter. You can use desc method instead:

from pyspark.sql.functions import col

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(col("count").desc()))

or desc function:

from pyspark.sql.functions import desc

(group_by_dataframe
    .count()
    .filter("`count` >= 10")
    .sort(desc("count"))

Both methods can be used with with Spark >= 1.3 (including Spark 2.x).

zero323
  • 322,348
  • 103
  • 959
  • 935
151

Use orderBy:

df.orderBy('column_name', ascending=False)

Complete answer:

group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html

Henrique Florencio
  • 3,440
  • 1
  • 18
  • 19
33

By far the most convenient way is using this:

df.orderBy(df.column_name.desc())

Doesn't require special imports.

Justin Lange
  • 897
  • 10
  • 25
gdoron
  • 147,333
  • 58
  • 291
  • 367
  • 2
    Credit to [Daniel Haviv](https://www.linkedin.com/in/danielhaviv/) a Solutions Architect at Databricks who showed me this way. – gdoron Dec 05 '19 at 10:44
  • 2
    by far the best answer here. – born_naked Mar 17 '20 at 22:35
  • 2
    This should be the accepted answer instead. Much simpeler and doesnt rely on packages (perhaps wasn't available at the time) – Anonymous May 12 '20 at 10:30
  • 1
    I really like this answer but didn't work for me with count in spark 3.0.0. I think is because count is a function rather than a number. TypeError: Invalid argument, not a string or column: of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function. – Way Too Simple Aug 20 '20 at 20:58
  • This orderBy (sort) works in Azure Synapse Analytics, when reading from dedicatedp1 using "spark.read.synapsesql" – Doug_Ivison Mar 29 '23 at 21:30
7

you can use groupBy and orderBy as follows also

dataFrameWay = df.groupBy("firstName").count().withColumnRenamed("count","distinct_name").sort(desc("count"))
Narendra Maru
  • 787
  • 7
  • 8
  • 1
    Why are you first renaming the column and then using the old name for sorting? Renaming is not even a part of the question asked – Sheldore Jan 31 '21 at 14:54
  • @Sheldore I am renaming the column name for the performance optimization while working with aggregation queries its difficult for Spark to maintain the metadata for the newly added column – Narendra Maru Mar 25 '21 at 12:31
6

In pyspark 2.4.4

1) group_by_dataframe.count().filter("`count` >= 10").orderBy('count', ascending=False)

2) from pyspark.sql.functions import desc
   group_by_dataframe.count().filter("`count` >= 10").orderBy('count').sort(desc('count'))

No need to import in 1) and 1) is short & easy to read,
So I prefer 1) over 2)

Prabhath Kota
  • 93
  • 1
  • 7
1

RDD.sortBy(keyfunc, ascending=True, numPartitions=None)

an example:

words =  rdd2.flatMap(lambda line: line.split(" "))
counter = words.map(lambda word: (word,1)).reduceByKey(lambda a,b: a+b)

print(counter.sortBy(lambda a: a[1],ascending=False).take(10))
Aramis NSR
  • 1,602
  • 16
  • 26
0

PySpark added Pandas style sort operator with the ascending keyword argument in version 1.4.0. You can now use

df.sort('<col_name>', ascending = False)

Or you can use the orderBy function:

df.orderBy('<col_name>').desc()
Edward Ji
  • 745
  • 8
  • 19
Mr RK
  • 9
  • 2
-2

You can use pyspark.sql.functions.desc instead.

from pyspark.sql.functions import desc

g.groupBy('dst').count().sort(desc('count')).show()
Dorian Turba
  • 3,260
  • 3
  • 23
  • 67
Wria Mohammed
  • 1,433
  • 18
  • 23