PySpark: opposite of .dropna()?

Question

I'm trying to find which shop has an 'empty' day, i.e. a day where no customer came.

My table has the following structure:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Paris    | 215         | 213         | 128         | 102         | 195         | 180         |        110 |
| London   | 145         | 106         | 102         | 83          | 127         | 111         |         56 |
| Beijing  | 179         | 245         | 134         | 136         | 207         | 183         |        136 |
| Sydney   | 0           | 0           | 0           | 0           | 0           | 6           |         36 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

With pandas I can do something like customers[customers== 0].dropna(how="all") and that will keep only the rows where there is a 0, and I get this:

+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| shop     | 2020-10-15  | 2020-10-16  | 2020-10-17  | 2020-10-18  | 2020-10-19  | 2020-10-20  | 2020-10-21 |
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+
| Sydney   | 0           | 0           | 0           | 0           | 0           | NaN         |         NaN|
+----------+-------------+-------------+-------------+-------------+-------------+-------------+------------+

In PySpark I believe .dropna() does something similar but I want to do the opposite, keep the NA/0 values. How can I do that?

you can make new data frame from dropna() of first dataframe and then use left anti join between these two dataframe. look this page for left anti join https://stackoverflow.com/questions/39887526/filter-spark-dataframe-based-on-another-dataframe-that-specifies-denylist-criter — yasi, Oct 21 '20 at 14:13

Samtech · Accepted Answer · 2020-10-22T07:37:30.397

creating sample dataset:

from pyspark.sql import SparkSession
from pyspark.sql import Row
from pyspark.sql import functions as f

df_list= [
  { "shop":"Paris", "2020-10-15" : 215,"2020-10-16": 213, "2020-10-17" : 128,"2020-10-18": 195,"2020-10-19":195},
{"shop":"London", "2020-10-15" : 145,"2020-10-16": 106, "2020-10-17" : 102,"2020-10-18": 127,"2020-10-19":127},
 { "shop":"Beijing ", "2020-10-15" : 179,"2020-10-16": 245, "2020-10-17" : 136,"2020-10-18": 207,"2020-10-19":207},

 {"shop":"Sydney", "2020-10-15" : 0,"2020-10-16": 0 ,"2020-10-17" : 0,"2020-10-18": 0, "2020-10-19":0}

]
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame(Row(**x) for x in df_list)
df.show()

--

+--------+----------+----------+----------+----------+----------+
|    shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+--------+----------+----------+----------+----------+----------+
|   Paris|       215|       213|       128|       195|       195|
|  London|       145|       106|       102|       127|       127|
|Beijing |       179|       245|       136|       207|       207|
|  Sydney|         0|         0|         0|         0|         0|
+--------+----------+----------+----------+----------+----------+

You can apply the filter function

df.filter(f.greatest(*[f.col(i).isin(0) for i in df.columns])).show()

result:

+------+----------+----------+----------+----------+----------+
|  shop|2020-10-15|2020-10-16|2020-10-17|2020-10-18|2020-10-19|
+------+----------+----------+----------+----------+----------+
|Sydney|         0|         0|         0|         0|         0|
+------+----------+----------+----------+----------+----------+

Hi, thanks for your answer, unfortunately I prefer not having to iterate over all columns. I want a solution that can work with any number of columns. — Be Chiller Too, Oct 22 '20 at 07:21
Oh this is great, I think I can use that! Just a quick remark, we don't need to use `isin(0)` since there's only one value to test. Using `f.col(i) == 0` may work, right? — Be Chiller Too, Oct 22 '20 at 12:54

PySpark: opposite of .dropna()?

1 Answers1