Inner anti join in pyspark

Question

I am trying to do inner anti join in pyspark. For example i have a common key in both df, now what i need is to extract all the row which are not common in both df. That is id of one should not match with id of another.

df1=df1.join(df2,how='inner',df1.id !=df2.id)

But with this code,I am getting rows those ids are same in both df.

Thanks in advance for help.

Assaf Segev · Accepted Answer · 2020-11-02T10:23:21.407

4

Maybe you can try left anti join -

df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')
df4 = df2.join(df1, df1['id']==df2['id'], how='left_anti')
final_df = df3.unionAll(df4)

So we do twice left anti join and then union.

edited Nov 02 '20 at 10:23

answered Nov 01 '20 at 16:45

Assaf Segev

381
1
7

score -1 · Answer 2 · answered Nov 02 '20 at 03:59

Spark allows you to handle such use cases in multiple ways

1. Use except : will return a new DataFrame containing rows in dataFrame1 but not in dataframe2. df1.except(df2)

2. Use subtract, Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

df1.subtract(df2)

3. Use exceptAll() : Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

df1.exceptAll(df2)

4. Use left_anti join : Key present which is part of DF1 and as well as DF2, should not be part of the resulted dataset

df = df1.join(df2, df1.key == df2.key, "left_anti")

Inner anti join in pyspark

2 Answers2