2

I am trying to do inner anti join in pyspark. For example i have a common key in both df, now what i need is to extract all the row which are not common in both df. That is id of one should not match with id of another.

df1=df1.join(df2,how='inner',df1.id !=df2.id)

But with this code,I am getting rows those ids are same in both df.

Thanks in advance for help.

Surbhi Jain
  • 132
  • 1
  • 6

2 Answers2

4

Maybe you can try left anti join -

df3 = df1.join(df2, df1['id']==df2['id'], how='left_anti')
df4 = df2.join(df1, df1['id']==df2['id'], how='left_anti')
final_df = df3.unionAll(df4)

So we do twice left anti join and then union.

Assaf Segev
  • 381
  • 1
  • 7
-1

Spark allows you to handle such use cases in multiple ways

1. Use except : will return a new DataFrame containing rows in dataFrame1 but not in dataframe2. df1.except(df2)

2. Use subtract, Return a new DataFrame containing rows in this DataFrame but not in another DataFrame.

df1.subtract(df2)

3. Use exceptAll() : Return a new DataFrame containing rows in this DataFrame but not in another DataFrame while preserving duplicates.

df1.exceptAll(df2)

4. Use left_anti join : Key present which is part of DF1 and as well as DF2, should not be part of the resulted dataset

df = df1.join(df2, df1.key == df2.key, "left_anti")

dsk
  • 1,863
  • 2
  • 10
  • 13