select values from first table not present in second table in spark using scala or spark sql

Question

I have a sample hive/spark table as below:

row_key	data_as_of_date	key	value
A	20210121	key1	value1
A	20210121	key2	value2
A	20210121	key3	value3
B	20210121	key1	value1
B	20210121	key2	value1
B	20210121	key3	value2
B	20210121	key4	value3
C	20210121	key1	value2

...and goes on.

I have another hive/spark table with same columns. Sample data below:

row_key	data_as_of_date	key	value
A	20210121	key1	value1
A	20210121	key2	value2
B	20210121	key1	value1
B	20210121	key4	value3
C	20210121	key1	value2

row_key is the joining column between these 2 tables and same row_key can repeat in multiple rows in both the tables.

I am struggling in writing spark sql query or using spark dataframe to show/select all the rows from table 1 which have key column value not present in table 2 joining on the row_key.

For the sample data the result should be:

row_key	data_as_of_date	key	value
A	20210121	key3	value3
B	20210121	key2	value1
B	20210121	key3	value2

Please help with the spark sql query or dataframe operations in scala.

Let me know if any more Info. is required.

score 1 · Accepted Answer · answered Jan 26 '21 at 04:54

You can just do a leftanti join on two dataframes to get the expected output.

 val df = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("A","20210121","key3","value3"),("B","20210121","key1","value1"),("B","20210121","key2","value1"),("B","20210121","key3","value3"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")

 val df1 = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("B","20210121","key1","value1"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")

 val outputdf = df.join(df1, Seq("row_key","data_as_of_date","key"),"leftanti")
 display(outputdf)

You can see the output as below :

select values from first table not present in second table in spark using scala or spark sql

1 Answers1