-2

I have a sample hive/spark table as below:

row_key data_as_of_date key value
A 20210121 key1 value1
A 20210121 key2 value2
A 20210121 key3 value3
B 20210121 key1 value1
B 20210121 key2 value1
B 20210121 key3 value2
B 20210121 key4 value3
C 20210121 key1 value2

...and goes on.

I have another hive/spark table with same columns. Sample data below:

row_key data_as_of_date key value
A 20210121 key1 value1
A 20210121 key2 value2
B 20210121 key1 value1
B 20210121 key4 value3
C 20210121 key1 value2

row_key is the joining column between these 2 tables and same row_key can repeat in multiple rows in both the tables.

I am struggling in writing spark sql query or using spark dataframe to show/select all the rows from table 1 which have key column value not present in table 2 joining on the row_key.

For the sample data the result should be:

row_key data_as_of_date key value
A 20210121 key3 value3
B 20210121 key2 value1
B 20210121 key3 value2

Please help with the spark sql query or dataframe operations in scala.

Let me know if any more Info. is required.

vijayinani
  • 2,548
  • 2
  • 26
  • 48

1 Answers1

1

You can just do a leftanti join on two dataframes to get the expected output.

 val df = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("A","20210121","key3","value3"),("B","20210121","key1","value1"),("B","20210121","key2","value1"),("B","20210121","key3","value3"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")

 val df1 = Seq(("A","20210121","key1","value1"),("A","20210121","key2","value2"),("B","20210121","key1","value1"),("B","20210121","key4","value3"),("C","20210121","key1","value2"))
.toDF("row_key","data_as_of_date","key","value")

 val outputdf = df.join(df1, Seq("row_key","data_as_of_date","key"),"leftanti")
 display(outputdf)

You can see the output as below : enter image description here

Nikunj Kakadiya
  • 2,689
  • 2
  • 20
  • 35