I have a sample hive/spark table as below:
row_key | data_as_of_date | key | value |
---|---|---|---|
A | 20210121 | key1 | value1 |
A | 20210121 | key2 | value2 |
A | 20210121 | key3 | value3 |
B | 20210121 | key1 | value1 |
B | 20210121 | key2 | value1 |
B | 20210121 | key3 | value2 |
B | 20210121 | key4 | value3 |
C | 20210121 | key1 | value2 |
...and goes on.
I have another hive/spark table with same columns. Sample data below:
row_key | data_as_of_date | key | value |
---|---|---|---|
A | 20210121 | key1 | value1 |
A | 20210121 | key2 | value2 |
B | 20210121 | key1 | value1 |
B | 20210121 | key4 | value3 |
C | 20210121 | key1 | value2 |
row_key
is the joining column between these 2 tables and same row_key
can repeat in multiple rows in both the tables.
I am struggling in writing spark sql query or using spark dataframe to show/select all the rows from table 1 which have key
column value not present in table 2 joining on the row_key
.
For the sample data the result should be:
row_key | data_as_of_date | key | value |
---|---|---|---|
A | 20210121 | key3 | value3 |
B | 20210121 | key2 | value1 |
B | 20210121 | key3 | value2 |
Please help with the spark sql query or dataframe operations in scala.
Let me know if any more Info. is required.