0

I'm using the Spark Java API and I'm trying to find records that were deleted between 2 files using Dataset. For one test that I have, I'm comparing 2 identical files that have 2 columns. I use one of the columns as a type of PK (if the PK is not on the newer file is a delete).

Example of the file

ID|TYPE
ABC|BUY
CDE|BUY
FGH|SELL

Datasets were created as:

Dataset<Row> previous/actual = sparkSession.read().
                       .option("inferSchema","true")
                       .option("header","true")
                       .option("delimiter","|")
                       .csv(*pathToFile*);

I have inconsistent results for the scenarios below

Example 1:

Dataset<Row> deleted = previous.join(actual,previous.col("ID").equalTo(actual.col("ID")),"leftanti"); 

As a result I get:

|
The pipe is printed in my output file. If I invoke deleted.show() I get null|null

Example 2 is very similar but I calculate a hash from all columns (for both datasets separately ) as:

//columns has the content of previous.columns();
previous = previous.withColumn("hash", functions.hash(columns.toArray(new Column[0])));

I replace the ID with the hash in the query

Dataset<Row> deleted = previous.join(actual,previous.col("hash").equalTo(actual.col("hash")),"leftanti");

But now my result, as expected, is a blank file. Why are the results different?

Graciano
  • 508
  • 4
  • 11

1 Answers1

0

If i understood your problem,you want records from both Dataset which is not present in one of the dataset,then you can go for except method.

Same is the reference here Spark: subtract two DataFrames

dataFrame1.except(dataFrame2) will return a new DataFrame containing rows in dataFrame1 but not in dataframe2. or it does not result just do vice versa.

dataFrame2.except(dataFrame1)

tarun
  • 218
  • 2
  • 11
  • Tarun, I want records from the previous Dataset that were deleted (not modified). The “ID” no longer exists. If the same ID is present in both files but with a different TYPE than it’s a record I don’t care bc it’s an update not a delete. – Graciano Oct 25 '19 at 10:52
  • But more important. Why I had different results when running with my "PK" column and a "hash" column on the same datasets? – Graciano Oct 25 '19 at 11:26