I'm using the Spark Java API and I'm trying to find records that were deleted between 2 files using Dataset. For one test that I have, I'm comparing 2 identical files that have 2 columns. I use one of the columns as a type of PK (if the PK is not on the newer file is a delete).
Example of the file
ID|TYPE
ABC|BUY
CDE|BUY
FGH|SELL
Datasets were created as:
Dataset<Row> previous/actual = sparkSession.read().
.option("inferSchema","true")
.option("header","true")
.option("delimiter","|")
.csv(*pathToFile*);
I have inconsistent results for the scenarios below
Example 1:
Dataset<Row> deleted = previous.join(actual,previous.col("ID").equalTo(actual.col("ID")),"leftanti");
As a result I get:
|
The pipe is printed in my output file. If I invoke deleted.show() I get null|null
Example 2 is very similar but I calculate a hash from all columns (for both datasets separately ) as:
//columns has the content of previous.columns();
previous = previous.withColumn("hash", functions.hash(columns.toArray(new Column[0])));
I replace the ID with the hash in the query
Dataset<Row> deleted = previous.join(actual,previous.col("hash").equalTo(actual.col("hash")),"leftanti");
But now my result, as expected, is a blank file. Why are the results different?