I have read a lot of Q&A here and still cannot find answer for my question.
My question is that can I find out the intersect of two datasets even when there are duplicated values inside?
The code below shows that with duplicate value, dataset t5 will not be able to show its extra element '2' with respect to the dataset t2.
For instance, I want to get something like t5 - t2 = (1, 2). However, I can only get t5 - t2 = (1) from following code:
val t1 = Seq(1, 2, 3).toDS()
val t2 = Seq(2, 3).toDS()
val t3 = Seq(3, 4).toDS()
val t4 = Seq(4, 5 ).toDS()
val t5 = Seq(1, 2, 2, 3).toDS()
val t6 = Seq(2, 2, 3).toDS()
t1.intersect(t2).show()
> 2 3
t1.intersect(t3).show()
> 3
t1.intersect(t4).show()
> null
t1.union(t2).except(t1.intersect(t2))
> 1
t5.intersect(t2).show()
> 2 3
t5.intersect(t6).show()
> 2 3
t5.except(t2).show()
>1
t5.except(t6).show()
>1
t5.union(t2).except(t5.intersect(t2))
>1
t5.union(t6).except(t5.intersect(t6))
>1
t5.join(t2, t5("value") === t2("value"), "leftanti").show()
>1