0

I have read a lot of Q&A here and still cannot find answer for my question.

My question is that can I find out the intersect of two datasets even when there are duplicated values inside?

The code below shows that with duplicate value, dataset t5 will not be able to show its extra element '2' with respect to the dataset t2.

For instance, I want to get something like t5 - t2 = (1, 2). However, I can only get t5 - t2 = (1) from following code:

val t1 = Seq(1, 2, 3).toDS()
val t2 = Seq(2, 3).toDS()
val t3 = Seq(3, 4).toDS()
val t4 = Seq(4, 5 ).toDS()
val t5 = Seq(1, 2, 2, 3).toDS()
val t6 = Seq(2, 2, 3).toDS()

t1.intersect(t2).show() 
> 2 3 

t1.intersect(t3).show()
> 3

t1.intersect(t4).show()
> null

t1.union(t2).except(t1.intersect(t2))
> 1

t5.intersect(t2).show()
> 2 3 

t5.intersect(t6).show()
> 2 3

t5.except(t2).show()
>1

t5.except(t6).show()
>1

t5.union(t2).except(t5.intersect(t2))
>1

t5.union(t6).except(t5.intersect(t6))
>1    

t5.join(t2, t5("value") === t2("value"), "leftanti").show()
>1
  • looking for something like [this](https://stackoverflow.com/questions/39887526/filter-spark-dataframe-based-on-another-dataframe-that-specifies-blacklist-crite/39889263#39889263) ? – eliasah Oct 13 '17 at 07:18
  • @eliasah Thank you for your link. I check it and find it's not quite the case. The solution in the link also has problem when dealing with duplicate keys. The problem is introduced by "join" which will take also duplications into account. – Gabriele Ran Oct 13 '17 at 07:26
  • @eliasah After trying the answer, I add the result in the question. It also returns (1) – Gabriele Ran Oct 13 '17 at 07:32

1 Answers1

0

The solution I have might not be optimal, open to better solutions.

So, we could use joins in this case.

t5.join(t2, t5("value") === t2("value")).drop(t2("value")).show()

//result
+-----+
|value|
+-----+
|    2|
|    2|
|    3|
+-----+

Thanks

L.Li
  • 523
  • 3
  • 10
  • Thank you for your answer. In fact, your solution just proved my problem: t2 has only (2,3), thus its intersection with t5 should be (2, 3) but not (2, 2, 3) as your result. Otherwise when I want to get the elements "in t5 but not in t2", I will get only (1) rather than the correct answer (1, 2) – Gabriele Ran Oct 13 '17 at 07:10
  • Sorry for the misleading question, I added some more declarations to make it more clear. – Gabriele Ran Oct 13 '17 at 07:20
  • Hi Gabriele, then you might not want to use intersection for it. The properties of a Set are non-ordered and unique. If you want to achieve the goal, then group the datasets, get the counts, and subtract them. Thanks. – L.Li Oct 13 '17 at 07:39
  • yes you are right. Using count, groupby+ count, distinct can partly solve it. The only problem is that if I don't have a unique key, 'group by' must take several dimensions and the operation will be very expensive. – Gabriele Ran Oct 13 '17 at 09:07