2

I have 2 RDDs I would like to join which looks like this

val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]

Is there any way I can do a left outer join on them? I have tried this but it does not work because the type of the key is different i.e Int, Option[Int]

 q.leftOuterJoin(a)

2 Answers2

5

The natural solution is to convert the Int to Option[Int] so they have the same type.

Following you example:

val a:RDD[(Option[Int],V)]
val q:RDD[(Int,V)]


q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a)

If you want to recover the Int type at the output, you can do this:

q.map{ case (k,v) => (Some(k),v))}.leftOuterJoin(a).map{ case (k,v) => (k.get, v) }

Note that you can do ".get" without any problem since it is not possible to get None's there.

  • I found this post as well to convert from option type https://stackoverflow.com/questions/4730842/how-to-transform-scala-collection-of-optionx-to-collection-of-x# – Priyan Chandrapala Jun 22 '17 at 08:38
2

One way to do is to convert it into dataframe and join

Here is a simple example

import spark.implicits._
val a = spark.sparkContext.parallelize(Seq(
  (Some(3), 33),
  (Some(1), 11),
  (Some(2), 22)
)).toDF("id", "value1")

val q = spark.sparkContext.parallelize(Seq(
  (Some(3), 33)
)).toDF("id", "value2")

q.join(a, a("id") === q("id") , "leftouter").show
koiralo
  • 22,594
  • 6
  • 51
  • 72