How to select the max row for every group in spark structured streaming 2.3.0 using Raw Spark SQL?

Question

How to select the max row for every group in spark structured streaming 2.3.0 without using order by or mapGroupWithState?

Input:

id | amount     | my_timestamp
-------------------------------------------
1  |      5     |  2018-04-01T01:00:00.000Z
1  |     10     |  2018-04-01T01:10:00.000Z
2  |     20     |  2018-04-01T01:20:00.000Z
2  |     30     |  2018-04-01T01:25:00.000Z
2  |     40     |  2018-04-01T01:30:00.000Z

Expected Output:

id | amount     | my_timestamp
-------------------------------------------
1  |     10     |  2018-04-01T01:10:00.000Z
2  |     40     |  2018-04-01T01:30:00.000Z

Looking for a streaming solution using either raw sql like sparkSession.sql("sql query") or similar to raw sql but not something like mapGroupWithState

You can use any aggregating method (not join or window function) in `complete` output mode. — zero323, Apr 15 '18 at 10:27
@user6910411 I am not sure why you think this is a duplicate question since my question is all about streaming and it is not an easy one to solve. I don't see an easy or definite solution yet given the restrictions I specified in the question. If you think its easy please help — user1870400, Apr 15 '18 at 11:12
`order by` in complete mode just generates a lot of data and that is not what I am looking as I already specified in the question — user1870400, Apr 15 '18 at 11:13

Chitral Verma · Answer 1 · 2018-04-15T06:38:47.080

There are multiple approaches to solve this problem.

Approach 1 :

You can use Window operations in Spark

import org.apache.spark.sql.expressions.{Window, WindowSpec}
import org.apache.spark.sql.functions.{col, desc, rank}

val filterWindow: WindowSpec = Window.partitionBy("id").orderBy(desc("amount"))

val df = ???

df.withColumn("temp_rank", rank().over(filterWindow))
.filter(col("temp_rank") === 1)
.drop("temp_rank")

The problem with this is that it does not work with Structured Streaming as windowing is only supported on TIMESTAMP columns. This works for batch jobs.

Approach 2:

With the specified conditions in the question you could go with something like below. The grouping is done on id and the grouped contents are converted to Seq[A]. Here, A represents a Struct. This Seq is then filtered out for the record.

object StreamingDeDuplication {

  case class SubRecord(time: java.sql.Timestamp, amount: Double)

  val subSchema: StructType = new StructType().add("time", TimestampType).add("amount", DoubleType)

  def deDupe: UserDefinedFunction =
    udf((data: Seq[Row]) => data.maxBy(_.getAs[Double]("amount")), subSchema)

  def main(args: Array[String]): Unit = {
    val spark = SparkSession.builder().master("local").appName("StreamingDeDuplication").getOrCreate()

    import spark.implicits._
    val records = spark.readStream
      .format("socket")
      .option("host", "localhost")
      .option("port", 9999)
      .load()
      .as[String]
      .map(_.split(","))
      .withColumn("id", $"value".getItem(0).cast("STRING"))
      .withColumn("amount", $"value".getItem(1).cast("DOUBLE"))
      .withColumn("time", $"value".getItem(2).cast("TIMESTAMP"))
      .drop("value")

    val results = records
      .withColumn("temp", struct("time", "amount"))
      .groupByKey(a => a.getAs[String]("id"))
      .agg(collect_list("temp").as[Seq[SubRecord]])
      .withColumnRenamed("collect_list(temp)", "temp_agg")
      .withColumn("af", deDupe($"temp_agg"))
      .withColumn("amount", col("af").getField("amount"))
      .withColumn("time", col("af").getField("time"))
      .drop("af", "temp_agg")

    results
      .writeStream
      .outputMode(OutputMode.Update())
      .option("truncate", "false")
      .format("console")
      .start().awaitTermination()
  }

}

hi please read my question carefully. 1) is this a streaming solution or batch solution?. I am looking for a streaming solution 2) order by in streaming can only be done in complete mode which produces a lot of data so I try not to use order by. — user1870400, Apr 15 '18 at 04:01
I assume you are talking about batch since `over` is not supported in streaming — user1870400, Apr 15 '18 at 04:22
right, So I'm updating the answer with a solution that works on structured streaming as per the conditions you specified — Chitral Verma, Apr 15 '18 at 06:24
thanks a lot for this. Is groupByKey allowed in raw sql ? like `sparkSession.sql("select * from table groupByKey...")` — user1870400, Apr 15 '18 at 07:05
Im not sure of the sql part, but it doesn't work IMO. do mark it as accepted if it helped. thanks — Chitral Verma, Apr 15 '18 at 09:05
Does it work in fact as I note no tick and what about performance? — thebluephantom, Feb 07 '19 at 19:26
@thebluephantom not using raw sql which is the solution I am looking for — user1870400, Mar 05 '19 at 12:21
I have been curious on this topic and not able to resolve. But you did or did not resolve? If so intetested in source example. Not sure if state mgt reqd — thebluephantom, Mar 05 '19 at 12:32
did not resolve and it is not a duplicate since my question is related to raw spark sql. — user1870400, Mar 05 '19 at 20:51
Is it not possible to do that with Spark Structured Streaming? — Eric Bellet, Jul 30 '19 at 09:45

How to select the max row for every group in spark structured streaming 2.3.0 using Raw Spark SQL?

1 Answers1