1

I was going through the Spark SQL for a join optimised using Adaptive Query Execution,

On the right side, spark get to know the size of table is small enough for broadcast and therefore decides for broadcast hash join.

As we know, broadcast hash join in a narrow operation, why do we still have exchange in the left table (large one)

Spark SQL

Even in the final Physical plan, exchange is there

Physical Plan

https://www.databricks.training/spark-ui-simulator/experiment-3799A/v002-S/index.html?artifact=sql-21

Rahul Kumar
  • 2,184
  • 3
  • 24
  • 46

1 Answers1

1

You get a local shuffle based on Mappers, not Reducers. I.e. a localized shuffle instead of regular such shuffle.

See https://www.waitingforcode.com/apache-spark-sql/what-new-apache-spark-3-local-shuffle-reader/read and https://dev.to/yaooqinn/how-to-use-spark-adaptive-query-execution-aqe-in-kyuubi-2ek2

For background you can also peruse Spark: disk I/O on stage boundaries explanation

thebluephantom
  • 16,458
  • 8
  • 40
  • 83