combineByKey fails

Question

I am copying and pasting the exact code from the O'Reilly Learning Spark textbook and I am getting an error: org.apache.spark.SparkException: Job aborted due to stage failure

I am trying to understand what this code is doing, but am having trouble understanding it because it won't run:

nums = sc.parallelize([1, 2, 3, 4])
sumCount = nums.combineByKey((lambda x: (x,1)),
 (lambda x, y: (x[0] + y, x[1] + 1)),
 (lambda x, y: (x[0] + y[0], x[1] + y[1])))

sumCount.map(lambda key, xy: (key, xy[0]/xy[1])).collectAsMap()

Below is the full error, any insights?

Job aborted due to stage failure: Task 3 in stage 26.0 failed 1 times, most recent failure: Lost task 3.0 in stage 26.0 (TID 73, localhost, executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/worker.py", line 480, in main
    process()
  File "/databricks/spark/python/pyspark/worker.py", line 470, in process
    out_iter = func(split_index, iterator)
  File "/databricks/spark/python/pyspark/rdd.py", line 2543, in pipeline_func
    return func(split, prev_func(split, iterator))
  File "/databricks/spark/python/pyspark/rdd.py", line 353, in func
    return f(iterator)
  File "/databricks/spark/python/pyspark/rdd.py", line 1905, in combineLocally
    merger.mergeValues(iterator)
  File "/databricks/spark/python/pyspark/shuffle.py", line 238, in mergeValues
    for k, v in iterator:
TypeError: 'int' object is not iterable

    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:514)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:650)
    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:633)
    at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:468)
    at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
    at scala.collection.Iterator$GroupedIterator.fill(Iterator.scala:1124)
    at scala.collection.Iterator$GroupedIterator.hasNext(Iterator.scala:1130)
    at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:409)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:55)
    at org.apache.spark.scheduler.Task.doRunTask(Task.scala:139)
    at org.apache.spark.scheduler.Task.run(Task.scala:112)
    at org.apache.spark.executor.Executor$TaskRunner$$anonfun$13.apply(Executor.scala:497)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1526)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:503)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:

Possible duplicate of [pyspark flatmat error: TypeError: 'int' object is not iterable](https://stackoverflow.com/questions/49189097/pyspark-flatmat-error-typeerror-int-object-is-not-iterable) — oetzi, Oct 07 '19 at 18:15
Something has changed as it appears to have run in prior versions of pyspark. — thebluephantom, Oct 07 '19 at 19:36
Do you know how it has changed? Or what changes I could make to get it to run? — Elliot Huebler, Oct 07 '19 at 19:49
try this person pault. He knows everything about pyspark. There is some change, but I cannot find it. — thebluephantom, Oct 07 '19 at 20:08
Not convinced a valid approach. You normally need K, V, not just V for a Combiner. — thebluephantom, Oct 07 '19 at 20:22
Is this the exact code from the book? It doesn't seem right. You need keys AND values. What do you expect the output to be? **Edit**: Your `nums` is the problem - try [this example](http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/). — pault, Oct 07 '19 at 20:26
Yes this is from the book, tho they do not define nums in the example. I took nums where it was previously defined in the book but not in the example. — Elliot Huebler, Oct 07 '19 at 20:50
@pault thank you for the example, it is good to work through it however it still does not run. When I use the "data" from the example I get the same error as before, and when I run the whole example I get a syntax error with the "averageByKey" step — Elliot Huebler, Oct 07 '19 at 20:52
see my question on this as well, the example does not work with python3 — thebluephantom, Oct 07 '19 at 20:57
I know that it doesn't work still... even with nums changed to data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] ) — Elliot Huebler, Oct 07 '19 at 21:26
https://stackoverflow.com/questions/58276755/combinebykey-works-fine-with-pyspark-python-2-but-not-python-3 — thebluephantom, Oct 07 '19 at 21:30
Link to your question? I don't this needs editing, it just needs an answer — Elliot Huebler, Oct 07 '19 at 21:31

thebluephantom · Accepted Answer · 2019-10-08T18:59:53.087

Well, well.

Assuming that the nums above is not that good, as it is not a (K,V) tuple, then assuming the code is as follows:

 data = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )

 sumCount = data.combineByKey(lambda value: (value, 1),
                              lambda x, value: (x[0] + value, x[1] + 1),
                              lambda x, y: (x[0] + y[0], x[1] + y[1]))

 averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))

 print averageByKey.collectAsMap()

Under Spark, with python2 (pyspark), the above code runs fine.

Under Spark, with python3 (pyspark), the above code generates an error on:

 averageByKey = sumCount.map(lambda (label, (value_sum, count)): (label, value_sum / count))

https://www.python.org/dev/peps/pep-3113/ explains why this feature, "tuple parameter unpacking", was removed in Python 3. It seems some what of a let down to me.

The easy way to solve it is to pass the above code online into https://www.pythonconverter.com/ and run the code converter. This is:

data         = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )

sumCount     = data.combineByKey(lambda value: (value, 1),
                                 lambda x, value: (x[0] + value, x[1] + 1),
                                 lambda x, y: (x[0] + y[0], x[1] + y[1]))

averageByKey = sumCount.map(lambda label_value_sum_count: (label_value_sum_count[0], label_value_sum_count[1][0] / label_value_sum_count[1][1]))

print(averageByKey.collectAsMap())

returns correctly:

{0: 3.0, 1: 10.0}

averageByKey has a different declaration now. You need to study and read that link and to get familiarity use the Python 2 to 3 Converter. Saves some time and you can ease your way into it. Esteemed SO-member pault also has had some issues with this, so there you have it, not so simple.

Thank you! This is a great answer :) – Elliot Huebler Oct 08 '19 at 19:18 — Elliot Huebler, Oct 08 '19 at 19:18

combineByKey fails

1 Answers1