0

The following works fine with pyspark using python2:

data = [
    ('A', 2.), ('A', 4.), ('A', 9.), 
    ('B', 10.), ('B', 20.), 
    ('Z', 3.), ('Z', 5.), ('Z', 8.), ('Z', 12.) 
      ]

rdd = sc.parallelize( data )

sumCount = rdd.combineByKey(lambda value: (value, 1),
                        lambda x, value: (x[0] + value, x[1] + 1),
                        lambda x, y: (x[0] + y[0], x[1] + y[1])
                           )

averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))
averageByKey.collectAsMap()

The line:

averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))

returns under python3:

SyntaxError: invalid syntax
  File "<command-2372155099811162>", line 14
    averageByKey = sumCount.map(lambda (key, (totalSum, count)): (key, totalSum / count))

Cannot find what python3 change causes this and alternative.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
  • Possible duplicate of [Nested arguments not compiling](https://stackoverflow.com/questions/10607293/nested-arguments-not-compiling) or [Python lambda does not accept tuple argument](https://stackoverflow.com/questions/11328312/python-lambda-does-not-accept-tuple-argument) – pault Oct 07 '19 at 21:16
  • I could not have found that @pault – thebluephantom Oct 07 '19 at 21:17
  • 2
    It's a common, albeit lesser known, 2 to 3 change but I've been burned by it before. – pault Oct 07 '19 at 21:19
  • @pault: it helped me but did not provide the answer, only indirectly. I guess it answered the question though. – thebluephantom Oct 08 '19 at 18:53

1 Answers1

0

The following code in pyspark using python3 works:

data         = sc.parallelize( [(0, 2.), (0, 4.), (1, 0.), (1, 10.), (1, 20.)] )

sumCount     = data.combineByKey(lambda value: (value, 1),
                                 lambda x, value: (x[0] + value, x[1] + 1),
                                 lambda x, y: (x[0] + y[0], x[1] + y[1]))

averageByKey = sumCount.map(lambda label_value_sum_count: (label_value_sum_count[0], label_value_sum_count[1][0] / label_value_sum_count[1][1]))

print(averageByKey.collectAsMap()) 

returns correctly:

{0: 3.0, 1: 10.0}

python2 & python3 have some differences and a lot of the stuff on SO is python2.

thebluephantom
  • 16,458
  • 8
  • 40
  • 83