I wanted to ask whats the best way to achieve per key auto increment numerals after sorting, for eg. :
raw file:
1,a,b,c,1,1
1,a,b,d,0,0
1,a,b,e,1,0
2,a,e,c,0,0
2,a,f,d,1,0
post-output (the last column is the position number after grouping on first three fields and reverse sorting on last two values)
1,a,b,c,1,1,1
1,a,b,d,0,0,3
1,a,b,e,1,0,2
2,a,e,c,0,0,2
2,a,f,d,1,0,1
I am using solution that uses groupbykey but that is running into some issues (possibly bug with pyspark/spark?), wondering if there is a better way to achieve this.
My solution:
A = sc.textFile("train.csv")
.filter(lambda x:not isHeader(x))
.map(split)
.map(parse_train)
.filter(lambda x: not x is None)
B = A.map(lambda k:((k.first_field,k.second_field,k.first_field,k.third_field),(k[0:5])))
.groupByKey()
B.map(sort_n_set_position)
.flatMap(lambda line: line)
where sort and set position iterates over the iterator and performs sorting and adding last column