In pyspark I want to pass a dataframe column of values to a function and manipulate say, t the 5th value, in that columns of data

Question

I'm using pySpark V2 on an EMR cluster on AWS and I'm trying to pass a dataframe column to a function and manipulate the individual items within the column

Let's say I have the following set up:

mylist = [x for x in range(0, 10)]
df=spark.createDataFrame(mylist,IntegerType())
df.show()

+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
+-----+

I want to have a function that performs a test on say, the value contained on row 5 of the data column and depending on what it finds, assigns that value to a new variable and perhaps do some other manipulations of the new variable

e.g

myfunc(df.value)

def myfunc(df_col):
#
#   In psuedocode:
#   x = value in row 5 of the data
#   if x = whatever:
#       do something with x
#

Can anyone help me out. Just seem to have hit a mental roadblock with this

Might help : https://stackoverflow.com/questions/35243744/get-specific-row-from-spark-dataframe — Benoit F, Feb 14 '20 at 11:45

score 0 · Answer 1 · answered Feb 14 '20 at 12:06

Thanks for the link given. it was helpful. Here is one possible answer

from pyspark.sql.types import *
mylist = [x for x in range(0, 10)]
df=spark.createDataFrame(mylist,IntegerType())
df.show()
rn=df.collect()[4]
x=rn.value
if(x == 4):
  print("fifth row value = ", str(x))

+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
+-----+

fifth row value =  4

In pyspark I want to pass a dataframe column of values to a function and manipulate say, t the 5th value, in that columns of data

1 Answers1