I'm using pySpark V2 on an EMR cluster on AWS and I'm trying to pass a dataframe column to a function and manipulate the individual items within the column
Let's say I have the following set up:
mylist = [x for x in range(0, 10)]
df=spark.createDataFrame(mylist,IntegerType())
df.show()
+-----+
|value|
+-----+
| 0|
| 1|
| 2|
| 3|
| 4|
| 5|
| 6|
| 7|
| 8|
| 9|
+-----+
I want to have a function that performs a test on say, the value contained on row 5 of the data column and depending on what it finds, assigns that value to a new variable and perhaps do some other manipulations of the new variable
e.g
myfunc(df.value)
def myfunc(df_col):
#
# In psuedocode:
# x = value in row 5 of the data
# if x = whatever:
# do something with x
#
Can anyone help me out. Just seem to have hit a mental roadblock with this