0

I'm using pySpark V2 on an EMR cluster on AWS and I'm trying to pass a dataframe column to a function and manipulate the individual items within the column

Let's say I have the following set up:

mylist = [x for x in range(0, 10)]
df=spark.createDataFrame(mylist,IntegerType())
df.show()

+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
+-----+

I want to have a function that performs a test on say, the value contained on row 5 of the data column and depending on what it finds, assigns that value to a new variable and perhaps do some other manipulations of the new variable

e.g

myfunc(df.value)

def myfunc(df_col):
#
#   In psuedocode:
#   x = value in row 5 of the data
#   if x = whatever:
#       do something with x
#

Can anyone help me out. Just seem to have hit a mental roadblock with this

user2699504
  • 195
  • 1
  • 4
  • 18

1 Answers1

0

Thanks for the link given. it was helpful. Here is one possible answer

from pyspark.sql.types import *
mylist = [x for x in range(0, 10)]
df=spark.createDataFrame(mylist,IntegerType())
df.show()
rn=df.collect()[4]
x=rn.value
if(x == 4):
  print("fifth row value = ", str(x))
+-----+
|value|
+-----+
|    0|
|    1|
|    2|
|    3|
|    4|
|    5|
|    6|
|    7|
|    8|
|    9|
+-----+

fifth row value =  4
user2699504
  • 195
  • 1
  • 4
  • 18