I am using spark-sql-2.4.1v with Java 8. I have a scenario where I will be passed the columns names as list/Seq, for those columns only i need to do perform certain operations like sum, avg, percentages etc.
In my scenario, let's say I have column1, column2, column3 columns. First time I will pass column1 name.
Will pull/select "column1" data and perform some operation based on "column1". Second time I will pass column2 name, but earlier column1 not pulled this time so my dataset does not contain "column1" hence earlier conditions are breaking with error "AnalysisException: cannot resolve 'column1
' given input columns".
Hence I need to check the columns, if some column exists then only perform that column related operations else ignore those operations.
How to do this in Spark?
Sample data which is in database.
val data = List(
("20", "score", "school", "2018-03-31", 14 , 12 , 20),
("21", "score", "school", "2018-03-31", 13 , 13 , 21),
("22", "rate", "school", "2018-03-31", 11 , 14, 22),
("21", "rate", "school", "2018-03-31", 13 , 12, 23)
)
val df = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")
.select("id", "code", "entity", "date", "column2") /// these are passed for each run....this set will keep changing.
Dataset<Row> enrichedDs = df
.withColumn("column1_org",col("column1"))
.withColumn("column1",
when(col("column1").isNotNull() , functions.callUDF("lookUpData",col("column1").cast(DataTypes.StringType)))
);
The above logic is only applicable when in select columns "column1" is available. This is failing in the second set as "column1" is not select, so I need some understanding why this only applicable when selected columns as "column1" is available. I need some logic to achieve this.