1

I am using spark-sql-2.4.1v with java8. I have scenario like below

List data = List(
  ("20", "score", "school", "2018-03-31", 14 , 12 , 20),
  ("21", "score", "school", "2018-03-31", 13 , 13 , 21),
  ("22", "rate", "school", "2018-03-31", 11 , 14, 22),
  ("21", "rate", "school", "2018-03-31", 13 , 12, 23)
 )

Dataset<Row> df =  = data.toDF("id", "code", "entity", "date", "column1", "column2" ,"column3")




Dataset<Row> resultDs =  df
          .withColumn("column_names", 
                  array(Arrays.asList(df.columns()).stream().map(s -> new Column(s)).toArray(Column[]::new))
              );

**But this is showing respective row columns values instread of column names. so what is wrong here ? how to get "column_names" in java **

I am trying to solve below use-case:

Lets say i have 100 columns like column1....to column100 ... each column calculation would be different depend on the column name and data .... but every time i run my spark job i will get which columns i need to calculate ... but in my code i will have all columns logic i.e. each column logic might be different ... i need to ignore the logic of unspecified columns... but as the dataframe contain all columns i am selecting specified columns..so for non-selected columns my code throws exception as the column not found ...i need to fix this

vaquar khan
  • 10,864
  • 5
  • 72
  • 96
BdEngineer
  • 2,929
  • 4
  • 49
  • 85
  • This is just a list, `df.columns` – Lamanus Aug 18 '20 at 08:49
  • This doesn't look like real Java code. `List data = List(...)`? What do you expect there to be in the new column? – RealSkeptic Aug 18 '20 at 08:58
  • @RealSkeptic at least second part is real java code ...new column i.e. "column_names" should have all column names of the dataframe. – BdEngineer Aug 18 '20 at 11:55
  • @Lamanus df.columns return String[] where as array() function expecting Column ...so this does not work – BdEngineer Aug 18 '20 at 11:58
  • Again, please tell us - by editing the question, as this is necessary information - what you expect that column to contain. If you really meant the column names, which are the same for all the rows in the dataset, can you explain your use case? Why have the same value duplicated in each row? – RealSkeptic Aug 18 '20 at 15:20
  • @RealSkeptic thanks for prompt reply , can you check Someshwar Kale answer i am struggling to achieve that use-case .. https://stackoverflow.com/questions/63450135/applying-when-condition-only-when-column-exists-in-the-dataframe – BdEngineer Aug 18 '20 at 16:43
  • @RealSkeptic Lets say i have 100 columns like column1....to column100 ... each column calculation would be different depend on the column name and data .... but every time i run my spark job i will get which columns i need to calculate ... but in my code i will have all columns logic i.e. each column logic might be different ... i need to ignore the logic of unspecified columns... but as the dataframe contain all columns i am selecting specified columns..so for non-selected columns my code throws exception as the column not found ...i need to fix this – BdEngineer Aug 18 '20 at 16:50
  • I don't understand. You know, on the driver side, which columns exist. So do the calculation only if the column exists. Why have the condition on the executor side? – RealSkeptic Aug 18 '20 at 23:02

0 Answers0