-1

If there is change in schema in my incoming CSV file, how can we handle this in spark?

Suppose on Day-1, I got a csv file with schema and data as below,

FirstName LastName Age
Sagar     Patro    26
Akash     Nayak    22
Amar      Kumar    18

And on Day 10, My incoming CSV file schema got changes, as below

FirstName LastName Mobile     Age 
Sagar     Patro    8984159475 26  
Akash     Nayak    9040988503 22  
Amar      Kumar    9337856871 18  

My Requirement No-1,

I want to know, if there is any change in Schema to my incoming CSV file.

My Requirement No-2,

I want to ignore those new added columns and continue with my earlier schema i.e. Day-1 schema data.

My Requirement No-3,

I also want to add the new schema automatically if there is change in schema to my incoming csv data i.e. Day-10 Schema

SCouto
  • 7,808
  • 5
  • 32
  • 49
Sagar patro
  • 115
  • 2
  • 11
  • you need to find the differences between the schemas, [here](https://stackoverflow.com/questions/47862974/schema-comparison-of-two-dataframes-in-scala) you can find an extensive discussion about it – abiratsis Apr 25 '20 at 10:50

1 Answers1

0
import org.apache.spark.sql.DataFrame

object SchemaDiff {

  def main(args: Array[String]): Unit = {
    // Just because its a simple CSV not considering column data type changes
    val df1 : DataFrame = null // Dataframe for yesterday's data
    val df2 : DataFrame = null // Dataframe for today's data
    val deltaColumnNames = df2.columns.diff(df1.columns)
    val ignoreSchemaChange = true
    if(!deltaColumnNames.isEmpty) {
      println("Schema change")
    }
    val resultDf = if(ignoreSchemaChange) {
      df2.toDF(df1.columns: _*) // Maintain yesterday's schema
    } else {
      df2  // Use updated schema
    }

  }
}

QuickSilver
  • 3,915
  • 2
  • 13
  • 29