0

I have an existing Hive table:

CREATE TABLE form_submit (form_id String,
submitter_name String)
PARTITIONED BY
submission_date String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC;

I have a csv of raw data, which I read using

 val session = SparkSession.builder()
      .enableHiveSupport()
      .config("spark.hadoop.hive.exec.dynamic.partition", "true")
      .config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
      .getOrCreate()
 val dataframe = session
      .read
      .option("header", "true")
      .csv(hdfsPath)

I then perform some manipulations on this data, using a series of withColumn and drop statements, to make sure that the format matches the table format.

I then try to write it like so:

formattedDataframe.write
      .mode(SaveMode.Append)
      .format("hive")
      .partitionBy("submission_date")
      .saveAsTable(tableName)

I'm not using insertInto, because the columns in the dataframe end up in a bad order, and I wouldn't want to rely on column order anyway.

And run it as a Spark job. I get an exception:

Exception in thread "main" org.apache.spark.SparkException: Requested partitioning does not match the form_submit table:
Requested partitions:
Table partitions: "submission_date"

What am I doing wrong? Didn't I choose the partitioning by calling partitionedBy?

sāe
  • 1

0 Answers0