I have an existing Hive table:
CREATE TABLE form_submit (form_id String,
submitter_name String)
PARTITIONED BY
submission_date String)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\t'
LINES TERMINATED BY '\n'
STORED AS ORC;
I have a csv of raw data, which I read using
val session = SparkSession.builder()
.enableHiveSupport()
.config("spark.hadoop.hive.exec.dynamic.partition", "true")
.config("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
.getOrCreate()
val dataframe = session
.read
.option("header", "true")
.csv(hdfsPath)
I then perform some manipulations on this data, using a series of withColumn
and drop
statements, to make sure that the format matches the table format.
I then try to write it like so:
formattedDataframe.write
.mode(SaveMode.Append)
.format("hive")
.partitionBy("submission_date")
.saveAsTable(tableName)
I'm not using insertInto
, because the columns in the dataframe end up in a bad order, and I wouldn't want to rely on column order anyway.
And run it as a Spark job. I get an exception:
Exception in thread "main" org.apache.spark.SparkException: Requested partitioning does not match the form_submit table:
Requested partitions:
Table partitions: "submission_date"
What am I doing wrong? Didn't I choose the partitioning by calling partitionedBy
?