Use case is to append a column to a Parquet dataset and then re-write efficiently at the same location. Here is a minimal example.
Create a pandas
DataFrame and write as a partitioned Parquet dataset.
import pandas as pd
df = pd.DataFrame({
'id': ['a','a','a','b','b','b','b','c','c'],
'value': [0,1,2,3,4,5,6,7,8]})
path = r'c:/data.parquet'
df.to_parquet(path=path, engine='pyarrow', compression='snappy', index=False, partition_cols=['id'], flavor='spark')
Then load the Parquet dataset as a pyspark
view and create a modified dataset as a pyspark
DataFrame.
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
spark.read.parquet(path).createTempView('data')
sf = spark.sql(f"""SELECT id, value, 0 AS segment FROM data""")
At this point sf
data is same as df
data but with an additional segment
column of all zeros. I would like to efficiently overwrite the existing Parquet dataset at path
with sf
as a Parquet dataset in the same location. Below is what does not work. Also prefer not to write sf
to a new location, delete old Parquet dataset, and rename as does not seem efficient.
# saves existing data and new data
sf.write.partitionBy('id').mode('append').parquet(path)
# immediately deletes existing data then crashes
sf.write.partitionBy('id').mode('overwrite').parquet(path)