Python Pandas export to parquet, how to overwrite folder outputs

Question

I am pandas to read a csv file then export it to Parquet partitioned by date, it works great

import pandas as pd
import datetime
df = pd.read_csv('box.csv',parse_dates=True)
df['SETTLEMENTDATE'] = pd.to_datetime(df['SETTLEMENTDATE'])
df['Date'] = df['SETTLEMENTDATE'].dt.date
df.to_parquet('nem.parquet',partition_cols=['Date'],allow_truncated_timestamps=True)

and here is a sample output

When I run the script again, it will generate a new files

is there an Option in Pandas to overwrite the file instead, and add a new file only when there is a new data ?

if you can write with `spark` instead: https://stackoverflow.com/questions/58013126/how-to-overwrite-a-parquet-file-from-where-dataframe-is-being-read-in-spark — antoine, Jan 12 '21 at 05:08
Spark!!!, i can't really install spark, this script is run in a small google function, less depndancy the better. — Mim, Jan 12 '21 at 07:37

score 1 · Accepted Answer · answered Jan 11 '23 at 10:26

I ran into this issue today when using to_parquet with partition_cols. My workaround was to run:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


df = pd.read_csv('box.csv',parse_dates=True)
df['SETTLEMENTDATE'] = pd.to_datetime(df['SETTLEMENTDATE'])
df['Date'] = df['SETTLEMENTDATE'].dt.date

# convert to pyarrow table
df_pa = pa.Table.from_pandas(df)

pq.write_to_dataset(df_pa,
  root_path = 'nem.parquet',
  partition_cols = ['Date'],
  basename_template = "part-{i}.parquet",
  existing_data_behavior = 'delete_matching')

The key argument here being existing_data_behavior which:

Controls how the dataset will handle data that already exists in the destination. The default behaviour is ‘overwrite_or_ignore’.

‘overwrite_or_ignore’ will ignore any existing data and will overwrite files with the same name as an output file. Other existing files will be ignored. This behavior, in combination with a unique basename_template for each write, will allow for an append workflow.

‘error’ will raise an error if any data exists in the destination.

‘delete_matching’ is useful when you are writing a partitioned dataset. The first time each partition directory is encountered the entire directory will be deleted. This allows you to overwrite old partitions completely. This option is only supported for use_legacy_dataset=False.

run using: pyarrow==10.0.1

Python Pandas export to parquet, how to overwrite folder outputs

1 Answers1