Pyspark split the file while writing with specific limit

Question

I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes, but it is not working as expected.

Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export (write) level it's giving random file sizes other than 4 GB

Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition or coalesce as the df is going through a lot of wide transformations.

df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)

score 0 · Answer 1 · edited Aug 23 '23 at 09:04

According to the Spark documentation spark.sql.files.maxPartitionBytes is working on read, if you are doing some shuffles later final size of tasks and due to that final files on write may change

You may try to use spark.sql.files.maxRecordsPerFile as according to the documentation it is working on write

spark.sql.files.maxRecordsPerFile Maximum number of records to write out to a single file. If this value is zero or negative, there is no limit.

If it is not going to do the trick i think that other option is, as you mentioned, to repartition this dataset just before write

Pyspark split the file while writing with specific limit

1 Answers1