I'm looking at specific limit ( 4GB ) size to be passed while writing the dataframe into csv in pyspark. I have already tried using maxPartitionBytes
, but it is not working as expected.
Below is the one I have used and tested on a 90 GB table from hive- ORC formatted. At the export (write) level it's giving random file sizes other than 4 GB
Any suggestion here to split the files with limit size while writing. Here I don't want to use repartition
or coalesce
as the df is going through a lot of wide transformations.
df.write.format("csv").mode("overwrite").option("maxPartitionBytes", 4*1024*1024(1024).save(outputpath)