Extension of compressed parquet file in Spark

Question

In my Spark job, I write a compressed parquet file like this:

df
  .repartition(numberOutputFiles)
  .write
  .option("compression","gzip")
  .mode(saveMode)
  .parquet(avroPath)

Then, my files has this extension : file_name .gz.parquet

How can I have ".parquet.gz" ?

Can you show us what value `avroPath` has? In my case (on Spark 2.4.5 using `spark-shell`) when I use the same command as you I just get the exact filename I specify. So if I have `orangeJuice` instead of your `avroPath`, I will get `orangeJuice` as file name. If I choose `orangeJuice.parquet.gz`, I get that file name. — Koedlt, Dec 26 '22 at 18:23

mazaneicha · Accepted Answer · 2022-12-27T21:30:37.950

I don't believe you can. File extension is hardcoded in ParquetWrite.scala as concatenation of codec's extension and ".parquet", in that order:

  :
    override def getFileExtension(context: TaskAttemptContext): String = {
      CodecConfig.from(context).getCodec.getExtension + ".parquet"
    }
  :

So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))

Extension of compressed parquet file in Spark

1 Answers1