1

In my Spark job, I write a compressed parquet file like this:

df
  .repartition(numberOutputFiles)
  .write
  .option("compression","gzip")
  .mode(saveMode)
  .parquet(avroPath)

Then, my files has this extension : file_name .gz.parquet

How can I have ".parquet.gz" ?

mazaneicha
  • 8,794
  • 4
  • 33
  • 52
Marwan02
  • 45
  • 6
  • Can you show us what value `avroPath` has? In my case (on Spark 2.4.5 using `spark-shell`) when I use the same command as you I just get the exact filename I specify. So if I have `orangeJuice` instead of your `avroPath`, I will get `orangeJuice` as file name. If I choose `orangeJuice.parquet.gz`, I get that file name. – Koedlt Dec 26 '22 at 18:23
  • It is something like "/my/path/partition_id=xxxxx" – Marwan02 Dec 27 '22 at 09:49

1 Answers1

1

I don't believe you can. File extension is hardcoded in ParquetWrite.scala as concatenation of codec's extension and ".parquet", in that order:

  :
    override def getFileExtension(context: TaskAttemptContext): String = {
      CodecConfig.from(context).getCodec.getExtension + ".parquet"
    }
  :

So, unless you want to change the source and compile your own Spark version, or open a JIRA request against Spark... ;))

mazaneicha
  • 8,794
  • 4
  • 33
  • 52