I am in the process of learning Parquet File's internal representation, so I went through Apache Parquet's Github page, Google Dremel's paper to understand the definition and repetition levels and Twitter's blog to learn more about Parquet file.
To relate my understanding of its representation that I gained through my read with the actual Parquet files representation, I used parquet-tools
command with meta
option for one of the sample Parquet file and it printed details with 3 major sections, Header, File schema and Row_groups. I understood the details presented under the first 2 sections but I couldn't completely understand all the details present in the row group section.
Below are the questions that I have.
- Wanted to know more about what
DO
,FPO
,VC
(This looks like the count of all the rows in the current row group) is. Expansion of what it stands for can be found in the parquet-tools Github page but I wanted to get more details about it. I understand whatSZ
andST
is. - Next to
ENC
I see list of encoding schemes likeBIT_PACKED
,PLAIN
,RLE
. I understand what it means individually but I do not understand why there are at least 3 encoding schemes used all the times. - Next to Record count
RC
and total sizeTS
of the row group, I seeOFFSET
. For the first page it is 4 always. How is it calculated?. - I came to know Parquet file's header and footer has 4 digit magic code as "PAR1", Does it have any special meaning? or just some arbitratry text to decide if the file is Parquet or not (without depending on the file extension).
Unfortunately I couldn't attach the snippet of the parquet-tools meta
command's output due to security constraints but I hope it will not be too much to visualize what I mean in each of questions.