Parquet meta data of "has_dictionary_page" is false but column has "PLAIN_DICTIONARY" encoding

Question

I used parquet of pyarrow ro read the meta data of parquet by this code:

from pyarrow import parquet

p_file = parquet.ParquetFile("v-c000.gz.parquet")

for rg_idx in range(p_file.metadata.num_row_groups):
    rg = p_file.metadata.row_group(rg_idx)
    for col_idx in range(rg.num_columns):
        col = rg.column(col_idx)
        print(col)

and got in the output: has_dictionary_page: False (for all the row group)

but according to my checks all the column chanks in all of row group are PLAIN_DICTIONARY encoded. furthermore I checked statistics about the dictionary and saw all the key and value over it. attaching part of it:

How is that possible that there is no dictionary page?

Please [don't upload text as image](https://meta.stackoverflow.com/a/285557/13447). Edit your question to contain all the information in text form - consider to use the editor's formatting options. Also see [ask]. — Olaf Kock, Nov 10 '21 at 17:09
If you rewrite that parquet file using pyarrow and then read it back do you see the correct value for `has_dictionary_page`? — Pace, Nov 11 '21 at 02:43

score 3 · Answer 1 · answered Nov 11 '21 at 02:41

My best guess is that you are running into PARQUET-1547 which is described a bit more in this question.

In summary, some parquet readers did not write the dictionary_page_offset field correctly. Those parquet readers have workarounds in place to recognize the invalid write. However, parquet-cpp (which is used by pyarrow) does not have such a workaround in place.

Parquet meta data of "has_dictionary_page" is false but column has "PLAIN_DICTIONARY" encoding

1 Answers1