1

I used parquet of pyarrow ro read the meta data of parquet by this code:

from pyarrow import parquet

p_file = parquet.ParquetFile("v-c000.gz.parquet")

for rg_idx in range(p_file.metadata.num_row_groups):
    rg = p_file.metadata.row_group(rg_idx)
    for col_idx in range(rg.num_columns):
        col = rg.column(col_idx)
        print(col)

and got in the output: has_dictionary_page: False (for all the row group)

enter image description here

but according to my checks all the column chanks in all of row group are PLAIN_DICTIONARY encoded. furthermore I checked statistics about the dictionary and saw all the key and value over it. attaching part of it:

enter image description here

How is that possible that there is no dictionary page?

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
etiel
  • 43
  • 5
  • Please [don't upload text as image](https://meta.stackoverflow.com/a/285557/13447). Edit your question to contain all the information in text form - consider to use the editor's formatting options. Also see [ask]. – Olaf Kock Nov 10 '21 at 17:09
  • If you rewrite that parquet file using pyarrow and then read it back do you see the correct value for `has_dictionary_page`? – Pace Nov 11 '21 at 02:43

1 Answers1

3

My best guess is that you are running into PARQUET-1547 which is described a bit more in this question.

In summary, some parquet readers did not write the dictionary_page_offset field correctly. Those parquet readers have workarounds in place to recognize the invalid write. However, parquet-cpp (which is used by pyarrow) does not have such a workaround in place.

Pace
  • 41,875
  • 13
  • 113
  • 156