0

I am new to Apache Beam and I came from Spark world where the API is so rich.

How can I get the schema of a Parquet file using Apache Beam? without that I load data in memory as sometimes it risks to be huge and I am interested only in knowing the columns, and optionally the columns type.

The language is Python.

The storage system is Google Cloud Storage, and the Apache Beam job must be run in Dataflow.

FYI, I have tried the following as suggested in the sof:

from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata

First, it didn't work when I give it a gs://.. path, giving me this error : error: No such file or directory

Then I have tried for a local file in my machine, and I have slightly changed the code to :

from pyarrow.parquet import ParquetFile
ParquetFile(source).metadata.schema

And so I could have the columns :

<pyarrow._parquet.ParquetSchema object at 0x10927cfd0>
name: BYTE_ARRAY
age: INT64
hobbies: BYTE_ARRAY String

But this solution as it seems to me it requires me to get this file to local (of Dataflow server??) and it doesn't use Apache Beam.

Any (better) solution?

Thank you!

Farah
  • 2,469
  • 5
  • 31
  • 52
  • Well, can you use gsutil with the same path to prove that file exists? – OneCricketeer Feb 12 '20 at 14:56
  • 1
    Yes running "gsutil ls" the file exists. Moreover in the Python script if I test using beam.io.ReadFromParquet(path, columns) the file is read successfully. Thank you. – Farah Feb 12 '20 at 15:25
  • Cool. Feel free to answer your own question below – OneCricketeer Feb 12 '20 at 15:31
  • But it's not what I need. the method ReadFromParquet (or PTransform?) read the content of the file and has as an argument the column names array to select. so the column names is not in the return. I need schema to be returned, containing all the existing column names. Thank you. – Farah Feb 12 '20 at 15:37
  • Personally, I would just use GCS python sdk, download the file, then use those pyarrow functions. I'm not sure why you'd really need the schema in Beam when it's more intended to read the entire file for processing – OneCricketeer Feb 12 '20 at 16:04
  • Hey I could make a quick-and-dirty solution so I answered my question. Thank you a lot! – Farah Feb 12 '20 at 16:07

1 Answers1

2

I'm happy I could come up with a hand made solution after reading the code source of apache_beam.io.parquetio :

import pyarrow.parquet as pq
from apache_beam.io.parquetio import _ParquetSource
import os

os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '<json_key_path>'

ps = _ParquetSource("", None, None, None) # file_pattern, min_bundle_size, validate, columns
with ps.open_file("<GCS_path_of_parquet_file>") as f:
    pf = pq.ParquetFile(f)
    print(pf.metadata.schema)
Farah
  • 2,469
  • 5
  • 31
  • 52