Load ORC data from multiple date folders in S3 Spark SQL

Question

I am trying to load a dataset using Spark SQL API in Java. My folder structure in S3 is

s3://my-bucket-name/prefix1/date=yyyy-mm-dd/

So I have two date folder in S3 like below:

s3://my-bucket-name/prefix1/date=2020-06-15/
s3://my-bucket-name/prefix1/date=2020-06-16/

I load my dataset using the following Code Snippet

    public static Dataset<MyPOJOClass> getORCRecordDataset (SQLContext sqlContext, String setOfPath) {
    
    Dataset<Row> rows = sqlContext.read().option("inferSchema", true).orc((setOfPath));
    Encoder<MyPOJOClass> myORCRecordEncoder = Encoders.bean(MyPOJOClass.class);
    Dataset<MyPOJOClass> myORCRecordDataset = rows.as(myORCRecordEncoder);
    log.error("count of record in myORCRecordDataset is = {} ", myORCRecordDataset.count());
    return myORCRecordDataset;
}

When I pass setOfPath Variable as this

s3://my-bucket-name/prefix1/{date=2020-06-15/,date=2020-06-16/}

the above code snippet gives me the correct dataset loaded.

Now, I was trying to use blob pattern as explained here blob-pattern and I passed setOfPath Variable as this

s3://my-bucket-name/prefix1/{date=2020-06-[15-16]}/

This did not work and was throwing this exception

User class threw exception: org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket-name/prefix1/{date=2020-06-[15-16]};
org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket-name/prefix1/{date=2020-06-[15-16]};

Can anyone guide What I am doing wrong here?

Load ORC data from multiple date folders in S3 Spark SQL

0 Answers0