I am trying to load a dataset using Spark SQL API in Java. My folder structure in S3 is
s3://my-bucket-name/prefix1/date=yyyy-mm-dd/
So I have two date folder in S3 like below:
s3://my-bucket-name/prefix1/date=2020-06-15/
s3://my-bucket-name/prefix1/date=2020-06-16/
I load my dataset using the following Code Snippet
public static Dataset<MyPOJOClass> getORCRecordDataset (SQLContext sqlContext, String setOfPath) {
Dataset<Row> rows = sqlContext.read().option("inferSchema", true).orc((setOfPath));
Encoder<MyPOJOClass> myORCRecordEncoder = Encoders.bean(MyPOJOClass.class);
Dataset<MyPOJOClass> myORCRecordDataset = rows.as(myORCRecordEncoder);
log.error("count of record in myORCRecordDataset is = {} ", myORCRecordDataset.count());
return myORCRecordDataset;
}
When I pass setOfPath
Variable as this
s3://my-bucket-name/prefix1/{date=2020-06-15/,date=2020-06-16/}
the above code snippet gives me the correct dataset loaded.
Now, I was trying to use blob pattern as explained here blob-pattern and I passed setOfPath
Variable as this
s3://my-bucket-name/prefix1/{date=2020-06-[15-16]}/
This did not work and was throwing this exception
User class threw exception: org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket-name/prefix1/{date=2020-06-[15-16]};
org.apache.spark.sql.AnalysisException: Path does not exist: s3://my-bucket-name/prefix1/{date=2020-06-[15-16]};
Can anyone guide What I am doing wrong here?