5

I am trying to query data from json data in s3-select.

   {
    person = [
    {
            "Id": 1,
            "Name": "Anshu",
            "Address": "Templestowe",
            "Car": "Jeep"
    }
    {
            "Id": 2,
            "Name": "Ben Mostafa",
            "Address": "Las Vegas",
            "Car": "Mustang"
    }
    {
                    "Id": 3,
                    "Name": "Rohan Wood",
                    "Address": "Wooddon",
                    "Car": "VW"
    }
]
}

QUERY = "select * from S3Object s"
QUERY = "select s.person from S3Object s"
QUERY = "select s.person[0] from S3Object s"
QUERY = "select s.person[0].Name from S3Object s"

All these queries work fine and returns the respective object as desired but when i am trying to search data on name/Car, it doesn't work.

QUERY = "select * from S3Object s where s.person.Name = \"Anshu\" "

error: com.amazonaws.services.s3.model.AmazonS3Exception: The column index at line 1, column 32 is invalid.

There is not much related content available on s3-select online. Wondering whether we can query on field name or not! There are no examples of select query with where clause for s3-select given in the documentation

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
yogas
  • 189
  • 1
  • 3
  • 17
  • https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-select-sql-reference-select.html seems to indicate that what you are doing should work. – Michael - sqlbot May 14 '18 at 10:47

3 Answers3

12

I can't find this in any of AWS documentations, but I was just playing around and discovered a working syntax:

QUERY = "select * from S3Object s where 'Anshu' in s.person[*].Name"

Based on some deductions:

  1. I know that syntax like WHERE ('blah' in s.tags) work when tags property is an array of string.
  2. AWS documentation also say that s.person[#] should work when # is a valid index/digit. Based on this, I discovered that using star (*) between square bracket, as in s.person[*].Name, also work. This is after failed testing of various syntax like s.Person[], s.Person[#], s.Person[?], etc...

Proof with Python and Boto3:

import boto3

S3_BUCKET = 'your-bucket-name'

s3 = boto3.client('s3')

r = s3.select_object_content(
        Bucket=S3_BUCKET,
        Key='your-file-name.json',
        ExpressionType='SQL',
        Expression="select * from s3object s where 'Anshu' in s.person[*].Name",
        InputSerialization={'JSON': {"Type": "Lines"}},
        OutputSerialization={'JSON': {}}
)

for event in r['Payload']:
    if 'Records' in event:
        records = event['Records']['Payload'].decode('utf-8')
        print(records)

Weird, I know. Remember to set [default] credentials in ~/.aws/credentials file.

Noogen
  • 1,652
  • 12
  • 13
  • FYI, looks like the AWS documentation was recently updated identifying star ( * ) ( * ) as a wildcard character and that's why the syntax above works. Now you can use the syntax without worrying that Amazon might take it away because it's some kind of a fluke. – Noogen Sep 07 '18 at 22:16
  • Saved my day!! For those using **SageMaker** I filtered images by tags running `SELECT * FROM S3Object s WHERE 1 IN s."test-machine-learning-image-tagging".annotations[*].class_id`. Where `test-machine-learning-image-tagging` is my first level manifest key. – iGian Nov 28 '19 at 07:46
5

After reading the AWS doc, I find following SQL works fine.

select * from S3Object[*].person[*] as p where p.Name='Anshu'

This SQL will give you all the person whose name is 'Anshu', like:

{
    "Id": 1,
    "Name": "Anshu",
    "Address": "Templestowe",
    "Car": "Jeep"
}

When you see [*], it means a json array.

Amazon S3 Select always treats a JSON document as an array of root-level values, so we use S3Object[*] in the SQL. And person value is a array, so we use person[*] in the SQL.

loic
  • 107
  • 1
  • 5
1

You can't do it that way. You need to "flatten" your JSON somewhat so it looks like this:

{
person: {
        "Id": 1,
        "Name": "Anshu",
        "Address": "Templestowe",
        "Car": "Jeep"
    }
}
{ 
person: {
        "Id": 2,
        "Name": "Ben Mostafa",
        "Address": "Las Vegas",
        "Car": "Mustang"
    }
}
{   
person:{
        "Id": 3,
        "Name": "Rohan Wood",
        "Address": "Wooddon",
        "Car": "VW"
    }
}

The query below will work as expcted then

select * from s3object s where s.person.name= 'Anshu'

Chris
  • 5,109
  • 3
  • 19
  • 40
user2699504
  • 195
  • 1
  • 4
  • 18
  • For your example file format to work, I find the syntax need to be (s.person["Name"] = 'Anshu') – Noogen Jun 22 '18 at 01:43