Managing secrets in AWS EMR PySpark job

Question

I have an EMR PySpark job which needs to access an s3 bucket owned by 3rd party.

The PySpark job is stored on s3://mybucket/job.py and submitted as a step

        {
            "Name": "Process promo_regs",
            "ActionOnFailure": "TERMINATE_CLUSTER",
            "HadoopJarStep": {
                "Jar": "command-runner.jar",
                "Args": ["spark-submit", "--master", "yarn", "--deploy-mode cluster", "s3://mybucket.job.py"],
            }
        }

In the job.py I configure a boto3 s3 client.

from pyspark.sql import SparkSession
import boto3

# How to inject this?
env = {
    'AWS_ACCESS_KEY_ID': '',
    '#AWS_SECRET_ACCESS_KEY': '',
    'AWS_REGION_NAME': ''

}
s3 = boto3.client(
    's3',
    aws_access_key_id=env['AWS_ACCESS_KEY_ID'],
    aws_secret_access_key=env['#AWS_SECRET_ACCESS_KEY'],
    region_name=env['AWS_REGION_NAME'],

spark = (SparkSession
         .builder
         .appName("Test processing dummy data")
         .getOrCreate())

What are my options of securely injecting the access keys into the script?

I am starting the cluster and submitting the job using boto3.client('emr').run_job_flow() if that matters

Snigdhajyoti · Accepted Answer · 2020-06-08T12:36:56.940

There is 2 ways I could think of:

Ask the 3rd party to grand add one policy to their S3 bucket.

Explanation: Your EMR cluster (lives in Account A) will have one IAM user EMR_EC2_ROLE. Ask the them to grant access from Account B to your Account A's EMR_EC2_ROLE. You can find more details here.

If that's not possible you could use AWS Secret Manager. Grant permission to EMR_EC2_ROLE. Using boto3 to fetch the details in runtime.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "secretsmanager:GetSecretValue"
            ],
            "Resource": "arn:aws:secretsmanager:us-east-1:<account-no>:secret:<Secret prefix if you have any>*",
            "Effect": "Allow",
            "Sid": "VisualEditor0"
        }
    ]
}

Thanks, both do make sense! We will see which is acceptable for the 3rd party. For reference here is the link to the [boto3 secrets api](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/secretsmanager.html#SecretsManager.Client.get_secret_value) — redacted, Jun 08 '20 at 11:14

Managing secrets in AWS EMR PySpark job

1 Answers1