diff --git a/README.md b/README.md index 14466f4..1a152ff 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ First, let's install the `emr` command. python3 -m pip install -U emr-cli ``` -> **Note** This tutorial assumes you have already [setup EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html) and have an EMR Serverless application, job role, and S3 bucket you can use. You can also use the `emr bootstrap` command. +> **Note** This tutorial assumes you have already [setup EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html) and have an EMR Serverless application, job role, and S3 bucket you can use. If not, you can use the `emr bootstrap` command. 1. Create a sample project @@ -56,30 +56,109 @@ emr run \ --application-id ${APPLICATION_ID} \ --job-role ${JOB_ROLE_ARN} \ --s3-code-uri s3://${S3_BUCKET}/tmp/emr-cli-demo/ \ + --s3-logs-uri s3://${S3_BUCKET}/logs/emr-cli-demo/ \ --build \ - --wait + --show-stdout ``` This command performs the following actions: -- Packages your project dependencies into a python virtual environment +- Packages your project dependencies into a Python virtual environment - Uploads the Spark entrypoint and packaged dependencies to S3 - Starts an EMR Serverless job -- Waits for the job to run to a successful completion! +- Waits for the job to run to completion and shows the `stdout` of the Spark driver when finished! And you're done. Feel free to modify the project to experiment with different things. You can simply re-run the command above to re-package and re-deploy your job. -## pyspark code +## EMR CLI Sub-commands + +The EMR CLI has several subcommands that you can see by running `emr --help` + +``` +Commands: + bootstrap Bootstrap an EMR Serverless environment. + deploy Copy a local project to S3. + init Initialize a local PySpark project. + package Package a project and dependencies into dist/ + run Run a project on EMR, optionally build and deploy + status +``` + +### bootstrap + +`emr bootstrap` allows you to create a sample EMR Serverless or EMR on EC2 environment for testing. It assumes you have admin access and creates various resources for you using AWS APIs. + +#### EMR Serverless + +To create a bootstrap EMR Serverless environment, using the following command: + +```shell +emr bootstrap \ + --target emr-serverless \ + --code-bucket \ + --job-role-name +``` + +When you do this, the CLI creates a new EMR CLI config file at `.emr/config.yaml` that will set default locations for your `emr run` command. + +### init + +The `init` command creates a new `pyproject.toml` or `poetry` project for you with a sample PySpark application. + +`init` is required to create those project types as it also initializes a `Dockerfile` used to package your dependencies. Single-file PySpark jobs and simple Python modules do not require the `init` command to be used. + +### package + +The `package` command bundles your PySpark code and dependencies in preparation for deployment. Often you'll either use `package` and `deploy` to deploy new artifacts to S3, or you'll use the `--build` flag in the `emr run` command to handle both of those tasks for you. + +The EMR CLI automatically detects what type of project you have and builds the necessary dependency packages. + +### deploy -In many organizations, PySpark is the primary language for writing Spark jobs. But Python projects can be structured in a variety of ways – a single `.py` file, `requirements.txt`, `setup.py` files, or even `poetry` configurations. EMR CLI aims to bundle your PySpark code the same way regardless of which system you use. +The `deploy` command copies the project dependencies from the `dist/` folder to your specified S3 location. -## Spark scala code (coming) +### run -While Spark Scala or Java code will be more standard from a packaging perspective, it's still useful to able to easily deploy and run your jobs across multiple EMR environments. +The `run` command is intended to help package, deploy, and run your PySpark code across EMR on EC2, EMR on EKS, or EMR Serverless. -## Spark SQL (coming) +You must provide one of `--cluster-id`, `--virtual-cluster-id`, or `--application-id` to specify which environment to run your code on. -Want to just write some `.sql` files and have those deployed? No problem. +`emr run --help` shows all the available options: + +``` +Usage: emr run [OPTIONS] + + Run a project on EMR, optionally build and deploy + +Options: + --application-id TEXT EMR Serverless Application ID + --cluster-id TEXT EMR on EC2 Cluster ID + --virtual-cluster-id TEXT EMR on EKS Virtual Cluster ID + --entry-point FILE Python or Jar file for the main entrypoint + --job-role TEXT IAM Role ARN to use for the job execution + --wait Wait for job to finish + --s3-code-uri TEXT Where to copy/run code artifacts to/from + --s3-logs-uri TEXT Where to send EMR Serverless logs to + --job-name TEXT The name of the job + --job-args TEXT Comma-delimited string of arguments to be + passed to Spark job + + --spark-submit-opts TEXT String of spark-submit options + --build Package and deploy job artifacts + --show-stdout Show the stdout of the job after it's finished + --save-config Update the config file with the provided + options + + --emr-eks-release-label TEXT EMR on EKS release label (emr-6.15.0) - + defaults to latest release +``` + +## Support PySpark configurations + +- Single-file project - Projects that have a single `.py` entrypoint file. +- Multi-file project - A more typical PySpark project, but without dependencies, that has multiple Python files or modules. +- Python module - A project with dependencies defined in a `pyproject.toml` file. +- Poetry project - A project using [Poetry](https://python-poetry.org/) for dependency management. ## Sample Commands @@ -125,6 +204,17 @@ emr run --entry-point main.py \ --wait ``` +- Re-run an already deployed job and show the `stdout` of the driver. + +```bash +emr run --entry-point main.py \ + --s3-code-uri s3:///code/ \ + --s3-logs-uri s3:///logs/ \ + --application-id \ + --job-role \ + --show-stdout +``` + > **Note**: If the job fails, the command will exit with an error code. - Re-run your jobs with 7 characters. @@ -146,18 +236,27 @@ emr run --entry-point main.py \ 🥳 -In the future, you'll also be able to do the following: +- Run the same job against an EMR on EC2 cluster -- Utilize the same code against an EMR on EC2 cluster +```bash ```bash -emr run --cluster-id j-8675309 +emr run --entry-point main.py \ + --s3-code-uri s3:///code/ \ + --s3-logs-uri s3:///logs/ \ + --cluster-id + --show-stdout ``` - Or an EMR on EKS virtual cluster. ```bash -emr run --virtual-cluster-id 654abacdefgh1uziuyackhrs1 +emr run --entry-point main.py \ + --s3-code-uri s3:///code/ \ + --s3-logs-uri s3:///logs/ \ + --virtual-cluster-id \ + --job-role \ + --show-stdout ``` ## Security diff --git a/pyproject.toml b/pyproject.toml index 9e2c7ba..3736b1e 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -1,7 +1,7 @@ [tool.poetry] name = "emr-cli" version = "0.0.16" -description = "A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs." +description = "A command-line interface for packaging, deploying, and running your PySpark jobs on EMR." authors = ["Amazon EMR "] license = "Apache-2.0" readme = "README.md" diff --git a/src/emr_cli/deployments/emr_ec2.py b/src/emr_cli/deployments/emr_ec2.py index 61c3a58..163afb3 100644 --- a/src/emr_cli/deployments/emr_ec2.py +++ b/src/emr_cli/deployments/emr_ec2.py @@ -101,6 +101,7 @@ def _default_s3_bucket_policy(self, bucket_name) -> str: "Sid": "RequireSecureTransport", "Effect": "Deny", "Principal": "*", + "Action": "s3:*", "Resource": [f"arn:aws:s3:::{bucket_name}/*", f"arn:aws:s3:::{bucket_name}"], "Condition": { "Bool": {"aws:SecureTransport": "false", "aws:SourceArn": f"arn:aws:s3:::{bucket_name} "} diff --git a/src/emr_cli/deployments/emr_serverless.py b/src/emr_cli/deployments/emr_serverless.py index 4b14c0b..56fd1c3 100644 --- a/src/emr_cli/deployments/emr_serverless.py +++ b/src/emr_cli/deployments/emr_serverless.py @@ -49,6 +49,7 @@ def _zip_local_pyfiles(self): class Bootstrap: + # Maybe add some UUIDs to these? DEFAULT_S3_POLICY_NAME = "emr-cli-S3Access" DEFAULT_GLUE_POLICY_NAME = "emr-cli-GlueAccess" @@ -74,7 +75,8 @@ def create_environment(self): def print_destroy_commands(self, application_id: str): # fmt: off for bucket in set([self.log_bucket, self.code_bucket]): - print(f"# aws s3 rm s3://{bucket} --force") + print(f"aws s3 rm s3://{bucket} --recursive") + print(f"aws s3api delete-bucket --bucket {bucket}") for policy in self.iam_client.list_attached_role_policies(RoleName=self.job_role_name).get('AttachedPolicies'): # noqa E501 arn = policy.get('PolicyArn') print(f"aws iam detach-role-policy --role-name {self.job_role_name} --policy-arn {arn}") # noqa E501 @@ -107,6 +109,7 @@ def _default_s3_bucket_policy(self, bucket_name) -> str: "Sid": "RequireSecureTransport", "Effect": "Deny", "Principal": "*", + "Action": "s3:*", "Resource": [f"arn:aws:s3:::{bucket_name}/*", f"arn:aws:s3:::{bucket_name}"], "Condition": { "Bool": {"aws:SecureTransport": "false", "aws:SourceArn": f"arn:aws:s3:::{bucket_name} "} diff --git a/src/emr_cli/emr_cli.py b/src/emr_cli/emr_cli.py index ae1f904..c6d8986 100644 --- a/src/emr_cli/emr_cli.py +++ b/src/emr_cli/emr_cli.py @@ -101,6 +101,7 @@ def bootstrap(target, code_bucket, logs_bucket, instance_profile_name, job_role_ resource_id: config.get(resource_id), "job_role": config.get("job_role_arn"), "s3_code_uri": f"s3://{config.get('code_bucket')}/code/pyspark/", + "s3_logs_uri": f"s3://{config.get('log_bucket')}/logs/pyspark/", } } ConfigWriter.write(run_config)