Skip to content

Commit

Permalink
Update docs
Browse files Browse the repository at this point in the history
  • Loading branch information
dacort committed Feb 1, 2024
1 parent 9ae33b4 commit fcacb7e
Show file tree
Hide file tree
Showing 5 changed files with 120 additions and 16 deletions.
127 changes: 113 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ First, let's install the `emr` command.
python3 -m pip install -U emr-cli
```

> **Note** This tutorial assumes you have already [setup EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html) and have an EMR Serverless application, job role, and S3 bucket you can use. You can also use the `emr bootstrap` command.
> **Note** This tutorial assumes you have already [setup EMR Serverless](https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html) and have an EMR Serverless application, job role, and S3 bucket you can use. If not, you can use the `emr bootstrap` command.
1. Create a sample project

Expand Down Expand Up @@ -56,30 +56,109 @@ emr run \
--application-id ${APPLICATION_ID} \
--job-role ${JOB_ROLE_ARN} \
--s3-code-uri s3://${S3_BUCKET}/tmp/emr-cli-demo/ \
--s3-logs-uri s3://${S3_BUCKET}/logs/emr-cli-demo/ \
--build \
--wait
--show-stdout
```

This command performs the following actions:

- Packages your project dependencies into a python virtual environment
- Packages your project dependencies into a Python virtual environment
- Uploads the Spark entrypoint and packaged dependencies to S3
- Starts an EMR Serverless job
- Waits for the job to run to a successful completion!
- Waits for the job to run to completion and shows the `stdout` of the Spark driver when finished!

And you're done. Feel free to modify the project to experiment with different things. You can simply re-run the command above to re-package and re-deploy your job.

## pyspark code
## EMR CLI Sub-commands

The EMR CLI has several subcommands that you can see by running `emr --help`

```
Commands:
bootstrap Bootstrap an EMR Serverless environment.
deploy Copy a local project to S3.
init Initialize a local PySpark project.
package Package a project and dependencies into dist/
run Run a project on EMR, optionally build and deploy
status
```

### bootstrap

`emr bootstrap` allows you to create a sample EMR Serverless or EMR on EC2 environment for testing. It assumes you have admin access and creates various resources for you using AWS APIs.

#### EMR Serverless

To create a bootstrap EMR Serverless environment, using the following command:

```shell
emr bootstrap \
--target emr-serverless \
--code-bucket <your_unique_new_bucket_name> \
--job-role-name <your_unique_emr_serverless_job_role_name>
```

When you do this, the CLI creates a new EMR CLI config file at `.emr/config.yaml` that will set default locations for your `emr run` command.

### init

The `init` command creates a new `pyproject.toml` or `poetry` project for you with a sample PySpark application.

`init` is required to create those project types as it also initializes a `Dockerfile` used to package your dependencies. Single-file PySpark jobs and simple Python modules do not require the `init` command to be used.

### package

The `package` command bundles your PySpark code and dependencies in preparation for deployment. Often you'll either use `package` and `deploy` to deploy new artifacts to S3, or you'll use the `--build` flag in the `emr run` command to handle both of those tasks for you.

The EMR CLI automatically detects what type of project you have and builds the necessary dependency packages.

### deploy

In many organizations, PySpark is the primary language for writing Spark jobs. But Python projects can be structured in a variety of ways – a single `.py` file, `requirements.txt`, `setup.py` files, or even `poetry` configurations. EMR CLI aims to bundle your PySpark code the same way regardless of which system you use.
The `deploy` command copies the project dependencies from the `dist/` folder to your specified S3 location.

## Spark scala code (coming)
### run

While Spark Scala or Java code will be more standard from a packaging perspective, it's still useful to able to easily deploy and run your jobs across multiple EMR environments.
The `run` command is intended to help package, deploy, and run your PySpark code across EMR on EC2, EMR on EKS, or EMR Serverless.

## Spark SQL (coming)
You must provide one of `--cluster-id`, `--virtual-cluster-id`, or `--application-id` to specify which environment to run your code on.

Want to just write some `.sql` files and have those deployed? No problem.
`emr run --help` shows all the available options:

```
Usage: emr run [OPTIONS]
Run a project on EMR, optionally build and deploy
Options:
--application-id TEXT EMR Serverless Application ID
--cluster-id TEXT EMR on EC2 Cluster ID
--virtual-cluster-id TEXT EMR on EKS Virtual Cluster ID
--entry-point FILE Python or Jar file for the main entrypoint
--job-role TEXT IAM Role ARN to use for the job execution
--wait Wait for job to finish
--s3-code-uri TEXT Where to copy/run code artifacts to/from
--s3-logs-uri TEXT Where to send EMR Serverless logs to
--job-name TEXT The name of the job
--job-args TEXT Comma-delimited string of arguments to be
passed to Spark job
--spark-submit-opts TEXT String of spark-submit options
--build Package and deploy job artifacts
--show-stdout Show the stdout of the job after it's finished
--save-config Update the config file with the provided
options
--emr-eks-release-label TEXT EMR on EKS release label (emr-6.15.0) -
defaults to latest release
```

## Support PySpark configurations

- Single-file project - Projects that have a single `.py` entrypoint file.
- Multi-file project - A more typical PySpark project, but without dependencies, that has multiple Python files or modules.
- Python module - A project with dependencies defined in a `pyproject.toml` file.
- Poetry project - A project using [Poetry](https://python-poetry.org/) for dependency management.

## Sample Commands

Expand Down Expand Up @@ -125,6 +204,17 @@ emr run --entry-point main.py \
--wait
```

- Re-run an already deployed job and show the `stdout` of the driver.

```bash
emr run --entry-point main.py \
--s3-code-uri s3://<BUCKET>/code/ \
--s3-logs-uri s3://<BUCKET>/logs/ \
--application-id <EMR_SERVERLESS_APP> \
--job-role <JOB_ROLE_ARN> \
--show-stdout
```

> **Note**: If the job fails, the command will exit with an error code.
- Re-run your jobs with 7 characters.
Expand All @@ -146,18 +236,27 @@ emr run --entry-point main.py \

🥳

In the future, you'll also be able to do the following:
- Run the same job against an EMR on EC2 cluster

- Utilize the same code against an EMR on EC2 cluster
```bash

```bash
emr run --cluster-id j-8675309
emr run --entry-point main.py \
--s3-code-uri s3://<BUCKET>/code/ \
--s3-logs-uri s3://<BUCKET>/logs/ \
--cluster-id <EMR_EC2_CLUSTER_ID>
--show-stdout
```

- Or an EMR on EKS virtual cluster.

```bash
emr run --virtual-cluster-id 654abacdefgh1uziuyackhrs1
emr run --entry-point main.py \
--s3-code-uri s3://<BUCKET>/code/ \
--s3-logs-uri s3://<BUCKET>/logs/ \
--virtual-cluster-id <EMR_EC2_CLUSTER_ID> \
--job-role <EMR_EKS_JOB_ROLE_ARN> \
--show-stdout
```

## Security
Expand Down
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
[tool.poetry]
name = "emr-cli"
version = "0.0.16"
description = "A command-line interface for packaging, deploying, and running your EMR Serverless Spark jobs."
description = "A command-line interface for packaging, deploying, and running your PySpark jobs on EMR."
authors = ["Amazon EMR <emr-developer-advocates@amazon.com>"]
license = "Apache-2.0"
readme = "README.md"
Expand Down
1 change: 1 addition & 0 deletions src/emr_cli/deployments/emr_ec2.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ def _default_s3_bucket_policy(self, bucket_name) -> str:
"Sid": "RequireSecureTransport",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [f"arn:aws:s3:::{bucket_name}/*", f"arn:aws:s3:::{bucket_name}"],
"Condition": {
"Bool": {"aws:SecureTransport": "false", "aws:SourceArn": f"arn:aws:s3:::{bucket_name} "}
Expand Down
5 changes: 4 additions & 1 deletion src/emr_cli/deployments/emr_serverless.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ def _zip_local_pyfiles(self):


class Bootstrap:
# Maybe add some UUIDs to these?
DEFAULT_S3_POLICY_NAME = "emr-cli-S3Access"
DEFAULT_GLUE_POLICY_NAME = "emr-cli-GlueAccess"

Expand All @@ -74,7 +75,8 @@ def create_environment(self):
def print_destroy_commands(self, application_id: str):
# fmt: off
for bucket in set([self.log_bucket, self.code_bucket]):
print(f"# aws s3 rm s3://{bucket} --force")
print(f"aws s3 rm s3://{bucket} --recursive")
print(f"aws s3api delete-bucket --bucket {bucket}")
for policy in self.iam_client.list_attached_role_policies(RoleName=self.job_role_name).get('AttachedPolicies'): # noqa E501
arn = policy.get('PolicyArn')
print(f"aws iam detach-role-policy --role-name {self.job_role_name} --policy-arn {arn}") # noqa E501
Expand Down Expand Up @@ -107,6 +109,7 @@ def _default_s3_bucket_policy(self, bucket_name) -> str:
"Sid": "RequireSecureTransport",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [f"arn:aws:s3:::{bucket_name}/*", f"arn:aws:s3:::{bucket_name}"],
"Condition": {
"Bool": {"aws:SecureTransport": "false", "aws:SourceArn": f"arn:aws:s3:::{bucket_name} "}
Expand Down
1 change: 1 addition & 0 deletions src/emr_cli/emr_cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ def bootstrap(target, code_bucket, logs_bucket, instance_profile_name, job_role_
resource_id: config.get(resource_id),
"job_role": config.get("job_role_arn"),
"s3_code_uri": f"s3://{config.get('code_bucket')}/code/pyspark/",
"s3_logs_uri": f"s3://{config.get('log_bucket')}/logs/pyspark/",
}
}
ConfigWriter.write(run_config)
Expand Down

0 comments on commit fcacb7e

Please sign in to comment.