- Introduction
- Setup
- Launching a cluster
- Archiving Spark logs
- Spark history server
- Broadcasting commands to all nodes
This repository is a collection of shell scripts and configuration files used to manage AWS EMR clusters (tested on small clusters).
-
common/
has utilities, and common variables used in the scripts. -
AWS account specific variables are defined in
common/conf
, however it is not included in the source tree due to security reasons. A templatecommon/conf
is shown below:# Amazon EC2 options declare SUBNET="SubnetId=<my-subnet>" declare KEYS="KeyName=<mykey>" # Amazon S3 config declare BUCKET=<my-bucket> declare LOG_BUCKET=<my-log-bucket>/elasticmapreduce
Note: You might need to replace a few
<my-bucket>
in the bootstrap script, and the conf files. -
Other configuration files are available under
etc/
.- Cluster configuration is in
etc/cluster-conf.json
- Instance configurations are in
etc/instance-groups-*.json
etc/bootstrap.sh
has placeholder names for S3 buckets which should be edited before using the launching script.
- Cluster configuration is in
-
AWS credentials are expected to be in the usual place (
$HOME/.aws/
).
- Dependencies: It sets up the cluster by making the dependencies
provided under
<my-bucket>/deps/
. However in the case of Python dependencies, they need to be explicitly installed usingpip
during bootstrap.jar
dependencies are automatically added to theclasspath
in the cluster configuration. - User management: The bootstrap script also enables a simple SSH
public key based user management scheme by copying the public keys
from
<my-bucket>/conf/keys/
. - Utilities: Finally it copies a few useful scripts into the
PATH
.
$ ./launch-cluster --name <project-name> --cluster-type SPARK \
--instance-groups etc/instance-groups-min.json \
--conf etc/cluster-conf.json
After starting a cluster, you can get the cluster public dns like this:
$ aws emr describe-cluster --cluster-id <cluster-id> | grep PublicDns
Then you can login to the cluster using your SSH keys. If you use the
-A
switch, that forwards your current set of keys to the cluster,
making it easier for you to use SSH to interact with other computers
outside the cluster.
The cluster can be terminated like this:
$ aws emr terminate-clusters --cluster-ids <cluster-id>
On EMR, the Spark logs reside on HDFS, and are destroyed when the
cluster is terminated. You can archive the logs of a currently
running cluster for future reference to S3 (under the log bucket)
using the script archive-spark-logs
. You may run this script either
from the workstation where you used launch-cluster
to launch the
cluster, or from an interactive terminal on the cluster; the script
detects the location and does the "right thing". If multiple clusters
are running, the script will prompt you to choose a cluster from a
menu.
The archived logs can be retrieved later using emr-get-logs
by
providing the cluster id.
You can start a Spark history server using history-server
by
pointing to the downloaded logs. This requires a local standalone
installation of Spark. You can point to the installation by setting
$SPARK_HOME
.
$ history-server start <log-dir>
$ history-server stop
Without going into too many details, you can setup spark like this:
function __hadoop_init()
{
export HADOOP_VER=2.7.3
export HADOOP_HOME=~/build/hadoop-${HADOOP_VER}
export LD_LIBRARY_PATH=$HADOOP_HOME/lib/native:$LD_LIBRARY_PATH
export PATH=$HADOOP_HOME/bin:$PATH
export CLASSPATH=`$HADOOP_HOME/bin/hdfs classpath --glob`
}
function __spark_init()
{
PYSPARK_PYTHON=$1
# __hbase_init
__hadoop_init
export SPARK_HOME=~/build/spark
if [[ -z $PYSPARK_PYTHON || ! $PYSPARK_PYTHON =~ .*python.* ]]; then
PYSPARK_PYTHON=python3
fi
export PYSPARK_PYTHON
# parenthesis necessary for filename glob expansion
local py4j=(${SPARK_HOME}/python/lib/py4j*.zip)
export PYTHONPATH="${SPARK_HOME}/python/lib/pyspark.zip:$PYTHONPATH"
export PYTHONPATH="$py4j:$PYTHONPATH"
export PATH=$SPARK_HOME/bin:$PATH
}
If you have logged into the cluster with SSH with the -A
flag
enabled, you can execute a command across all nodes of the cluster (or
IOW, broadcast a command) using the broadcast
command.
$ broadcast <mycmd> <with> <arguments>