diff --git a/scripts/cluster/yarn_cluster_setup/README.md b/scripts/cluster/yarn_cluster_setup/README.md index 5e7b1fc..89cca83 100644 --- a/scripts/cluster/yarn_cluster_setup/README.md +++ b/scripts/cluster/yarn_cluster_setup/README.md @@ -1,7 +1,7 @@ -Setup an experiment on Cloudlab using the SparkFHE-Dist-Ubuntu18.04 image. Use the Wisconsin server. +Setup an experiment on Cloudlab using the SparkFHE-YARN-Client-Ub18-HDFS image. Use the Wisconsin server. -Please note that the scripts are designed to run on Master Node. +The installation scripts are designed to run from master. The cluster start/stop scripts and example scripts are designed to work from client. # SSH into Master Node SSH into the master node and navigate to the address specified below: @@ -10,50 +10,166 @@ cd /spark-3.0.0-SNAPSHOT-bin-SparkFHE/SparkFHE-Addon/scripts/cluster/yarn_cluste ``` # Install Hadoop and Configure Spark on all nodes through Master Node -Specify the hostnames of nodes as arguments. +The hostnames of nodes in cluster will be picked up from etc/hosts. Read Appendix for further details about Hostnames. ``` -sudo bash install_yarn_cluster.sh master,worker1,worker2 ... +sudo bash install_yarn_cluster.sh ``` -# Start Yarn Spark Cluster -Cluster can only be started on master node after installation is complete on all nodes and configuration files for Yarn and Spark are placed in correct folders. +# SSH into Client Node +SSH into the Client node and navigate to the address specified below: +``` +cd /spark-3.0.0-SNAPSHOT-bin-SparkFHE/SparkFHE-Addon/scripts/cluster/yarn_cluster_setup +``` + +# Start Yarn Spark Cluster and HDFS from Client Node +Cluster can only be started on master node after installation is complete on all nodes and configuration files for Yarn and Spark are placed in correct folders. Check Appendix for HDFS Commands. ``` sudo bash start_yarn_cluster.sh ``` -# Run Test Spark Job on Master -Use the link generated after successful completion of cluster building to view the web interface for Yarn. +# Run Test Spark Job on Master Through Client ``` cd test_scripts -sudo bash run_spark_test_job_pi.sh +sudo bash run_spark_test_job_pi_remotely.sh ``` +If the job is successfulll completed, final status is 'SUCCEEDED'. The links generated can be used by following the guide specified below. + +# Web Interfaces: + +Different Web Interfaces can be accessed by changing the port number. The list is specified directly below. -### Useful Links: -Other links can be generated by changing the port number. +To view the web Interface, some additional steps will have to be performed. Check Appendix for SSH Tunneling Instructions. -YARN Interface: +The public IP addresses of some nodes have been closed to bolster security. Check Appendix for Security for individual aspects of cluster. -http://:8088/ +## YARN Interface: -Spark Interface: +http://:8088/ -http://:8080/ +The output of test job is available in the link above. -Namenode Interface: +Select the latest spplications, open the logs for that application, and select stdout. This should show the value for Pi calculated on the cluster. -http://:50070/ +## Spark Interface: -Datanode Interface: +http://:8080/ -http://:50075/ +## Namenode Interface: -JobMaster Interface: +http://:50070/ -http://:19888/ +## JobMaster Interface: -# Stop the Cluster +http://:19888/ + +## Datanode Interface: + +http://:50075/ + +# Stop the Cluster Through the client Node ``` cd .. sudo bash stop_yarn_job.sh ``` After running this command, the web interfaces will not work. + +# Appendix + +## Hostnames +The current process is designed to read worker names from etc/hosts. This might not be the case for 3rd party products Amazon EC2. Changes will have to be made to the step. The user would have to manually enter public IP addresses of master and worker nodes. + +## HDFS Commands +An important condition to for HDFS to work is the public IP address. Please make sure that every node in the cluster has a publicly accessible IP address. + +HDFS is turned on when start_yarn_cluster.sh is executed. The individual command to turn on HDFS is /sbin/start-dfs.sh. To close use /sbin/stop-dfs.sh. + +### HDFS Commands on cluster nodes +Once on, following commands can be run from any of nodes in the cluster. +``` +# List Folders in HDFS +hdfs dfs -ls / +# Make Folder +hadoop fs -mkdir -p / +# Confirm Folder Creation +hdfs dfs -ls / +# Move Local file into HDFS +hadoop fs -put / // +# View content of file created in HDFS +hdfs dfs -cat // +``` +Additional Information can be found [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html) + +### HDFS Commands from remote machines using webHDFS i.e. port 9000 +For the most part the HDFS commands here stay similar HDFS commands on cluster nodes. The address to access the HDFS needs to be changed in the following manner. The standard address for hdfs/hadoop can be /usr/local/hadoop/etc/hadoop +``` +# List Folders in HDFS +hdfs dfs -ls hdfs://:9000/ +# Make Folder +hadoop fs -mkdir -p hdfs://:9000/ +# Confirm Folder Creation +hdfs dfs -ls hdfs://:9000/ +# Move Local file into HDFS +hadoop fs -put / hdfs://:9000// +# View content of file created in HDFS +hdfs dfs -cat hdfs://:9000// +``` +Additional Information can be found [here](https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html) + +### HDFS Commands from remote machines using webHDFS i.e. port 50070 +To run commands from machines outside the cluster, REST API can be used. Here are a few examples. +``` +# Make folder +curl -X put "http://:50070/webhdfs/v1/user/?user.name=root&op=MKDIRS" +# Create an empty file +curl -i -X put "http://:50070/webhdfs/v1/user//?user.name=root&op=CREATE" +# The command above generates a link(specified in quotes) that can be used to upload the file. Use it to append onto HDFS. +curl -i -T "http://:50075/webhdfs/v1/user//?op=CREATE&user.name=root&namenoderpcaddress=master:9000&createflag=&createparent=true&overwrite=false" + +``` +Additional Information can be found [here](https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/WebHDFS.html) + +## SSH Tunneling Instructions + +### Find Internal IP of Master/Worker Node + +On the client node run the following to get the internal IP of Master Node: +``` +sudo ssh master "hostname -I | awk '{print \$1}'" +sudo ssh worker1 "hostname -I | awk '{print \$1}'" +``` + +### Setup SSH Tunneling for nodes + +Open a Terminal window on local machine and type the following: + +``` +ssh -4 -ND +``` +This step will bind the local machine's port to the IP address of Master Node. + +### Configure Browser to open link + +* Open Mozilla Firefox browser in the local machine. + +* Click on three horizontal bars available on the top right hand side. + +* Select Preferences and look for 'Network Settings' on the page. + +* Once inside Network Settings, Select Manual Proxy Configuration. + +* Select Socks_v5 and type in the Port Number chosen in the previous step for SOCKS Host. The IP of SOCKS Host does not need to be changed. Select OK. + +### Open Weblinks (address-format and port number specified above) + +### Stop SSH Tunneling + +* To use the Mozilla Firefox browser as usual, Select 'No Proxy' in Network Settings and Select OK. + +* Stop the SSH tunneling by Closing the Terminal Window or Hit Ctrl + C in the terminal window. + + +## Security for individual aspects of cluster +* YARN - Accessible only on internal IP +* Remote HDFS(Port 9000) - Publicly accessible +* webHDFS(Port 50070) - Publicly Accesible +* Spark - Publicly accessible diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/core-site.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/core-site.xml index 7d1a4bb..5690604 100644 --- a/scripts/cluster/yarn_cluster_setup/configs/hadoop/core-site.xml +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/core-site.xml @@ -1,14 +1,12 @@ - fs.defaultFS - hdfs://master:9000 + hdfs://master-public-ip:9000 - - dfs.namenode.rpc-bind-host - 0.0.0.0 + dfs.namenode.rpc-bind-host + master-public-ip \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/hdfs-site.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/hdfs-site.xml index 90ad504..7fba571 100644 --- a/scripts/cluster/yarn_cluster_setup/configs/hadoop/hdfs-site.xml +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/hdfs-site.xml @@ -7,11 +7,15 @@ dfs.namenode.http-address - 0.0.0.0:50070 + master-variable-ip:50070 dfs.namenode.secondary.http-address - 0.0.0.0:50090 + master-public-ip:50090 + + + dfs.namenode.datanode.registration.ip-hostname-check + false dfs.replication @@ -25,4 +29,8 @@ dfs.datanode.data.dir /data/hadoop/data + + dfs.webhdfs.enabled + true + \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-capacity.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-capacity.xml index b403f69..4a3ef7d 100755 --- a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-capacity.xml +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-capacity.xml @@ -3,7 +3,7 @@ yarn.resourcemanager.hostname - master + master-internal-ip yarn.nodemanager.aux-services @@ -15,30 +15,30 @@ yarn.resourcemanager.address - master:8032 + master-internal-ip:8032 yarn.resourcemanager.scheduler.address - master:8030 + master-internal-ip:8030 yarn.resourcemanager.resource-tracker.address - master:8031 + master-internal-ip:8031 yarn.resourcemanager.admin.address - 0.0.0.0:8033 + master-internal-ip:8033 yarn.resourcemanager.webapp.address - 0.0.0.0:8088 + master-internal-ip:8088 mapreduce.jobhistory.address - master:10020 + master-internal-ip:10020 mapreduce.jobhistory.webapp.address - 0.0.0.0:19888 + master-internal-ip:19888 \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-fair.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-fair.xml index 2903140..687725d 100755 --- a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-fair.xml +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-fair.xml @@ -3,7 +3,7 @@ yarn.resourcemanager.hostname - master + master-internal-ip yarn.nodemanager.aux-services @@ -15,31 +15,31 @@ yarn.resourcemanager.address - master:8032 + master-internal-ip:8032 yarn.resourcemanager.scheduler.address - master:8030 + master-internal-ip:8030 yarn.resourcemanager.resource-tracker.address - master:8031 + master-internal-ip:8031 yarn.resourcemanager.admin.address - 0.0.0.0:8033 + master-internal-ip:8033 yarn.resourcemanager.webapp.address - 0.0.0.0:8088 + master-internal-ip:8088 mapreduce.jobhistory.address - master:10020 + master-internal-ip:10020 mapreduce.jobhistory.webapp.address - 0.0.0.0:19888 + master-internal-ip:19888 yarn.resourcemanager.scheduler.class diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-regular.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-regular.xml new file mode 100755 index 0000000..4a3ef7d --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site-regular.xml @@ -0,0 +1,44 @@ + + + + + yarn.resourcemanager.hostname + master-internal-ip + + + yarn.nodemanager.aux-services + mapreduce_shuffle + + + yarn.nodemanager.aux-services.mapreduce.shuffle.class + org.apache.hadoop.mapred.ShuffleHandler + + + yarn.resourcemanager.address + master-internal-ip:8032 + + + yarn.resourcemanager.scheduler.address + master-internal-ip:8030 + + + yarn.resourcemanager.resource-tracker.address + master-internal-ip:8031 + + + yarn.resourcemanager.admin.address + master-internal-ip:8033 + + + yarn.resourcemanager.webapp.address + master-internal-ip:8088 + + + mapreduce.jobhistory.address + master-internal-ip:10020 + + + mapreduce.jobhistory.webapp.address + master-internal-ip:19888 + + \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site.xml b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site.xml index b403f69..4a3ef7d 100755 --- a/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site.xml +++ b/scripts/cluster/yarn_cluster_setup/configs/hadoop/yarn-site.xml @@ -3,7 +3,7 @@ yarn.resourcemanager.hostname - master + master-internal-ip yarn.nodemanager.aux-services @@ -15,30 +15,30 @@ yarn.resourcemanager.address - master:8032 + master-internal-ip:8032 yarn.resourcemanager.scheduler.address - master:8030 + master-internal-ip:8030 yarn.resourcemanager.resource-tracker.address - master:8031 + master-internal-ip:8031 yarn.resourcemanager.admin.address - 0.0.0.0:8033 + master-internal-ip:8033 yarn.resourcemanager.webapp.address - 0.0.0.0:8088 + master-internal-ip:8088 mapreduce.jobhistory.address - master:10020 + master-internal-ip:10020 mapreduce.jobhistory.webapp.address - 0.0.0.0:19888 + master-internal-ip:19888 \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/configs/hostnames b/scripts/cluster/yarn_cluster_setup/configs/hostnames new file mode 100644 index 0000000..821616e --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/configs/hostnames @@ -0,0 +1,3 @@ +master +worker1 +worker2 diff --git a/scripts/cluster/yarn_cluster_setup/configs/master b/scripts/cluster/yarn_cluster_setup/configs/master new file mode 100644 index 0000000..1f7391f --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/configs/master @@ -0,0 +1 @@ +master diff --git a/scripts/cluster/yarn_cluster_setup/configs/slaves b/scripts/cluster/yarn_cluster_setup/configs/slaves new file mode 100644 index 0000000..6e273a2 --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/configs/slaves @@ -0,0 +1,2 @@ +worker1 +worker2 diff --git a/scripts/cluster/yarn_cluster_setup/configs/spark/spark-defaults.conf b/scripts/cluster/yarn_cluster_setup/configs/spark/spark-defaults.conf new file mode 100644 index 0000000..11e7fbb --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/configs/spark/spark-defaults.conf @@ -0,0 +1,28 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more +# contributor license agreements. See the NOTICE file distributed with +# this work for additional information regarding copyright ownership. +# The ASF licenses this file to You under the Apache License, Version 2.0 +# (the "License"); you may not use this file except in compliance with +# the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# + +# Default system properties included when running spark-submit. +# This is useful for setting default environmental settings. + +# Example: +# spark.master spark://127.0.0.1:7077 +spark.eventLog.enabled true +spark.eventLog.dir file:///tmp/spark-events +spark.history.fs.logDirectory file:///tmp/spark-events +# spark.serializer org.apache.spark.serializer.KryoSerializer +# spark.driver.memory 5g +# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three" diff --git a/scripts/cluster/yarn_cluster_setup/install_yarn_cluster.sh b/scripts/cluster/yarn_cluster_setup/install_yarn_cluster.sh index 90ac210..4504523 100644 --- a/scripts/cluster/yarn_cluster_setup/install_yarn_cluster.sh +++ b/scripts/cluster/yarn_cluster_setup/install_yarn_cluster.sh @@ -1,20 +1,41 @@ #!/bin/sh -# Checking for no arguments passed -if [[ $# -eq 0 ]] ; then - echo "Missing arguments." - echo "Usage: bash install_yarn_cluster.bash masterHostname1,workerHostname1,workerHostname2,..." - exit 0 -fi +# Save Master Node Hostname As Global variable +ROOT_VARIABLES_ADDRESS=/etc/profile +HOSTS_ADDRESS=/etc/hosts -# Split based on de-limiter as comma -cluster=$1 -eval $(echo $cluster | awk '{split($0, array, ",");for(i in array)print "host_array["i"]="array[i]}') +# Assume that master node and worker node contain the phrase master and worker in their names respectively +master_name=master +worker_name=worker +name_index_location=4 + +master_index_in_host_array=0 + +current_directory=`pwd` + +# Make Master and Slaves File, Clear Older Files +rm -rf $current_directory/configs/master || true +touch $current_directory/configs/master +rm -rf $current_directory/configs/slaves || true +touch $current_directory/configs/slaves +rm -rf $current_directory/configs/hostnames || true + +# Assume that etc/hosts is correctly populated +# Read hostnames for master and worker nodes +grep $master_name $HOSTS_ADDRESS | awk -v var="$name_index_location" '{print $var}' >> $current_directory/configs/master +grep $worker_name $HOSTS_ADDRESS | awk -v var="$name_index_location" '{print $var}' >> $current_directory/configs/slaves +cat $current_directory/configs/master $current_directory/configs/slaves > $current_directory/configs/hostnames + +host_array=($(cat $current_directory/configs/hostnames |tr "\n" " ")) function checkSSH() { echo "Checking SSH connections" - for(( i=2;i<=${#host_array[@]};i++)) ; do - ssh ${host_array[i]} "hostname" + for(( i=0;i<${#host_array[@]};i++)) ; do + echo ${host_array[i]} + PUBLIC_IP=`ssh root@${host_array[i]} "hostname -i"` + # Replace internal hostnames with public IP + sed -i "s/${host_array[i]}/${PUBLIC_IP}/g" "$current_directory/configs/master" + sed -i "s/${host_array[i]}/${PUBLIC_IP}/g" "$current_directory/configs/slaves" if [ $? -eq 0 ] then echo -e "Can SSH to ${host_array[i]}" @@ -27,49 +48,46 @@ function checkSSH() { checkSSH -# Make Master and Slaves File -# Clear Content from Files - -current_directory=`pwd` - -rm -rf $current_directory/configs/master || true -touch $current_directory/configs/master -rm -rf $current_directory/configs/slaves || true -touch $current_directory/configs/slaves - -# Save 1st argument in master file -master_limit=1 -echo ${host_array[$master_limit]} >> $current_directory/configs/master - -# Save Remaining arguments in slaves file -for(( i=2;i<=${#host_array[@]};i++)) ; do - echo ${host_array[i]} >> $current_directory/configs/slaves -done +MASTER_PUBLIC_IP=`hostname -i` echo ========================================================= echo "Setup Yarn Master" echo ========================================================= echo "Installing Yarn-master" -# Setup Environment at node -bash install_yarn_master_slave.sh +bash install_yarn_master_slave.sh $MASTER_PUBLIC_IP +# Move Config Files and install_yarn_master_slave.sh +# Install Cluster on all Worker Nodes echo ========================================================= echo "Setting up Yarn Slaves" echo ========================================================= - -# Read addresses in slaves file -cat $current_directory/configs/slaves | while read line - -do - if [ "$line" = "-" ]; then - echo "Skip $line" - else - # Move master and slaves file to worker nodes - scp $current_directory/configs/master root@$line:$current_directory/configs - scp $current_directory/configs/slaves root@$line:$current_directory/configs - echo "Installing on $line" - echo "Installing Yarn-slave" - ssh root@$line -n "cd ${current_directory} && sudo bash install_yarn_master_slave.sh" - echo "Finished config node $line" - fi +for(( i=1;i<${#host_array[@]};i++)) ; do + # ssh root@${host_array[i]} -n "sudo rm -rf ${current_directory} && sudo mkdir -p ${current_directory}" + rsync -a --rsync-path="sudo rsync" $current_directory/configs/ ${host_array[i]}:$current_directory/configs/ + scp $current_directory/install_yarn_master_slave.sh ${host_array[i]}:$current_directory/ + echo "Installing on "${host_array[i]} + ssh root@${host_array[i]} -n "cd ${current_directory} && sudo bash install_yarn_master_slave.sh ${MASTER_PUBLIC_IP}" + echo "Finished configuration on "${host_array[i]} + echo "" done + + +echo "Starting Cluster to Ping all Nodes" + +source /etc/profile + +$HADOOP_HOME/sbin/start-dfs.sh +$HADOOP_HOME/sbin/start-yarn.sh +$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver +$HADOOP_HOME/bin/hdfs dfsadmin -safemode leave +$SPARK_HOME/sbin/start-history-server.sh +$SPARK_HOME/sbin/start-all.sh +jps +$HADOOP_HOME/bin/hdfs dfsadmin -report + +echo "Stopping Cluster" +$SPARK_HOME/sbin/stop-history-server.sh +$SPARK_HOME/sbin/stop-all.sh +$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver +$HADOOP_HOME/sbin/stop-dfs.sh +$HADOOP_HOME/sbin/stop-yarn.sh \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/install_yarn_master_slave.sh b/scripts/cluster/yarn_cluster_setup/install_yarn_master_slave.sh index ae806b0..29f51fb 100644 --- a/scripts/cluster/yarn_cluster_setup/install_yarn_master_slave.sh +++ b/scripts/cluster/yarn_cluster_setup/install_yarn_master_slave.sh @@ -1,5 +1,13 @@ #!/bin/sh +if [ $# -eq 0 ] + then + echo "No arguments supplied, installation on node terminated" + exit 255 +fi + +# Accept Public IP of master as a parameter +MASTER_PUBLIC_IP=$1 JAVA_HOME_INFILE=/usr/lib/jvm/default-java/ HADOOP_DATA=/data/hadoop/ HADOOP_HOME_INFILE=/usr/local/hadoop/ @@ -7,7 +15,13 @@ HADOOP_SYMLINK=/usr/local/hadoop HADOOP_CONFIG_LOCATION=${HADOOP_HOME_INFILE}etc/hadoop/ HADOOP_VERSION=2.9.2 HADOOP_WEB_SOURCE=https://www-us.apache.org/dist/hadoop/common/ -GLOBAL_VARIABLES_SOURCE=/etc/environment +ROOT_VARIABLES_ADDRESS=/etc/profile +SPARK_HISTORY_DATA=/tmp/spark-events + +# These variable values will change as node names change +MASTER_INTERNAL_NAME=master +WORKER_INTERNAL_NAME=worker +current_hostname=`hostname` # Install Pre-Reqs apt-get update -y @@ -17,6 +31,22 @@ apt-get install -y python default-jdk wget unlink ${HADOOP_SYMLINK} && rm -rf ${HADOOP_DATA} rm -rf /usr/local/hadoop-*/ +# Remove Global Variables +sed -i /JAVA_HOME/d $ROOT_VARIABLES_ADDRESS && sed -i /default-java/d $ROOT_VARIABLES_ADDRESS +sed -i /HADOOP_HOME/d $ROOT_VARIABLES_ADDRESS && sed -i /hadoop/d $ROOT_VARIABLES_ADDRESS + +# Make Hadoop Global Variables for User and Root +echo "export JAVA_HOME="$JAVA_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export PATH=$PATH:"$JAVA_HOME_INFILE"bin/:"$JAVA_HOME_INFILE"sbin/" >> $ROOT_VARIABLES_ADDRESS +echo "export HADOOP_HOME="$HADOOP_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export HADOOP_MAPRED_HOME="$HADOOP_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export HADOOP_COMMON_HOME="$HADOOP_HOME_INFILE>> $ROOT_VARIABLES_ADDRESS +echo "export HADOOP_HDFS_HOME="$HADOOP_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export YARN_HOME="$HADOOP_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export HADOOP_COMMON_LIB_NATIVE_DIR="$HADOOP_HOME_INFILE"lib/native" >> $ROOT_VARIABLES_ADDRESS +echo "export PATH=$PATH:"$HADOOP_HOME_INFILE"bin/:"$HADOOP_HOME_INFILE"sbin/" >> $ROOT_VARIABLES_ADDRESS +source $ROOT_VARIABLES_ADDRESS + # Make Data Directories for Hadoop mkdir -p ${HADOOP_DATA}name mkdir -p ${HADOOP_DATA}data @@ -28,34 +58,62 @@ current_directory=`pwd` if [ ! -f "${current_directory}/hadoop-$HADOOP_VERSION.tar.gz" ]; then echo "Downloading Hadoop ${HADOOP_VERSION} ..." sudo wget ${HADOOP_WEB_SOURCE}hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz - # wget ${HADOOP_WEB_SOURCE}hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz -P /hadoop-${HADOOP_VERSION}.tar.gz - # sudo curl ${HADOOP_WEB_SOURCE}hadoop-${HADOOP_VERSION}/hadoop-${HADOOP_VERSION}.tar.gz > /hadoop-${HADOOP_VERSION}.tar.gz echo "Download of Hadoop ${HADOOP_VERSION} Successful!" fi # Unzip and Install Hadoop Tar tar -xzf $current_directory/hadoop-$HADOOP_VERSION.tar.gz -C /usr/local/ -# tar -xzf /hadoop-$HADOOP_VERSION.tar.gz -C /usr/local/ + +rm $current_directory/hadoop-$HADOOP_VERSION.tar.gz # Make Symbolic link ln -s /usr/local/hadoop-$HADOOP_VERSION/ $HADOOP_SYMLINK -# Copy Config Files - +# Copy Hadoop Config Files cp -a $current_directory/configs/hadoop/. $HADOOP_CONFIG_LOCATION cp $current_directory/configs/master $HADOOP_CONFIG_LOCATION cp $current_directory/configs/slaves $HADOOP_CONFIG_LOCATION +# Editing Config Files +# Making Uniform Changes applicable to all nodes +sed -i "s/master-public-ip/${MASTER_PUBLIC_IP}/g" "$HADOOP_CONFIG_LOCATION/core-site.xml" +sed -i "s/master-public-ip/${MASTER_PUBLIC_IP}/g" "$HADOOP_CONFIG_LOCATION/hdfs-site.xml" +sed -i "s/master-internal-ip/${MASTER_INTERNAL_NAME}/g" "$HADOOP_CONFIG_LOCATION/yarn-site-capacity.xml" +sed -i "s/master-internal-ip/${MASTER_INTERNAL_NAME}/g" "$HADOOP_CONFIG_LOCATION/yarn-site-fair.xml" +sed -i "s/master-internal-ip/${MASTER_INTERNAL_NAME}/g" "$HADOOP_CONFIG_LOCATION/yarn-site-regular.xml" +sed -i "s/master-internal-ip/${MASTER_INTERNAL_NAME}/g" "$HADOOP_CONFIG_LOCATION/yarn-site.xml" + +# Following changes are different on master and worker node +if [[ $current_hostname == *$MASTER_INTERNAL_NAME* ]]; then + echo "Changing namenode IP on master" + sed -i "s/master-variable-ip/0.0.0.0/g" "$HADOOP_CONFIG_LOCATION/hdfs-site.xml" +else + echo "Changing namenode IP on worker" + sed -i "s/master-variable-ip/${MASTER_PUBLIC_IP}/g" "$HADOOP_CONFIG_LOCATION/hdfs-site.xml" +fi + echo "Hadoop Installation Complete on this node" SPARK_HOME_INFILE=`cd ${current_directory}/../../../.. && pwd` - SPARK_CONFIG_LOCATION=$SPARK_HOME_INFILE/conf/ +# Remove Spark Global Variables +sed -i /SPARK_HOME/d $ROOT_VARIABLES_ADDRESS && sed -i /spark/d $ROOT_VARIABLES_ADDRESS + +# Make Spark Global Variables for User and Root +echo "export SPARK_HOME="$SPARK_HOME_INFILE >> $ROOT_VARIABLES_ADDRESS +echo "export PATH=$PATH:"$SPARK_HOME_INFILE"/bin/" >> $ROOT_VARIABLES_ADDRESS +source $ROOT_VARIABLES_ADDRESS + +# Make Spark Directory for History Recording +sudo rm -rf $SPARK_HISTORY_DATA +mkdir -p $SPARK_HISTORY_DATA + +# Copy Spark Config Files cp -a $current_directory/configs/spark/. $SPARK_CONFIG_LOCATION -cp -a $current_directory/configs/hadoop/. $SPARK_CONFIG_LOCATION +cp -a $HADOOP_CONFIG_LOCATION. $SPARK_CONFIG_LOCATION cp $current_directory/configs/master $SPARK_CONFIG_LOCATION cp $current_directory/configs/slaves $SPARK_CONFIG_LOCATION # Format Namenode -/usr/local/hadoop/bin/hdfs namenode -format \ No newline at end of file +$HADOOP_HOME_INFILE/bin/hdfs namenode -format \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/start_yarn_cluster.sh b/scripts/cluster/yarn_cluster_setup/start_yarn_cluster.sh index 3e1c531..fde2de5 100644 --- a/scripts/cluster/yarn_cluster_setup/start_yarn_cluster.sh +++ b/scripts/cluster/yarn_cluster_setup/start_yarn_cluster.sh @@ -1,21 +1,48 @@ #!/bin/bash -echo "STARTING HADOOP SERVICES" -/usr/local/hadoop/sbin/start-dfs.sh +# Master, Client Name depends on cluster config +# If cluster config changes, the variable values should change +client_name=client +master_name=master +MASTER_HOSTNAME=`ssh root@$master_name "hostname -i"` +current_hostname=`hostname` -/usr/local/hadoop/sbin/start-yarn.sh +if [[ $current_hostname == *$client_name* ]]; then + echo "Commands running from correct node" + ssh root@$MASTER_HOSTNAME ' + source /etc/profile -/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh start historyserver + echo "STARTING HADOOP SERVICES" -/usr/local/hadoop/bin/hdfs dfsadmin -safemode leave + $HADOOP_HOME/sbin/start-dfs.sh -echo "STARTING SPARK SERVICES" -/spark-3.0.0-SNAPSHOT-bin-SparkFHE/sbin/start-all.sh + $HADOOP_HOME/sbin/start-yarn.sh -echo "RUN jps - Java Virtual Machine Process Status Tool" -jps + $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver -echo "Get basic filesystem information and statistics." -/usr/local/hadoop/bin/hdfs dfsadmin -report + $HADOOP_HOME/bin/hdfs dfsadmin -safemode leave -echo "Yarn Cluster is Active" \ No newline at end of file + echo "STARTING SPARK SERVICES" + $SPARK_HOME/sbin/start-history-server.sh + $SPARK_HOME/sbin/start-all.sh + + echo "RUN jps - Java Virtual Machine Process Status Tool" + jps + + echo "Get basic filesystem information and statistics." + $HADOOP_HOME/bin/hdfs dfsadmin -report + + echo "Yarn Cluster is Active" + + echo "Follow the instructions for Web Interfaces specified in the Readme page" + + master_node_ip_address_internal=`hostname -I | sed "'"s/\s.*$//"'"` + + echo "YARN Interface Available At: "$master_node_ip_address_internal":8088/" + echo "Spark Interface Available At: "$master_node_ip_address_internal":8080/" + echo "NameNode Interface Available At: "$master_node_ip_address_internal":50070/" + echo "Job Master Interface Available At: "$master_node_ip_address_internal":19888/" + ' +else + echo "This code can run ONLY on Client Node" +fi diff --git a/scripts/cluster/yarn_cluster_setup/stop_yarn_cluster.sh b/scripts/cluster/yarn_cluster_setup/stop_yarn_cluster.sh index d1cad90..921ef4c 100644 --- a/scripts/cluster/yarn_cluster_setup/stop_yarn_cluster.sh +++ b/scripts/cluster/yarn_cluster_setup/stop_yarn_cluster.sh @@ -1,15 +1,31 @@ #!/bin/bash -echo -e "STOPPING SPARK SERVICES" +# Master, Client Name depends on cluster config +# If cluster config changes, the variable values should change +client_name=client +master_name=master +MASTER_HOSTNAME=`ssh root@$master_name "hostname -i"` +current_hostname=`hostname` -/spark-3.0.0-SNAPSHOT-bin-SparkFHE/sbin/stop-all.sh +if [[ $current_hostname == *$client_name* ]]; then + echo "Commands running from correct node" + ssh root@$MASTER_HOSTNAME ' + source /etc/profile -echo -e "STOPPING HADOOP SERVICES" + echo -e "STOPPING SPARK SERVICES" + $SPARK_HOME/sbin/stop-history-server.sh + $SPARK_HOME/sbin/stop-all.sh -/usr/local/hadoop/sbin/mr-jobhistory-daemon.sh stop historyserver + echo -e "STOPPING HADOOP SERVICES" -/usr/local/hadoop/sbin/stop-dfs.sh + $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver -/usr/local/hadoop/sbin/stop-yarn.sh + $HADOOP_HOME/sbin/stop-dfs.sh -echo "Hadoop Cluster is Inactive Now" \ No newline at end of file + $HADOOP_HOME/sbin/stop-yarn.sh + + echo "Hadoop Cluster is Inactive Now" + ' +else + echo "This code can run ONLY on Client Node" +fi diff --git a/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi.sh b/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi.sh deleted file mode 100644 index 1dcd31d..0000000 --- a/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi.sh +++ /dev/null @@ -1,12 +0,0 @@ -#!/bin/bash - -echo "SPARK TEST" -/spark-3.0.0-SNAPSHOT-bin-SparkFHE/bin/spark-submit --class org.apache.spark.examples.SparkPi \ - --master yarn \ - --deploy-mode cluster \ - --num-executors 1 \ - --driver-memory 1g \ - --executor-memory 512m \ - --executor-cores 1 \ - /spark-3.0.0-SNAPSHOT-bin-SparkFHE/examples/jars/spark-examples*.jar \ - 10 \ No newline at end of file diff --git a/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi_remotely.sh b/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi_remotely.sh new file mode 100644 index 0000000..33eeb24 --- /dev/null +++ b/scripts/cluster/yarn_cluster_setup/test_scripts/run_spark_test_job_pi_remotely.sh @@ -0,0 +1,28 @@ +#!/bin/bash + +# Master, Client Name depends on cluster config +# If cluster config changes, the variable values should change +client_name=client +master_name=master +MASTER_HOSTNAME=`ssh root@$master_name "hostname -i"` +current_hostname=`hostname` + +if [[ $current_hostname == *"client"* ]]; then + echo "Commands running from correct node" + ssh root@$MASTER_HOSTNAME ' + source /etc/profile + + echo "SPARK TEST" + $SPARK_HOME/bin/spark-submit --class org.apache.spark.examples.SparkPi \ + --master yarn \ + --deploy-mode cluster \ + --num-executors 1 \ + --driver-memory 1g \ + --executor-memory 512m \ + --executor-cores 1 \ + $SPARK_HOME/examples/jars/spark-examples*.jar \ + 10 + ' +else + echo "This code can run ONLY on Client Node" +fi