-
Notifications
You must be signed in to change notification settings - Fork 2
Home
This tutorial has been written by me and Megasthenis Asteris.
In this tutorial we will explain how to set up Hadoop and configure it to run HDFS-RAID. This GitHub repo is forked from Facebook's version and we made a few changes to get it to build. Also, we are in the process of adding a new type of erasure code in the RAID system, which is being done in the regeneratingcode branch. It is under development, so please consider using the master branch instead (which has Reed Solomon and XOR encoding schemes available).
You could also clone Facebook's repo available at https://github.com/facebook/hadoop-20 and then apply the two patches in https://gist.github.com/1365008 so it builds, and then follow the below tutorial to make necessary modifications to get it running.
Note that the write up is under progress. In case you encounter any problem, PLEASE DO NOT hesitate to send me a message through GitHub, or mail me at msathiam at usc.edu so I can fix it!
The following guide is heavily based on Michael Noll's tutorial for Running Hadoop on Ubuntu Linux which can be found here.
Note (a): You will need root privileges to install hadoop.
Note (b): Here you may find a script that automates the following installation process in case you wish to avoid the step by step procedure.
Java 1.6.x is recommended for running Hadoop. To install sun's java, we have to do the following:
- Add the Canonical Partner Repository to your apt repositories. We do that by adding the following line:
deb http://archive.canonical.com/ lucid partner
in "/etc/apt/sources.list". Replace 'lucid' with your distribution's codename. This can be found by typinglsb_release -c | cut -f2
in a bash terminal. - Then update the source list:
sudo apt-get update
- Install sun-java6-jdk:
sudo apt-get install sun-java6-jdk
- Install sun-java6-plugin:
sudo apt-get install sun-java6-plugin
- Select Sun’s Java as the default on your machine:
sudo update-java-alternatives -s java-6-sun
The full JDK which will be placed in "/usr/lib/jvm/java-6-sun". After installation, check whether Sun’s JDK is installed correctly. Type the command java -version
. The output should look like the following:
java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine. In this tutorial we use hduser as the dedicated hadoop installation user and hdgroup as the corresponding group. You may use your preferred user-name and group-name.
- First create the group and the user:
addgroup hdgroup
adduser hduser --gecos hduser
(When interested in building a multinode cluster, where the user will only be accessed through ssh, the --disabled-password option should be considered. In that case, in order to make the (password-less) account accessible via ssh, we must add the corresponding public ssh keys in hduser's "~/.ssh/authorized_keys" file.) - Then add the user in the group 'hdgroup' as well as in the 'admin' group:
usermod -a -G admin hduser
usermod -a -G hdgroup hduser
- Finally, add the following line in "/etc/sudoers":
hduser ALL=(ALL) NOPASSWD:ALL
Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.
I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.
WARNING: If you are using an existing user as the hadoop-installation user (instead of creating a new one), then be careful before proceeding; the following steps will generate a new public key for your user. If you have used you public key to set up connections to other machines, this public key will be lost.
For this reason, you may wish to skip this step and use your existing public key. In this case, go directly to the 3rd command, which adds your user's existing public key to the his authorized keys.
But,
- Make sure that your public key file actually contains a key (a string of characters).
- Note that the RSA key-pair used must have been created without a password (which is usually the case). If it was created with a password, you will probably not avoid creating a new key-pair.
- If file "/home/hduser/.ssh/id_rsa" exists, remove it:
sudo -u hduser rm -f "/home/hduser/.ssh/id_rsa"
- Create a new, password-less key-pair:
sudo -u hduser ssh-keygen -q -t rsa -P "" -f "/home/hduser/.ssh/id_rsa"
This command created an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes). - Copy the public key, into hduser's authorized keys, so that hduser can ssh local host:
sudo -u hduser cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys
- Set hduser as the owner of his own ".ssh" directory and "authorized_keys" file:
chown hduser:hdgroup /home/hduser/.ssh
chown hduser:hdgroup /home/hduser/.ssh/authorized_keys
- The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. You may do that by typing:
sudo -u hduser ssh -o StrictHostKeyChecking=no localhost ' '
If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).
The first thing to be done, is make sure that we have 'git-core' and 'ant' tools installed in our system. These are required to download the source code from git-hub and build it. In addition, we suggest that a temporary directory is created to download and build hadoop source code. Later, the built code will be transfered to a permanent installation directory.
- To obtain the necessary tools:
sudo apt-get -y --allow-unauthenticated --force-yes install git-core
sudo apt-get -y --allow-unauthenticated --force-yes install ant
- Create a temporary build directory, here "/tmp/hadoop_build":
mkdir -p /tmp/hadoop_build
- Enter the build directory:
cd /tmp/hadoop_build
- Download a fresh copy of our hadoop source from github into the current directory:
git clone https://github.com/madiator/hadoop-20.git .
(Watch the dot at the end which denotes the current directory.) - Then type:
git branch -a
- If you are interested in getting the latest 'regeneratingcode' branch do:
git checkout -b regeneratingcode origin/regeneratingcode
We should still be in the temporary build directory, "/tmp/hadoop_build". The directory contains a "build.properties" file. Typing:
cat ./build.properties
you can see its content:
resolvers=internal
#you can increment this number as you see fit
version=0.20-fb-mahesh
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}
You may change the third line, "version=..." to some other value, i.e. "version=0.20-update-01", provided it is not too generic.
-
Note that I have placed a symbolic link inside
src/contrib/raid
. You may want to do so for other contribs as you desire. The main goal of this tutorial is to run RAID. So for the raid contrib, I had done:cd src/contrib/raid ln -s ../../../build.properties build.properties cd -
This link should be there already in the cloned copy, so you do not have to do it. Now you are ready to build.
To build the code, being in "/tmp/hadoop_build", type:
ant
If the build is successful, the output will end up like that:
...
BUILD SUCCESSFUL
Total time: 2 minutes 43 seconds
Also build the raid contrib:
cd ./src/contrib/raid
ant package -Ddist.dir=/tmp/hadoop_build/build
The output should be similar to the previous one. Both the above builds should be successful.
You will have to edit a few configuration files as shown below. Many of them must already be in there in the cloned version, but you may want to edit them to suitable parameters you like. The files are located under the directory in which hadoop was built (/tmp/hadoop_build).
Change line
export HADOOP_USERNAME="hadoop"
to
export HADOOP_USERNAME="hduser"
to reflect the username of the dedicated hadoop installation user for added security (or comment it out).
Further, set
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_CLASSPATH=${HADOOP_HOME}/build/contrib/raid/hadoop-0.20-fb-mahesh-raid.jar
Note:
- The 'HADOOP_HOME' variable appearing in the last line denotes the directory where hadoop will be permanently installed. The variable will be set in the "/home/hduser/.bashrc" file in the sequel.
- The '0.20-fb-mahesh' part corresponds to the 'version' variable in the "build.properties" edited above. If you do not do this, everytime you compile raid, the jar file must be copy pasted to "libs/ directory".
Make sure that the value of 'dfs.permissions' is set to 'false' as shown below:
<property>
<name>dfs.permissions</name>
<value>false</value>
<description>Check for superuser privileges?</description>
</property>
In this file, we determine the directory where Hadoop's temporary files will be stored. For this tutorial, we use "/app/hadoop/tmp-hduser", but you may change it to some directory you like. Make sure that the hadoop dedicated user has write permissions to the directory:
mkdir -p /app/hadoop/tmp-hduser
chown -R hduser:hdgroup /app/hadoop/tmp-hduser
Then, edit the content of conf/core-site.xml by setting the of 'hadoop.tmp.dir' to that directory as shown below:
<property>
<name>hadoop.tmp.dir</name>
<value>/app/hadoop/tmp-hduser</value>
<description>A base for other temporary directories.</description>
</property>
To finalize the installation, we copy the built hadoop to its permanent installation directory. Here we use "/usr/local/hadoop":
mkdir -p /usr/local/hadoop
cp -rf /tmp/hadoop_build/* /usr/local/hadoop
chown -R hduser:hdgroup /usr/local/hadoop
The final step, is editing hduser's .bashrc profile file, so that the system knows where hadoop is installed and what version of java it should use.
We add the following lines in "/home/hduser/.bashrc":
export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export PATH=$PATH:$HADOOP_HOME/bin
The variables will be loaded in a new bash, when you login as hduser.
At this point Hadoop is ready to run. You may go the next step to see how. However, it does not support the raid function yet. To support raid, we need to do the following:
Create a file raid.xml with the following content
<configuration>
<srcPath prefix="*">
<policy name = "uscpolicy">
<erasureCode>RS</erasureCode>
<property>
<name>srcReplication</name>
<value>3</value>
<description>
Pick files for RAID only if their replication factor is greater than or equal to t$
</description>
</property>
<property>
<name>targetReplication</name>
<value>1</value>
<description>
After RAIDing, decrease the replication factor of a file to this value.
</description>
</property>
<property>
<name>metaReplication</name>
<value>1</value>
<description>
The replication factor of the RAID meta file
</description>
</property>
<property>
<name>modTimePeriod</name>
<value>3</value>
<description>
Time (in milliseconds) after a file is modified to make it a candidate for RAIDing
</description>
</property>
</policy>
</srcPath>
</configuration>
Place raid.xml into directory "/app/hadoop/conf" (or other directory of your choice. Make sure that hduser is the owner of the directory and the raid.xml file.
Finally, make sure that the value of the property with name "raid.config.file" points to the raid.xml file:
<property>
<name>raid.config.file</name>
<value>/app/hadoop/conf/raid.xml</value>
<description>This is needed by the RaidNode </description>
</property>
Also add the following elements in the same file (/conf/hdfs-site.xml):
<property>
<name>fs.hdfs.impl</name>
<value>org.apache.hadoop.hdfs.DistributedFileSystem</value>
<description>The FileSystem for hdfs: uris.</description>
</property>
<property>
<name>raid.classname</name>
<value>org.apache.hadoop.raid.DistRaidNode</value>
<description>Specify which implementation of RaidNode to use (class name).</description>
</property>
<property>
<name>raid.policy.rescan.interval</name>
<value>1000</value>
<description>Specify the periodicity in milliseconds after which all source paths are rescanned and parity blocks recomputed if necessary. By default, this value is 1 hour.</description>
</property>
<property>
<name>hdfs.raid.stripeLength</name>
<value>10</value>
<description>The stripe length for the code. stripeLength number of blocks are used for coding for RS coding</description>
</property>
<property>
<name>hdfs.raidrs.rsparitylength</name>
<value>4</value>
<description>The number of parity blocks generated from stripeLength number of blocks</description>
</property>
<property>
<name>hdfs.raidrs.srcparitylength</name>
<value>2</value>
<description>The number of SRC blocks. If its zero we will use pure RS</description>
</property>
<property>
<name>raid.blockfix.classname</name>
<value>org.apache.hadoop.raid.LocalBlockIntegrityMonitor</value>
<description>Specify the BlockFixer implementation to use. The default is org.apache.hadoop.raid.DistBlockFixer.</description>
</property>
<property>
<name>raid.blockfix.interval</name>
<value>1000</value>
<description>interval in milliseconds between checks for lost files. Default is 1 minute</description>
</property>
If everything went well, you should be able to run hadoop.
First format the namenode:
hadoop namenode -format
Then run script start-all.sh to get the first nodes (processes). The script is in "${HADOOP_HOME}/bin" which has been added to the PATH. Hence, it can be invoked by typing in the bash-terminal:
start-all.sh
View the running processes (nodes) by typing:
jps
The output should look like the following:
21610 Jps
21316 Seconda_ryNameNode _ 21525 TaskTracker
21377 JobTracker
21168 DataNode
If you want to add extra datanodes, you may do so with the ${HADOOP_HOME}/bin/extra-local-datanodes.sh script. First, you have to edit the script, to set the dfs.data.dir. This is done by modifying the corresponding line to:
-D dfs.data.dir=/app/hadoop/tmp-hduser/dfs/data$DN"
where "/app/hadoop/tmp-hduser" is the temporary directory we selected previously in this guide. Save the changes and execute the script. For example, to get 5 data nodes running, type:
extra-local-datanodes.sh start 1 2 3 4 5
Typing jps
once again, the output should now look like the following:
21867 DataNode
21316 SecondaryNameNode
21525 TaskTracker
21377 JobTracker
21168 DataNode
21815 DataNode
21763 DataNode
21659 DataNode
21711 DataNode
21897 Jps
We are done! Hopefully everything went well. You may want to stop the processes by running:
stop-all.sh
extra-local-datanodes.sh stop 1 2 3 4 5
First create a new project using File -> New -> Java Project
.
Give a project name, say Hadoop
and select 'Create project from existing source'.
Then if it asks you about deleting Hadoop/bin
folder, click No, and in the next page, specify a different output folder, like Hadoop/build/eclipse-classes
.
Set up ANT_HOME
variable by going to Preferences->Java->Build Path->Classpath Variables
. Set a new variable called ANT_HOME
to /usr/share/ant
after verifying the path exists.
You can now build, by browsing to build.xml
and then right click to Run As..
->Ant Build..
, where you can select what targets to build. If you are in trouble, head to Cloudera Eclipse tutorial
- HDFS-RAID: http://wiki.apache.org/hadoop/HDFS-RAID (there are a few changes in Facebook's version; for example BlockFixer has been refactored to BlockIntegrityMonitor etc.)
- Michael Noll's tutorial on how to set up Hadoop: www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/
- Michael Noll's tutorial on building an Hadoop 0.20.x version: http://www.michael-noll.com/blog/2011/04/14/building-an-hadoop-0-20-x-version-for-hbase-0-90-2/
- All other questions: http://www.google.com !!