Skip to content
electroscientist edited this page Sep 8, 2012 · 133 revisions

This tutorial has been written by me and Megasthenis Asteris.

Introduction

In this tutorial we will explain how to set up Hadoop and configure it to run HDFS-RAID. This GitHub repo is forked from Facebook's version and we made a few changes to get it to build. Also, we are in the process of adding a new type of erasure code in the RAID system, which is being done in the regeneratingcode branch. It is under development, so please consider using the master branch instead (which has Reed Solomon and XOR encoding schemes available).

You could also clone Facebook's repo available at https://github.com/facebook/hadoop-20 and then apply the two patches in https://gist.github.com/1365008 so it builds, and then follow the below tutorial to make necessary modifications to get it running.

Note that the write up is under progress. In case you encounter any problem, PLEASE DO NOT hesitate to send me a message through GitHub, or mail me at msathiam at usc.edu so I can fix it!

Acknowledgement

The following guide is heavily based on Michael Noll's tutorial for Running Hadoop on Ubuntu Linux which can be found here.

Prerequisites

Note (a): You will need root privileges to install hadoop.

Note (b): Here you may find a script that automates the following installation process in case you wish to avoid the step by step procedure.

Sun Java 6

Java 1.6.x is recommended for running Hadoop. To install sun's java, we have to do the following:

  1. Add the Canonical Partner Repository to your apt repositories. We do that by adding the following line:
    deb http://archive.canonical.com/ lucid partner
    in "/etc/apt/sources.list". Replace 'lucid' with your distribution's codename. This can be found by typing lsb_release -c | cut -f2 in a bash terminal.
  2. Then update the source list:
    sudo apt-get update
  3. Install sun-java6-jdk:
    sudo apt-get install sun-java6-jdk
  4. Install sun-java6-plugin:
    sudo apt-get install sun-java6-plugin
  5. Select Sun’s Java as the default on your machine:
    sudo update-java-alternatives -s java-6-sun

The full JDK which will be placed in "/usr/lib/jvm/java-6-sun". After installation, check whether Sun’s JDK is installed correctly. Type the command java -version. The output should look like the following:

java version "1.6.0_26"
Java(TM) SE Runtime Environment (build 1.6.0_26-b03)
Java HotSpot(TM) 64-Bit Server VM (build 20.1-b02, mixed mode)

Adding a dedicated Hadoop system user

We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine. In this tutorial we use hduser as the dedicated hadoop installation user and hdgroup as the corresponding group. You may use your preferred user-name and group-name.

  1. First create the group and the user:
    addgroup hdgroup
    adduser hduser --gecos hduser
    (When interested in building a multinode cluster, where the user will only be accessed through ssh, the --disabled-password option should be considered. In that case, in order to make the (password-less) account accessible via ssh, we must add the corresponding public ssh keys in hduser's "~/.ssh/authorized_keys" file.)
  2. Then add the user in the group 'hdgroup' as well as in the 'admin' group:
    usermod -a -G admin hduser
    usermod -a -G hdgroup hduser
  3. Finally, add the following line in "/etc/sudoers":

hduser ALL=(ALL) NOPASSWD:ALL

Configuring SSH

Hadoop requires SSH access to manage its nodes, i.e. remote machines plus your local machine if you want to use Hadoop on it (which is what we want to do in this short tutorial). For our single-node setup of Hadoop, we therefore need to configure SSH access to localhost for the hduser user we created in the previous section.

I assume that you have SSH up and running on your machine and configured it to allow SSH public key authentication. If not, there are several guides available.


WARNING: If you are using an existing user as the hadoop-installation user (instead of creating a new one), then be careful before proceeding; the following steps will generate a new public key for your user. If you have used you public key to set up connections to other machines, this public key will be lost.

For this reason, you may wish to skip this step and use your existing public key. In this case, go directly to the 3rd command, which adds your user's existing public key to the his authorized keys.
But,

  • Make sure that your public key file actually contains a key (a string of characters).
  • Note that the RSA key-pair used must have been created without a password (which is usually the case). If it was created with a password, you will probably not avoid creating a new key-pair.

  1. If file "/home/hduser/.ssh/id_rsa" exists, remove it:
    sudo -u hduser rm -f "/home/hduser/.ssh/id_rsa"
  2. Create a new, password-less key-pair:
    sudo -u hduser ssh-keygen -q -t rsa -P "" -f "/home/hduser/.ssh/id_rsa"
    This command created an RSA key pair with an empty password. Generally, using an empty password is not recommended, but in this case it is needed to unlock the key without your interaction (you don’t want to enter the passphrase every time Hadoop interacts with its nodes).
  3. Copy the public key, into hduser's authorized keys, so that hduser can ssh local host:
    sudo -u hduser cat /home/hduser/.ssh/id_rsa.pub >> /home/hduser/.ssh/authorized_keys
  4. Set hduser as the owner of his own ".ssh" directory and "authorized_keys" file:
    chown hduser:hdgroup /home/hduser/.ssh
    chown hduser:hdgroup /home/hduser/.ssh/authorized_keys
  5. The final step is to test the SSH setup by connecting to your local machine with the hduser user. The step is also needed to save your local machine’s host key fingerprint to the hduser user’s known_hosts file. You may do that by typing:
    sudo -u hduser ssh -o StrictHostKeyChecking=no localhost ' '
    If you have any special SSH configuration for your local machine like a non-standard SSH port, you can define host-specific SSH options in $HOME/.ssh/config (see man ssh_config for more information).

Setting up for the build

The first thing to be done, is make sure that we have 'git-core' and 'ant' tools installed in our system. These are required to download the source code from git-hub and build it. In addition, we suggest that a temporary directory is created to download and build hadoop source code. Later, the built code will be transfered to a permanent installation directory.

  1. To obtain the necessary tools:
    sudo apt-get -y --allow-unauthenticated --force-yes install git-core
    sudo apt-get -y --allow-unauthenticated --force-yes install ant
  2. Create a temporary build directory, here "/tmp/hadoop_build":
    mkdir -p /tmp/hadoop_build
  3. Enter the build directory:
    cd /tmp/hadoop_build
  4. Download a fresh copy of our hadoop source from github into the current directory:
    git clone https://github.com/madiator/hadoop-20.git .
    (Watch the dot at the end which denotes the current directory.)
  5. Then type:
    git branch -a
  6. If you are interested in getting the latest 'regeneratingcode' branch do:
    git checkout -b regeneratingcode origin/regeneratingcode

Building the code

We should still be in the temporary build directory, "/tmp/hadoop_build". The directory contains a "build.properties" file. Typing:

 cat ./build.properties  

you can see its content:

resolvers=internal
#you can increment this number as you see fit
version=0.20-fb-mahesh
project.version=${version}
hadoop.version=${version}
hadoop-core.version=${version}
hadoop-hdfs.version=${version}
hadoop-mapred.version=${version}

You may change the third line, "version=..." to some other value, i.e. "version=0.20-update-01", provided it is not too generic.

  • Note that I have placed a symbolic link inside src/contrib/raid. You may want to do so for other contribs as you desire. The main goal of this tutorial is to run RAID. So for the raid contrib, I had done:

      cd src/contrib/raid
      ln -s ../../../build.properties build.properties
      cd -
    

This link should be there already in the cloned copy, so you do not have to do it. Now you are ready to build.

To build the code, being in "/tmp/hadoop_build", type:

 ant  

If the build is successful, the output will end up like that:

...
BUILD SUCCESSFUL
Total time: 2 minutes 43 seconds

Also build the raid contrib:

cd ./src/contrib/raid
ant package -Ddist.dir=/tmp/hadoop_build/build

The output should be similar to the previous one. Both the above builds should be successful.

Setting up to run Hadoop

You will have to edit a few configuration files as shown below. Many of them must already be in there in the cloned version, but you may want to edit them to suitable parameters you like. The files are located under the directory in which hadoop was built (/tmp/hadoop_build).

conf/hadoop-env.sh

Change line

export HADOOP_USERNAME="hadoop"

to

export HADOOP_USERNAME="hduser"

to reflect the username of the dedicated hadoop installation user for added security (or comment it out).

Further, set

export JAVA_HOME=/usr/lib/jvm/java-6-sun
export HADOOP_CLASSPATH=${HADOOP_HOME}/build/contrib/raid/hadoop-0.20-fb-mahesh-raid.jar

Note:

  1. The 'HADOOP_HOME' variable appearing in the last line denotes the directory where hadoop will be permanently installed. The variable will be set in the "/home/hduser/.bashrc" file in the sequel.
  2. The '0.20-fb-mahesh' part corresponds to the 'version' variable in the "build.properties" edited above. If you do not do this, everytime you compile raid, the jar file must be copy pasted to "libs/ directory".

conf/hdfs-site.xml

Make sure that the value of 'dfs.permissions' is set to 'false' as shown below:

<property>  
    <name>dfs.permissions</name>  
    <value>false</value>  
    <description>Check for superuser privileges?</description>  
</property>  

conf/core-site.xml

In this file, we determine the directory where Hadoop's temporary files will be stored. For this tutorial, we use "/app/hadoop/tmp-hduser", but you may change it to some directory you like. Make sure that the hadoop dedicated user has write permissions to the directory:

 mkdir -p /app/hadoop/tmp-hduser   
 chown -R hduser:hdgroup /app/hadoop/tmp-hduser

Then, edit the content of conf/core-site.xml by setting the of 'hadoop.tmp.dir' to that directory as shown below:

<property>  
    <name>hadoop.tmp.dir</name>  
    <value>/app/hadoop/tmp-hduser</value>  
    <description>A base for other temporary directories.</description>  
</property>  

To finalize the installation, we copy the built hadoop to its permanent installation directory. Here we use "/usr/local/hadoop":

mkdir -p /usr/local/hadoop  
cp -rf /tmp/hadoop_build/* /usr/local/hadoop  
chown -R hduser:hdgroup /usr/local/hadoop  

The final step, is editing hduser's .bashrc profile file, so that the system knows where hadoop is installed and what version of java it should use.
We add the following lines in "/home/hduser/.bashrc":

export HADOOP_HOME=/usr/local/hadoop
export JAVA_HOME=/usr/lib/jvm/java-6-sun
export PATH=$PATH:$HADOOP_HOME/bin

The variables will be loaded in a new bash, when you login as hduser.

Setting Up Raid

At this point Hadoop is ready to run. You may go the next step to see how. However, it does not support the raid function yet. To support raid, we need to do the following:

Create a file raid.xml with the following content

<configuration>  
        <srcPath prefix="*">  
                <policy name = "uscpolicy">  
                        <erasureCode>RS</erasureCode>  
                        <property>  
                            <name>srcReplication</name>  
                                <value>3</value>  
                                <description>  
                                        Pick files for RAID only if their replication factor is greater than or equal to t$  
                                </description>  
                        </property>  
                        <property>  
                                <name>targetReplication</name>  
                                <value>1</value>  
                                <description>  
                                        After RAIDing, decrease the replication factor of a file to this value.  
                                </description>  
                        </property>  
                        <property>  
                                <name>metaReplication</name>  
                                <value>1</value>  
                                <description>  
                                        The replication factor of the RAID meta file  
                                </description>  
                        </property>  
                        <property>  
                                <name>modTimePeriod</name>  
                                <value>3</value>  
                                <description>  
                                        Time (in milliseconds) after a file is modified to make it a candidate for RAIDing  
                                </description>  
                        </property>  
                </policy>  
        </srcPath>  
</configuration>  

Place raid.xml into directory "/app/hadoop/conf" (or other directory of your choice. Make sure that hduser is the owner of the directory and the raid.xml file.

Finally, make sure that the value of the property with name "raid.config.file" points to the raid.xml file:

<property>  
    <name>raid.config.file</name>  
    <value>/app/hadoop/conf/raid.xml</value>  
    <description>This is needed by the RaidNode </description>  
</property>  

Also add the following elements in the same file (/conf/hdfs-site.xml):

<property>  
    <name>fs.hdfs.impl</name>  
    <value>org.apache.hadoop.hdfs.DistributedFileSystem</value>          
    <description>The FileSystem for hdfs: uris.</description>  
</property>  
<property>  
    <name>raid.classname</name>  
    <value>org.apache.hadoop.raid.DistRaidNode</value>  
    <description>Specify which implementation of RaidNode to use (class name).</description>  
</property>  
<property>  
    <name>raid.policy.rescan.interval</name>  
    <value>1000</value>  
    <description>Specify the periodicity in milliseconds after which all source paths are rescanned and parity blocks recomputed if necessary. By default, this value is 1 hour.</description>  
</property>  
<property>
    <name>hdfs.raid.stripeLength</name>  
    <value>10</value>  
    <description>The stripe length for the code. stripeLength number of blocks are used for coding for RS coding</description>  
</property>
<property>  
    <name>hdfs.raidrs.rsparitylength</name>  
    <value>4</value>  
    <description>The number of parity blocks generated from stripeLength number of blocks</description>  
</property>
<property>  
    <name>hdfs.raidrs.srcparitylength</name>  
    <value>2</value>  
    <description>The number of SRC blocks. If its zero we will use pure RS</description>  
</property>
<property>  
    <name>raid.blockfix.classname</name>  
    <value>org.apache.hadoop.raid.LocalBlockIntegrityMonitor</value>  
    <description>Specify the BlockFixer implementation to use. The default is org.apache.hadoop.raid.DistBlockFixer.</description>  
</property>  
<property>  
    <name>raid.blockfix.interval</name>  
    <value>1000</value>  
    <description>interval in milliseconds between checks for lost files. Default is 1 minute</description>  
</property>  

Run Hadoop

If everything went well, you should be able to run hadoop.

First format the namenode:

   hadoop namenode -format  

Then run script start-all.sh to get the first nodes (processes). The script is in "${HADOOP_HOME}/bin" which has been added to the PATH. Hence, it can be invoked by typing in the bash-terminal:

   start-all.sh   

View the running processes (nodes) by typing:

  jps

The output should look like the following:

21610 Jps
21316 Seconda_ryNameNode _ 21525 TaskTracker
21377 JobTracker
21168 DataNode

If you want to add extra datanodes, you may do so with the ${HADOOP_HOME}/bin/extra-local-datanodes.sh script. First, you have to edit the script, to set the dfs.data.dir. This is done by modifying the corresponding line to:

-D dfs.data.dir=/app/hadoop/tmp-hduser/dfs/data$DN"

where "/app/hadoop/tmp-hduser" is the temporary directory we selected previously in this guide. Save the changes and execute the script. For example, to get 5 data nodes running, type:

 extra-local-datanodes.sh start 1 2 3 4 5 

Typing jps once again, the output should now look like the following:

21867 DataNode
21316 SecondaryNameNode
21525 TaskTracker
21377 JobTracker
21168 DataNode
21815 DataNode
21763 DataNode
21659 DataNode
21711 DataNode
21897 Jps

We are done! Hopefully everything went well. You may want to stop the processes by running:

stop-all.sh
extra-local-datanodes.sh stop 1 2 3 4 5

Setting up Eclipse

First create a new project using File -> New -> Java Project.
Give a project name, say Hadoop and select 'Create project from existing source'.
Then if it asks you about deleting Hadoop/bin folder, click No, and in the next page, specify a different output folder, like Hadoop/build/eclipse-classes.

Set up ANT_HOME variable by going to Preferences->Java->Build Path->Classpath Variables. Set a new variable called ANT_HOME to /usr/share/ant after verifying the path exists.

You can now build, by browsing to build.xml and then right click to Run As..->Ant Build.., where you can select what targets to build. If you are in trouble, head to Cloudera Eclipse tutorial

Other links: