Learning map/reduce frameworks like Hadoop is a useful skill to have but not everyone has the resources to implement and test a full system. Thanks to cheap arm based boards, it is now more feasible for developers to set up a full Hadoop cluster. I coupled my existing knowledge of setting up and running single node Hadoop installs with my BeagleBone cluster from my previous post to create my second project. This tutorial goes through the steps I took to set up and run Hadoop on my Ubuntu cluster. It may not be a practical application for everyone to learn but currently distributed map/reduce experience is a good skill to have. All the machines in my cluster are already set up with Java and SSH from my first project so you may need to install them if you don’t have them.
Set Up
The first step naturally is to download Hadoop from apache’s site on each machine in the cluster and untar it. I used version 1.2.1 but version 2.0 and above is now available. I placed the resulting files in /usr/local and named the directory hadoop. With the files on the system, we can create a new user called hduser to actual run our jobs and a group for it called hadoop:
sudo addgroup hadoop sudo adduser --ingroup hadoop hduser
With the user created, we will make hduser the owner of the directory containing hadoop:
sudo chown -R hduser:hadoop /usr/local/hadoop
Then we create a temp directory to hold files and make the hduser the owner of it as well:
sudo mkdir -p /hadooptemp sudo chown hduser:hadoop /hadooptemp
With the directories set up, log in as hduser. We will start by updating the .bashrc file in the home directory. We need to add two export lines at the top to point to our hadoop and java locations. My java installation was openjdk7 but yours may be different:
# Set Hadoop-related environment variables export HADOOP_HOME=/usr/local/hadoop # Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on) export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf # Add Hadoop bin/ directory to PATH export PATH=$PATH:$HADOOP_HOME/bin
Next we can navigate to our hadoop installation directory and locate the conf directory. Once there we need to edit the hadoop-env.sh file and uncomment the java line to point to the location of the java installation again:
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf
Next we can update core-site.xml to point to the temp location we created above and specify the root of the file system. Note that the name in the default url is the name of the master node:
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/hadooptemp</value>
<description>Root temporary directory.</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://beaglebone1:54310</value>
<description>URI pointing to the default file system.</description>
</property>
</configuration>
Next we can edit mapred-site.xml:
<configuration>
<property>
<name>mapred.job.tracker</name>
<value>beaglebone1:54311</value>
<description>The host and port that the MapReduce job tracker runs
at.</description>
</property>
</configuration>
Finally we can edit hdfs-site.xml to list how many replication nodes we want. In this case I chose all 3:
<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.</description>
</property>
</configuration>
Once this has been done on all machines we can edit some configuration files on the master to tell it which nodes are slaves. Start by editing the masters file to make sure it just contains our host name:
beaglebone1
Now edit the slaves file to list all the nodes in our cluster:
beaglebone1 beaglebone2 beaglebone3
Next we need to make sure that our master can communicate to its slaves. Make sure hosts file on your nodes contains the names of all the nodes in your cluster. Now on the master, create a key and copy it to the authorized_keys:
ssh-keygen -t rsa -P "" cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys
Then copy the key to other nodes:
ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@beaglebone2
Finally we can test the connections from the master to itself and others using ssh.
Starting Hadoop
The first step is to format HDFS on the master. From the bin directory in hadoop run the following:
hadoop namenode -format
Once that is finished Now we can start the NameNode and JobTracker daemons on the master. We simply need to execute these two commands from the bin directory:
start-dfs.sh start-mapred.sh
To stop them later we can use:
stop-mapred.sh stop-dfs.sh
Running an Example
With Hadoop running on the master and slaves, we can test out one of the examples. First we need to create some files on our system. Create some text files in a directory of your chosing with a few words in each. We can then copy the files to hdfs. From the main hadoop directory, execute the following:
bin/hadoop dfs -mkdir /user/hduser/test bin/hadoop dfs -copyFromLocal /tmp/*.txt /user/hduser/test
The first line will call mkdir on HDFS to create a directory /user/hduser/test. The second line will copy files I created in /tmp to the new HDFS directory. Now we can run the wordcount sample against it:
bin/hadoop hadoop-examples-1.2.1.jar wordcount /user/hduser/test /user/hduser/testout
The jar file name will vary based on what version of hadoop you downoaded. Once the job is finished, it will output the results in HDFS to /user/hduser/testout. To view the resulting files we can do this:
bin/hadoop dfs -ls /user/hduser/testout
We can then use the cat command to show the contents of the output:
bin/hadoop dfs -cat /user/hduser/testout/part-r-00000
This file will show us each word found and the number of times it was found. If we want to see proof that the job ran on all nodes, we can view the logs on the slaves from the hadoop/logs directory. For example, on the beaglebone2 node I can do this:
cat hadoop-hduser-datanode-beaglebone2.log
When I examined the file, I could see messages at the end showing the jobname and data received and sent, letting me know that all was well.
Conclusion
If you through all of this and it worked, congratulations on setting up a working cluster. Due to the slow performance of the BeagleBone’s SD card, it is not the best device for getting actual work done. However, these steps are applicable to faster arm devices as they come along. In the meantime, the BeagleBone Black is a great platform for practice and learning how to set up distributed systems.