Setting Up a Multi-Node Hadoop Cluster with Beagle Bone Black

243

Learning map/reduce frameworks like Hadoop is a useful skill to have but not everyone has the resources to implement and test a full system. Thanks to cheap arm based boards, it is now more feasible for developers to set up a full Hadoop cluster. I coupled my existing knowledge of setting up and running single node Hadoop installs with my BeagleBone cluster from my previous post to create my second project. This tutorial goes through the steps I took to set up and run Hadoop on my Ubuntu cluster. It may not be a practical application for everyone to learn but currently distributed map/reduce experience is a good skill to have. All the machines in my cluster are already set up with Java and SSH from my first project so you may need to install them if you don’t have them.

Set Up

The first step naturally is to download Hadoop from apache’s site on each machine in the cluster and untar it. I used version 1.2.1 but version 2.0 and above is now available. I placed the resulting files in /usr/local and named the directory hadoop. With the files on the system, we can create a new user called hduser to actual run our jobs and a group for it called hadoop:

sudo addgroup hadoop
sudo adduser --ingroup hadoop hduser

With the user created, we will make hduser the owner of the directory containing hadoop:

sudo chown -R hduser:hadoop /usr/local/hadoop

Then we create a temp directory to hold files and make the hduser the owner of it as well:

sudo mkdir -p /hadooptemp
sudo chown hduser:hadoop /hadooptemp

With the directories set up, log in as hduser. We will start by updating the .bashrc file in the home directory. We need to add two export lines at the top to point to our hadoop and java locations. My java installation was openjdk7 but yours may be different:

# Set Hadoop-related environment variables
export HADOOP_HOME=/usr/local/hadoop
# Set JAVA_HOME (we will also configure JAVA_HOME directly for Hadoop later on)
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf
# Add Hadoop bin/ directory to PATH
export PATH=$PATH:$HADOOP_HOME/bin

Next we can navigate to our hadoop installation directory and locate the conf directory. Once there we need to edit the hadoop-env.sh file and uncomment the java line to point to the location of the java installation again:

export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-armhf

Next we can update core-site.xml to point to the temp location we created above and specify the root of the file system. Note that the name in the default url is the name of the master node:

<configuration>
  <property>
    <name>hadoop.tmp.dir</name>
    <value>/hadooptemp</value>
    <description>Root temporary directory.</description>
  </property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://beaglebone1:54310</value>
    <description>URI pointing to the default file system.</description>
  </property>
</configuration>

Next we can edit mapred-site.xml:

<configuration>
<property>
<name>mapred.job.tracker</name>
<value>beaglebone1:54311</value>
<description>The host and port that the MapReduce job tracker runs
at.</description>
</property>
</configuration>

Finally we can edit hdfs-site.xml to list how many replication nodes we want. In this case I chose all 3:

<configuration>
<property>
<name>dfs.replication</name>
<value>3</value>
<description>Default block replication.</description>
</property>
</configuration>

Once this has been done on all machines we can edit some configuration files on the master to tell it which nodes are slaves. Start by editing the masters file to make sure it just contains our host name:

beaglebone1

Now edit the slaves file to list all the nodes in our cluster:

beaglebone1
beaglebone2
beaglebone3

Next we need to make sure that our master can communicate to its slaves. Make sure hosts file on your nodes contains the names of all the nodes in your cluster. Now on the master, create a key and copy it to the authorized_keys:

ssh-keygen -t rsa -P ""
cat $HOME/.ssh/id_rsa.pub >> $HOME/.ssh/authorized_keys

Then copy the key to other nodes:

ssh-copy-id -i $HOME/.ssh/id_rsa.pub hduser@beaglebone2

Finally we can test the connections from the master to itself and others using ssh.

Starting Hadoop

The first step is to format HDFS on the master. From the bin directory in hadoop run the following:

hadoop namenode -format

Once that is finished Now we can start the NameNode and JobTracker daemons on the master. We simply need to execute these two commands from the bin directory:

start-dfs.sh
start-mapred.sh

To stop them later we can use:

stop-mapred.sh
stop-dfs.sh

Running an Example

With Hadoop running on the master and slaves, we can test out one of the examples. First we need to create some files on our system. Create some text files in a directory of your chosing with a few words in each. We can then copy the files to hdfs. From the main hadoop directory, execute the following:

bin/hadoop dfs -mkdir /user/hduser/test
bin/hadoop dfs -copyFromLocal /tmp/*.txt /user/hduser/test

The first line will call mkdir on HDFS to create a directory /user/hduser/test. The second line will copy files I created in /tmp to the new HDFS directory. Now we can run the wordcount sample against it:

bin/hadoop hadoop-examples-1.2.1.jar wordcount /user/hduser/test /user/hduser/testout

The jar file name will vary based on what version of hadoop you downoaded. Once the job is finished, it will output the results in HDFS to /user/hduser/testout. To view the resulting files we can do this:

bin/hadoop dfs -ls /user/hduser/testout

We can then use the cat command to show the contents of the output:

bin/hadoop dfs -cat /user/hduser/testout/part-r-00000

This file will show us each word found and the number of times it was found. If we want to see proof that the job ran on all nodes, we can view the logs on the slaves from the hadoop/logs directory. For example, on the beaglebone2 node I can do this:

cat hadoop-hduser-datanode-beaglebone2.log

When I examined the file, I could see messages at the end showing the jobname and data received and sent, letting me know that all was well.

Conclusion

If you through all of this and it worked, congratulations on setting up a working cluster. Due to the slow performance of the BeagleBone’s SD card, it is not the best device for getting actual work done. However, these steps are applicable to faster arm devices as they come along. In the meantime, the BeagleBone Black is a great platform for practice and learning how to set up distributed systems.