Hadoop Cluster Setup
This page will help you set up a real Hadoop cluster of 4 or more nodes.
Prerequisites
-
Read Hadoop Single Node
Setup and learn how to set up a single-node Hadoop
"cluster"
-
Have access to Berry patch machines. While the intro can be
completed on your own computer, running a cluster requires
access to the Berry patch.
Introduction
To set up a Hadoop cluster, you must first choose at least four
machines in the Berry patch. During testing you do not have to
worry about whether others are using the machines you choose,
but when you need to do performance testing for a project we
will help you schedule time so that you have exclusive access to
the machines in your cluster.
Following are the capacities in which nodes may act in your
cluster:
-
NameNode: Manages the namespace, file system metadata,
and access control. There is exactly one NameNode in each
cluster.
-
SecondaryNameNode: Downloads
periodic checkpoints from the NameNode for
fault-tolerance. There is exactly one SecondaryNameNode in
each cluster. We will learn more about the purpose of
SecondaryNameNode as we study Hadoop.
-
JobTracker: Hands out tasks to the slave
nodes. There is exactly one JobTracker in each
cluster.
-
DataNode: Holds file system data; each data node
manages its own locally-attached storage (i.e., the node's
hard disk) and stores a copy of some or all blocks in the file
system. There are one or more DataNodes in each
cluster. If your cluster has only one DataNode then file
system data cannot be replicated.
-
TaskTracker: Slaves that carry
out map
and reduce tasks. There are one
or more TaskTrackers in each cluster.
Each node in the cluster is either a master or
a slave. Slave nodes are
always both a DataNode and a TaskTracker. While
it is possible for the same node to be both a NameNode and a
JobTracker (or in fact for a node to operate in all four
capacities, as the machine in your single-node cluster did when
you followed the Single Node
Cluster Setup tutorial), typically you will put the
JobTracker and NameNode on separate machines for performance
reasons.
Configuration
-
We assume that you have downloaded and unpacked Hadoop 0.15.2
as described in Single
Node Cluster Setup. Do so now if you haven't already.
-
You will each be assigned 6 port numbers to use. This ensures
that while you are working on getting the installation correct
you will not prevent each other from running their Hadoop
clusters. We will use the
tokens YOURPORT1
– YOURPORT5 to indicate where
you should use your assigned port numbers.
-
See the TA if you haven't yet been assigned port numbers.
-
The rest of the instructions assume that you enter commands
from the Hadoop root directory. E.g., if you unpacked to the
directory described above, then commands should executed
from ~/proj/cs147a/hadoop-0.15.2
-
Pick four machines in the Berry Patch. This tutorial will
refer to their names as machine1
– machine4.
-
One of the machines you chose should be the NameNode. Another
should be the JobTracker. You need to remember which one is
which. We assume that machine1 is
the NameNode and machine2 is the
JobTracker.
-
Open conf/hadoop-site.xml in
your favorite text editor.
-
This is the XML configuration file that each Hadoop node
reads in order to know how to contact other machines in
your compute cluster.
-
Because your Hadoop installation is in your home directory
(and hence accessible over NFS), each node will read the same
configuration file.
-
You must specify the machine that acts as the NameNode
(fs.default.name) and the
machine that acts as the JobTracker
(mapred.job.tracker), along with
a few other properties. Here is how your configuration
file should look:
conf/hadoop-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>fs.default.name</name>
<value>MACHINE1:YOURPORT1</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>MACHINE2:YOURPORT1</value>
</property>
<property>
<name>dfs.replication</name>
<value>2</value>
</property>
<property>
<name>dfs.secondary.info.port</name>
<value>YOURPORT2</value>
</property>
<property>
<name>dfs.info.port</name>
<value>YOURPORT3</value>
</property>
<property>
<name>mapred.job.tracker.info.port</name>
<value>YOURPORT4</value>
</property>
<property>
<name>tasktracker.http.port</name>
<value>YOURPORT5</value>
</property>
</configuration>
(don't forget to replace MACHINEX and YOURPORTX with
real machine names and your assigned port numbers).
-
Open conf/masters in your
favorite text editor.
-
This file lists the master nodes (i.e., the NameNode and
JobTracker).
-
here is how your masters file should look:
MACHINE1
MACHINE2
(don't forget to replace MACHINEX with real
machine names).
-
Open conf/slaves in your
favorite text editor.
-
This file lists the slave nodes (the nodes which will each
act as both a DataNode and a TaskTracker). You need to
enter the names of the other machines in your
cluster (i.e., the names of the machines which aren't the
NameNode or the JobTracker).
-
Here is how your slaves file should look:
MACHINE3
MACHINE4
(don't forget to replace MACHINEX with real
machine names).
Starting Your Hadoop Cluster
Now that you have configured Hadoop, it's time to start it.
-
Log into each one of your chosen nodes to make sure that you
can log in without any prompts for password or checks to see
if you trust the machine. If you are asked if you wish to
trust the machine, say yes. If you are asked for a passphrase,
then you need to run in an ssh
agent.
-
Log into your NameNode and change to the Hadoop root
directory, then run
the bin/start-dfs.sh
script. This starts the HDFS. You must do this while
logged into the NameNode. For example, if you chose
gordius to be your NameNode then you would do something like
this:
ssh gordius
cd ~/proj/cs147a/hadoop-0.15.2
bin/start-dfs.sh
-
Log into your JobTracker and change to the Hadoop root
directory, then run
the bin/start-mapred.sh
script. This starts the Map-Reduce daemon. You must do
this while logged into the JobTracker. For example,
if you chose creusa to be your JobTracker then you would do
something like this:
ssh creusa
cd ~/proj/cs147a/hadoop-0.15.2
bin/start-mapred.sh
Testing Your Hadoop Cluster
You should test your cluster by running the example described in
the Quickstart
and in
the Hadoop
example program.
Stopping Your Hadoop Cluster
You must always remember to stop your Hadoop
cluster when you are done with it. To do so you should do the
following:
-
Log into your NameNode and run
the bin/stop-dfs.sh script,
similar to how you started the HDFS.
-
Log into your JobTracker and run
the bin/stop-mapred.sh script,
similar to how you started the Map-Reduce daemon.
Questions for Further Thought
-
Why do slave nodes always act as both a DataNode and
a TaskTracker?
What's Next?
Work on the first project!