Hadoop Cluster Setup

Prerequisites

Read Hadoop Single Node Setup and learn how to set up a single-node Hadoop "cluster"

Have access to Berry patch machines. While the intro can be completed on your own computer, running a cluster requires access to the Berry patch.

Introduction

To set up a Hadoop cluster, you must first choose at least four machines in the Berry patch. During testing you do not have to worry about whether others are using the machines you choose, but when you need to do performance testing for a project we will help you schedule time so that you have exclusive access to the machines in your cluster.

Following are the capacities in which nodes may act in your cluster:

NameNode: Manages the namespace, file system metadata, and access control. There is exactly one NameNode in each cluster.

SecondaryNameNode: Downloads periodic checkpoints from the NameNode for fault-tolerance. There is exactly one SecondaryNameNode in each cluster. We will learn more about the purpose of SecondaryNameNode as we study Hadoop.

JobTracker: Hands out tasks to the slave nodes. There is exactly one JobTracker in each cluster.

DataNode: Holds file system data; each data node manages its own locally-attached storage (i.e., the node's hard disk) and stores a copy of some or all blocks in the file system. There are one or more DataNodes in each cluster. If your cluster has only one DataNode then file system data cannot be replicated.

TaskTracker: Slaves that carry out map and reduce tasks. There are one or more TaskTrackers in each cluster.

Each node in the cluster is either a master or a slave. Slave nodes are always both a DataNode and a TaskTracker. While it is possible for the same node to be both a NameNode and a JobTracker (or in fact for a node to operate in all four capacities, as the machine in your single-node cluster did when you followed the Single Node Cluster Setup tutorial), typically you will put the JobTracker and NameNode on separate machines for performance reasons.

This tutorial is based on the Hadoop Cluster Setup tutorial from the Hadoop website.

Configuration

We assume that you have downloaded and unpacked Hadoop 0.15.2 as described in Single Node Cluster Setup. Do so now if you haven't already.

You will each be assigned 6 port numbers to use. This ensures that while you are working on getting the installation correct you will not prevent each other from running their Hadoop clusters. We will use the tokens YOURPORT1 – YOURPORT5 to indicate where you should use your assigned port numbers.

See the TA if you haven't yet been assigned port numbers.

The rest of the instructions assume that you enter commands from the Hadoop root directory. E.g., if you unpacked to the directory described above, then commands should executed from ~/proj/cs147a/hadoop-0.15.2

Pick four machines in the Berry Patch. This tutorial will refer to their names as machine1 – machine4.

You can find machine names by consulting the list of public workstations.

One of the machines you chose should be the NameNode. Another should be the JobTracker. You need to remember which one is which. We assume that machine1 is the NameNode and machine2 is the JobTracker.

Open conf/hadoop-site.xml in your favorite text editor.

This is the XML configuration file that each Hadoop node reads in order to know how to contact other machines in your compute cluster.
Because your Hadoop installation is in your home directory (and hence accessible over NFS), each node will read the same configuration file.

You must specify the machine that acts as the NameNode (fs.default.name) and the machine that acts as the JobTracker (mapred.job.tracker), along with a few other properties. Here is how your configuration file should look:

conf/hadoop-site.xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>fs.default.name</name>
    <value>MACHINE1:YOURPORT1</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>MACHINE2:YOURPORT1</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>2</value>
  </property>
  <property>
    <name>dfs.secondary.info.port</name>
    <value>YOURPORT2</value>
  </property>
  <property>
    <name>dfs.info.port</name>
    <value>YOURPORT3</value>
  </property>
  <property>
    <name>mapred.job.tracker.info.port</name>
    <value>YOURPORT4</value>
  </property>
  <property>
    <name>tasktracker.http.port</name>
    <value>YOURPORT5</value>
  </property>
</configuration>

(don't forget to replace MACHINEX and YOURPORTX with real machine names and your assigned port numbers).

Open conf/masters in your favorite text editor.

This file lists the master nodes (i.e., the NameNode and JobTracker).
here is how your masters file should look:
```
MACHINE1
MACHINE2
```
(don't forget to replace MACHINEX with real machine names).

Open conf/slaves in your favorite text editor.

This file lists the slave nodes (the nodes which will each act as both a DataNode and a TaskTracker). You need to enter the names of the other machines in your cluster (i.e., the names of the machines which aren't the NameNode or the JobTracker).
Here is how your slaves file should look:
```
MACHINE3
MACHINE4
```
(don't forget to replace MACHINEX with real machine names).

Starting Your Hadoop Cluster

Now that you have configured Hadoop, it's time to start it.

Log into each one of your chosen nodes to make sure that you can log in without any prompts for password or checks to see if you trust the machine. If you are asked if you wish to trust the machine, say yes. If you are asked for a passphrase, then you need to run in an ssh agent.

Log into your NameNode and change to the Hadoop root directory, then run the bin/start-dfs.sh script. This starts the HDFS. You must do this while logged into the NameNode. For example, if you chose gordius to be your NameNode then you would do something like this:

ssh gordius
cd ~/proj/cs147a/hadoop-0.15.2
bin/start-dfs.sh

Log into your JobTracker and change to the Hadoop root directory, then run the bin/start-mapred.sh script. This starts the Map-Reduce daemon. You must do this while logged into the JobTracker. For example, if you chose creusa to be your JobTracker then you would do something like this:

ssh creusa
cd ~/proj/cs147a/hadoop-0.15.2
bin/start-mapred.sh

Stopping Your Hadoop Cluster

You must always remember to stop your Hadoop cluster when you are done with it. To do so you should do the following:

Log into your NameNode and run the bin/stop-dfs.sh script, similar to how you started the HDFS.

Log into your JobTracker and run the bin/stop-mapred.sh script, similar to how you started the Map-Reduce daemon.

Hadoop Cluster Setup

Prerequisites

Introduction

Configuration

Starting Your Hadoop Cluster

Testing Your Hadoop Cluster

Stopping Your Hadoop Cluster

Questions for Further Thought

What's Next?