Hadoop Cluster Setup

This page will help you set up a real Hadoop cluster of 4 or more nodes.

Prerequisites

Introduction

To set up a Hadoop cluster, you must first choose at least four machines in the Berry patch. During testing you do not have to worry about whether others are using the machines you choose, but when you need to do performance testing for a project we will help you schedule time so that you have exclusive access to the machines in your cluster.

Following are the capacities in which nodes may act in your cluster:

Each node in the cluster is either a master or a slave. Slave nodes are always both a DataNode and a TaskTracker. While it is possible for the same node to be both a NameNode and a JobTracker (or in fact for a node to operate in all four capacities, as the machine in your single-node cluster did when you followed the Single Node Cluster Setup tutorial), typically you will put the JobTracker and NameNode on separate machines for performance reasons.

Configuration

  1. We assume that you have downloaded and unpacked Hadoop 0.15.2 as described in Single Node Cluster Setup. Do so now if you haven't already.
  2. You will each be assigned 6 port numbers to use. This ensures that while you are working on getting the installation correct you will not prevent each other from running their Hadoop clusters. We will use the tokens YOURPORT1YOURPORT5 to indicate where you should use your assigned port numbers.
  3. The rest of the instructions assume that you enter commands from the Hadoop root directory. E.g., if you unpacked to the directory described above, then commands should executed from ~/proj/cs147a/hadoop-0.15.2
  4. Pick four machines in the Berry Patch. This tutorial will refer to their names as machine1 machine4.
  5. One of the machines you chose should be the NameNode. Another should be the JobTracker. You need to remember which one is which. We assume that machine1 is the NameNode and machine2 is the JobTracker.
  6. Open conf/hadoop-site.xml in your favorite text editor.
  7. Open conf/masters in your favorite text editor.
  8. Open conf/slaves in your favorite text editor.

Starting Your Hadoop Cluster

Now that you have configured Hadoop, it's time to start it.

  1. Log into each one of your chosen nodes to make sure that you can log in without any prompts for password or checks to see if you trust the machine. If you are asked if you wish to trust the machine, say yes. If you are asked for a passphrase, then you need to run in an ssh agent.
  2. Log into your NameNode and change to the Hadoop root directory, then run the bin/start-dfs.sh script. This starts the HDFS. You must do this while logged into the NameNode. For example, if you chose gordius to be your NameNode then you would do something like this:
    ssh gordius
    cd ~/proj/cs147a/hadoop-0.15.2
    bin/start-dfs.sh
  3. Log into your JobTracker and change to the Hadoop root directory, then run the bin/start-mapred.sh script. This starts the Map-Reduce daemon. You must do this while logged into the JobTracker. For example, if you chose creusa to be your JobTracker then you would do something like this:
    ssh creusa
    cd ~/proj/cs147a/hadoop-0.15.2
    bin/start-mapred.sh

Testing Your Hadoop Cluster

You should test your cluster by running the example described in the Quickstart and in the Hadoop example program.

Stopping Your Hadoop Cluster

You must always remember to stop your Hadoop cluster when you are done with it. To do so you should do the following:

  1. Log into your NameNode and run the bin/stop-dfs.sh script, similar to how you started the HDFS.
  2. Log into your JobTracker and run the bin/stop-mapred.sh script, similar to how you started the Map-Reduce daemon.

Questions for Further Thought

What's Next?

Work on the first project!