Hadoop Single Node "Cluster" Setup

These instructions will show you how to run Hadoop on a single machine, simulating a cluster by running multiple Java VMs. This setup is great for developing and testing Hadoop applications. The Hadoop website has an excellent tutorial on installing and setting up Hadoop on a single node. This document will supplement that tutorial with some tips and gotchas. You will also write a small sample program that leverages hadoop to fetch titles from web pages.

Prereqs

SSH Client and Server

Berry patch machines: No extra software required.
Mac OSX: You will need to start the SSH server that comes with your Mac. Open the Control Panel, then click on Sharing, then check the box next to "Remote Login". Be sure to turn it off when you are not using Hadoop.
Linux (your own box, not the berry patch machines): This varies a little from distro to distro, but if an SSH server is not already running on your box then you can probably start one with /etc/init.d/sshd start.
Windows:
- Option 1 (recommended): ssh into a Berry patch machine and work there. If you don't have an SSH client, you will need to get one.
- Option 2: Install Cygwin and run Hadoop from the Cygwin command shell. For help please see the Single-Node Hadoop "Cluster" on Windows XP tutorial.

Java

Java 1.5x is required. You may already have it installed; try typing java -version at the command line to see if you have it installed. It should say version "1.5.x" (e.g., 1.5.0_14).

Berry patch machines: You will need to make sure that you are using the 1.5.x version of Java. To do so, log into a Berry patch machine and paste this incantation at the command line:
```
echo "JAVA_HOME=/usr/java/java1.5" >>~/.bashrc \
  && echo "PATH=/usr/java/java1.5/bin:\$PATH" >>~/.bashrc
```
Then log out and back in (or just close and reopen the terminal).
Windows: You want to install the JDK version 5 from this link. Don't bother getting the version with NetBeans installed (unless you know what the NetBeans IDE is and want it). The version you want is called something like "JDK 5.0 Update 14".
Mac: You already have Java installed.

Once Java is installed, go to the command line and type the following command:

java -version

You should see output similar to this:

java version "1.5.0_13"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.5.0_13-b05-241)
Java HotSpot(TM) Client VM (build 1.5.0_13-121, mixed mode, sharing)

If you see something to the effect of "command not found" then Java may not be on your PATH. If you see a version other than "1.5.x" then you do not have the correct version of Java installed.

If you have any trouble with Java installation or otherwise getting your computer set up for use with Hadoop, please ask the TA for help as soon as possible.

Installation

Make sure that you have your assigned port numbers for the class (please see the TA if you do not know your port numbers).
Download Hadoop 0.15.2 from an Apache mirror; here is a direct link to one of the mirrors (look carefully at the version number if you pick your own mirror, not all mirrors have the most up-to-date version).
Unpack Hadoop into your home directory. For example:
```
mkdir -p ~/proj/cs147a
mv hadoop-0.15.2.tar.gz ~/proj/cs147a
tar xvfz hadoop-0.15.2.tar.gz
```
On Windows you will have to enter those commands from a Cygwin shell.
The rest of the instructions assume that you enter commands from the Hadoop root directory. E.g., if you unpacked to the directory described above, then commands should executed from ~/proj/cs147a/hadoop-0.15.2
Follow the instructions in the Hadoop Quickstart (opens in new window) with the following modifications:
- You only need these modifications if you are working on a Berry patch machine. If you are on your personal computer then the Quickstart is fine how it is.
- In hadoop-site.xml You need to change the port numbers 9000 and 9001 to your first two assigned port numbers.
- You must set the administration/monitoring port numbers in the configuration file (not described in the quickstart). To do this, you add a few more properties to the site configuration. Your hadoop-site.xml should look something like this (replacing YOURPORTX with one of your assigned ports):
  conf/hadoop-site.xml
```
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>



<configuration>
  <property>
    <name>fs.default.name</name>
    <value>localhost:YOURPORT1</value>
  </property>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:YOURPORT2</value>
  </property>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.secondary.info.port</name>
    <value>YOURPORT3</value>
  </property>
  <property>
    <name>dfs.info.port</name>
    <value>YOURPORT4</value>
  </property>
  <property>
    <name>mapred.job.tracker.info.port</name>
    <value>YOURPORT5</value>
  </property>
  <property>
    <name>tasktracker.http.port</name>
    <value>YOURPORT6</value>
  </property>
</configuration>
```
  (don't forget to replace YOURPORTX with your assigned port numbers).

Notes and Gotchas

Passwordless login: If your ssh key requires a passphrase, then you will need to run the commands in the tutorial in an ssh agent.
SSH trust: when you SSH to a new machine, it may ask you if you wish to trust the host. You can say yes to this (ideally you would check the fingerprint by hand but it is okay to trust the berry patch setup). If the fingerprint changes, you will see a message like this one:
```
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
@    WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED!     @
@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@
IT IS POSSIBLE THAT SOMEONE IS DOING SOMETHING NASTY!
Someone could be eavesdropping on you right now (man-in-the-middle attack)!
It is also possible that the RSA host key has just been changed.
```
This will happen if you try to run the single-node cluster on two separate machines in the Berry patch (e.g., you work on Pentheus on Tuesday, then work on Gordius on Wednesday), because the name "localhost" points to different machines depending on which machine you are logged into. A solution to this problem can be found in the troubleshooting guide.
Directory exists error: The example command "grep" that the Quickstart asks you to run will complain if the "output" directory already exists. One reason this might occur is if you run the example twice. To resolve this you can either use a different output directory, like:
```
$ bin/hadoop jar hadoop-*-examples.jar grep input output2 'dfs[a-z.]+'
```
Or else you can remove the output directory and then re-run the original command:
```
$ bin/hadoop dfs -rmr output
$ bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'
```

Interacting with HDFS

You can see the commands that HDFS allows by typing this at the command line:

bin/hadoop dfs

A non-exhaustive list of important commands to remember:

bin/hadoop dfs -put localsrc dst: Put directory at classlocalsrc on your local file system into dst in the HDFS.
bin/hadoop dfs -get src localdst: Get directory at src in the HDFS and copy it to localdst on your local file system.
bin/hadoop dfs -cat src: Output a file at src to the console.
bin/hadoop dfs -ls path: Print the contents of the directory at path on the HDFS.
bin/hadoop dfs -rmr path: Remove the file or directory path on the HDFS.

Shutting Down Hadoop

When you are done working with Hadoop, you should always shut it down. To do this, execute this command:

bin/stop-all.sh

What's Next?

Write and test a small Hadoop program.