Hadoop Example Program

This tutorial will help you write your first Hadoop program.

Prereqs

You have set up a single-node "cluster" by following the single-node setup tutorial.
You have tested your cluster using the grep example described in the Hadoop Quickstart.
This tutorial will work on Linux boxes and Macs. If you are on Windows you may need to install Python; ActivePython from ActiveState is a fine (and free) choice for Windows. If you are on Windows, be sure to do this tutorial from inside a Cygwin shell.
We assume that you run commands from inside the Hadoop directory.

What you Will Create

This tutorial uses the HadoopStreaming API to interact with Hadoop. This API allows you to write code in any language and use a simple text-based record format for the input and output <key, value> pairs. We will explore other, richer APIs in later tutorials.
The finished version of this program will fetch the titles of web pages at particular URLs.

Fetch Titles from URLs

First we write a program to fetch titles from one or more web pages in Python:

multifetch.py

#!/usr/bin/env python

import sys, urllib, re

title_re = re.compile("<title>(.*?)</title>",
                      re.MULTILINE | re.DOTALL | re.IGNORECASE)

for url in sys.argv[1:]:
    match = title_re.search(urllib.urlopen(url).read())
    if match:
        print url, "\t", match.group(1).strip()

You can test this yourself. To make the file executable do this:

chmod u+x multifetch.py

Then execute it like this:

./fetch.py http://www.google.com http://www.brandeis.edu

Output should look something like this:

http://www.cs.brandeis.edu      Michtom School of Computer Science
http://www.brandeis.edu         Brandeis University

This program does not do any error checking; so it will not behave correctly if you give it invalid URLs. Also note that it silently ignores a URL if it cannot find a title.

Self Test

See if you can guess what each line of this short program does, even if you are not familiar with Python. If you are unsure about any part of this program, please ask the TA, we are always happy to talk about code with you!

Hadoop Scaffolding

Next we expand multifetch.py to accept URLs to fetch from standard input (one per line).

multifetch.py

#!/usr/bin/env python
#
# Adapted from an example by Michael G. Noll at:
#
# http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
#
 
import sys, urllib, re

title_re = re.compile("<title>(.*?)</title>",
                      re.MULTILINE | re.DOTALL | re.IGNORECASE)
 
# Read pairs as lines of input from STDIN
for line in sys.stdin:
    # We assume that we are fed a series of URLs, one per line
    url = line.strip()
    # Fetch the content and output the title (pairs are tab-delimited)
    match = title_re.search(urllib.urlopen(url).read())
    if match:
        print url, "\t", match.group(1).strip()

You can test the new multifetch.py by feeding it urls separated by newlines, something like this:

echo "http://www.cs.brandeis.edu" >urls
echo "http://www.nytimes.com" >>urls
cat urls | ./multifetch.py

Now we must write the reducer. For this example our reducer does not do anything interesting, it just outputs all of the input pairs with no aggregation or transformation. This is a perfectly valid reducer, but for your projects you will probably have to do more.

reducer.py

#!/usr/bin/env python
#
# Adapted from an example by Michael G. Noll at:
#
# http://www.michael-noll.com/wiki/Writing_An_Hadoop_MapReduce_Program_In_Python
#

from operator import itemgetter
import sys

for line in sys.stdin:
    line = line.strip()
    print line

Again we make it executable:

chmod u+x reducer.py

Your reducer doesn't actually do anything noticeable to the output, but in general if you want to test your Python mapper and reducer on the command line you can do something like this (here I am just using generic names instead of the specific names we used above, this same technique can be applied with any MapReduce code written in Python):

cat inputs | ./mappper.py | sort | ./reducer.py

Note that sort is a UNIX command line tool which sorts its input. This is necessary to simulate Hadoop behavior since the Hadoop reduce phase is preceded by the sort phase.

Running on the command line is not a substitute for testing with Hadoop, but it can be helpful as you are writing your code and will give you a good sense of what your code is doing without waiting on the overhead of starting a Hadoop task.

Running the Example with HadoopStreaming

First we must create some input data and put it in the HDFS. Make two or more files named urlX where X is a number. Each file should contain exactly one URL. For example, here we create two files:

echo "http://www.cs.brandeis.edu" >url1
echo "http://www.nytimes.com" >url2

This creates these two files:

url1

http://www.cs.brandeis.edu

url2

http://www.nytimes.com

Now we must put these files in the HDFS. Remember the command to put files into HDFS? We will also use the mkdir command so that the input files are in their own directory.

bin/hadoop dfs -mkdir urls
bin/hadoop dfs -put url1 urls/
bin/hadoop dfs -put url2 urls/

You may be able to come up with a more efficient way to put many such url files into HDFS. This part of the tutorial has been updated: you need to include a full, absolute path as illustrated below if you are running on a real cluster, and this path must be in your NFS-mounted homedir.

At long last, we can run the command:

bin/hadoop jar contrib/hadoop-0.15.2-streaming.jar \
  -mapper  $HOME/proj/hadoop/multifetch.py         \
  -reducer $HOME/proj/hadoop/reducer.py            \
  -input   urls/*                                  \
  -output  titles

This should output something like this:

null=@@@userJobConfProps_.get(stream.shipped.hadoopstreaming
packageJobJar: [/tmp/hadoop-ross/hadoop-unjar5997/] []
  /tmp/streamjob5998.jar tmpDir=null
08/01/11 10:34:43 INFO mapred.FileInputFormat: Total input paths to process : 5
08/01/11 10:34:44 INFO streaming.StreamJob: getLocalDirs():
  [/tmp/hadoop-ross/mapred/local]
08/01/11 10:34:44 INFO streaming.StreamJob: Running job: job_200801111003_0003
08/01/11 10:34:44 INFO streaming.StreamJob: To kill this job, run:
08/01/11 10:34:44 INFO streaming.StreamJob:
  /home/ross/cs147a/hadoop/hadoop-0.15.2/bin/../bin/hadoop job
  -Dmapred.job.tracker=localhost:9001 -kill job_200801111003_0003
08/01/11 10:34:44 INFO streaming.StreamJob:
  Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_200801111003_0003
08/01/11 10:34:45 INFO streaming.StreamJob:  map 0%  reduce 0%
08/01/11 10:34:52 INFO streaming.StreamJob:  map 40%  reduce 0%
08/01/11 10:34:53 INFO streaming.StreamJob:  map 80%  reduce 0%
08/01/11 10:34:54 INFO streaming.StreamJob:  map 100%  reduce 0%
08/01/11 10:35:03 INFO streaming.StreamJob:  map 100%  reduce 100%
08/01/11 10:35:03 INFO streaming.StreamJob: Job complete: job_200801111003_0003
08/01/11 10:35:03 INFO streaming.StreamJob: Output: titles

And you should be able to view the results like this:

bin/hadoop dfs -cat titles/part-00000

http://www.cs.brandeis.edu  Michtom School of Computer Science
http://www.nytimes.com      The New York Times - Blah blah ...

Look at the Job Stats

You can look at the job statistics by following these two urls:

Map/Reduce Administration on localhost (opens in new window)
NameNode status on localhost (opens in new window)

You can also use the Map/Reduce administration link to view the status of a job as it runs (but you must refresh the page manually, it does not automatically refresh).

These links have the port numbers from the Quickstart. If you are on a Berry patch machine you should have changed them to your assigned port numbers. The Map/Reduce administration link should have the port you assigned to the mapred.job.tracker.info.port property in conf/hadoop-site.xml. The NameNode status link should have the port you assigned to the dfs.info.port property. If you are on your personal computer then the links should work as-is.

Questions for Further Thought

Did the multifetch.py run faster with Hadoop than when you just tested it on the command line?
How might you extend multifetch.py to fetch other properties? Remember that map can output multiple <key, value> pairs for each input.

What about Java?

There is a Java version of the multifetch example. Please read it and try it out if you wish you write your Hadoop code in Java.

What's Next?

Set up a real Hadoop cluster!