cs147a: Project 1

Assigned : 8 February 2008 Due : 15 February 2008 before 11:59 PM

In this project you will write a simple Hadoop program to count the frequency of html tags in one or more documents downloaded from the web, then test your program and answer questions about its performance.

Part 1: Programming

Start with the multifetch example from the tutorial. You may begin with either the:

You will retain the part of the code which fetches the contents of web pages. But, instead of emitting a single tuple containing the web page title for each document, you will output the average frequency with which html tags are used across all documents fetched. For example, if the following two documents are fetched:

Document 1
<html><body><p>This is a <strong>paragraph</strong>.</p>
<p>This is <emph>another</emph> paragraph.</p></body></html>
Document 2
<html><body><p>Hello <a href="the-world.html">world</a>.</p></body></html>

Then the tag frequency for each document *separately* looks like (order of tuples doesn't matter):

document 1--
  html:   1
  body:   1
  p:      2
  strong: 1
  emph:   1
document 2--
  html:   1
  body:   1
  p:      1
  a:      1

So the average frequency across all documents looks like this:

html:   1
body:   1
p:      1.5
strong: 0.5
emph:   0.5
a:      0.5

You should include all tags that are present in the documents in your frequency counts (i.e., don't just pick a subset of html tags and count those). If you find a document that uses an invalid tag that's fine, your program is not worried about correctness, go ahead and count it as a tag name like any other.

You should not treat tag tag names as case sensitive. For example, you must count both <A> and <a> as the same tag.

Producing this output will require you to modify both the mapper and the reducer.

Notes

You should not expend much effort parsing the html; for example, you do not have to validate for correctness. To avoid failing on poorly formed html you may be lazy and treat a tag name as any word (i.e., a string of letters that doesn't contain whitespace) that is preceded by an open angle bracket. This will also help you avoid including attributes in the tag name. For example, here is some html:

<a href="target.html">a link</a>

If you match "<a" and then chop off the first character (the leading angle bracket), you are left with the tag name ("a"). You may use any tactic you wish for parsing documents for tag names provided that it works for most html.

For this assignment you do not have to do sophisticated error handling. If there is an error fetching or parsing a document, you may simply throw a runtime exception out of the whole map task. When you are debugging you should look in the web interface for any Java exceptions that might be thrown by your mapper or reducer.

What to Turn In

You should write your code in python (using the HadoopStreaming interface) or the Java API. For this project, the Python+HadoopStreaming approach is probably simplest, but you are free to use the language with which you are most comfortable.

Part 2: Running Your Program

  1. Store 10 urls, each one into its own file inside a directory called "urls" Important: please be kind to the Internet, don't hammer common sites like google.com or nytimes.com over and over if you are testing, change which urls you use while you work
  2. Start a single-node cluster
  3. Put the urls directory into the DFS (as described in the multifetch example from the tutorial)
  4. Run your program in the single-node cluster 5 times (run as described in the multifetch example)
  5. Go to the web interface for the job tracker and report the running time for each of the 5 executions
  6. Stop the single-node cluster and start a 4-node cluster (set up as described in the multifetch example)
  7. You will probably have to reformat your file system and put the urls directory in it again
  8. Again run your program 5 times, go to the web interface, and report the running times

Part 3: Thinking About Performance

Answer the following questions:

  1. How does the performance of a single-node cluster compare to the performance of a multi-node cluster?
  2. If there is a performance difference, describe its source (i.e., why did you see this difference)?
  3. How might the performance of running this task with Hadoop compare to simply spawning multiple threads/processes on the same machine?

How to Submit

Follow the instructions on the lab FAQ.