In this project you will write a simple Hadoop program to count the frequency of html tags in one or more documents downloaded from the web, then test your program and answer questions about its performance.
Start with the multifetch example from the tutorial. You may begin with either the:
You will retain the part of the code which fetches the contents of web pages. But, instead of emitting a single tuple containing the web page title for each document, you will output the average frequency with which html tags are used across all documents fetched. For example, if the following two documents are fetched:
<html><body><p>This is a <strong>paragraph</strong>.</p> <p>This is <emph>another</emph> paragraph.</p></body></html>
<html><body><p>Hello <a href="the-world.html">world</a>.</p></body></html>
Then the tag frequency for each document *separately* looks like (order of tuples doesn't matter):
document 1-- html: 1 body: 1 p: 2 strong: 1 emph: 1 document 2-- html: 1 body: 1 p: 1 a: 1
So the average frequency across all documents looks like this:
html: 1 body: 1 p: 1.5 strong: 0.5 emph: 0.5 a: 0.5
You should include all tags that are present in the documents in your frequency counts (i.e., don't just pick a subset of html tags and count those). If you find a document that uses an invalid tag that's fine, your program is not worried about correctness, go ahead and count it as a tag name like any other.
You should not treat tag tag names as case sensitive. For example, you must count both <A> and <a> as the same tag.
Producing this output will require you to modify both the mapper and the reducer.
You should not expend much effort parsing the html; for example, you do not have to validate for correctness. To avoid failing on poorly formed html you may be lazy and treat a tag name as any word (i.e., a string of letters that doesn't contain whitespace) that is preceded by an open angle bracket. This will also help you avoid including attributes in the tag name. For example, here is some html:
<a href="target.html">a link</a>
If you match "<a" and then chop off the first character (the leading angle bracket), you are left with the tag name ("a"). You may use any tactic you wish for parsing documents for tag names provided that it works for most html.
For this assignment you do not have to do sophisticated error handling. If there is an error fetching or parsing a document, you may simply throw a runtime exception out of the whole map task. When you are debugging you should look in the web interface for any Java exceptions that might be thrown by your mapper or reducer.
You should write your code in python (using the HadoopStreaming interface) or the Java API. For this project, the Python+HadoopStreaming approach is probably simplest, but you are free to use the language with which you are most comfortable.
Answer the following questions:
Follow the instructions on the lab FAQ.