cs147a: Project 2

Assigned : 27 February 2008 Part 1 Due : 29 February 2008 before class Part 2 Due : 10 March 2008 before 11:59 PM Part 3 Due : 21 March 2008 before 11:59 PM (extended from 14 March)

In this project you will write a series of Hadoop programs to aggregate and then analyze data from multiple large input files. You will then test your programs and measure their performance.

Example Solution

A possible solution is posted here. It is not the only way of solving this problem, just one way.

The Setup

Your job is to create a system for aggregating and analyzing data produced by existing systems for managing parking at Logan International Airport (BOS). When a customer parks at BOS, they receive a ticket. This ticket is all they need to check out, including discovering their parking fee and the location of their car.

Parking is managed by appending information about vehicles that enter the parking area in the "intake" table, and then when those vehicles leave information about them is appended to the "discharge" table. When vehicles enter, a photograph is taken of their license plate and OCR software is used to recognize their license plate number (which isn't really a number, it's a string of characters) and the state in which their car is registered. These tables are never cleaned, so they just grow over time.

Periodically, a vehicle drives through the parking areas, photographing license plates. These photographs are also processed with OCR and the license plate numbers are associated with a parking spot id in the "parking" table. This table is useful when a customer is leaving, it can tell them where to find their car.

You must write a program to efficiently aggregate these inputs into two output tables. The first ("kiosk") lists the location and current fee for each oustanding ticket in BOS parking, and is accessed by customers at the automated checkout kiosks before they leave the terminal. The second ("fee_history") lists all fees that have been assessed over time.

Input Specification

Input File: intake

A series of tuples formatted as follows:

(ticket, license, state, entry_timestamp)

Where ticket and license are arbitrary strings, state is a two-letter state abbreviation (the state from which the plate was issued) and entry_timestamp is an ISO 8601 datetime in the local time zone (e.g., "2008-02-26 13:30").

You may assume that the license string uniquely identifies a vehicle, i.e., the state is not necessary to identify a vehicle. You may also assume that the same state is always associated with a particular license number. However, you may not assume that a license only appears once in this table, since a vehicle may park at BOS on multiple separate occassions.

Input File: discharge

A series of tuples formatted as follows:

(ticket, exit_timestamp)

Where ticket is an arbitrary string and exit_timestamp is an IOS 8601 datetime.

Input File: parking

A series of tuples formatted as follows:

(spot_id, license)

Where spot_id and license are arbitrary strings.

Format of Input

In this assignment we describe the format of each row in a table using a loose tuple notation, but physically tuples will be given in the input text files in this format:

key\tvalue\n

Where "\t" is a tab character, "\n" is a newline separating one tuple from the next, and both "key" and "value" could contain spaces. The spaces in "value" separate multiple components of the value, thus simulating an n-tuple. For example:

ticket\tlicense state entry_timestamp

This format is easy to use because it will be divided up into <key, value> pairs automatically by HadoopStreaming, and will be divided up correctly as well by the Java API provided you add this to your job configuration:

conf.setInputFormat(KeyValueTextInputFormat.class);

You will probably want to use this same format for your intermediate and summary output, but you are free to use whatever format you wish.

TASK 1: Produce Summary Tables

Output File: kiosk

Produce a table that lists the ticket, spot_id, and current parking fee for all cars that have not exited the parking area. Each row of the kiosk table must be formatted as:

(ticket, spot_id, fee)

The ticket and spot_id should be the same strings used in the source files. The fee must be computed using the entry_timestamp and the current datetime (when debugging, you might want to temporarily fix the datetime used for "now" so that your results don't change with every run). You should use the current Logan airport parking fees to compute the fee.

Any ticket which appears in the discharge table should not be included in the kiosk table.

Creating the Table

This table is similar to the result of a SQL join, such as might be the result of the following SQL query (you are NOT writing SQL for this project, this is just an illustration for the benefit of those of you who are familiar with SQL):

select ticket, spot_id, CALCULATE_FEE(entry_timestamp, now)
from   intake inner join parking on intake.license = parking.license
where  ticket not in (select ticket from discharge)

This SQL command would probably not be very efficient because of the nested select in the where clause, but if you know SQL this might give you an idea of what we are looking for. Of course, the CALCULATE_FEE UDF would be provided by the database programmer, similar to how you must program it into a Hadoop function.

Another way to look at this problem is that you are filtering out those tickets which do not appear in the discharge table, then looking up the spot_id in the parking table, and finally computing the fee from the intake table, and outputting those two values for each ticket (where the ticket is the key).

This means that you should find the fee and spot_id which are associated with the same license number, and then output those three items in a tuple with the license as the key.

Output File: fee_history

Produce a table that lists the license number and fee assessed for all vehicles that have been discharged. A ticket must be in both intake and discharge to be considered for this table, since the fee_history table only considers those fees which have been assessed. Each row of the fee_history table must be formatted as:

(license, state, [fee1, fee2, ..., feeN])

If you wish, you may simply treat the value as a Text object and append fees to that string instead of embedding an array. In other words, you ouput can look like this:

license    state fee1 fee2 ... feeN

Where license and state are the same strings used in the intake table. There may be more than one fee (i.e., if the same car parks more than once at Logan), so you should represent your values as arrays of one or more fees.

Each fee should be computed using the current Logan airport fee structure and the time difference between entry_timestamp and exit_timestamp.

Creating the Table

Similar to the first summary table, you must combine data from two tables. This time, you are combining data from intake and discharge instead of from intake and parking. The key on which you combine is again the license. Fees are computed in a similar way, except you use entry_timestamp and exit_timestamp instead of entry_timestamp and now.

TASK 2: Analyze the Summary Tables

Analysis 1: check_out

When a customer wishes to check out at a parking kiosk, they need data from the kiosk table. They need to know the location of their car and what their fee is. The customer presents their ticket to the machine, from which the ticket is read. Write a MapReduce-style function to run in Hadoop which, given a ticket, can search the kiosk table for the location and fee. The output will be in a file in the HDFS and will be either empty, if the ticket does not appear, or the pair (location, fee) if ticket is found.

Analysis 2: out_of_state_earnings

The BOS executives are interested in the earnings from parking fees from flyers who drive in from out-of-state. You must sum the total fees assessed for all out-of-state tags (that is, fees for which the state field is not "MA").

Downloads

You will not want to debug using these large files. Make your own test inputs that exercise various corners of your algorithm. You should, however, test with these files once you think your programs work for performance testing.

You will probably need to adjust the Java heap size beyond the default. See Hadoop Cluster Setup for instructions on how to change the heap size (look for mapred.child.java.opts).

More updates: I have mentioned this in an email to the class, but in case somebody missed the mail, here is how to grab a copy of the big input without having to make it yourself:

scp -r username@verdandi.cs.brandeis.edu:/tmp/inputs .

Also, you may find that you run out of space on the hard disks. In this case, you can change the location where DFS stores its blocks from /tmp to /var/local/cs147a-spr08.

What to Turn In

This assignment is divided into 3 parts, with 3 separate due dates. Please see the beginning of this document for the due dates for each part.

Part 1

Turn in pseudocode for how you will produce the 2 summary tables and 2 analyses. Your pseudocode must take advantage of the MapReduce style of programming.

Your pseudocode must fit on one printed page. It's fine if it's less than one page.

Part 2

Program the assignment. You should test your code in a single-node cluster. You are welcome to test in a multi-node cluster as well, but doing so is only required for part 3. Turn in the following:

Part 3

Examine the performance of your system and consider the limitations of Hadoop.

How to Submit

Follow the instructions on the lab FAQ.