cs147a: Project 3
Assigned : 29 March 2008
Proposal Due : 08 April 2008 before class
Code Due : 28 April 2008 before 11:59pm
Presentation : 29 April 2008 in class and at 5:30pm
This is the final project for cs147a. We are asking you to
propose a project that leverages what you have learned about
distributed systems in general and Hadoop in particular to solve
an interesting problem in a creative way. You will give a 30
minute presentation at the end of the term structured as a pitch
to potential investors.
Presentation Schedule
The presentation schedule has been posted.
The Proposal
Your project should be completed in groups of 2 or 3. We would
like you to propose a creative topic that interests the members
of your group. Please include the following in your proposal:
- The members of your team
- Describe the problem you want to solve and how you will
employ Hadoop
- Rough sketch of the solution to convince us (and yourselves)
of its feasibility. This should encompass the following:
- descriptions of the map and reduce tasks that will be
employed in your solutions;
- enumeration of any code outside of the map and reduce
tasks and the scafolding to run jobs that will be necessary
to accomplish your task (if any) and why performing these
steps outside of Hadoop is still scalable; and,
- a brief argument why you believe that Hadoop will
provide a performance benefit for your task (i.e., describe
how the input data can scale to be large and still exhibit a
high degree of potential for parallelism).
Project Requirements
- Your project must address the problem described in your proposal
- Your project must be able to scale to very large data sizes
- Your code must leverage Hadoop for parallelization
- Your code must be simple to compile and run
-
Your project must be original enough to be of interest to
hypothetical investors
-
You must complete a set of self-designed experiments to assess
the potential for parallelization in your project. These
experiments should be similar in nature to the experiments for
previous projects, but it is up to you how exactly to show
that your algorithms are parallelizable. Your
experiments must produce some hard data that you can include
with your submission.
Deliverables
- Code which satisfies the goals of your proposal
-
Your main goal is showing performance benefit in Hadoop,
so the most important part of your project is to get your
distributed algorithms working.
-
Other (non-distributed) parts of your code (if you
proposed any) are less important so manage your time such
that the code for these parts are written second, if
possible.
- Hard data from your experiments and a short (2 to 3
paragraphs) interpretation of the data
-
Scripts to make running your code easy (i.e., a bash script or
a Makefile). The TA should be able to run your code (including
all initialization tasks such as loading files into the HDFS)
with just a handful of commands that can be copy/pasted from a
README. Please see the TA if you are not confident that your
scripts satisfy this requirement.
-
A README which includes
- A short description of your project (1 paragraph)
- A list of your project files and their purpose
- Instructions for how to run your project
- A list of any unresolved bugs in your code
-
A 25 minute pitch to investors (to be delivered as a
presentation in class), followed by 5 to 10 minutes
of questions. You should assume that the investors are skilled
technically as well as in business. Your pitch should cover at
least the following points:
-
Why your project is worth investment (how will the
investment pay off)?
-
What is the originality of your project as compared to
existing approaches (if there are any)?
-
A concise but detailed overview of the technical details
of your project, with a special focus on how an
investment in a cluster of machines and a Hadoop
infrastructure will provide ROI when it is used as the
deployment platform for your project. This should include
data from your experiments that show that your project can
scale.
(Just to be clear, this isn't a business class, so don't go
overboard with hypothetical ROI calculations or anything. Our
main goal is for you to demonstrate novelty, usefulness,
functionality, and parallelizability).
Pitch Recommendations
Both the high-level pitch and the technical details are
important for the final presentation, so please be sure to leave
room for both.
-
Your pitch should include an overview of existing solutions,
and draw a clear distinction between those approaches and
your approach;
-
if there are not existing solutions to your problem, explain
why the problem needs to be solved and why it is worth the
risk of investment to solve; and,
-
the technical details should be concise: assume that the
investors will ask questions at your talk or examine your
prototype if they have very specific questions about your
code.
You will need to be very clear and efficient in your
presentation. You should prepare concise slides that complement
rather than mirror what you will say in your
presentation. Please do not spend time going over
"outline" slides or other visual material that is
superfluous to the content you are delivering.
How to Submit
Proposal
Please attach your proposal in plain text for pdf format to an
email and send it to both the professor and the
TA before class on the due date. One email per group,
please. Make sure that all information (including the group
membership) is in the attached document.
Project
Follow the instructions on the lab
FAQ. Please email your project to both the professor and the
TA.