COSI 120a
Topics in Systems: Querying the Web
Spring, 2000
Final Projects
Project Ideas
The following is a list of topics that should form the basis for your
final projects. These topics, as given, are somewhat broad. It is your
job to use these ideas (or any other that you come up with) as a
springboard for a final project proposal. This proposal is due on
March 10 at the start of class, so you should meet with your group of
3 or 4 as soon as possible to choose your project and define its
scope.
Automatic Mirror Selector
Many web sites use mirroring in an attempt to balance the load
requests of large numbers of clients. An example of this is Real
Audio which requests that downloads of its product be made from sites
that are in geographic proximity to the requesting client. The
problem with this approach is that there is no way for clients to know
what are the relative workloads at each mirror site. It may be worth
a client situated in Boston to download a file from a site across the
country rather than from a site in New York, if the load at the New
York site is heavy enough to make a cross-country file transfer
faster.
For this project, the goal is to come up with an automated mirror
selector that predicts (from the client-side) from a collection of
mirror sites, the one that will offer the best response time for a
given file request. You should come up with a number of predictive
tools (an obvious one that comes to mind is ping) and run experiments
that determine how well the tool predicts the response times of file
requests that follow. Your tools should require no intrusion on the server site (e.g., ping)
as typically you only have control at the client side.
Query Benchmarks for XML-Based Query System
The purpose of this project is to give you a chance to experiment with
research prototypes that implement XML query languages.
Since they are prototypes, its a little unfair to compare their
performance, but, on the other hand, it is a great opportunity to
learn something about how they work and about how you might go about
constructing a benchmark. At the very least, you will use the
XML-QL interpreter
from AT&T Bell Labs and the Lorel
interpreter from Stanford University. You should find one or two other
XML-based query language implementations to examine also.
Your job is to first install and become familiar with the way that
these systems work. You should then pick an application area for
which you will test these systems. An example would be a digital
library. To construct the benchmark, you will first need a data
source. You can get this in any (legal) way that you want. You can
get it off the net, you can get it from a friend, or you can build
it yourself. Part of the exercise is to get this data into a form
that can be processed by your target systems.
You must then design a mix of queries that you will send to each of
the systems. You will also need a clear idea of what you will measure
as a result of the system's execution of your queries. A good set of
queries should exercise different aspects of a system. They should be
representative of what you might expect in practice. You will justify
the design of your benchmark in your project writeup.
Of course, you will run your benchmark on all of the query systems and report your
measured results.
Client-side Profile Language and Interpreter
For this project, you should come up with a site-specific profile
language and interpreter. For example, the chosen site might be
CNN.com, and your profile language might specify the pages that
interest you that can be retrieved from CNN.com as they get generated.
For example, your profile might specify interest in poll results for
the 2000 election primaries as they get generated, basketball game
results where some scorer got over 40 points, or weather forecasts for
any U.S. city that predicts over a foot of snow to fall.
Many sites include "server-side" profiles. For example, CNN.com
includes a feature called "myCNN" which allows users to specify from a
checklist, those topics of interest to them, and automatically
generates pages with links to stories related to chosen topics. The
project proposed here would differ in the following ways:
- it would process profiles on the client side rather than the server
side (among other advantages, this would allow you to expand this
project to profile and integrate data from multiple sites)
- it would allow finer-grained specification of profile interests
(e.g., based on the content of the news stories, as in game summaries
for basketball games with players who scored over 40 points), or based
on how recent the story was published (only stories
that are less than a week old).
The exact flavor of the profile language would be up to you. You
should define the language unambiguously (e.g., with an annotated
grammar) and write an interpreter that would process a profile and
generate a personalized web page that includes links to the pages
specified in the profile.
Web Server Dashboard
This project would involve designing and building a web server tool
that monitors, and displays graphically, analysis of the "current"
workload of a web server such that what is "current" (last 5 minutes,
last 24 hours, etc.) can be specified by tweaking a "knob". A simple
version of this tool might simply monitor the web access log, and
maintain current request arrival rates (for time intervals that could
also be specified with a "knob"), per-file retrieval statistics,
per-requester retrieval statistics and average response times. It
might also use clustering techniques to display an analytical model of
the current workload as was discussed in Chapter 6 of the Capacity
Planning text. The dashboard could also have a "spam detector" to
detect excessive numbers of HTTP requests arriving from a single IP
address.
An important part of this project would involve specifying, in
advance, its functionality and GUI. The functionality proposed here
is meant solely to help you come up with ideas, and is in no way meant
to limit the product you produce.
A Simulated Web Cache Server
The task here is to better understand the role of a web cache and its
effect on web performance. While it is commonly agreed that caching
holds the key to good performance for popular web sites, there is
still a good deal of disagreement about how they should work and how
and where they should be deployed. You can think of this project as a
white paper containing experimental results that could be used to
inform that process.
The fundamental operation of a cache is very much like any other cache
that might be maintained in your computer. The trick is to always
keep the most items that are most likely to be accessed next. If your
cache manager were omniscient, you would always win. However, this is
difficult to achieve so the best we can do is to use a good heuristic
like LRU.
Since it is not that easy to obtain and configure a web cache server,
we will be satisfied with a simulated enviroment. You will simulate a
workload and a set of components that manage a fictitious cache. Your
cache will sit in front of a database (read web site) with some
specified size. You can vary the size as a part of your simulation.
Your cache will also have a specified size that is smaller than the
database.
The documents in your database will also be of varying size. You can
describe the variation in their size with some form of statistical
distribution (e.g., Poisson or normal distributions). Playing with the
parameters of the distribution will allow you to investigate varying
degrees of skew in document size.
Other things that you might want to vary would include the type of
workload (uniform, hot-cold) and its intensity (exponential
interarrival times, bursty, etc.). The caching policy is also up for
grabs. While LRU is a good starting point, it is likely for web
traffic that you could do better. This is your chance to be creative.
Try to design something that more closely reflects the web and that is
simple (efficient) to implement.
Your final problem is to figure out what to measure. Response time
improvement (with or without the cache) is an obvious choice, but you
will likely think of more interesting metrics as your project
progresses.
Workload Generator
The wwwstat tool discussed in class generates an analytic workload
model using the web access log. The model (table) that can be
generated with this tool is limited. The only columns (i.e., workload
parameters) of the table that can be generated are "bytes
transferred", "requests", "% bytes transferred" and "% requests". The
only rows of the table (i.e., workload classes) that can be generated
are classes based on time intervals (days or hours), files requested
or IP addresses of requesting clients. The goal of this project would
be to build a more general wwwstat tool. This tool would permit
specification of the columns (e.g., arrival rates) and rows desired
(e.g., rows based on file sizes or automatically generated rows based
on clustering), as well as permit global analysis of the workload file
for identifying such phenomena as load spikes and causality
relationships (e.g., people who load file A tend also load file B 95%
of the time.)
Profile-based Prefetcher
This project is based on the Data Recharging proposal described here. This project would involve
building a tool that analyzes profile specifications generated by
existing software (e.g., schedule files for calendar managers) and
retrieves information from various sources (web pages, newsgroups,
email) that is judged to be relevant according to these profiles. As
an example, a schedule might reveal that a client has a meeting with
Bruce Lindsay at IBM Almaden regarding Data Recharging on Friday.
Your prefetcher tool might then send your client an email that lists
web pages involving Data Recharging, recent mails to you from Bruce
Lindsay, and directions to IBM Almaden (perhaps including flight
schedules to the Bay area, seat sales etc.) Of course, all of this is
fairly ambitious, so a key component of this project would be to
specify that part of it which could be delivered in the time allotted.
Web-page Monitor
This project also has applications to Data Recharging. The goal of
this project would be to build a web-page "monitor" that accepts as
input from a client, a set of URL's (i.e., a bookmark file). This
tool would notify the client whenever changes were made to any of
these URL's within some specified time period or since the client's
last visit to the pages. The processing done by this tool might
involve building and maintaining a table that stores a hash value for
each site listed in the bookmark file for a given client. This hash
value would be computed on the basis of the contents of some
preprocessed form of the page (e.g., with advertisements removed).
Then, periodic revisits to pages and subsequent processing would
generate new hash values that could be compared to previous hash
values to detect changes. Such a tool could be used by any processor
of user profiles that expresses interest in pages according to update
criteria.
Literature Survey
For those of you who prefer to work on a final project on your own,
you may also complete a literature survey on any topic related to those
covered in the course. Example topics might include
type inference for semistructured data, or dynamic
query optimization. Proposals for literature surveys should describe
the area that will be surveyed, as well as provide a preliminary list
of at least 5 papers that will be reviewed. Note that these papers
should not be chosen from those covered in class.
Project Proposals
One-to-two page proposals for your final projects are due on Friday,
March 10 at the start of class. The idea is that the proposal should
clearly sketch all of what you intend to do so that you can 1) get
valuable feedback, and 2) you will be able to set out to do it with as
few missteps as possible. These proposals should describe the
following things clearly.
- A clear description of the goals of the project work.
- A description of the applications of your project to the
real-world. What benefit will your final project serve to the web
and/or database communities?
- Preliminary thoughts about the design of the project.
- Intended milestones for completing the project.
- A description of the deliverable to be submitted at the project's end.
- Evaluation criteria that could be used to assess the success or
failure of the project. Note that the evaluation criteria chosen
should reflect upon the goals of the project described in (1).