Brandeis CS 146a

ASSIGNMENT 5: October 12 through October 16, 20090

For Class Tuesday, October 13, 09

For Lecture Material:

Our lecture topic today is engineering good performance in the end-to-end layer, including a tutorial on asynchronous socket i/o intended to help you with project 3. To make the most of the tutorial, please review the tutorial by Mazieres et. al (to be posted shortly).

There is no formal assignment for today but plenty of reading is due Friday, please take advantage of the opportunity and get an early start. This is also an opportunity to review the early class material in preparation for the upcoming quiz (see the class web page for the schedule of quizes) .

For Discusion Class, Friday, October 16, 09

For Class Discussion, Friday October 16, 09

The discussion today will be on Network Address Translation. Please read the following papers: "Anatomy: a Look Inside Network Translations" by Geoff Huston.

Your suplementary reading is

discussing the problems with NAT. You may also want to check out these relevant documents.

Address the following question in your prezi (make sure to use figures in your explanations):

Lecture Material:

With extra time to read, we are reading about two systems, DNS and Google search engine. Read about Domain Name System (the Internet's name system). from Chapter 4 Appendix A and also browse the RFC1032 man pages of 'Named' - they describe the Internet Domain name server. Your asignment for the DNS reading includes a Hands-on on DNS.

In addition, read The Anatomy of a Large-Scale Hypertextual Web Search Engine by Sergey Brin and Lawrence Page (the reading is available on-line).

Sergy Brin and Lawrence Page. The anatomy of a large-scale hypertextual web search engine. Proc. of the 7th WWW Conference, Brisbane, Australia (April 1998). Published in: COMPUTER NETWORKS, (formerly Computer Networks and ISDN Systems), Vol 30, 1998, pp 107-117.

The paper considers several themes. First, the authors try to explain how their search engine gives better answers than others. Second, they talk about the system architecture that actually lets them to crawl, index, and search the web quickly enough. They also take the time to present some idealism that work in search engines ought to be shared.

For the purposes of cs146a, it is the second issue, the system architecture, that is most obviously relevant. The real meat of the paper for you is Section 4. The driving feature of their entire design is the sheer size of the Web. (You may find some of their numbers a bit small by today's standards, but in the interim machines and the Web have both scaled up such that their assertions about what can fit where are still pretty accurate.)

Some background on search: while the details differ, essentially all large Web search engines work similarly at the systems level. In order to resolve a query made up of a number of search terms, we want to inspect all the documents and find those that have the most, or the best, occurrences of those search terms. One could imagine doing this by brute force (e.g., using the UNIX command grep), but it is clear that a scheme that actually touches every document on the Web in order to resolve every query is going to have some scaling problems.

Instead, any efficient search engine makes use of an inverted index. This is a data structure that lists, for any word, all of the documents that contain that word. When faced with a search term that appears only in a small fraction of all the documents on the web, the search engine can limit its inspection to only those documents that actually contain that word --- a massive improvement in efficiency. The index is called "inverted" because it maps from each word to documents that have the word; conversely the (much less used) forward index maps from each document to the words in that document.

A Web search engine's work is therefore built around three steps. Its first job is to crawl the web, gathering copies of all of the documents that will be searchable by the engine. The second job is to build the inverted index. This is basically a big sorting job. As a crawler visits a document, it can be thought of as outputting a sequence of pairs (called postings) of the form (document-id,term-id) which, since we visit documents in sequence, will be sorted according to the first, document-id coordinate. For the inverted index, we want the same sequence of pairs, sorted by the second, term-id coordinate. Once the inverted index is built, the search engine resolves a query involving multiple terms by looking up each term in the inverted index, finding its containing documents, and somehow combining the list of documents for each term into a single list. Finally, using heuristics (some of which are mentioned in the paper and others of which Google considers proprietary) it arranges the list in order with the goal that the things the client was probably looking for appear at the top.

The performance of the search engine is generally evaluated in terms of recall (what fraction of the desired documents are actually found) and precision (what fraction of the documents shown to the user are actually desired).

Here is a question for your prezi:

Consider the naming supported by DNS and the naming and finding on the Web.

You all have probably used Google to find a site that you are familiar with but have forgotten its URL. What would happen if DNS was abolished and Google queries were used to resolve all DNS names?

CS146a, Assignment 6, issued 10/9/09