cs146a Project 3: Web Proxy Server

Design Document: 06 November by 11:59 pm
Parsing Server: 13 November by 11:59 pm
Synchronous Proxy: 20 November by 11:59 pm
Asynchronous Proxy: 02 December by 11:59 pm
Extra Credit: 02 December by 11:59 pm
Project Defense: TBA

This project must be written in C and must compile and run on one of the RHEL boxes in the berry patch (click here for a list of available machines).

Introduction

You will design and implement a forward web proxy. A forward web proxy is a proxy server that can interpret specially-formatted proxy requests from web browsers using HTTP. Often a web proxy is used to cache responses from web servers to accelerate performance from the perspective of clients. You will have an extra credit opportunity to add caching to your proxy.

Wikipedia has a nice, high-level overview of proxy servers.

Basic Requirements

Web Proxy Requests

Your web server will proxy web requests (requests made to web servers using HTTP) from clients in a network to web server hosts on the Internet. This type of proxy server is called a forward proxy. A forward proxy requires special client configuration, since the format of a request varies slightly from the HTTP requests made to regular web servers. In HTTP/1.0, the format of a proxy request looks like this:

VERB http://server-name/document-name HTTP/1.0
Header1: value1
Header2: value2

[Content]

Here is an example of a proxy request sent from the Opera web browser:

GET http://www.cs.brandeis.edu/~cs146a/project3 HTTP/1.0
User-Agent: Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1
Host: www.cs.brandeis.edu
Accept: text/html, application/xml;q=0.9, application/xhtml+xml, */*;q=0.1
Accept-Language: en
Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1
Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0
Proxy-Connection: Keep-Alive

In this example, the client is expecting the proxy server to request "/~cs146a/project3" from "www.cs.brandeis.edu" and to return to it (the client) whatever the remote server (www.cs.brandeis.edu) responds. A web proxy will rewrite the request such that the protocol and server-name parts of the URL are removed, and will instead use the server-name to open a second socket to the remote server. Your web proxy will be transparent, meaning that the request will be otherwise unchanged. To be explicit, the example request above should be rewritten by your web proxy to look like this:

GET /~cs146a/project3 HTTP/1.0
User-Agent: Opera/9.64 (X11; Linux x86_64; U; en) Presto/2.1.1
Host: www.cs.brandeis.edu
Accept: text/html, application/xml;q=0.9, application/xhtml+xml, */*;q=0.1
Accept-Language: en
Accept-Charset: iso-8859-1, utf-8, utf-16, *;q=0.1
Accept-Encoding: deflate, gzip, x-gzip, identity, *;q=0
Proxy-Connection: Keep-Alive

This request should be sent to the web server at www.cs.brandeis.edu.

You may not rely on the "Host" header (such as the one in the example shown above) to determine the remote host which your server must proxy. This header is not always what you think it is, and additionally may not be provided by some clients.

You can read more about HTTP/1.0 proxy requests in RFC 1945, especially section 5.1.2.

Web Proxy Responses

A web proxy responds to the client with whatever data is returned in the response from the proxied web server. This includes any type of data, including both text files (e.g., HTML files) and binaries (e.g., images). Your web proxy should be transparent, meaning that you will not change the response in any way.

From a programming perspective, your web proxy will have up to two open TCP connections for each client:

  1. a TCP connection between the client and the web proxy; and,
  2. a TCP connection between the web proxy and the remote server that is opened after the web proxy receives the server name from the client.

Request / Response Cycle

Here is a diagram that illustrates an example request for an HTML file (that contains the text "<strong>Hello</strong> world!") from a client to the web server at www.cs.brandeis.edu (this file doesn't really exist):

Listening for New Clients

Your web proxy server will listen for new clients on a specified port. Remember that only clients making HTTP requests specially formatted for being understood by a web proxy will connect in this way. Your web proxy will initiate connections to remote servers, not listen for them.

Handling Failures

A web proxy must handle the failure of either or both of the client and web server. For example, your web proxy should not crash or fail to release file descriptors and other resources if a client or remote web server stops responding.

Invoking your server

You web proxy server must be named "proxy" and must accept a port number on the command line. You must print a usage string if the wrong number of arguments are given. Your web proxy server will listen on the specified port for incoming client connections. Here is an example invocation:

./proxy 8888

The user of your program may provide any port number they wish, although remember that you will not be allowed by the OS to listen on low-numbered ports.

Project Structure

You will complete Project 3 in four stages. Each stage contributes to your overall project grade.

StagePoints
Design Document20
Parsing Server15
Synchronous Proxy40
Asynchronous Proxy25
Total100

In order to complete all 4 stages of the project, you must start early, work steadily, and practice careful source control. For each stage we expect to see a program that accomplishes that stage's tasks and nothing more. You must complete a Stage to move on to the next stage, but you are encouraged to work ahead of the deadlines. If you stop at Stage 3, you can earn a maximum of 75 points.

Stage 1: Design Document

You must write a 1 page description of the design for your web proxy server. Careful design up front can save you time and stress later by helping you to think through the mechanisms you will program and tradeoffs that you will need to make in order to implement in your program.

Your design document should include a description of the flow of requests and responses, how the proxy server will parse requests, and how you will introduce asynchrony. You must also describe how you will test your work, and what steps you will take to use each assigned stage as a building block for the next stage. You should not include code, but you may include pseudocode or list system calls or library functions that you plan on using.

Additional criteria you should describe and defend include:

Stage 2: Parsing Server

This server acts as one-half of a one-client-at-a-time proxy server. It is a server which waits for a connection from a client, parses the client request, and prints to the server console the following information in a precise format:

Remote host: server-name
Remote port: server-port
Resource name: document-name

rewritten-request

Since you only need to print this to the server console, your web proxy will not send anything back to the client (yet); this means that the client will hang, and that's okay for now.

This server should not use fork, threads, or asynchronous I/O. It is a simple server that only allows one client at a time. You will add asynchrony to handle multiple simultaneous clients in Stage 4.

Stage 3: Synchronous Proxy

In stage 2, you produced a server which is capable of accepting one client at a time and rewriting their proxy request into a normal HTTP request plus a remote server name and port to which that request should be sent. Now, your goal is to add proxying to your web proxy. For this stage your server will remain capable of handling only one client at a time, so do not add asynchrony yet (and do not under any circumstances use fork or multithreading).

To complete stage 3, your server should:

  1. accept client requests and parse them to find out the proxied server name (from stage 2)
  2. connect to the remote server (this is a second TCP connection opened while still connected to the client)
  3. send the rewritten request to the remote server (remember that the VERB, protocol identifier, and optional headers and content remain the same, only the URL is rewritten into a document-name)
  4. read the response from the remote server
  5. send the response from the remote server back to the client
  6. close both sockets (close the socket for the connection to the server and the socket for the connection to the client)
  7. go back to listening for a client connection

Your server must work with all file types, including images. It is okay if your server only supports GET at this stage.

Stage 4: Asynchronous Proxy

You have reached the final stage of project 3!

In this stage you will modify your web proxy to use asynchronous I/O so that it can handle multiple simultaneous clients, each connecting to any remote web server. You must do this using the select() system call and non-blocking sockets. Refer to the Using TCP through Sockets handout for examples of how to set sockets to non-blocking mode and how to use select.

Remember that both connections (between the client and web proxy, and between web proxy and remote server) must be non-blocking, and that both will be treated equally by the select call in your web proxy program. It will be up to you to manage the relationship between these two TCP connections.

You will not receive credit for Stage 4 (i.e., you will get a 0) if you employ fork() or multithreading. You must use asynchronous I/O.

Handling errors

We expect your web proxy to handle errors gracefully, and to not hang if the client or remote server stops responding. You will want to refer to your design document carefully here.

malloc and free

You may want to use malloc in your code for stage 4. If you are unfamiliar with heap management in C, I recommend that you review a basic C programming reference for help with malloc() and free() (the man pages for these system calls are probably too obtuse to learn from, but they are a good reference if you have done memory management in C before).

Strategy

This stage will be easiest if your work for the previous 2 stages is well-modularized. You will need to set up callbacks to respond to read and write readiness of sockets, and these callbacks can leverage the code you have already written to solve proxying for a single (non-simultaneous) client. A key change that must be made to solve stage 4 is that you will need to allocate many buffers, as well as manage a mapping between file descriptors and the buffer from which they read and the buffer to which they write.

Extra Credit: Caching

Making your web proxy cache resources is optional. A completely functioning web cache is worth up to 20% extra credit on this assignment, but we cannot assign extra credit if your asynchronous proxy does not work, so be sure to do the extra credit only after you have completed the rest of the assignment.

If you choose to implement caching, then your web proxy will cache web resources (web pages, images, etc.) locally so that subsequent requests for those resources can be fulfilled without actually requesting them from the remote server. This can greatly improve the proxy's performance.

You must decide how you will keep your web cache coherent. Clients dislike stale pages; however, a stale page may still have some value, so there is a trade-off between coherency and performance. Document your decision and why you made it in your final README.

Some examples of cache coherency in a web proxy: cache resources for a fixed period of time before allowing them to be refreshed from the remote server; use the If-Modified-Since header to conditionally request a resource from the remote server; or, respect the Expires header sent by the remote server.

Caching and asynchrony

If you store cached web pages on disk, you will have to consider the impact of this decision on asynchrony. You can earn most of the extra credit for storing cached pages in memory (provided you limit memory consumption somehow). You may also choose to employ the AIO kernel support in Linux 2.6. If you wish to implement thread pools or use fork to prevent disk I/O from blocking your server this is also fine, but you must make sure that you are not using these techniques for any part of stage 4 or you will lose points. If you have questions, please talk to the TA about how you plan to implement caching.

Additional resources about caching

Tips

Testing with telnet

You can make HTTP/1.0 proxy requests to a proxy server running on localhost at port 8888 with telnet like this:

telnet localhost 8888
Trying 127.0.0.1
Connected to localhost.
Escape character is '^]'.
GET http://www.cs.brandeis.edu/~cs146a HTTP/1.0

Make sure that you press return twice after the command.

Testing with links

You may use text-mode browser links to browse the web from the console (useful if you are connecting to a berry patch machine via ssh) using your proxy server on localhost at port 8888 like this:

http_proxy=http://localhost:8888/ links -source http://www.brandeis.edu

Check the man page of links, I've found that another version installed on a different machine expected the proxy to be given like this:

links -http-proxy localhost:8888 -source http://www.brandeis.edu

Testing with firefox

Because you must ensure that images work too, you will want to test in a graphical browser as well. To use a proxy in Firefox, do the following:

  1. Visit the page about:config
  2. Scroll down until you see the Preference Names beginning with "network" (the Filter box can help)
  3. Make the following changes to your settings:
  4. Go to the Firefox preferences, and under the Advanced tab click the connections settings button and change the proxy to manual config. Set the server to localhost and the port to whatever you provide as the port argument to your proxy on the command line (e.g., 8888).

Collaboration

How to Hand In

Submission instructions appear in the FAQ.

Defense

You will have to defend your project 3. After we receive your submission, we will send your group an email to schedule the defense time. During the defense, you will explain and defend your design decisions and demonstrate your project.