cs146a Project 2: Web Proxy Server

Due: 2011-11-11 (Friday) by 11:59pm

This project must be written in C and must compile (with gcc) and run on one of the iMacs in the Vertica Lounge or one of the RHEL boxes in the conference room or machine room (click here for a list of available machines).

You must include a Makefile for compiling from the command line (you are free to use an IDE during development but your grader will expect to type `make tcpproxy` to build your project).

This is a team project; you should continue to work with the same team as in Project 1.

Introduction

In this second milestone, you will be modifying the TCP proxy that you wrote in Project 1 to act as a web proxy. A web proxy server operates using a HTTP, an application-level protocol. You will also be adding threading to allow your server to proxy for multiple clients concurrently.

You will be making use of libraries for this milestone. You will be using the POSIX package for threading, Pthreads. Pthreads should be installed already on your development machine. The Pthreads threading model is a lot like Java threads, so the concepts should be familiar to you.

We will be using the HTTP parser from Node.js (for those of you familiar with Node.js, the HTTP parser is part of "node-core", the low-level functionality that is built in C). This parser is very easy to use and saves us from the difficult task of writing a fast and reliable HTTP parser. Although this milestone doesn't require much in the way of parsing, the third milestone will, so it's important to get the parser integrated with your project now.

Here are links if you want to look at these tools right away. Instructions on how to use them are given later in this assignment.

Web proxy basics

Web proxies have various uses. A somewhat outdated use for web proxies is to give many clients inside an internal network access to the web through a single IP address (if you have NAT, you don't need this). A web proxy can be used to filter content (e.g., to provide additional security or remove offending content). A web proxy can improve performance in some cases by caching frequently-accessed content. You can also use a web proxy to circumvent network security (imagine that there is a web server that is only accessible to hosts inside the Brandeis network; if you can set up a web proxy server on another machine in the Brandeis network and access that proxy from outside Brandeis, then you may be able to gain access to the protected system from an external network).

Web proxies that allow any Internet user to use them to access any host are called open proxies. You will be coding an open proxy. Sometimes open proxies are used to provide a form of anonymity (because your IP address will be hidden, only the address of the proxy will be known); but, this relies on the assumption that the open proxy itself is not logging your activity.

Web proxy examples

Web proxies are important in the mobile browser space. Proxy servers can be used to compress and optimize web pages for mobile browsers running on small screens and transferring data over slow, unreliable connections. Opera Mini uses a proxy server run by Opera Software that reformats pages for small screens and compresses the content. Amazon's Silk (which will debut on the Kindle Fire) likely uses similar techniques to route your web traffic through EC2 for caching and reformatting (and of course tracking to monetize your web browsing habits).

Outside of the mobile browser space, there are a number of open source proxies which aim to improve web performance for users on a network when they connect to the wider Internet. A good example is Squid. Note that Squid leverages asynchronous I/O for performance reasons (like your TCP proxy), and also uses multiple processes for SMP (remember the Flash web server paper?). Your project will be written from the ground up to support kernel threads instead of heavier-weight processes.

Invoking your web proxy

Your Makefile should name the web proxy binary webproxy. webproxy should take a single argument, the port the server should listen to for new client connections. As in Project 1, it should print a usage message to the console and exit if executed with no arguments.

$ ./webproxy
Usage: ./webproxy server_port
$ ./webproxy 8888

HTTP

Your web proxy will support HTTP/1.0. HTTP is a request/response protocol in which web clients (typically a web browser) makes a request to a web server using TCP, and the server sends a response back and then terminates the TCP connection. The web client and web server are two separate programs which may or may not be on separate machines (they usually are).

There's a fair amount of detail given here about HTTP, but don't worry, you won't actually be dealing too much with the details of HTTP. It's important to have a basic understanding of the protocol in order to debug your proxy, so just bear with the details for now, and we'll return to the project at hand soon.

HTTP requests

The basic contents of an HTTP request are as follows:

  1. A method, which indicates what kind of request the client is making. A method is one of,
    1. GET
    2. HEAD
    3. POST
  2. A request URI, which indicates the resource the client is requesting.
  3. The protocol, which should be HTTP/1.0 in this project.
  4. Zero or more headers, which allow the client to specify extra information to the server. We'll talk a lot more about headers in the next milestone, but for now we won't worry about them much (except Content-Length).
  5. A blank line
  6. A body. A POST request may include data to upload to the server (when you click "submit" on a form, you are often POSTing data to the server). Whether or not there is a body depends on whether the special header Content-Length was in the list of headers. The Content-Length header tells how many bytes of data there will be in the body.

Here is an example HTTP request:

GET /about/defining.html HTTP/1.0
Host: www.brandeis.edu
User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0

We can identify the pieces of this request:

method
GET
URI
/about/defining.html
protocol
HTTP/1.0
headers
The name of the host (www.brandeis.edu) and the identification of the client's browser program (Firefox 5.0)

Note that in HTTP, lines are terminated with carriage return followed by line feed, abbreviated CRLF. These characters can be written in C with the escape sequence "\r\n". If we annotate the request with CRLFs we can see the blank line more clearly:

GET /about/defining.html HTTP/1.0\r\n
Host: www.brandeis.edu\r\n
User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0\r\n
\r\n

HTTP responses

Web servers send responses in a very similar format to requests from web clients. They consist of a status line, headers, and body. The headers and body are formatted like a request, so we only go over the status line in detail:

  1. The protocol (HTTP/1.0 for this project)
  2. A status code (a number)
  3. A reason phrase (text that describes the status code)

Here is an example HTTP response:

HTTP/1.0 200 OK
Date: Wed, 26 Oct 2011 01:01:01 GMT
Content-Type: text/html
Content-Length: 5

hello

HTTP for web proxies

For this milestone, we will write a mostly correct web proxy. To design our proxy, we only need to observe how the example request from above would be sent to a web proxy instead of directly to www.brandeis.edu:

GET http://www.brandeis.edu/about/defining.html HTTP/1.0
Host: www.brandeis.edu
User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0

Notice that the URI is absolute, it includes a protocol name and (most importantly for us) the name of the host to which the request should be sent. The job of our proxy is to strip off the protocol and hostname from the request. The proxy will leave the rest of the request (including headers and body, if any) unchanged. The proxy then forwards the rewritten request to the host specified in the absolute URI.

Many HTTP requests put the host in the Host header as well as giving it in the absolute URI. You should not rely on this header, since it serves a slightly different purpose and may not always match the hostname given in the absolute URI (also the Host header is also not required, so it may not be present in all HTTP requests your proxy receives). You should always proxy to the host specified in the absolute URI in the first line of the HTTP request.

Web proxy request / response cycle

The following diagram illustrates an example request for a (fictional) web page from a client, through your web proxy, to the server www.cs.brandeis.edu. Notice that the URI gets rewritten by the proxy, while the response is passed back to the client as-is.

HTTP Resources

Step 1: Parse and proxy HTTP

Request and response

You will parse and rewrite the request, send it to the appropriate host, and then send the response verbatim back to the client, after which you may close the connections to the remote client and server. The only exception to this program flow is if the client sends you a bad request or if you there is a problem with the remote server.

Content-Length

You will have to recognize only one header (for now), the Content-Length header. You have to extract the value of the Content-Length header in order to know how many bytes of data to read following the headers. If Content-Length is not present in a request or response, then that request or response has no body and you can stop reading it after the first blank line.

Status codes

HTTP responses contain status codes. You do not have to know about most of them for this milestone, since responses will be sent back to the client as-is (i.e., in this milestone, your web proxy does not rewrite responses). However, a client might send you a malformed request. For example, a misconfigured client might send you an HTTP request without an absolute URI. If this happens, your proxy will have no way of knowing what remote server to connect to in order to proxy. You should send the following response to the client:

HTTP/1.0 400 Bad Request

It may also happen that the client sends a well-formed request, but the server they requested access to is not accessible or sends an invalid HTTP response, for whatever reason. In these cases your proxy should send the following response:

HTTP/1.0 502 Bad Gateway

Don't forget the CRLF following each line in these responses, including the blank line at the end of the response.

In the normal case of correct behavior by the remote client and remote server, you will not be generating responses in your web proxy, you will simply be transfering the response sent back by the remote server. Note that the server may send other kinds of errors through the proxy (such as the common 404 error indicating that the requested resource could not be found). Your proxy is not responsible for generating these kinds of errors, you should just proxy them like any other response from the remote server.

Integrate http-parser with your server

Your TCP proxy code already has the ability to proxy data between the client and server. The new difficulty added in this milestone is that the remote server is not known until after you have parsed the first line of the HTTP request (since the server you are going to contact is in the absolute URI of the requested resource).

You can assume that the first line of the HTTP request fits in your fixed-size proxying buffer (make your buffer reasonably large, say, 10K). The parser will call a function that you provide with the absolute URI, and you will rewrite that URI and extract the hostname.

http-parser is documented in the README.md file in the source download, which you can read in a nice format on the http-parser project page on github.

Installing http-parser

  1. Download a zip or tar.gz package of the http-parser source code (or, if you are familiar with git, you can clone the repository; see the http-parser main page).
  2. Unpack the http-parser source code and enter the created directory
  3. Note that path to the http-parser source code (you can type `pwd` inside that directory if you aren't sure where you put it)
  4. Add the following to your Makefile, replacing "/path/to/http-parser" with the path you noted in the previous step:
    HTTP_PARSER = /path/to/http-parser
    CC          = gcc
    CFLAGS      = -g -Wall -I$(HTTP_PARSER)
    VPATH       = $(HTTP_PARSER)
    
    /path/to/http-parser/libhttp_parser.a:
    	cd /path/to/http-parser; $(MAKE) package
    
    webproxy: webproxy.o /path/to/http-parser/libhttp_parser.a
    This basic template sets up Make to build the http parser in the directory where you downloaded it, and use the resulting library in your webproxy program. You will of course add to this Makefile if you need to link other .o files into webproxy.

Using http-parser

  1. Add this include to your source files
    #include "http_parser.h"
  2. Add these definitions to your web proxy source code:
    static int message_begin_cb(http_parser *parser) {
        printf("message begin\n");
        return 0;
    }
    
    static int message_complete_cb(http_parser *parser) {
        printf("message complete\n");
        return 0;
    }
    
    static int url_cb(http_parser *parser, const char *s, size_t length) {
        printf("url: %.*s\n", (int)length, s);
        return 0;
    }
    
    static http_parser_settings settings = {
        .on_message_begin    = message_begin_cb,
        .on_header_field     = NULL,
        .on_header_value     = NULL,
        .on_url              = url_cb,
        .on_body             = NULL,
        .on_headers_complete = NULL,
        .on_message_complete = message_complete_cb
    };
    The settings variable will tell the parser which functions to call when certain pieces of the HTTP request are encountered. The key functions are url_cb, which will be called with the URI part of the request, and on_header_field, on_header_value, which tell you about the header lines (this example hasn't yet set callbacks to handle parsed headers).
  3. Creating a parser is simple, just put this before the loop where you proxy data:
    http_parser *parser = malloc(sizeof(http_parser));
    http_parser_init(parser, HTTP_REQUEST);
        
  4. The HTTP parser is incremental, so it works perfectly with our fixed-size buffer strategy for proxying HTTP. Here is pseudocode for how you invoke the parser:
    if(/* data to read from client_sock */) {
        if((nread = buf_read(buf, client_sock)) < 0) {
            /* handle error */
        } else {
            nparsed = http_parser_execute(parser, &settings, buf_data(buf), nread);
            if(nparsed != nread) {
                /* fatal error */
            }
        }
    }
    If you are using a circular buffer then you may have to split the data being parsed up into two chunks. If you are using the write-then-shift method of handling short writes from your buffer, then this code should be sufficient.
  5. To test, try to run the above code on an HTTP request written into a file, you should see something like the following output to your console:
    message begin
    url: http://www.cs.brandeis.edu/index.html (position 4 in buffer)
    message complete
  6. You will have to change the on_header_field and on_header_value callbacks in order to recognize the Content-Length header. Note that there are some "gotchas" involved with how these callbacks are structured, see the README on the http-parser github page for more detail.

Rewriting the request

In the function url_cb function, you will be able to extract the hostname. You'll also have to rewrite the request in your buffer before proxying it to the remote server. One way to do it is to rewrite the url in url_cb. To do this, you'll need to have a pointer to your buffer object inside url_cb, you can do this by using parser->data, a little bit of user data storage given to you by http-parser.

http_parser *parser = malloc(sizeof(http_parser));
http_parser_init(parser, HTTP_REQUEST);
parser->data = buf;

Then inside url_cb you can get a pointer to your buffer:

static int url_cb(http_parser *parser, const char *s, size_t length) {
    BUF *buf = (BUF *)parser->data;
    // ...
}

Another way to handle this is to copy from your buffer into a regular, null-terminated string before parsing, and then copying back from the flattened string into your buffer after parsing. This way, if the parser function wants to modify the string, those modifications get put into the buffer. This is a little more expensive but might make the callbacks easier to write.

Parsing the URI

You can hand-create a parser to extract the hostname from the absolute URI, or you can use a high-quality parser created by someone else. I think that a hand-created parser is easiest in this case, but this parser is also a good choice: uriparser. (It seems like there is a bug in the build system in the current head of the git repository for uriparser, but the tar.gz file linked in the download section seems to build correctly).

Parsing and tracking state

You will need some state as your parser is executing callbacks. Don't store this in global variables. Either put it in a heap-allocated struct and point to that struct from parser->data (recommended), or else use Thread-local storage (TLS).

Step 2: Add threading

As long as your program does not rely on global variables, this step is actually fairly simple. You will want to consult the man page for how to create threads:

man pthread_create

You will need to include the Pthreads header:

#include <pthread.h>

The last bit of boilerplate is that you need to tell Make to link your binary with Pthreads. Add this to your Makefile:

LDFLAGS = -lpthread

The key idea is to put the code which proxies between two sockets into a separate function, create a struct that holds all the information needed for that proxying, and then pass that function and struct instance to pthread_create instead of calling it directly.

typedef struct {
    int client_sock;
} PROXY_INFO;

// You may want to pass more information in, but the bare minimum
// is the client socket.

static PROXY_INFO *make_proxy_info(int client_sock) {
    // implementation here
}

static void free_proxy_info(PROXY_INFO *proxy_info) {
    free(proxy_info);
}

static void *proxy(void *vproxy_info) {
    PROXY_INFO *proxy_info = (CONDUIT *)vproxy_info;

    // Proxy here, this is mostly the code from the milestone 1
    // plus a little extra to parse and rewrite the first line of
    // the request.

    free_proxy_info(proxy_info);

    return NULL;
}

int main(int argc, char **argv) {
    // parse command line arguments, etc.

    PROXY_INFO *proxy_info;
    pthread_t thread;
    for(;;) {
        client = host = -1;

        if((client = server_accept(server)) < 0)
            pdie("server_accept");
        if((host = host_connect(info)) < 0)
            pdie("host_connect");
        if(make_async(client) < 0 || make_async(host) < 0)
            pdie("make_async");
        if((proxy_info = make_proxy_info(client)) == NULL)
            pdie("out of memory");
        if(pthread_create(&thread, NULL, &proxy, proxy_info))
            pdie("could not create thread");
    }
}

If you follow this basic template, you will spawn a new thread for each client, and that thread will persist until you are finished proxying between that client and the requested remote socket.

Testing

Testing with telnet

You can make HTTP/1.0 proxy requests to a proxy server running on localhost at port 8888 with telnet like this:

telnet localhost 8888
Trying 127.0.0.1
Connected to localhost.
Escape character is '^]'.
GET http://www.cs.brandeis.edu/~cs146a/index.html HTTP/1.0
Host: www.cs.brandeis.edu

Make sure that you press return twice after the last header to send a blank line.

Testing with Firefox

You should check that images work, so you will want to test in a graphical browser as well. To use a proxy in Firefox, do the following:

  1. Visit the page about:config
  2. Scroll down until you see the Preference Names beginning with "network" (the Filter box can help)
  3. Make the following changes to your settings:
  4. Go to the Firefox preferences, and under the Advanced tab click the connections settings button and change the proxy to manual config. Set the server to localhost and the port to whatever you provide as the port argument to your proxy on the command line (e.g., 8888).