This project must be written in C and must compile (with gcc) and run on one of the iMacs in the Vertica Lounge or one of the RHEL boxes in the conference room or machine room (click here for a list of available machines).
You must include a Makefile for compiling from the command line (you are free to use an IDE during development but your grader will expect to type `make tcpproxy` to build your project).
This is a team project; you should continue to work with the same team as in Project 1.
In this second milestone, you will be modifying the TCP proxy that you wrote in Project 1 to act as a web proxy. A web proxy server operates using a HTTP, an application-level protocol. You will also be adding threading to allow your server to proxy for multiple clients concurrently.
You will be making use of libraries for this milestone. You will be using the POSIX package for threading, Pthreads. Pthreads should be installed already on your development machine. The Pthreads threading model is a lot like Java threads, so the concepts should be familiar to you.
We will be using the HTTP parser from Node.js (for those of you familiar with Node.js, the HTTP parser is part of "node-core", the low-level functionality that is built in C). This parser is very easy to use and saves us from the difficult task of writing a fast and reliable HTTP parser. Although this milestone doesn't require much in the way of parsing, the third milestone will, so it's important to get the parser integrated with your project now.
Here are links if you want to look at these tools right away. Instructions on how to use them are given later in this assignment.
Web proxies have various uses. A somewhat outdated use for web proxies is to give many clients inside an internal network access to the web through a single IP address (if you have NAT, you don't need this). A web proxy can be used to filter content (e.g., to provide additional security or remove offending content). A web proxy can improve performance in some cases by caching frequently-accessed content. You can also use a web proxy to circumvent network security (imagine that there is a web server that is only accessible to hosts inside the Brandeis network; if you can set up a web proxy server on another machine in the Brandeis network and access that proxy from outside Brandeis, then you may be able to gain access to the protected system from an external network).
Web proxies that allow any Internet user to use them to access any host are called open proxies. You will be coding an open proxy. Sometimes open proxies are used to provide a form of anonymity (because your IP address will be hidden, only the address of the proxy will be known); but, this relies on the assumption that the open proxy itself is not logging your activity.
Web proxies are important in the mobile browser space. Proxy servers can be used to compress and optimize web pages for mobile browsers running on small screens and transferring data over slow, unreliable connections. Opera Mini uses a proxy server run by Opera Software that reformats pages for small screens and compresses the content. Amazon's Silk (which will debut on the Kindle Fire) likely uses similar techniques to route your web traffic through EC2 for caching and reformatting (and of course tracking to monetize your web browsing habits).
Outside of the mobile browser space, there are a number of open source proxies which aim to improve web performance for users on a network when they connect to the wider Internet. A good example is Squid. Note that Squid leverages asynchronous I/O for performance reasons (like your TCP proxy), and also uses multiple processes for SMP (remember the Flash web server paper?). Your project will be written from the ground up to support kernel threads instead of heavier-weight processes.
Your Makefile should name the web proxy binary webproxy. webproxy should take a single argument, the port the server should listen to for new client connections. As in Project 1, it should print a usage message to the console and exit if executed with no arguments.
$ ./webproxy Usage: ./webproxy server_port $ ./webproxy 8888
Your web proxy will support HTTP/1.0. HTTP is a request/response protocol in which web clients (typically a web browser) makes a request to a web server using TCP, and the server sends a response back and then terminates the TCP connection. The web client and web server are two separate programs which may or may not be on separate machines (they usually are).
There's a fair amount of detail given here about HTTP, but don't worry, you won't actually be dealing too much with the details of HTTP. It's important to have a basic understanding of the protocol in order to debug your proxy, so just bear with the details for now, and we'll return to the project at hand soon.
The basic contents of an HTTP request are as follows:
Here is an example HTTP request:
GET /about/defining.html HTTP/1.0 Host: www.brandeis.edu User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0
We can identify the pieces of this request:
Note that in HTTP, lines are terminated with carriage return followed by line feed, abbreviated CRLF. These characters can be written in C with the escape sequence "\r\n". If we annotate the request with CRLFs we can see the blank line more clearly:
GET /about/defining.html HTTP/1.0\r\n Host: www.brandeis.edu\r\n User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0\r\n \r\n
Web servers send responses in a very similar format to requests from web clients. They consist of a status line, headers, and body. The headers and body are formatted like a request, so we only go over the status line in detail:
Here is an example HTTP response:
HTTP/1.0 200 OK Date: Wed, 26 Oct 2011 01:01:01 GMT Content-Type: text/html Content-Length: 5 hello
For this milestone, we will write a mostly correct web proxy. To design our proxy, we only need to observe how the example request from above would be sent to a web proxy instead of directly to www.brandeis.edu:
GET http://www.brandeis.edu/about/defining.html HTTP/1.0 Host: www.brandeis.edu User-Agent: Mozilla/5.0 (X11; U; Linux i586; de; rv:5.0) Gecko/20100101 Firefox/5.0
Notice that the URI is absolute, it includes a protocol name and (most importantly for us) the name of the host to which the request should be sent. The job of our proxy is to strip off the protocol and hostname from the request. The proxy will leave the rest of the request (including headers and body, if any) unchanged. The proxy then forwards the rewritten request to the host specified in the absolute URI.
Many HTTP requests put the host in the Host header as well as giving it in the absolute URI. You should not rely on this header, since it serves a slightly different purpose and may not always match the hostname given in the absolute URI (also the Host header is also not required, so it may not be present in all HTTP requests your proxy receives). You should always proxy to the host specified in the absolute URI in the first line of the HTTP request.
The following diagram illustrates an example request for a (fictional) web page from a client, through your web proxy, to the server www.cs.brandeis.edu. Notice that the URI gets rewritten by the proxy, while the response is passed back to the client as-is.
You will parse and rewrite the request, send it to the appropriate host, and then send the response verbatim back to the client, after which you may close the connections to the remote client and server. The only exception to this program flow is if the client sends you a bad request or if you there is a problem with the remote server.
You will have to recognize only one header (for now), the Content-Length header. You have to extract the value of the Content-Length header in order to know how many bytes of data to read following the headers. If Content-Length is not present in a request or response, then that request or response has no body and you can stop reading it after the first blank line.
HTTP responses contain status codes. You do not have to know about most of them for this milestone, since responses will be sent back to the client as-is (i.e., in this milestone, your web proxy does not rewrite responses). However, a client might send you a malformed request. For example, a misconfigured client might send you an HTTP request without an absolute URI. If this happens, your proxy will have no way of knowing what remote server to connect to in order to proxy. You should send the following response to the client:
HTTP/1.0 400 Bad Request
It may also happen that the client sends a well-formed request, but the server they requested access to is not accessible or sends an invalid HTTP response, for whatever reason. In these cases your proxy should send the following response:
HTTP/1.0 502 Bad Gateway
Don't forget the CRLF following each line in these responses, including the blank line at the end of the response.
In the normal case of correct behavior by the remote client and remote server, you will not be generating responses in your web proxy, you will simply be transfering the response sent back by the remote server. Note that the server may send other kinds of errors through the proxy (such as the common 404 error indicating that the requested resource could not be found). Your proxy is not responsible for generating these kinds of errors, you should just proxy them like any other response from the remote server.
Your TCP proxy code already has the ability to proxy data between the client and server. The new difficulty added in this milestone is that the remote server is not known until after you have parsed the first line of the HTTP request (since the server you are going to contact is in the absolute URI of the requested resource).
You can assume that the first line of the HTTP request fits in your fixed-size proxying buffer (make your buffer reasonably large, say, 10K). The parser will call a function that you provide with the absolute URI, and you will rewrite that URI and extract the hostname.
http-parser is documented in the README.md file in the source download, which you can read in a nice format on the http-parser project page on github.
HTTP_PARSER = /path/to/http-parser CC = gcc CFLAGS = -g -Wall -I$(HTTP_PARSER) VPATH = $(HTTP_PARSER) /path/to/http-parser/libhttp_parser.a: cd /path/to/http-parser; $(MAKE) package webproxy: webproxy.o /path/to/http-parser/libhttp_parser.aThis basic template sets up Make to build the http parser in the directory where you downloaded it, and use the resulting library in your webproxy program. You will of course add to this Makefile if you need to link other .o files into webproxy.
#include "http_parser.h"
static int message_begin_cb(http_parser *parser) { printf("message begin\n"); return 0; } static int message_complete_cb(http_parser *parser) { printf("message complete\n"); return 0; } static int url_cb(http_parser *parser, const char *s, size_t length) { printf("url: %.*s\n", (int)length, s); return 0; } static http_parser_settings settings = { .on_message_begin = message_begin_cb, .on_header_field = NULL, .on_header_value = NULL, .on_url = url_cb, .on_body = NULL, .on_headers_complete = NULL, .on_message_complete = message_complete_cb };The settings variable will tell the parser which functions to call when certain pieces of the HTTP request are encountered. The key functions are url_cb, which will be called with the URI part of the request, and on_header_field, on_header_value, which tell you about the header lines (this example hasn't yet set callbacks to handle parsed headers).
http_parser *parser = malloc(sizeof(http_parser)); http_parser_init(parser, HTTP_REQUEST);
if(/* data to read from client_sock */) { if((nread = buf_read(buf, client_sock)) < 0) { /* handle error */ } else { nparsed = http_parser_execute(parser, &settings, buf_data(buf), nread); if(nparsed != nread) { /* fatal error */ } } }If you are using a circular buffer then you may have to split the data being parsed up into two chunks. If you are using the write-then-shift method of handling short writes from your buffer, then this code should be sufficient.
message begin url: http://www.cs.brandeis.edu/index.html (position 4 in buffer) message complete
In the function url_cb function, you will be able to extract the hostname. You'll also have to rewrite the request in your buffer before proxying it to the remote server. One way to do it is to rewrite the url in url_cb. To do this, you'll need to have a pointer to your buffer object inside url_cb, you can do this by using parser->data, a little bit of user data storage given to you by http-parser.
http_parser *parser = malloc(sizeof(http_parser)); http_parser_init(parser, HTTP_REQUEST); parser->data = buf;
Then inside url_cb you can get a pointer to your buffer:
static int url_cb(http_parser *parser, const char *s, size_t length) { BUF *buf = (BUF *)parser->data; // ... }
Another way to handle this is to copy from your buffer into a regular, null-terminated string before parsing, and then copying back from the flattened string into your buffer after parsing. This way, if the parser function wants to modify the string, those modifications get put into the buffer. This is a little more expensive but might make the callbacks easier to write.
You can hand-create a parser to extract the hostname from the absolute URI, or you can use a high-quality parser created by someone else. I think that a hand-created parser is easiest in this case, but this parser is also a good choice: uriparser. (It seems like there is a bug in the build system in the current head of the git repository for uriparser, but the tar.gz file linked in the download section seems to build correctly).
You will need some state as your parser is executing callbacks. Don't store this in global variables. Either put it in a heap-allocated struct and point to that struct from parser->data (recommended), or else use Thread-local storage (TLS).
As long as your program does not rely on global variables, this step is actually fairly simple. You will want to consult the man page for how to create threads:
man pthread_create
You will need to include the Pthreads header:
#include <pthread.h>
The last bit of boilerplate is that you need to tell Make to link your binary with Pthreads. Add this to your Makefile:
LDFLAGS = -lpthread
The key idea is to put the code which proxies between two sockets into a separate function, create a struct that holds all the information needed for that proxying, and then pass that function and struct instance to pthread_create instead of calling it directly.
typedef struct { int client_sock; } PROXY_INFO; // You may want to pass more information in, but the bare minimum // is the client socket. static PROXY_INFO *make_proxy_info(int client_sock) { // implementation here } static void free_proxy_info(PROXY_INFO *proxy_info) { free(proxy_info); } static void *proxy(void *vproxy_info) { PROXY_INFO *proxy_info = (CONDUIT *)vproxy_info; // Proxy here, this is mostly the code from the milestone 1 // plus a little extra to parse and rewrite the first line of // the request. free_proxy_info(proxy_info); return NULL; } int main(int argc, char **argv) { // parse command line arguments, etc. PROXY_INFO *proxy_info; pthread_t thread; for(;;) { client = host = -1; if((client = server_accept(server)) < 0) pdie("server_accept"); if((host = host_connect(info)) < 0) pdie("host_connect"); if(make_async(client) < 0 || make_async(host) < 0) pdie("make_async"); if((proxy_info = make_proxy_info(client)) == NULL) pdie("out of memory"); if(pthread_create(&thread, NULL, &proxy, proxy_info)) pdie("could not create thread"); } }
If you follow this basic template, you will spawn a new thread for each client, and that thread will persist until you are finished proxying between that client and the requested remote socket.
You can make HTTP/1.0 proxy requests to a proxy server running on localhost at port 8888 with telnet like this:
telnet localhost 8888 Trying 127.0.0.1 Connected to localhost. Escape character is '^]'. GET http://www.cs.brandeis.edu/~cs146a/index.html HTTP/1.0 Host: www.cs.brandeis.edu
Make sure that you press return twice after the last header to send a blank line.
You should check that images work, so you will want to test in a graphical browser as well. To use a proxy in Firefox, do the following: