cs146a Project 2: Webserver

Due: 2009-10-16 before 11:59 PM

This project must be written in C and must compile and run on one of the RHEL boxes in the berry patch (click here for a list of available machines).

Introduction

You will create a basic HTTP/0.9 webserver. HTTP/0.9 is the first version of HTTP from 1991. This simple protocol allowed web clients (called browsers) to request documents formatted in HTML from a server one-at-a-time. HTTP requires a reliable connection, and so is implemented as a layer over TCP. Your server will support multiple simultaneous clients using TCP sockets and the fork() mechanism from your UNIX/Linux OS.

Your server will also work with HTTP/1.0 clients and you will be able to test using a conventional web browser like Firefox. You will have an extra credit opportunity at the end of this assignment to implement some of HTTP/1.0 and allow images as well as HTML to be loaded from your server.

Skills

You will employ socket I/O, file I/O, and set up a socket to listen for incoming connections. You will also use the fork() system call. We assume that you covered file I/O in a previous class.

Setting up a socket to listen for connections and how to use fork() is described thoroughly in Using TCP Through Sockets by David Mazières, Frank Dabek, and Eric Petererson (section 3.4).

Design Requirements

Communication Protocol

Your webserver will use HTTP/0.9 to communicate with clients. HTTP/0.9 is a request/response protocol, where a single TCP connection handles exactly one request/response pair. You have seen an example of an HTTP/0.9 request in Project 1: "GET /". The format of an HTTP/0.9 request is as follows:

GET document-name

Here are some examples of HTTP/0.9 requests:

GET /index.html
GET /
GET /foo/bar/baz

Be Forgiving

There aren't many modern web clients that actually send HTTP/0.9 requests. Luckily, HTTP/1.0 was designed to be backwards compatible with HTTP/0.9. For your server to service HTTP/1.0 requests, you should simply ignore any part of the request that appears after the verb GET and the document name. For example, here is what a request from Firefox running in HTTP/1.0 mode might look like:

GET /foo/bar/baz HTTP/1.0
Host: localhost:8080
User-Agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.6)
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
Accept-Language: en-us,en;q=0.5
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Keep-Alive: 300
Connection: keep-alive

Your webserver should only look at the first two words in this request, ignoring the rest; i.e.:

GET /foo/bar/baz

Document Root

Your webserver serves the contents of text files to clients. These files will be stored on the local file system of your webserver. In order to send the contents of these files, you must use file I/O to read them, then network socket I/O to send the contents to the client.

The document names requested by clients are absolute paths which you must transform to be relative to the document root. For example, if your document root is /home/ross/.www, then the document name /foo/bar/baz/index.html will be found in the file /home/ross/.www/foo/bar/baz/index.html.

Default File Name

If the client requests a directory instead of a file name, you should supply the file index.html inside that directory by default. For example, assume that the document root is:

/home/ross/.www

And that the following file exists:

/home/ross/.www/foo/bar/baz/index.html

If the client makes the following request:

GET /foo/bar/baz

Then your webserver should recognize that /home/ross/.www/foo/bar/baz is a directory, and return in the response the contents of the following file:

/home/ross/.www/foo/bar/baz/index.html

File Types

Your webserver does not need to handle binary data such as images, just plain text (as this is all that was specified as part of HTTP/0.9). You may serve any plain text file verbatim, do not worry about the distinction between HTML and plain text made by the HTTP/0.9 standard.

Errors

If the client requests a file that does not exist, or submits a malformed request, then you must return a well-formed (but simple) HTML page to the client containing a customized error message specifying what went wrong.

Implementation Requirements

Multiplex with fork()

fork() is a system call which spawns a second process that is identical to the current one, except that the new one has a new process ID. After the call to fork(), both processes continue from the same point in the program. The original process is referred to as the parent, and the copy is referred to as the child.

In the parent process, fork() returns the ID of the child process. In the child process, fork() returns 0. Thus, the return value of fork() can be used to differentiate between the parent and child processes. The usual pattern looks something like this:

int pid;

pid = fork();
if(pid > 0) {
    /* Parent process */
} else if(pid == 0) {
    /* Child process */
} else {
    /* Error; check errno */
}

You will use fork() to spawn a child process to handle each incoming request. This way, your server can handle multiple requests simultaneously. Because fork() spawns child processes, you do not have to worry about coordinating access to shared memory, the parent and child will not share memory.

There are examples of using fork() in the Using TCP Through Sockets handout.

File I/O

Your server must use C file I/O to serve real files on your local file system to web clients.

Fixed-size Buffers

File and network socket I/O should both be done using fixed-size buffers in memory. This means that you may not allocate memory based on the size of an incoming request or the size of a file that will be sent in a response. This is similar to how you echoed the response from a server using a fixed-size buffer in the program sc in Project 1.

Invocation

Your server must be invoked on the command line like this:

./webserver document-root port

Examples:

./webserver /home/ross/.www 8080
./webserver /home/ross/proj/cs146a/project2/test 8080

Quitting

You may assume that your webserver is always quit by sending the interrupt signal (pressing ctrl-c).

Testing with a Client

TCP client

You can test with your program sc from Project 1, or with the program telnet. Here is an example using sc:

./sc localhost 8080 "GET /foo/bar/baz"

Firefox

You can also test with Firefox, although you will have to configure it to use HTTP/1.0 by carrying out the following steps:

  1. In the location bar, type about:config
  2. In the search box at the top of the page labeled "Filter", type network.http.version
  3. Double-click on the preference named network.http.version
  4. In the pop-up box, replace the string (which probably says 1.1) with the string 1.0
  5. Click ok in the pop-up box
  6. Visit http://localhost:8080/foo/bar/baz, where "/foo/bar/baz" is whatever document name you would like to retrieve

Don't forget to change the HTTP version back to 1.1 when you are done testing.

lynx

Finally, you might find it convenient to test with lynx, a text-mode browser commonly found on Linux. lynx is a simple browser that doesn't try to display images, so it should work well with your server.

Hints

Server Port

As you know, TCP allows a server to support multiple services using the port abstraction. The port number is an identifier that is transmitted to the server when initiating a TCP connection. The server uses the port number to determine which application should service the connection. Many port numbers are reserved by convention for particular applications. If you are curious, you can review a comprehensive list of assigned port numbers managed by the IANA.

It turns out that port numbers below 1024 are inaccessible to ordinary user programs. In order to test your webserver, instruct it to listen to a higher port number. Port 8080 is a user-accessible port often used for HTTP.

man Pages for File I/O

You may find the following man pages useful (execute these commands at a Linux/UNIX command line):

man 2 open
man 2 close
man 2 read
man 2 write
man 2 stat

Includes

The Using TCP Through Sockets handout shows the header files that need to be included with #include needed for a socket server. You can find the headers for file I/O by consulting the man pages for file I/O-related functions.

Divide the Problem into Stages

You may want to break the project down into smaller chunks. Besides employing modularity to divide the assignment into separable components, you might also want to implement a simple test server to make sure that your TCP handling and use of fork() work correctly before adding support for HTTP. One idea is to first implement an echo server. An echo server accepts TCP connections, forks to handle them, and sends as a response an exact copy of the request. You can test your echo server using telnet.

Extra Credit

Security

The specification above does not prevent clients from requesting resources outside of the document root. For example, if the document root is /home/ross/.www, and the client sends the request GET /../cs146a/grades/project2/mallory.txt, then a student Mallory might be able to learn their Project 2 grade, even though this file was not intended to be accessible via the web.

Earn 5% extra credit for preventing your webserver from serving files outside of the document root.

Images

You can implement a small part of HTTP/1.0 and give your webserver the ability to serve binary data like images. You will need to consider what headers to send in your response and also how to read binary files (as opposed to plain text files).

Earn 5% extra credit for serving images from your webserver.

Collaboration

How to Hand In

Submission instructions appear in the FAQ.