Programming Assignment 1: Fun With n-grams

CS 114b, Spring 2008

The idea of this assignment is to get you familiar and comfortable with the NLTK. To that end, I'd like you to play with n-grams over a corpus or some corpora. For instance, you might consider constructing some simple statistical models, or examining co-reference behavior, or looking at syntactic or semantic similarity between the terms in various kinds of n-grams—but these are only examples. For this assignment, I want you to:

  1. Come up with some kind of hypothesis or question you'd like to investigate about n-grams.
  2. Write some code to perform the actual investigation. Please do not write more than about 100 lines of code. If you feel like you must write more, factor your existing code until you just can't factor no more, or consider investigating a different issue.
  3. Write a few paragraphs explaining what you were investigating, how you investigated it, and what the results of this investigation turned out to be.

Please submit your code and write-up via email to the CS114 account by 23:59:59.9 Friday, 1 Feb, 2008. Joint work on the code is acceptable and encouraged, but (1) the code must include the names of all the people that worked on it, and (2) the write-ups must be done individually.

This assignment is supposed to teach you about the NLTK, but it's also supposed to be fun. Try something outlandish and original. It doesn't have to be right, but try to make it interesting. If you have any questions, please email me.

Additional: Acceptable formats for write-ups are plain text, PostScript, PDF, (La)TeX, DVI, and HTML. This list should be considered exhaustive; please note in particular that Microsoft Word format is not included.