The idea of this assignment is to get you familiar and comfortable with the NLTK. To that end, I'd like you to play with n-grams over a corpus or some corpora. For instance, you might consider constructing some simple statistical models, or examining co-reference behavior, or looking at syntactic or semantic similarity between the terms in various kinds of n-grams—but these are only examples. For this assignment, I want you to:
Please submit your code and write-up via email to the CS114 account by 23:59:59.9 Friday, 1 Feb, 2008. Joint work on the code is acceptable and encouraged, but (1) the code must include the names of all the people that worked on it, and (2) the write-ups must be done individually.
This assignment is supposed to teach you about the NLTK, but it's also supposed to be fun. Try something outlandish and original. It doesn't have to be right, but try to make it interesting. If you have any questions, please email me.
Additional: Acceptable formats for write-ups are plain text, PostScript, PDF, (La)TeX, DVI, and HTML. This list should be considered exhaustive; please note in particular that Microsoft Word format is not included.