Spring 2013

Late Policy

Assignments will be accepted late ONLY if an extension has been agreed upon. You must email Professor Meteer and Alex the day before the assignment is due with the reason you need an extension. Later requests will only be accepted in emergency circumstances.

Assignment 2: Chunking

Part One (due March 18 COB)

  • Use the nltk chunker and write chunking rules and apply it to the Treebank data which you can download from NLTK (see the NLTK Book, Ch. 7 for more info on chunking).  
  • The IDs of the sentences to use for "dev test" are listed here: Development Data If you just want to see them all listed to take a look at what you are looking for, I've pulled them together here: Devevelopment Sentences
  • Evaluate your performance on the "tagged" treebank data (which is already chunked). Use only the words and POS as input and create the output structures. You can run this set through your rules as many times as you want until you think the performance is good enough. I'll add a "real" test set closer to the due date which you should run without looking at for your final submission
  • You should basically be seeking to produce something similar to the code in section 7.3 of the NLTK textbook on developing and evaluating rule-based regular expression chunkers, except that you will be using a subset of the treebank_chunk corpus instead of 'the conll2000 corpus. So your grammar will be a regular expression-style chunk rule string (like the example in the book's code "NP: {<[CDJNP].*>+}" but more fleshed out), and you will evaluate your rules over a list of test sentences using the nltk.RegexpParser.evaluate() method.
  • To get the particular sentences you are interested in out of the whole treebank_chunk corpus, you can either pass the list of filenames (with the ending ".pos" on each of them) to the treebank_chunk.chunked_sents() function or create a new list out of the elements of the list at those indices.
  • Either way, it is important that you evaluate over all 10 final test sentences at once and get a single accuracy/precision/recall/fmeasure score.
  • FINAL TEST SENTENCES: Test data
  • SUBMIT
    • Your rules
    • Your evaluation scores
    • The test corpus with your chunks marked

Assignment 3: Superchunking

Create larger constituents (due April 3 COB)

  • Create a new chunker which takes the chunked data and produces better chunks (with internal structure if you see fit). (Write your own. DO not use the REGEXP parser from the book). Your program should take a tree as input and produce a new tree (where a "tree" is an instance of the NLTK type Tree)
  • Your program should consist of a set of declaratives rules (taking advantage of python data structures) and an "interpreter" which is agnostic as to the specific set of rules being applied.
  • Focus your efforts on people, places, and organizations. Use the the dev set from the previous assignment. DON"T LOOK AT THE TEST SET!
  • You will be given a list of filenames to test you program on Monday (April 1st).  If you want to change your program after you get the test data, you need to turn in both your original and the rerun.
  • This is not an easy task. Focus on making a coherent program and doing the best you can on the rules. As you might guess, additional information and probabilities will make the job easier (Can you see another assignment in the works?).
  • TURN IN: Your program, your rules, and your results on the test data

Programming Assignment 1: Dividing Sentences

Due: Monday, 4 February COB (close of business, 6 pm)

In this assignment, you will be dividing sentences into given and new portions. The data you will be working with is parsed and annotated with part-of-speech tags, but is from spoken dialog, and so contains many dysfluencies and interruptions.

The theory behind this assignment is described in Section 4.1 of the Modeling Conversational Speech paper. Here is the current tag list. More information on the part of speech tags can be found in the Tag Guidelines document. This may not be an exact set, but it's close. You can infer the differences from the data.

Details of the assignment are here

Programming Assignment 1: PART TWO:Using Perplexity to examine the differences

DUE Wednesday February 27, COB

Now that you've divided the sentences, how do you know there is any difference in the two sets? One way is to build ngram models using the two sets and test the perplextity.

For this part of the assignment, I want you to use off the open source tools to build the models and compute the perplexity. You can use either the SRILM toolkit or the CMUCU toolkit. You should check both out, but I have found in the past that some students have had a problem with one or the other depending on the operating system, but it's not really predictable, so you'll need to try them out.

Info on the toolkits: SRILM tutorial, SRILM man pages (these are a little dense), paper on CMUCU toolkit, documentation on typcial usage for CMUCU (the documentation overall is pretty good).

If you run into problems on this one, email me (not Alex) and I'll try to help you out. However part of the assignment is learning how to figure out how to use toolkits from the web. Before you come to me, Google the problem you are having (include your operating system if you think that's relevant). It's amazing to find out how many people have the same problems. But look at several answers--not all responders know what they are talking about. Filtering through is an essential skill in this business.

You need to do the following steps:

  • Determine the format requirements for the tool you are using (remember, the ngrams are over only the words, not the tags)
  • Divide the data into the following corpora. Training and test should be non-overlappying and the test should be about 25% of the data, selected randomly (e.g. pull out every 4th sentence).
    • All sentences test and training
    • Before pivot test and training
    • After the pivot (including the pivot) test and training
  • Write a new output routine to create the corpora in the correct format
  • Build ngram models using the training data
  • Run the "perplexity" routine in the toolset over each test set on each training set (9 runs in all)
  • Your final submission should include:
    • a 3x3 matrix of the results of computing the perplexity of each of the three test sets on each of the three models created from the training.
    • A short write up discussing the results