Spring 2014
Late Policy
Assignments will be accepted late ONLY if an extension has been agreed upon. You must email Professor Meteer and John the day before the assignment is due with the reason you need an extension. Later requests will only be accepted in emergency circumstances.
Assignment 4: Classification (due COB April 29th)
Your assignment is to train the NLKT Naive Bayes classifier on fdeatures from the Switchboard DIalog Act Corpus and classify the utterances in a test set according to the dialog act tag (from the 44 member set, see below).
This is the same corpus you used for Assgn 1, part 2, so you should already have the readers and access to the corpus. Information about the corpus can be found here: http://compprag.christopherpotts.net/swda.html
You need to first pull out a test set (~25% of the whole corpus). If you decide to use previous tags as a feature, you test set should be full dialogs (e.g. 25% of the files, rather than 25% of the utterances).
Use the Reduced Tag set of 44 tags (or there abouts) defined in the coders Manual (http://www.stanford.edu/~jurafsky/ws97/manual.august1.html).
The Utterance object method damsl_act_tag() converts the original tags to this 44 member subset:
1. Compute the baseline using just the words as features.
2. Compute the "next level" using bigrams and trigrams
3. Design 3-5 more features and see how much you can improve the performance. You can use anything in the data as features. Be creative.
Submit your code, results (include accuracy and a confusion matrix using nltk.metrics.confusionmatrix for each feature set), and discussion of the features including why you chose that feature and what it contributed to the results.
Assignment 2: Chunking
Part One (due March 14 COB)
- Use the nltk chunker and write chunking rules and apply it to the Treebank data which you can download from NLTK (see the NLTK Book, Ch. 7 for more info on chunking).
- The IDs of the sentences to use for "dev test" are listed here: Development Data If you just want to see them all listed to take a look at what you are looking for, I've pulled them together here: Devevelopment Sentences
- Evaluate your performance on the "tagged" treebank data (which is already chunked). Use only the words and POS as input and create the output structures. You can run this set through your rules as many times as you want until you think the performance is good enough. I'll add a "real" test set closer to the due date which you should run without looking at for your final submission
- You should basically be seeking to produce something similar to the code in section 7.3 of the NLTK textbook on developing and evaluating rule-based regular expression chunkers, except that you will be using a subset of the treebank_chunk corpus instead of 'the conll2000 corpus. So your grammar will be a regular expression-style chunk rule string (like the example in the book's code "NP: {<[CDJNP].*>+}" but more fleshed out), and you will evaluate your rules over a list of test sentences using the nltk.RegexpParser.evaluate() method.
- To get the particular sentences you are interested in out of the whole treebank_chunk corpus, you can either pass the list of filenames (with the ending ".pos" on each of them) to the treebank_chunk.chunked_sents() function or create a new list out of the elements of the list at those indices.
- Either way, it is important that you evaluate over all 10 final test sentences at once and get a single accuracy/precision/recall/fmeasure score.
- FINAL TEST SENTENCES: FINAL TEST SENTENCES: Test data
- SUBMIT
- Your rules
- Your evaluation scores
- The test corpus with your chunks marked
Assignment 3: Superchunking
Create larger constituents (due April 4 COB)
- Create a new chunker which takes the chunked data and produces better chunks (with internal structure if you see fit). (Write your own. DO not use the REGEXP parser from the book). Your program should take a tree as input and produce a new tree (where a "tree" is an instance of the NLTK type Tree)
- Your program should consist of a set of declaratives rules (taking advantage of python data structures) and an "interpreter" which is agnostic as to the specific set of rules being applied.
- Focus your efforts on people, places, and organizations. Use the the dev set from the previous assignment. DON"T LOOK AT THE TEST SET!
- You will be given a list of filenames to test you program on Monday (April 1st). If you want to change your program after you get the test data, you need to turn in both your original and the rerun.
- This is not an easy task. Focus on making a coherent program and doing the best you can on the rules. As you might guess, additional information and probabilities will make the job easier (Can you see another assignment in the works?).
- TURN IN: Your program, your rules, and your results on the test data SuperChunkTest Note you are also welcome to show the performance on the previous test and dev tests to illustrate the features of your program.
Programming Assignment 1: Dividing Sentences
Due: Wednesday, 29 January COB (close of business, 6 pm)
In this assignment, you will be dividing sentences into given and new portions. The data you will be working with is parsed and annotated with part-of-speech tags, but is from spoken dialog, and so contains many dysfluencies and interruptions.
The theory behind this assignment is described in Section 4.1 of the Modeling Conversational Speech paper. Here is the current tag list. More information on the part of speech tags can be found in the Tag Guidelines document. This may not be an exact set, but it's close. You can infer the differences from the data.
Write a program that divides as many of the sentences in the sample corpus as possible into their given and new portions, using the heuristics described in the paper (i.e., before the first strong verb or after the last weak one). You may use POS data, the full parse, or any combination. Try to deal with dysfluencies and interrupted sentences as best you can.
Submit your code along with a short (1–2 paragraphs) writeup of your algorithms and results.
Data and Tips
The data for this assignment is a small corpus consisting of 59 sentences. You can use the NLTK corpus reader to read the data like so:
>>> import nltk >>> reader = nltk.corpus.reader.BracketParseCorpusReader('...', 'Swbd_POS_Disf_data.txt', detect_blocks='sexpr')
where ‘...
’ is the path to the data file, or just
‘.
’ if it's in the current directory.
You can use the sents
, tagged_sents
, and parsed_sents
methods of the corpus reader to obtain the
words of the sentences, the words with POS, and the fully
parsed sentences, respectively. The last is represented using instances
of nltk.Tree
, which has many convenient methods for traversing
and manipulating tree-structured data.
weak_verbs = ["'m", "'re", "'s" "are", "be", "did", "do", "done", "guess","has", "have", "is", "mean", "seem", "think", "thinking","thought", "try", "was", "were"]
Programming Assignment 1 PART TWO: Using Perplexity to examine the differences
DUE Monday February 24, COB
Now that you've divided the sentences, how do you know there is any difference in the two sets? One way is to build ngram models using the two sets and test the perplextity.
For this part of the assignment, I want you to use off the open source tools to build the models and compute the perplexity. You can use either the SRILM toolkit or the CMUCU toolkit. You should check both out, but I have found in the past that some students have had a problem with one or the other depending on the operating system, but it's not really predictable, so you'll need to try them out.
Info on the toolkits: SRILM tutorial, SRILM man pages (these are a little dense), paper on CMUCU toolkit, documentation on typcial usage for CMUCU (the documentation overall is pretty good).
Part of the assignment is learning how to figure out how to use toolkits from the web. Before you come to John or me, Google the problem you are having (include your operating system if you think that's relevant). It's amazing to find out how many people have the same problems. But look at several answers--not all responders know what they are talking about. Filtering through is an essential skill in this business.
For this part of the assignment you will be using a new, much larger data set. However, if you are feeling unsure about the assignment, I would recommend trying it out with the data from Part 1, since it will reduce the number of things you are changing at once.
You need to do the following steps:
- Using the data set and readers supplied by John, replace your input component and change your "find pivot" component to use the new data.
- Divide the data into training and test, which should be non-overlappying and the test should be about 25% of the data, selected randomly (e.g. pull out every 4th conversation).
- Determine the format requirements for the tool you are using (remember, the ngrams are over only the words, not the tags) and write a new output routine to create the corpora in the correct format
- Using your "find pivot" program, produce the following 6 files in the required format:
- All sentences test and training
- Before pivot test and training
- After the pivot (including the pivot) test and training
- Extra credit: No pivot test and training
- Build ngram models using the training data (either bigrams or trigrams)
- Run the "perplexity" routine in the toolset over each test set on each training set (9 runs in for just before, after, all; 16 if you are including the no pivot)
- Your final submission should include:
- a 3x3 (or 4x4) matrix of the results of computing the perplexity of each of the test sets on each of the models created from the training.
- A short write up discussing the results (about a page)