In this assignment, you will be dividing sentences into given and new portions. The data you will be working with is parsed and annotated with part-of-speech tags, but is from spoken dialog, and so contains many dysfluencies and interruptions.
The theory behind this assignment is described in Section 4.1 of the Modeling Conversational Speech paper.
Your task is to write a program that divides as many of the sentences in the Switchboard corpus as possible into their ‘given’ and ‘new’ portions, using the heuristics described in the paper (i.e., before the first strong verb or after the last weak one). You'll be using just POS data for now. Try to deal with dysfluencies and interrupted sentences as best you can.
To get started, you'll need to import the appropriate corpus reader:
>>> from nltk.corpus import switchboard >>> switchboard <SwitchboardCorpusReader in '.../corpora/switchboard' (not loaded yet)>
The NLTK corpus reader for the Switchboard corpus is organized around the idea of a ‘discourse’, which is basically just a conversation. We don't actually care about that level of organization in this assignment, but you need to be aware of it in order to use the reader. Inside each discourse, the speakers take turns making utterances, each of which consist of one or more sentences, separated by periods (with part of speech ‘.&rsquo'). For instance, here are the first three turns of the first discourse; each consists of exactly one sentence:
>>> switchboard.tagged_discourses()[0][:3] [<A.1: 'Uh/UH ,/, do/VBP you/PRP have/VB a/DT pet/NN Randy/NNP ?/.'>, <B.2: 'Uh/UH ,/, yeah/UH ,/, currently/RB we/PRP have/VBP a/DT poodle/NN ./.'>, <A.3: 'A/DT poodle/NN ,/, miniature/JJ or/CC ,/, uh/UH ,/, full/JJ size/NN ?/.'>]The data structure for representing turns is a subclass of
list, and can be treated as such:
>>> switchboard.tagged_discourses()[0][0][0]
('Uh', 'UH')
These are the elements you'll be working over: tuples of the form
(word, POS). A sentence is basically just a list of such
tuples; some sentences are concatenated to form a turn; and a list of
turns comprise a discourse.
Your first task should be to separate turns into sentences, since that's what we want to operate over. You can just use lists of tagged words to represent sentences.
The primary task is to write a function that takes a sentence and returns a tuple of three lists: the words before the pivot point, the pivot word (as a list, for consistency), and those after the pivot. (If no pivot is found, everything should be in either the first or last sublist; it doesn't much matter which). For example:
>>> pivot_sentence(switchboard.tagged_discourses()[0][6])
([('I', 'PRP')],
[('read', 'VBD')],
[('somewhere', 'RB'), ('that', 'IN'), (',', ','), ('the', 'DT'), ...])
The crucial point here is that it is not sufficient to simply print the divided sentences. (It's probably a good idea to have something that does so for debugging purposes, but that is not the primary goal.) Imagine that your code will be some intermediate stage of processing in a longer pipeline—you get the data as nicely structured objects, and you should be able to annotate those objects with the pivot point and pass them on to the next phase of processing.
Here's a rough list of ‘weak verbs’:
["'m", "'re", "'s" "are", "be", "did", "do", "done", "guess", "has", "have", "is", "mean", "seem", "think", "thinking", "thought", "try", "was", "were"]Feel free to add to or tweak this list as you see fit.
Submit your code along with a short (a few paragraphs) writeup of your
algorithms and results by emailing both Prof. Meteer and Alex with a
subject that contains the literal string "PA1". Your code and writeup
should be in a single ZIP or TAR
file whose name includes (an abbreviation of) your name; e.g.:
cs114-plotnick-pa1.tar.