Assignment 1: Tokenizer

Assigned: January 25th, 2006
Due: February 6th, 2006

Specification

Write a tokenizer for this text, with the following properties:

Assume the output of the tokenizer will be used for lexical lookup (e.g., for part-of-speech tagging or semantic typing).  Thus, it should ouput tokens which match the tokens in the (assumed) lexicon.  As pointed out in class, the best way to do tokenization depends a lot on how you want to do later stages of processing.  To do this exercise, you'll have to make assumptions about how these later stages will proceed.  Part of this assignment involves making those assumptions intelligently.  For example, assume that the following kinds of forms will not be cluttering up the lexicon:

This document specifies many more assumptions you have to make to do the assignment. If you find yourself having to make other assumptions or judgement calls, submit a paragraph or two outlining what the assumptions are and justifying them.

Keep hyphenated forms as single tokens (Don't split words at hyphens.  While no lexicon can contain all possible hyphenated forms, we will assume that some later stage of processing will take care of this).  For example:

Similarly, keep slashed forms as single tokens.  For example:

Assume that whatever abbreviations are in the lexicon (they probably can't all be in there) contain periods. For example:

Thus, your tokenizer should keep the dot as part of the abbreviation.

As discussed in class, this last bit is tricky: compare the required treatment of 'Saddam.' which might occur at the end of a sentence.   How do you know whether you have a sentence ending or the end of an abbreviation (or an abbreviation at the end of a sentence)?  Try to come up with the most general strategy you can to solve this problem. (Hint: trying to list all possible abbreviations is not a great strategy.  And listing only the abbreviations which occur in the sample text is a terrible strategy)  

Another tricky thing: Your tokenizer should not split 'names' (companies, files, URLs, etc.) at punctuation. For example:

Obviously, this is going to be hard to do perfectly. Try to come up with as general a strategy as you can to handle this problem.

Split '%' from numbers.

Assume that the lexicon contains all the forms of the suffix part of a contraction.  So it contains "not" but also "n't", "will" but also "'ll"; However, our lexicon doesn't contain variations of the prefix part of the contraction, once the suffix part is removed from them: It contains will, but not wo  (e.g., won't); it contains can, but not ca (e.g., can't). Thus, your tokenizer should expand all contractions:

Note that 's (genitive) is not a contraction in phrases like Saddam's regime.  Your tokenizer should split the 's from the word it appears on.

Sentence punctuation (, ! ( ) ? ; . etc.) should be treated as distinct tokens:

  1. ... business day. ==> [..., 'business', 'day', '.']
  2. ... reason to be cheerful: ==> [..., 'reason', 'to', 'be', 'cheerful', ':'] 
  3. ... 'we don't know how many,' he said ==> ["'", 'we', 'do', "n't", 'know', 'how', 'many', "'", ',', 'he', 'said']

This last example is not trivial: it requires changing the order of the puntuation sequence: '",' (it must be this way in order to have a coherent treatment of the sequence of '"' plus '.' as in the last sentence of the sample text: '... at least 2 dead". ')

Insert paragraph markers ('<PP>') between the paragraphs, so as not to lose the text marking.
Ideally there should be just one paragraph marker between every two paragraphs --even if they are separated by more than two carriage returns.

The tokenizer should be a function named tokenize, whose argument is a single string and that returns a list of strings.

Submission

Work must be done individually.  For a definition of individually, please see Brandeis University's standards for academic integrity.

Email a file named tokenizer.py, that can be used to generate the following interaction (for example):

% python
Python 1.5.2 (#1, Feb 1 2000, 16:32:16) [GCC egcs-2.91.66 19990314/Linux (egcs- on linux-i386
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> import tokenizer
>>> tokenizer.tokenize("They're in trouble.")
['They', "'re", 'in', 'trouble', '.']
>>> tokenizer.tokenize("They're in trouble.")[0]
'They'

At the same time, the program must be executable with the command python tokenizer.py input_file output_file. This will be of great help to you for checking the adequacy of your program --and for us, to grade the assignment! Allow the output in output_file to be printed as a list which maintains the paragraphs markers (<PP>). Therefore, an input file containing the same testing sentence as above, should be outputted as follows:

They
're
in
trouble
.
<PP>

We'll take points out if your output file is in a format different from the one specified here.

Grading

Total points: up to 10

Part of these credits will be awarded according to the readability and structure of the code. Part of the credits will also be awarded according to whether the code meets the specifications here (We will take into account any arguments you make in an accompanying paragraph).

Some criteria for readable programs are: