Assignment 3: Perl-Tagger, part 1

Write a Perl program that assigns tags to all tokens in a text. Also include a perl routine that computes the ambiguity-length ratio.

The goal of the tagger is to turn input like:

	The green dog .
into
	The/DT green/JJ|NN dog/NN|VB ./.
That is, multiple tags should be assigned if a token is ambiguous. We will provide a tokenized input text and a lexicon.

Here are the steps that you have to take:

  1. Decide how you want to use the lexicon. The easiest thing to do is to read the lexicon file and put all lexical entries on a hash. If you get annoyed by waiting for the program to read in the lexicon, then you should look into the dbmopen function.

  2. Go through a sentence and add tags to all tokens. Figure out what to do if a token is not in the lexicon. Sometimes those tokens are plain rubbish and there isn't a lot you can do with them accept for assigning them a default tag or set of default tags. But in many cases you can do something clever. Most punctuations for example are not in the lexicon but you should be able to assign some meaningful tag to them. Likewise for numbers and cases like "Atlanta's".

    Also, some of the lexical entries have a tag that is ambiguous. For example,
    	advanced VBN JJ VBD VBN|VBD
    	fresh JJ JJ|RB
    	running VBG JJ NN NN|VBG
    
    The lexicon file is created from a tagged corpus where the tags were assumed to be correct. But in some instances, the tagger that tagged the corpus could not disambiguate two tags. This is how VBN|VBD and JJ|RB got into the lexicon. For this assignment, you should see the tags of "fresh" as a set of three tags: JJ, JJ and RB. But at some point, you will have to do something about these tags since having duplicate tags does't do any good for the ambiguity-length ratio. You may choose to delete duplicate tags at this point, you may also let the heuristics of the next assignment take care of it.

  3. Compute the ambiguity-length ratio for each sentence. This ratio is the total number of possible tag assignments to the sentence divided by the length of the sentence. For example with
    	The/DT green/JJ|NN dog/NN|VB ./.
    the ratio is 4 divided by 4. Include punctuations in the calculations. You may do this by writing a separate perl program that works on the output of the tagger but it is faster to do it in the same while loop in which you assign the tags. It is not necessary to compute the average ratio for the whole file (contrary to what I said during the class).
Again, put in lots of (correct) comments and hand in a hard copy and a soft copy.