The goal of the tagger is to turn input like:
The green dog .into
The/DT green/JJ|NN dog/NN|VB ./.That is, multiple tags should be assigned if a token is ambiguous. We will provide a tokenized input text and a lexicon.
Here are the steps that you have to take:
advanced VBN JJ VBD VBN|VBD fresh JJ JJ|RB running VBG JJ NN NN|VBGThe lexicon file is created from a tagged corpus where the tags were assumed to be correct. But in some instances, the tagger that tagged the corpus could not disambiguate two tags. This is how VBN|VBD and JJ|RB got into the lexicon. For this assignment, you should see the tags of "fresh" as a set of three tags: JJ, JJ and RB. But at some point, you will have to do something about these tags since having duplicate tags does't do any good for the ambiguity-length ratio. You may choose to delete duplicate tags at this point, you may also let the heuristics of the next assignment take care of it.
The/DT green/JJ|NN dog/NN|VB ./.the ratio is 4 divided by 4. Include punctuations in the calculations. You may do this by writing a separate perl program that works on the output of the tagger but it is faster to do it in the same while loop in which you assign the tags. It is not necessary to compute the average ratio for the whole file (contrary to what I said during the class).