The/DT green/JJ|NN dog/NN|VB ./.should not be disambiguated to
The/DT green/NN dog/VB ./.but to
The/DT green/JJ dog/NN ./.The goal therefore is to reduce the ambiguity ratio without throwing away correct tags. To test this we will provide a tagged unambiguous text that has the correct tags (although some errors might still occur).
Your program will take tokenized text as input. This text is provided in input file. This is the same file that your last Perl script worked on. The first part of your program is easy, you already did it: assign tags from the lexicon and do something smart with tokens that do not occur in the lexicon. Use the same lexicon as before.
Then you will have to define heuristics that perform partial disambiguation of the tokens. Some of these heuristics were given in class and some will appear amongst the hints below. But you also have to define a couple of heuristics yourself (with a couple I mean between 5 and 10). The task is to show that you can write the Perl code that applies the heuristics, that you can create heuristics that do some disambiguation (without introducing too many errors), and that you come to get aware of some of the problems. You don't have to write a perfect disambiguater. The Brill-tagger, for example, uses 800 rules for disambiguation (give and take a few hundred).
The output of your Perl script has to be in a one-sentence-per-line format with the tags separated from the tokens by a slash ("/") and multiple tags separated by a vertical bar ("|"). At the end of each line there have to be two numbers: the ambiguity ratio before disambiguation and the ratio after disambiguation. Examples:
The/DT green/JJ dog/NN ./. [1 ; 0.25] The/DT green/JJ dog/NN|VB ./. [1 ; 0.5]The reason for this strict format is that I might write a program that compares your output with the correct unambiguous text and computes a correctness score (on the other hand, I might leave this up to you).
The thing to do with the input line is to create an array of hashes, like this:
@line = ( { word => The , tags => { DT => 1 } }, { word => green , tags => { JJ => 1 } }, { word => dog , tags => { NN => 1, VB => 1 } }, { word => "." , tags => { "." => 1 } } );Now you can acces individual words, hashes of tags and tags like this:
$line[0]->{word}; # "The" $line[1]->{tags}; # "HASH(0x1001ae8c)", ie a reference to a hash $line[2]->{tags}->{NN} # "1", ie the third token is tagged as a NN $line[2]->{tags}->{JJ} # "0", ie the third token is not tagged as a JJOf course, you will have to figure out how to create this structure. Here is a template, please fill in the holes:
@line = split; # while looping through input foreach $token ( @line ) { ...... # extract word and tags ...... # put tags on hash $token = { word => $word, tags => ...... }}This function will actually change the @line array.
And now the heuristics. Heuristics always focus at a particular token and have information about what it looks for in that token, what it expects to the left and right and what it will do to the token (deleting tags presumably). Use a list of hashes like this one:
@heuristics = ( { name => initial_det , tags => [DT,NNP,JJ] , # list of tags expected on token left => ['^'] , # left context is beginning of line right => [] , # right context not important delete => [NNP,JJ] }, # tags to be deleted from token { name => adjective_noun , tags => [NN,VB] , left => [JJ] , right => [] , delete => [VB] });Note that the heuristics themselves as well as the lists of tags are really references to list, so if you want to acces it you have to dereference it, for example:
@{ $heuristics[0]->{tags} }; # the list (DT,NNP,JJ)The heuristics have their limitations of course. Maybe you want to refer directly to the word. Also, the left and right context are just a list of tags, maybe you want to refer to words. Or maybe you want to be able to state that the token to the left has an ambuous tag. There is some work to do here.
delete $line[2]->{tags}->{VB};to delete the VB tag from the word "dog".
More hints Tuesday in class.