Assignment 4: POS tagger , part 2

Assignment 4: Rule-based heuristics for a POS tagger

In the previous assignment, you looked up lexical items in a lexicon and assigned all tags to the token. For this assignment, you'll have to reduce the ambiguity ratio by using a set of heuristics. However, you should not just throw away all tags that you don't like. The trick is to disambiguate correctly. For example,

   The/DT green/JJ|NN dog/NN|VB ./.

should not be disambiguated to

   The/DT green/NN dog/VB ./.

but to

   The/DT green/JJ dog/NN ./.

The goal therefore is to reduce the ambiguity ratio without throwing away correct tags. To test this we will provide a tagged unambiguous text that has the correct tags (although some errors might still occur).

Your program will take tokenized text as input. This text is provided in input file. This is the same file that your last Perl script worked on. The first part of your program is easy, you already did it: assign tags from the lexicon and do something smart with tokens that do not occur in the lexicon. Use the same lexicon as before.

Then you will have to define heuristics that perform partial disambiguation of the tokens. Some of these heuristics were given in class and some will appear amongst the hints below. But you also have to define a couple of heuristics yourself (with a couple I mean between 5 and 10). The task is to show that you can write the Perl code that applies the heuristics, that you can create heuristics that do some disambiguation (without introducing too many errors), and that you come to get aware of some of the problems. You don't have to write a perfect disambiguater. The Brill-tagger, for example, uses 800 rules for disambiguation (give and take a few hundred).

The output of your Perl script has to be in a one-sentence-per-line format with the tags separated from the tokens by a slash ("/") and multiple tags separated by a vertical bar ("|"). At the end of each line there have to be two numbers: the ambiguity ratio before disambiguation and the ratio after disambiguation. Examples:

   The/DT green/JJ dog/NN ./.		[1 ;  0.25]
   The/DT green/JJ dog/NN|VB ./.	[1 ;  0.5]

The reason for this strict format is that I might write a program that compares your output with the correct unambiguous text and computes a correctness score (on the other hand, I might leave this up to you).

Hints

Three parts: data structures, small useful sub routines and overal control.

1. Data Structures

First you should think about how to represent the sentence and the heuristics (well, you don't have to do that anymore if you trust me on my word, but it is good to realise why I gave you the representations below).

The thing to do with the input line is to create an array of hashes, like this:

   @line = (
      { word => The , tags => { DT => 1 } },
      { word => green , tags => { JJ => 1 } },
      { word => dog , tags => { NN => 1, VB => 1 } },
      { word => "." , tags => { "." => 1 } } );

Now you can acces individual words, hashes of tags and tags like this:

   $line[0]->{word};		# "The"
   $line[1]->{tags};		# "HASH(0x1001ae8c)", ie a reference to a hash
   $line[2]->{tags}->{NN}	#  "1", ie the third token is tagged as a NN
   $line[2]->{tags}->{JJ}	#  "0", ie the third token is not tagged as a JJ

Of course, you will have to figure out how to create this structure. Here is a template, please fill in the holes:

   @line = split;	# while looping through input
   foreach $token ( @line ) {
      ......		# extract word and tags
      ......		# put tags on hash
      $token = { word => $word, tags => ...... }}

This function will actually change the @line array.

And now the heuristics. Heuristics always focus at a particular token and have information about what it looks for in that token, what it expects to the left and right and what it will do to the token (deleting tags presumably). Use a list of hashes like this one:

   @heuristics = (
   {
    name   => initial_det ,
    tags   => [DT,NNP,JJ] ,		# list of tags expected on token
    left   => ['^'] ,			# left context is beginning of line
    right  => [] ,			# right context not important
    delete => [NNP,JJ] },		# tags to be deleted from token
   {
    name   => adjective_noun ,
    tags   => [NN,VB] ,
    left   => [JJ] ,
    right  => [] ,
    delete => [VB] });

Note that the heuristics themselves as well as the lists of tags are really references to list, so if you want to acces it you have to dereference it, for example:

   @{ $heuristics[0]->{tags} };	# the list (DT,NNP,JJ)

The heuristics have their limitations of course. Maybe you want to refer directly to the word. Also, the left and right context are just a list of tags, maybe you want to refer to words. Or maybe you want to be able to state that the token to the left has an ambuous tag. There is some work to do here.

2. Small Useful Subroutines

Once you got your data structures in order and you know how to access them it gets time to write some functions. Here are a couple that you will need:

&Check_Tags( x , y )

A function that checks whether the tags in the heuristic occur on the token. The arguments are either integers or references to hashes and arrays. If they are integers then x will stand for token x and y for heuristic y, the function will then need to figure out where to get the tags.

&Check_Left( x , y )

Check the left context of a heuristic. Probaly the best is to let x be a token number and y a list of tags that have to be checked (taken from the heuristic). This function is likely to be recursive (but by no means necessarily so). You also need to do something about a left context that is the beginning of a line.

&Check_Right( x , y )

Similar to &Check_Left( x , y ).

&Delete_Tags( x , y )

Again, x and y could be a heuristic number and a token number. But they also could be references to a list of tags on the heuristic and a hash of tags on the token. You can use

   delete $line[2]->{tags}->{VB};

to delete the VB tag from the word "dog".

3. Overall Control

You should figure out the general control structure. That is, how do you go through the string (token by token, starting from the left) and how and when do I apply the heuristics. This comes down to writing one or more loops around the useful little functions (or around one function that call the useful little functions). I ain't writing no more about this right now.

More hints Tuesday in class.