Assignment 2: Chunking/Tagging

Assigned: February 8th, 2006
Due: February 15th, 2006

Specifications:


Goal:  Get used to Tagging/Chunking tools and their results with different types of corpora, as well as tagging/chunking problems.

1)  Obtain a  Tagger/Chunker  or a Tagger   and a Chunker, already trained (if it not trained you must train it)  and able(s) to tag/chunk English, using the  PennTreeBank Tagset (Brill's Tagset).

Here are some pointers (there are many more  resources out there):

                 Carafe  taggers  (Conditional Random Fields, MITRE) - pre-built for Linux with pre-trained POS and chunking models. Source code and additional documentation is available here
                 CoNLL  on Chunking with references  to  Chunking Literature and some scripts, some of these
                              scripts are also here
                 NLTK   Natural Language tools in python. You may want to use some of them.
                 YamCha  "a generic, customizable, and open source text chunker oriented toward a lot of NLP tasks,
                                such as POS tagging, Named Entity Recognition, base NP chunking, and Text Chunking."
                 CRF++  An open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data.
                 You can download Brill's Tagger from here.
                
  

In your submission, briefly describe what kind of machinery (HMM, SVM, CRF), and what features/window it uses for each tasks.

2)  Tag  and  Chunk  the following texts, which are going to be the Gold Standards:

a)  WSJ  subset.     b) Genia Corpus    

The Texts are in the CoNLL format, it means that each line has  three columns.

Word    POS-TAG    CHUNK-TAG   (in IOB format). 


You have to take the   Words as input for your tagger  and probably the  the pairs of (Word POS-TAG) for the chunker.
You may need to write a script to transform the input of the  Gold Standard files to your tagger/chunker and to generate the output.

Your output for each file must have the same format and number of lines as the input.

In  Case you may need to Train your algorithms for  Chunking,  you can use the following training corpus:

                   CoNLL training

You must report:

i)   POS accuracy for Gold Standards   a) and b). 
ii)  You must also  report the  5 most frequent error dimensions (e.g. between VBD VBN) and their  incidence in the error rate  (porcentage)
iii) Chunking  Precision and Recall for  Verbal Chunks and Nominal Chunks:
    
     Where Precision is (for Nominal replace by Verbal to get the Verbal measures, and Produced refers to the output of the chunker):
  
       (Number of Correct Nominal Chunk Tags Produced) / (Number of Total Nominal Chunk Tags Produced)

     Where Recall is:
       (Number of Correct Nominal Chunk Tags Produced) / (Number of Total Nominal Chunk Tags in the Gold Standard)
 
   
3)   Select   an English  text of   no less than   500  words,  where  words are English words and punctuation characters, symbols after tokenizing  (about 3 paragraphs).  The whole assignment is individual, but each student must select a different Text, although different students may use the same tagger/chunker.

     a)   Indicate the source, type of text (literature, scientific, etc.)
     b)   Annotate  the text with  POS, and Nominal and Verbal CHUNKING  Tags  (all the other  elements  outside a chunk get the   O  tag -it is a capital o, not a zero).
           You can do this, by first running your tagger and chunker.  But you have to   CORRECT MANUALLY  the tags.
           Chunking  Tags do not have to  include prepositions.   They should include coordinate nouns inside a noun phrase, but not
           coordinated noun phrases  (e.g.   the book and the pencil are two chunks:  [the book] and [the pencil].
     c)   Report   Tag accuracy of your tagger and Noun and Verbal Precision and Recall for your chunker.)

4) Submit your annotated text, together with your report on each of the above mentioned items.