Assignment 2: Chunking/Tagging
Assigned: February 8th, 2006
Due: February 15th, 2006
Specifications:
Goal: Get used to
Tagging/Chunking tools and their results with different types of
corpora, as well as tagging/chunking problems.
1) Obtain a Tagger/Chunker or a Tagger
and a Chunker, already trained (if it not trained you must train
it) and able(s) to tag/chunk English, using the
PennTreeBank Tagset (Brill's Tagset).
Here are some pointers (there are many more resources out there):
Carafe taggers (Conditional Random
Fields, MITRE) - pre-built for Linux with pre-trained POS and chunking
models. Source code and additional documentation is available here
CoNLL
on Chunking with references to Chunking Literature and some
scripts, some of these
scripts are also here
NLTK Natural
Language tools in python. You may want to use some of them.
YamCha
"a generic, customizable, and open source text chunker oriented toward
a lot of NLP tasks,
such as POS tagging, Named Entity
Recognition, base NP chunking, and Text Chunking."
CRF++
An open source implementation of Conditional
Random Fields (CRFs) for segmenting/labeling sequential data.
You can download Brill's Tagger from here.
In your submission, briefly describe what kind of machinery (HMM, SVM,
CRF), and what features/window it uses for each tasks.
2) Tag and Chunk the following texts, which are
going to be the Gold Standards:
a) WSJ subset.
b) Genia Corpus
The Texts are in the CoNLL format, it means that each line has
three columns.
Word POS-TAG CHUNK-TAG (in
IOB format).
You have to take the Words as input for your tagger
and probably the the pairs of (Word POS-TAG) for the chunker.
You may need to write a script to transform the input of the Gold
Standard files to your tagger/chunker and to generate the output.
Your output for each file must have the same format and number of lines
as the input.
In Case you may need to Train your algorithms for
Chunking, you can use the following training corpus:
CoNLL
training
You must report:
i) POS accuracy for Gold Standards a) and
b).
ii) You must also report the 5 most frequent error
dimensions (e.g. between VBD VBN) and their incidence in the
error rate (porcentage)
iii) Chunking Precision and Recall for Verbal Chunks and
Nominal Chunks:
Where Precision is (for Nominal replace by
Verbal to get the Verbal measures, and Produced refers to the output of
the chunker):
(Number of Correct Nominal Chunk Tags
Produced) / (Number of Total
Nominal Chunk Tags Produced)
Where Recall is:
(Number of Correct Nominal Chunk Tags
Produced) / (Number of Total
Nominal Chunk Tags in the Gold Standard)
3) Select an English text of
no less than 500 words, where words are
English words and punctuation characters, symbols after
tokenizing (about 3 paragraphs). The whole assignment is
individual, but each student must select a different Text, although
different students may use the same tagger/chunker.
a) Indicate the source, type of
text (literature, scientific, etc.)
b) Annotate the text
with POS, and Nominal and Verbal CHUNKING Tags (all
the other elements outside a chunk get the
O tag -it is a capital o, not a zero).
You can do
this, by first running your tagger and chunker. But you have
to CORRECT MANUALLY the tags.
Chunking Tags do not have to include
prepositions. They should include coordinate nouns inside a
noun phrase, but not
coordinated noun phrases (e.g. the book and the
pencil are two chunks: [the book] and [the pencil].
c) Report Tag accuracy
of your tagger and Noun and Verbal Precision and Recall for your
chunker.)
4) Submit your annotated text, together with your report on each of the
above mentioned items.