Assignment 4: Enitity Recognition

Assigned: March 1st, 2006
Due: March 15th , 2006

Goal

Adding entities and events to a chunked document.

Specifications

1)  Take the tagger and chunker you used for assignment 2.

2)  Tag and chunk the whole document from the tokenizer assignment.

3)  Write a script to recognize entities and events.

The entities shoud be typed and you'll need to recognize at least the following types: persons, companies/organizations, locations, URLs, and numbers. Try to make your script as general as possible, we do not want to see regular expressions like /Mr. Smith/ in your code.

The output should be like the input file but with tags added, for example:

<ENTITY type=PERSON>Dr Smith</ENTITY>, an M.D. and P.M.D. from <ENTITY type=ORGANIZATION>Health Works Inc.</ENTITY>, <EVENT>located</EVENT> in the S.E. part of the city,

4)  Select a few documents from any source of your liking and tag and chunk those as well. Together, these extra documents should contain at least 10,000 words.

5)  Run your script on these documents and analyze the results. You do not need to update your script, but you should give pointers as to how you would improve precision and recall.

What you need to submit

Submit a tar file with three files:
  1. Your script, named "recognize.py". The TA should be able to run this script on a department machine. If your script uses other resources, which is likely, then put those in a directory named "resources".
  2. The parsed output of the document from the tokenizer assignment, named "output.txt.
  3. A two-page write-up in which you expand on what characteristics of your original script caused hits in precision and recall, plus, for each of those problems, a suggestion on how to improve it.
Mail the tar file to the class account on or before the due date.