Assigned: March 1st, 2006
Due: March 15th , 2006
1) Take the tagger and chunker you used for assignment 2.
2) Tag and chunk the whole document from the tokenizer assignment.
3) Write a script to recognize entities and events.
The entities shoud be typed and you'll need to recognize at least
the following types: persons, companies/organizations, locations,
URLs, and numbers. Try to make your script as general as possible, we
do not want to see regular expressions like /Mr. Smith/ in your code.
The output should be like the input file but with tags added, for example:
<ENTITY type=PERSON>Dr
Smith</ENTITY>, an M.D. and P.M.D. from <ENTITY
type=ORGANIZATION>Health Works Inc.</ENTITY>,
<EVENT>located</EVENT> in the S.E. part of the city,
4) Select a few documents from any source of your liking and tag and chunk those as well. Together, these extra documents should contain at least 10,000 words.
5) Run your script on these documents and analyze the results. You do not need to update your script, but you should give pointers as to how you would improve precision and recall.