CS140b Annotation Efforts

Assignment (to be done in pairs)

NOTE: We will discuss the sets in more detail in class on Friday (1/23). If you know what you want to do before that email me directly. I'll set up the forum after class.

By the end of the day on Friday, select one of the sets below and announce your selection on the forum I will send out. (and look at the forum before making your choice so we don't get duplicates. If it's a race condition, I'll let the second
Using the information provided (which is a mix of guidelines and papers) determine the annotation goal the annotation task
Describe the properties fo the corpus. How was it collected? Is it a good example of sampling and balance?
Using the text provided (soon) follow the guidelines, annotating the text by hand individually.
In your groups, compare your annotations with the "gold standard" and discuss differences, what was hard, what was underspecified, what was clear.
Find at least two research groups that used the annotated corpus (you can each do one) and determine their annotation goal, what algorithms they usedhow they evalutated their work (e.g. f-measure, WER, etc), and what the result was.
Present to the class a description of the annotation project, your assessment of the guideline, and a summary of the research that used the data. You should have roughly 1 slide per bullet point (though one for each research group you look at)

Annotation Efforts to choose from

Choose one from the following annotation efforts or come up with your own. However, you must have guidelines, a data set, and 2 users of the data if you are going to select your own.

Part of Speech Tagging

Part-of-Speech Tagging Guidelines for the Penn Treebank Project (3rd Revision)
Data available on GoogeleDocs for both WSJ and SWBD. It is useful and interesting to try both

Penn Treebank

The Penn Treebank Project
Data available on GoogeleDocs for both WSJ and SWBD. It is useful and interesting to try both

Discourse Treebank

Time ML

Unified Linguistic Annotation Text Collection: Committed Belief and REFLEX Entity Translation

Overview

Language Understanding Corpus, Committed Belief

Data a & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/unified_lang_ann/Language_Understanding_Corpus/data/committed_belief

REFLEX Entity Translation Dev Test

Data & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/unified_lang_ann/Reflex_entity_relations

CoNLL 2010: Detecting Uncertain Information and Resolution on in-sentence scopes of hedge cues

ConLL-2010 Shared Task
Guidelines for both tasks described in the first paper in the set :The CoNLL 1020 Shared Task: ...
Choose one of the following: (Data available for both on the class GoogleDrive)

Task 1: Detecting Uncertain Information
Task 2: Resolution on in-sentence scopes of hedge cues

OntoNotes: coreference, named entity, parses, propositions, sense

Project documentation and annotation guidelines/li>
Sample WSJ English data for coreference, named entity, parses, propositions, sense and raw (unmarked) on the class GoogleDrive)
Full Data & Docs on Brandeis CS Servers: /home/j/corpuswork/ldc-data/ontonotes_r1

SemEval

Propbank: Semantic Role Labeling (CoNLL-2005 Shared Task)

Propbank

Some data (or something) on Brandeis CS Servers: /home/j/corpuswork/corpora/PropBank/

CS140b Annotation Exploration

Assignment (to be done in pairs)

Annotation Efforts to choose from

Part of Speech Tagging

Penn Treebank

Discourse Treebank

Time ML

Unified Linguistic Annotation Text Collection: Committed Belief and REFLEX Entity Translation

Language Understanding Corpus, Committed Belief

REFLEX Entity Translation Dev Test

CoNLL 2010: Detecting Uncertain Information and Resolution on in-sentence scopes of hedge cues

OntoNotes: coreference, named entity, parses, propositions, sense

SemEval

Propbank: Semantic Role Labeling (CoNLL-2005 Shared Task)

More links to guidelines and tasks. Some of the data may overlap with data pointed to or linked to above

Senseval 3

*Sem 2012: Resolving the Scope and Focus onf Negation

*Sem

*Sem 2013 Semantically Textual Similarity

Dysfluencies

Possibly interesting options, but I couldn't find data. But if you can, go for it!

Dependency Treebank

i2b2

ACE: Automatic Content Extraction

BOLT: Broad Operational Language Translation