Programming Assignment 3: Semantic Role Labeling

CS 114b, Spring 2008
Due: Monday, 14 April

Introduction

In this assignment, you will be experimenting with semantic role labeling (SRL). In particular, your task is the recognition sub-problem: given a predicate and its arguments, decide the types of each argument. (The general task includes locating the arguments themselves, but we're not asking you to do that.) To accomplish this, you are asked to use (at least) two classifiers: Naïve Bayes (nltk.classify.naivebayes), and decision trees (nltk.classify.decisiontree).

Resources for Semantic Role Labeling

You can check out how various groups have solved this problem. The CONNL-2004 Shared Task on Semantic Role Labeling is a good place to start. There are lots of good papers on this page, so look around.

Problem Statement

Separate the PropBank corpus into a training set and a smaller test set. (I don't care what the exact proportions are; try testing on 5% and adjust as necessary.) Determine a set of features that can be extracted from a PropBank instance (nltk.corpus.reader.propbank.PropbankInstance) and the associated TreeBank parse tree. Using those features, train a classifier on the training set, then use that classifier to label the argument types for the predicates in the test set. Report precision and recall values for each of the Naïve Bayes and decision tree classifiers.

Data

The NLTK does not ship with the entire TreeBank corpus. For developing your classifier features, it might be convenient to use the smaller corpus that it includes; however, for reporting final values, you should do at least one run over the entire corpus.

The combined data files for the entire TreeBank are available via HTTP as a gzipped TAR file called treebank-combined.tar.gz in the top-level CS114 directory (I can't link directly to it for licensing reasons). This file should be unpacked in your data/corpora/treebank/ directory. The example driver loop discussed in the next section includes code to produce corpus readers for the full TreeBank and PropBank.

Code

To help you focus on the interesting bits of the problem (i.e., extracting features and dealing with the classifiers), we're providing code for a simple driver loop. You should feel free to ignore or rewrite this code as you see fit for your own project, but you may find it helpful. If you find any bugs or problems with it, please contact me.

Documentation

To understand what kind of features you might want to use, you should begin by reading some of the documentation for both TreeBank and PropBank. The TreeBank homepage has many useful links, as does Martha Palmer's PropBank page. SRL was also part of the CoNLL-2004 and -2005 shared tasks.