Official Rules

Organizing committee

Nianwen Xue (chair), Brandeis University
Hwee Tou Ng, National University of Singapore
Sameer Pradhan, Harvard University
Bonnie Webber, Edinburgh University
Attapol Rutherford, Brandeis University
Chuan Wang, Brandeis Unviersity

Training and Development Data

The training and development data for English are from the Penn Discourse TreeBank (PDTB) 2.0, a 1-million-word Wall Street Journal corpus. The training data are from Sections 2-21 and the development set is from Section 22.

New The Chinese data are from the Chinese Discourse Treebank (CDTB). plus some additional annotation in version 1.0, which has not been released through LDC. The Chinese data is much smaller than the English data. The training data are from Section 0001-0270 and Section 0400-0803, the development set is from Section 0301-0325.

More information on the PDTB and shallow discourse parsing, how to obtain the data and the data format on this website.

Evaluation Data

There are two English evaluation datasets

The blind test set. This will consist of 20,000 to 30,000 words of newswire text annotated following the PDTB annotation guidelines. This test set will not be given to you. Instead, you will be asked to deploy the system on the remote machine, and we will run the system on this blind test set for the final ranking of systems.
The PDTB test set. Section 23 of the PDTB as an additional test set. Results on this PDTB test set will NOT be used to rank the systems, but they can be used as a reference for comparison with previously reported results. Participants are asked NOT to use this additional test set to train or tune their systems, but they can use the development set to tune (but not train) their systems.

New There are two Chinese evaluation datasets

A small Chinese test set will be held out from the Section 0271-0300 of Chinese Discourse Treebank.
A blind (unseen) test set

System Submission

Participants must deploy their systems to a remote virtual machine for automatic evaluation instead of submitting system output for the test set. You are free to install any toolbox or software on the virtual machine to make your system work. The participants will develop their systems as usual and produce their output in JSON format. More details will be provided to all registered participants.

Participants can submit the system for any combination of open vs closed tracks, English vs Chinese, and main vs supplementary tasks. Participants can also tune the system for a specific combination of evaluation.

Open and Closed Tracks

There are two evaluation tracks. In the closed track, a participating system can only be trained on the provided training set and linguistic resources in its corresponding language. To reduce the data preparation burden on participants, we provide in the data package the training, development and test data with the following layers of automatic linguistic annotation (included in the data release):

Phrase structure parses (predicted by the Berkeley Parser)
Dependency parses (converted from the Berkeley Parser output, using the basic Stanford dependency converter)

Closed track participants are also allowed to use the following publicly available linguistic resources that have been found to be useful in shallow discourse parsing literature.

Brown Clusters
VerbNet
Sentiment lexicon
Word embeddings (produced using Word2Vec)

The versions that are allowed to be used can be downloaded here To ensure meaningful comparisons among participating systems, we ask closed track participants to use only the version of linguistic resources that we provide. Participants can, however, re-process the training set themselves. If they use a third party tool, we require that they use one that is publicly available so that results can be replicated by other researchers who are interesting in working on this problem after the shared task is over.

In the open track, a participating system may use any publicly available NLP tools to process the data AND any publicly available (i.e., non-proprietary) data for training. A participating team can choose to participate in the closed track or the open track or both.

Main Evaluation Metrics

The evaluation metric for the main task will be based on the F measure, the harmonic mean of precision and recall. The scorer can be downloaded from the shared task github repo. The evaluation will be done on a per-discourse relation basis. A relation is correctly predicted if and only if

the discourse connective is correctly detected (for explicit discourse relations)
the sense of a discourse relation is correctly predicted, and
the text spans of the two arguments as well as their labels (Arg1 and Arg2) are correctly predicted.

The winning system is judged based on this main evaluation metric only.

New Supplementary Evaluation Metrics

Discourse relation sense classification has been of particular interest of many researchers in discourse parsing. This year, we accommodate the teams that want to focus on this subtask. The supplementary evaluation will be based on the accuracy of sense classification. The dataset includes a list of discourse relations without the 'Sense' and 'Type' field. Gold standard arguments explicit discourse connectives are provided.

Auxiliary Evaluation Metrics

To help with error analysis and diagnostic, the evaluation script that we provide will also evaluate individual components (e.g. discourse connective detection, argument labeling, sense classification, etc.) for you, but these are not taken into account in the ranking of systems. It is there for you to use as an analytic and development tool. If you create additional tools that you would like to share with other participants, please make a pull request to the task repository .