Official Rules
Organizing committee
- Nianwen Xue (chair), Brandeis University
- Hwee Tou Ng, National University of Singapore
- Sameer Pradhan, Harvard University
- Bonnie Webber, Edinburgh University
- Attapol Rutherford, Brandeis University
- Chuan Wang, Brandeis Unviersity
Training and Development Data
The training and development data for English are from the Penn
Discourse TreeBank (PDTB) 2.0, a 1-million-word Wall Street Journal corpus. The training data are from Sections 2-21 and the development set is from Section 22.
New The Chinese data are from the Chinese Discourse Treebank (CDTB). plus some additional annotation in version 1.0, which has not been released through LDC. The Chinese data is much smaller than the English data. The training data are from Section 0001-0270 and Section 0400-0803, the development set is from Section 0301-0325.
More information on the PDTB and shallow discourse parsing, how to obtain the data and the data format on this website.
Evaluation Data
There are two English evaluation datasets
- The blind test set. This will consist of 20,000 to 30,000 words of newswire text annotated
following the PDTB annotation guidelines. This test set will not be given to you.
Instead, you will be asked to deploy the system on the remote machine, and we will run the system on this
blind test set for the final ranking of systems.
- The PDTB test set. Section 23 of the PDTB as an additional test set. Results on this PDTB test set will NOT be
used to rank the systems, but they can be used as a reference for comparison with
previously reported results. Participants are asked NOT to use this additional test set to train or tune their
systems, but they can use the development set to tune (but not train) their systems.
New There are two Chinese evaluation datasets
- A small Chinese test set will be held out from the Section 0271-0300 of Chinese Discourse Treebank.
- A blind (unseen) test set
System Submission
Participants must deploy their systems to a remote virtual machine for automatic evaluation instead of
submitting system output for the test set. You are free to install any toolbox or software
on the virtual machine to make your system work. The participants will develop their
systems as usual and produce their output in JSON format. More details will be
provided to all registered participants.
Participants can submit the system for any combination of open vs closed tracks, English vs Chinese, and main vs supplementary tasks.
Participants can also tune the system for a specific combination of evaluation.
Open and Closed Tracks
There are two evaluation tracks. In the closed track, a participating system can only
be trained on the provided training set and linguistic resources in its corresponding language. To reduce
the data preparation burden on participants, we provide in the data package the training, development
and test data with the following layers of automatic linguistic annotation (included in the data release):
- Phrase structure parses (predicted by the Berkeley Parser)
- Dependency parses (converted from the Berkeley Parser output, using the basic
Stanford dependency converter)
Closed track participants are also allowed to use the following publicly available linguistic
resources that have been found to be useful in shallow discourse parsing literature.
- Brown Clusters
- VerbNet
- Sentiment lexicon
- Word embeddings (produced using Word2Vec)
The versions that are allowed to be used can be downloaded
here
To ensure meaningful comparisons among participating systems, we ask closed
track participants to use only the version of linguistic resources that we provide.
Participants can, however, re-process the training set themselves. If they use a third party
tool, we require that they use one that is publicly available so that results
can be replicated by other researchers who are interesting in working on this
problem after the shared task is over.
In the open track, a participating system may use any publicly available NLP tools to
process the data AND any publicly available (i.e., non-proprietary) data for training.
A participating team can choose to participate in the closed track or the open track
or both.
Main Evaluation Metrics
The evaluation metric for the main task will be based on the F measure, the harmonic mean of
precision and recall. The scorer can be downloaded from the shared task github repo.
The evaluation will be done on a per-discourse relation basis. A relation is correctly
predicted if and only if
- the discourse connective is correctly detected (for explicit discourse
relations)
- the sense of a discourse relation is correctly predicted, and
- the text spans of the two arguments as well as their labels (Arg1 and Arg2) are
correctly predicted.
The winning system is judged based on this main evaluation metric only.
New Supplementary Evaluation Metrics
Discourse relation sense classification has been of particular interest of many
researchers in discourse parsing. This year, we accommodate the teams that want to focus on this subtask.
The supplementary evaluation will be based on the accuracy of sense classification.
The dataset includes a list of discourse relations without the 'Sense' and 'Type' field.
Gold standard arguments explicit discourse connectives are provided.
Auxiliary Evaluation Metrics
To help with error analysis and diagnostic, the evaluation script that we provide will also evaluate
individual components (e.g. discourse connective
detection, argument labeling, sense classification, etc.) for you, but these are not taken into account
in the ranking of systems. It is there for you to use as an analytic and development tool.
If you create additional tools that you would like to share with other participants, please make
a pull request to
the task repository .