CoNLL-2015 Shared Task Results

The formal detailed description of the task and the analysis of the results can be found in our overview paper. The best-performing team uses a refined pipeline architecture and is detailed in the paper as well. The code for this system is provided for forking here.

Supplemental Evaluation Results

The official evaluation of the system requires that the error propagates from the upstream tasks (e.g. argument extraction and discourse connective detection). We also evaluated the winning parser on sense classification without error propagation to test the performance of sense classification, which has been the focus of work in this area of discourse parsing. More specifically, the system is given the gold standard argument pairs and the gold standard discourse connective. The system only needs to use tag each relation (with its argument pair and connective) with one of the senses.

WSJ Dev Set
WSJ Test Set
Sense Overall Explicit Non-explicit Overall Explicit Non-explicit
Comparison.Concession 42.86 52.17 0.00 45.61 50.98 0.00
Comparison.Contrast 72.85 91.17 9.80 77.03 93.45 17.22
Contingency.Cause.Reason 55.76 84.62 43.98 51.69 95.16 36.21
Contingency.Cause.Result 45.61 91.43 25.32 45.45 1.00 15.49
Contingency.Condition 94.51 94.51 - 90.43 90.43 -
EntRel 62.47 - 62.47 52.65 - 52.65
Expansion.Alternative 92.31 92.31 - 76.92 76.92 -
Expansion.Alternative.Chosen Alternative 76.92 90.91 0.00 28.57 1.00 0.00
Expansion.Conjunction 73.15 97.28 45.48 69.55 96.34 30.68
Expansion.Instantiation 40.00 1.00 21.05 53.73 1.00 32.61
Expansion.Restatement 36.92 90.91 34.54 30.12 61.54 29.08
Temporal.Asynchronous.Precedence 77.17 98.00 0.00 86.75 97.30 0.00
Temporal.Asynchronous.Succession 84.21 86.96 0.00 81.36 84.96 0.00
Temporal.Synchrony 78.21 82.43 0.00 72.73 73.56 0.00
Accuracy 65.11 90.00 42.72 61.27 90.79 34.45
The table shows the F1 scores from the best performing system. The last row shows the accuracy. The dashes indicate that there are no instances of the type.

For the blind test set, the accuracy rates are 54.76 (overall), 76.44 (explicit) and 36.29 (non-explicit). We reserve the per-sense results for the blind test set because we might need to reuse the blind test set potentially for the next year evaluation. We do not want to reveal the sense distribution of the blind test set.