The formal detailed description of the task and the analysis of the results can be found in our overview paper. The best-performing team uses a refined pipeline architecture and is detailed in the paper as well. The code for this system is provided for forking here.
The official evaluation of the system requires that the error propagates from the upstream tasks (e.g. argument extraction and discourse connective detection). We also evaluated the winning parser on sense classification without error propagation to test the performance of sense classification, which has been the focus of work in this area of discourse parsing. More specifically, the system is given the gold standard argument pairs and the gold standard discourse connective. The system only needs to use tag each relation (with its argument pair and connective) with one of the senses.
WSJ Dev Set |
WSJ Test Set |
|||||
---|---|---|---|---|---|---|
Sense | Overall | Explicit | Non-explicit | Overall | Explicit | Non-explicit |
Comparison.Concession | 42.86 | 52.17 | 0.00 | 45.61 | 50.98 | 0.00 |
Comparison.Contrast | 72.85 | 91.17 | 9.80 | 77.03 | 93.45 | 17.22 |
Contingency.Cause.Reason | 55.76 | 84.62 | 43.98 | 51.69 | 95.16 | 36.21 |
Contingency.Cause.Result | 45.61 | 91.43 | 25.32 | 45.45 | 1.00 | 15.49 |
Contingency.Condition | 94.51 | 94.51 | - | 90.43 | 90.43 | - |
EntRel | 62.47 | - | 62.47 | 52.65 | - | 52.65 |
Expansion.Alternative | 92.31 | 92.31 | - | 76.92 | 76.92 | - |
Expansion.Alternative.Chosen Alternative | 76.92 | 90.91 | 0.00 | 28.57 | 1.00 | 0.00 |
Expansion.Conjunction | 73.15 | 97.28 | 45.48 | 69.55 | 96.34 | 30.68 |
Expansion.Instantiation | 40.00 | 1.00 | 21.05 | 53.73 | 1.00 | 32.61 |
Expansion.Restatement | 36.92 | 90.91 | 34.54 | 30.12 | 61.54 | 29.08 |
Temporal.Asynchronous.Precedence | 77.17 | 98.00 | 0.00 | 86.75 | 97.30 | 0.00 |
Temporal.Asynchronous.Succession | 84.21 | 86.96 | 0.00 | 81.36 | 84.96 | 0.00 |
Temporal.Synchrony | 78.21 | 82.43 | 0.00 | 72.73 | 73.56 | 0.00 |
Accuracy | 65.11 | 90.00 | 42.72 | 61.27 | 90.79 | 34.45 |
For the blind test set, the accuracy rates are 54.76 (overall), 76.44 (explicit) and 36.29 (non-explicit). We reserve the per-sense results for the blind test set because we might need to reuse the blind test set potentially for the next year evaluation. We do not want to reveal the sense distribution of the blind test set.