CoNLL 2015 Shared Task

CoNLL-2015 Shared Task Results

The formal detailed description of the task and the analysis of the results can be found in our overview paper. The best-performing team uses a refined pipeline architecture and is detailed in the paper as well. The code for this system is provided for forking here.

Supplemental Evaluation Results

The official evaluation of the system requires that the error propagates from the upstream tasks (e.g. argument extraction and discourse connective detection). We also evaluated the winning parser on sense classification without error propagation to test the performance of sense classification, which has been the focus of work in this area of discourse parsing. More specifically, the system is given the gold standard argument pairs and the gold standard discourse connective. The system only needs to use tag each relation (with its argument pair and connective) with one of the senses.

	WSJ Dev Set			WSJ Test Set
Sense	Overall	Explicit	Non-explicit	Overall	Explicit	Non-explicit
Comparison.Concession	42.86	52.17	0.00	45.61	50.98	0.00
Comparison.Contrast	72.85	91.17	9.80	77.03	93.45	17.22
Contingency.Cause.Reason	55.76	84.62	43.98	51.69	95.16	36.21
Contingency.Cause.Result	45.61	91.43	25.32	45.45	1.00	15.49
Contingency.Condition	94.51	94.51	-	90.43	90.43	-
EntRel	62.47	-	62.47	52.65	-	52.65
Expansion.Alternative	92.31	92.31	-	76.92	76.92	-
Expansion.Alternative.Chosen Alternative	76.92	90.91	0.00	28.57	1.00	0.00
Expansion.Conjunction	73.15	97.28	45.48	69.55	96.34	30.68
Expansion.Instantiation	40.00	1.00	21.05	53.73	1.00	32.61
Expansion.Restatement	36.92	90.91	34.54	30.12	61.54	29.08
Temporal.Asynchronous.Precedence	77.17	98.00	0.00	86.75	97.30	0.00
Temporal.Asynchronous.Succession	84.21	86.96	0.00	81.36	84.96	0.00
Temporal.Synchrony	78.21	82.43	0.00	72.73	73.56	0.00
Accuracy	65.11	90.00	42.72	61.27	90.79	34.45

The table shows the F1 scores from the best performing system. The last row shows the accuracy. The dashes indicate that there are no instances of the type.

For the blind test set, the accuracy rates are 54.76 (overall), 76.44 (explicit) and 36.29 (non-explicit). We reserve the per-sense results for the blind test set because we might need to reuse the blind test set potentially for the next year evaluation. We do not want to reveal the sense distribution of the blind test set.