Brandeis University's Chinese Language Processing program is anchored by linguistic corpora
annotated with morphological, syntactic, semantic and discourse structures. The
Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech
tagged, and fully bracketed corpus. The latest release is CTB7.0, soon available through the LDC. It
has 51 thousand sentences, 1.2 million words, and 1.9 Chinese characters. The sources of this corpus include newswire, magazine articles, broadcast news, broadcast conversations, and weblogs.
The segmentation,
POS-tagging and syntactic bracketing standards are fully documented.
The Chinese Proposition Bank adds a layer of semantic annotation to the Chinese Treebank. This layer of semantic annotation mainly deals with the predicate-argument structure of Chinese verbs. This task is also called semantic role labeling in the sense that each verb is expected to take a fixed number of arguments and each argument plays a role with regard to the verb. The current release is CPB2.0, also available via the LDC . The annotation of CPB3.0 has been completed and will be available soon.
Extending the idea of predicate-argument structure to discourse, we are also in the initial stages of building a Chinese Discourse Treebank in which discourse connectives are treated as predicates that take arguments. A discourse connective can be a subordinate conjunction, a coordinate conjunction, or an anaphorical adverbial expression. Sometimes discourse relations can even be inferred when explicit discourse markers are not available.
Other Chinese annotation projects that are carried out at Brandeis University include temporal annotation in the context of discourse. These projects are still at pilot stages.
In the context of NLP research, building annotated corpora is of course only part of the larger picture, a means to an end. The goal is to train natural language systems. To that end, we have built Chinese segmenters and part-of-speech taggers, parsers, semantic role labelers. We're also working on empty category recovery and automatic inference of temporal structures. These projects are carried out in the context of Machine Translation research between Chinese and English.
This Chinese Treebank Project is supported by DOD, NSF, DARPA TIDES and DARPA GALE Programs.