These are some of the areas we have been working on.
Brandeis University's Chinese Language Processing program is anchored by linguistic corpora annotated with morphological, syntactic, semantic and discourse structures. The Chinese Treebank, started at University of Pennsylvania, is a segmented, part-of-speech tagged, and fully bracketed corpus. The latest release is CTB7.0, soon available through the LDC. It has 51 thousand sentences, 1.2 million words, and 1.9 Chinese characters. The sources of this corpus include newswire, magazine articles, broadcast news, broadcast conversations, and weblogs. The segmentation, POS-tagging and syntactic bracketing standards are fully documented.