Chinese Treebank Project

The Chinese Treebank Project

Descriptions of the project:

The Chinese Treebank Project started at the IRCS of University of Pennsylvania. Later on, it moved to the CLEAR Lab the University of Colorado at Boulder. There are still two old websites for the project which are no longer actively maitained, one at PENN and another at CU. The information there is very outdated.

The development of the Chinese Treebank has been supported by DOD, NSF and DARPA TIDES, GALE and BOLT Programs. The latest release of the Chinese Treebank is CTB 9.0 and the genres covered in this release include newswire, magazine articles, broadcast news, broadcast conversations, newsgroups and weblogs, discussion forums. The corpus is currently under expansion and more genres will be included in future releases.

A semantic layer of annotation has been added to the Chinese TreeBank via the Chinese Proposition Bank Project. The latest release of the Chinese Proposition Bank is CPB 3.0, which is also released via the Linguistic Data Consortium.

Talks

A talk given about the Chinese Treebank at TLT'11. slides

Annotation guidelines for the Chinese Treebank

Segmentation guidelines (final version): [ps-file], [pdf-file]
Guideline for POS tagging (final version): [ps-file], [pdf-file]
Guideline for Bracketing (final version): [ps-file], [pdf-file]
All three guidelines are now IRCS technical reports. The ID numbers are 00-06, 00-07 and 00-08, respectively.

Publications

2014: Joint POS Tagging and Transition-based Constituent Parsing in Chinese with Non-local Features.: Zhiguo Wang and Nianwen Xue; Proceedings of ACL 2014., Baltimore, MD.

2013: A Lattice-based Framework for Joint Chinese Word Segmentation, POS-tagging and Parsing.: Zhiguo Wang, Chengqing Zong and Nianwen Xue; Proceedings of ACL 2013., Sophia, Bulgaria.

2012: Extending and Scaling up the Chinese Treebank Annotation .: Xiuhong Zhang and Nianwen Xue; Proceedings of the 2nd CIPS-SIGHAN Joint Conference on Chinese Language Processing (CLP-2012)., Tianjin, China.

2007: Tapping the implicit information for the PS to DS conversion of the Chinese Treebank.: Nianwen Xue; Proceedings of the Sixth International Workshop on Treebanks and Linguistics Theories,Bergen, Norway. pdf; This paper describes the tool that converts the phrase structure representation of the Chinese Treebank to dependency structure to generate the Chinese section of the CoNLL 2009 Share Task data. The code for the tool can be downloaded here.

2005: The Penn Chinese TreeBank: Phrase Structure Annotation of a Large Corpus.: Nianwen Xue, Fei Xia, Fu-Dong Chiou, and Martha Palmer; Natural Language Engineering, 11(2)207-238.

2002: Building a Large-Scale Annotated Chinese Corpus: Nianwen Xue, Fu-Dong Chiou, and Martha Palmer; Proceedings of the 19th. International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 2002.

2001: Facilitating Treebank Annotation with a Statistical Parser: Fu-Dong Chiou, David Chiang, and Martha Palmer; Proceedings of the Human Language Technology Conference (HLT 2001), San Diego, California, 2001.

2000: Developing Guidelines and Ensuring Consistency for Chinese Text Annotation: Fei Xia, Martha Palmer, Nianwen Xue, Mary Ellen Okurowski, John Kovarik, Fu-Dong Chiou, Shizhe Huang, Tony Kroch, and Mitch Marcus; Proceedings of the second International Conference on Language Resources and Evaluation (LREC 2000), Athens, Greece, 2000.

Workshops and meetings

1st CLP Workshop (6-7/98), Philadelphia, USA

meeting during ACL-98, Montreal, Canada (8/98)

meeting during ICCIP-98, Beijing, China (11/98)

meeting during ACL-99, Maryland, USA (6/99)

2nd CLP Workshop (10/00), Hong Kong, China

Links to other sites

Penn English Treebank Project

Penn Korean Treebank Project

Acknowledgment

Last modified on December 28, 2012.