Chinese Language Processing at Brandeis University

Research

STAGES

Sponsor: NSF.

Statistical machine translation (MT) systems have improved greatly in the past several years, and reached a point where they are widely used for at least getting the “gist” of foreign language documents and web pages. However, reading the output of even the best Chinese-English machine translation systems remains a painful experience. Furthermore, current systems perform well only on the type of text on which they have been trained (most often newswire text), and require very large amounts of texts from this domain.

Under STAGES, we attempt to address these problems by abstracting away from the surface realization of the source language sentence in order to provide a semantically based representation that can guide the translation process. This will allow us to handle fundamental differences in how Chinese and English encode information. By leveraging recent advances in semantic parsing and discourse analysis at Colorado and Brandeis, we will provide automatic high-performance annotations of source and target language sentences that will better support semantics-based statistical machine translation. Our translation approach has as a foundation both the ISI and Rochester MT systems, state of the art statistical MT systems that already make effective use of syntactic structure. We will feed all levels of the output of these systems to Columbia’s state of the art generation system, which will be able to address the issue of the semantic plausibility of the target language translation. Our several layers of representation will include high-level conceptual views of the source and target language sentences, as well as fully lexicalized syntactic parses. For some sentences these different views will be largely compatible and will merge seamlessly, whereas for other sentences there will be considerable conflict. It will be the task of the generation system to assess the conflicts and the intrinsic semantic plausibility of the alternatives in an attempt to reconcile them and produce the best possible translation.

The Chinese Treebank

Sponsor: DARPA GALE.

The Chinese Treebank Project originated at Penn and was later moved to University of Colorado at Boulder. Now it is the process of being to moved to Brandeis University. The goal of the Chinese Treebank Project is to build a large-scale linguistically annotated Chinese corpus that can be used to train a wide range of NLP tools such as word segmenters, POS-taggers and syntactic parsers. Data from the Chinese Treebank has been used in a series of international Chinese word segmentation bakeoffs sponsored by Sighan, a special interest group under the Association of Computational Linguistics. It has also been converted to a dependency and used to support the 2009 CoNLL Shared Task on multilingual dependency parsing and semantic role labeling. The initial seed funding for the project was provided by DoD, and over the years it has been funded by National Science Foundation and DARPA. The data sources of the Chinese Treebank range from Xinhua newswire (mainland China), Hong Kong news, and Sinorama Magazine (Taiwan). More recently under DARPA GALE funding it has been expanded to include broadcast news, broadcast conversation, news groups and web log data. It currently has over one million words and is fully segmented, POS-tagged and annotated with phrase structures similar to that of the Penn English Treebank (Marcus et al, 1993). The latest release via Linguistic Data Consortium (LDC) is CTB 6.0. The future direction of the project is to build parallel treebanks between Chinese and other languages like English that can be used to support the development of next-generation MT systems.

Temporal annotation of Chinese text

Sponsor: NSF.

under construction

The Chinese Propbank

Sponsor: DARPA GALE.

The Chinese Proposition Bank Project grows out of the Chinese Treebank Project and it adds a layer of semantic annotation to the syntactic parses in the Chinese Treebank. This layer of semantic annotation mainly deals with the predicate-argument structure of Chinese verbs and their nominalizations. This task is also called semantic role labeling in the sense that each verb is expected to take a fixed number of arguments and each argument plays a role with regard to the verbal or nominal predicates. The annotated data as well as the {\it frame files}, a predicate-argument structure lexicon we created to guide the annotation, has been released to the computational linguistics research community via the Linguistic Data Consortium. The latest release is Chinese Proposition Bank 2.0. This data has been used to support the semantic role labeling component of the 2009 CoNLL Shared Task on multilingual dependency parsing and semantic role labeling.

Copyright 2012