Generative Lexicon Theory

The focus of research in Generative Lexicon Theory is on the computational and cognitive modelling of natural language meaning. More specifically, the investigation is in how words and their meanings combine to make meaningful texts. This research has focused on developing a lexically oriented theory of semantics, based on a methodology making use of formal and computational semantics. That is, we are looking at how word meaning in natural language might be characterized both formally and computationally, in order to account for both the subtle use of words in different sentences, as well as the creative use of words in novel contexts. One of the major goals in our current research, therefore, is to study polysemy, ambiguity, and sense shifting phenomena in different languages.

PI: James Pustejovsky

TimeML: Temporal Markup Language

TimeML is a robust specification language for events and temporal expressions in natural language. It is designed to address four problems in event and temporal expression markup. TimeML has been developed in the context of three AQUAINT workshops and projects. The 2002 TERQAS workshop set out to enhance natural language question answering systems to answer temporally-based questions about the events and entities in news articles. The first version of TimeML was defined and the TimeBank corpus was created as an illustration. TANGO was a follow-up workshop in which a graphical annotation tool was developed. Currently, the TARSQI project develops algorithms that tag events and time expressions in NL texts and temporally anchor and order the events.

PI: James Pustejovsky

TARSQI Toolkit: Temporal Parsing and Reasoning in Language

The TARSQI Project allows developers and analysts to sort and organize information in NL texts based on their temporal characteristics. Specifically, we will develop algorithms that tag mentions of events in NL texts, tag time expressions and normalize them, and temporally anchor and order the events. We will also develop temporal reasoning algorithms that operate on the resulting event-time graphs for each document. These temporal reasoning algorithms will include a graph query capability, that will, for example, find when a particular event occurs, or which events occur in a time period. They will also include a temporal closure algorithm that will allow more complete coverage of queries (by using the transitivity of temporal precedence and inclusion relationships to insert additional links into the graph), and a timelining algorithm that provides chronological views at various granularities of an event graph as a whole or a region of it. We will also develop a capability to compare event graphs across documents. Finally, we will develop a model of the typical durations of various kinds of events.

PI: James Pustejovsky


Spatiotemporal Reasoning

The goals of this research are to further the representational and algorithmic support for spatio-temporal reasoning from natural language text in the service of practical applications. One such task is tracking the movements of individuals; providing automated support for such a task can be vital for national security. To create such technological support, we propose to use lexical resources to integrate two existing annotation schemes, creating an entirely new representation that captures, in a fine-grained manner, the movement of individuals through spatial and temporal locations. This integrated representation will be extracted automatically from natural language documents using symbolic and machine learning methods.

The other challenge we address is translating verbal subjective descriptions of spatial relations into metrically meaningful positional information, and extend this capability to spatiotemporal monitoring. Document collections, transcriptions, cables, and narratives routinely make reference to objects moving through space over time. Integrating such information derived from textual sources into a geosensor data system can enhance the overall spatiotemporal representation in changing and evolving situations, such as when tracking objects through space with limited image data.

PI: James Pustejovsky



ULA: Unified Linguistic Annotation

Lack of progress in automatically producing semantic representations constitutes a major obstacle for naturallanguage processing. Our research addresses this issue by creating a Unified Linguistic Annotation (ULA) exemplified by the first large (550K words), balanced, semantically annotated corpus. This corpus hasmost basic types of semantic information annotated according to high-quality schemes using state-of-the-artannotation technology. Crucially, all individual annotations, although unified, are kept separate in order tomake it easy to produce alternative annotations of a specific type of semantic information (word senses,anaphora, etc.) without modifying annotation at other levels. The ULA framework is easily extendable toincorporate new annotation schemes as they become available. We have created an infrastructure including bothmultiply-annotated corpora and guidelines for merging so that the ULA can be extended. This work is in conjunction with the University of Pennsylvania, NYU, University of Pittsburgh, and University of Colorado at Boulder.

PI: James Pustejovsky




SILT: Sharable Infrastructure for Language Technology

This proposal for U.S. participation in a network to work toward achieving interoperability among language resources is being submitted in parallel with a proposal to the European Union eContentplus Programme to support European participation. The resulting international effort (hereafter called “the Network”) will involve members of the language processing community and others working in related areas to build consensus regarding the sharing of data and technologies for language resources and applications, to work towards interoperability of existing data, and, where possible, to promote standards for annotation and resource building. In addition to broad-based US and European participation, we are seeking the participation of colleagues in Asia.The resources and technologies to be addressed include annotated corpora (texts, audio), lexicons, ontologies, automatic speech recognizers, part of speech taggers and lemmatizers, named entity recognizers, information extractors, etc., as well as systems for search, access, and annotation. The creation and use of these resources spans several related but relatively isolated disciplines, including NLP, information retrieval, machine translation, speech, and the semantic web. The goal is to turn existing, fragmented technology and resources developed within these groups in relative isolation into accessible, stable, and interoperable resources that can be readily reused across several fields.

PI: James Pustejovsky


Chinese TreeBank

The Chinese Treebank Project originated at Penn and was later moved to University of Colorado at Boulder. Now it is the process of being to moved to Brandeis University. The goal of the Chinese Treebank Project is to build a large-scale linguistically annotated Chinese corpus that can be used to train a wide range of NLP tools such as word segmenters, POS-taggers and syntactic parsers. Data from the Chinese Treebank has been used in a series of international Chinese word segmentation bakeoffs sponsored by Sighan, a special interest group under the Association of Computational Linguistics. It has also been converted to a dependency and used to support the 2009 CoNLL Shared Task on multilingual dependency parsing and semantic role labeling. The initial seed funding for the project was provided by DoD, and over the years it has been funded by National Science Foundation and DARPA. The data sources of the Chinese Treebank range from Xinhua newswire (mainland China), Hong Kong news, and Sinorama Magazine (Taiwan). More recently under DARPA GALE funding it has been expanded to include broadcast news, broadcast conversation, news groups and web log data. It currently has over one million words and is fully segmented, POS-tagged and annotated with phrase structures similar to that of the Penn English Treebank (Marcus et al, 1993). The latest release via Linguistic Data Consortium (LDC) is CTB 6.0. The future direction of the project is to build parallel treebanks between Chinese and other languages like English that can be used to support the development of next-generation MT systems.

PI: Nianwen Xue


Chinese PropBank

The Chinese Proposition Bank Project grows out of the Chinese Treebank Project and it adds a layer of semantic annotation to the syntactic parses in the Chinese Treebank. This layer of semantic annotation mainly deals with the predicate-argument structure of Chinese verbs and their nominalizations. This task is also called semantic role labeling in the sense that each verb is expected to take a fixed number of arguments and each argument plays a role with regard to the verbal or nominal predicates. The annotated data as well as the {\it frame files}, a predicate-argument structure lexicon we created to guide the annotation, has been released to the computational linguistics research community via the Linguistic Data Consortium. The latest release is Chinese Proposition Bank 2.0. This data has been used to support the semantic role labeling component of the 2009 CoNLL Shared Task on multilingual dependency parsing and semantic role labeling.

PI: Nianwen Xue



OntoNotes

OntoNotes is a five year multi-site collaboration between BBNTechnologies, Information Sciences Institute of University of SouthernCalifornia, University of Colorado, University of Pennsylvania andBrandeis University. The goal of the OntoNotes project is to providelinguistic data annotated with a skeletal representation of theliteral meaning of sentences, allowing a new generation of language understandingtechnologies to be developed with new functional capabilities. It aims to annotate a large corpus comprising various genres of text (news, conversational telephone speech, weblogs, use net, broadcast, talk shows) in three languages (English, Chinese, and Arabic) with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). OntoNotes builds on two time-tested resources, following the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Its semantic representation will include word sense disambiguation for nouns and verbs, with each word sense connected to an ontology, and coreference. Over the course of the five-year program, the goal of the project is to annotate over a million words each of English and Chinese, and half a million words of Arabic.

PI: Nianwen Xue