CoreLex: Systematic Polysemy and Underspecification

This thesis is concerned with a unified approach to the systematic polysemy and underspecification of nouns. Systematic polysemy -- senses that are systematically related and therefore predictable over classes of lexical items -- is fundamentally different from homonymy -- senses that are unrelated, non-systematic and therefore not predictable. At the same time, studies in discourse analysis show that lexical items are often left underspecified for a number of related senses. Clearly, there is a correspondence between these phenomena, the investigation of which is the topic of this thesis.

Acknowledging the systematic nature of polysemy and its relation to underspecified representations, allows one to structure ontologies for lexical semantic processing more efficiently, generating more appropriate interpretations within context. In order to achieve this, one needs a thorough analysis of systematic polysemy and underspecification on a large and useful scale. The thesis establishes an ontology and semantic database ( CoreLex ) of 126 semantic types, covering around 40,000 nouns and defining a large number of systematic polysemous classes that are derived by a careful analysis of sense distributions in WordNet . The semantic types are underspecified representations based on generative lexicon theory.

The representations are used in underspecified semantic tagging, addressing two problems in traditional semantic tagging: sense enumeration (the difficulty on deciding the number of discrete senses), due to systematic polysemy; and multiple reference (NP's denoting more than one model-theoretic referent), due to underspecification. Also, traditional semantic tags that are based on discrete senses tend to be too fine-grained for practical use. For instance, WordNet has, in principle, around 60,000 different tags (synsets) for nouns alone. The CoreLex approach, on the other hand, offers a concise set of 126 tags that are inherently more coarse-grained, by taking into account systematic polysemy and underspecification.

Underspecified semantic tagging is implemented, using probabilistic classification in order to cover unknown nouns (not in CoreLex) and to identify context-specific and new interpretations. The classification algorithm is centered around the computation of a Jaccard (similarity) score that compares lexical items in terms of the attributes (linguistic patterns acquired from domain specific corpora) they share.

Back to Publications

This page maintained by Paul Buitelaar: paulb@cs.brandeis.edu