CoreLex: Systematic Polysemy and Underspecification
PhD Thesis, Computer Science, Brandeis University, February 1998
This thesis is concerned with a unified approach to the systematic polysemy and underspecification of nouns. Systematic polysemy -- senses that are systematically related and therefore predictable over classes of lexical items -- is fundamentally different from homonymy -- senses that are unrelated, non-systematic and therefore not predictable. At the same time, studies in discourse analysis show that lexical items are often left underspecified for a number of related senses. Clearly, there is a correspondence between these phenomena, the investigation of which is the topic of this thesis.
Acknowledging the systematic nature of polysemy and its relation to underspecified representations, allows one to structure ontologies for lexical semantic processing more efficiently, generating more appropriate interpretations within context. In order to achieve this, one needs a thorough analysis of systematic polysemy and underspecification on a large and useful scale. The thesis establishes an ontology and semantic database (CoreLex) of 126 semantic types, covering around 40,000 nouns and defining a large number of systematic polysemous classes that are derived by a careful analysis of sense distributions in WordNet . The semantic types are underspecified representations based on generative lexicon theory.
The representations are used in underspecified semantic tagging, addressing two problems in traditional semantic tagging: sense enumeration (the difficulty on deciding the number of discrete senses), due to systematic polysemy; and multiple reference (NP's denoting more than one model-theoretic referent), due to underspecification. Also, traditional semantic tags that are based on discrete senses tend to be too fine-grained for practical use. For instance, WordNet has, in principle, around 60,000 different tags (synsets) for nouns alone. The CoreLex approach, on the other hand, offers a concise set of 126 tags that are inherently more coarse-grained, by taking into account systematic polysemy and underspecification.
Underspecified semantic tagging is implemented, using probabilistic classification in order to cover unknown nouns (not in CoreLex) and to identify context-specific and new interpretations. The classification algorithm is centered around the computation of a Jaccard (similarity) score that compares lexical items in terms of the attributes (linguistic patterns acquired from domain specific corpora) they share.
You can download the complete thesis in Postscript (gzipped 256k or unzipped 950k)
The CoreLex database is freely available for research purposes, including commercial ones. This is in accordance with the WordNet policy on licensing. However, if you decide to use any of this material, make sure you give reference to
- the thesis:
- Paul Buitelaar - CoreLex: Systematic Polysemy and Underspecification, PhD Thesis, Computer Science, Brandeis University, February 1998
- this webpage:
- the WordNet webpage:
The database consists of three files. Each has a certain format that links them together like a relational database and also links them (through SYNSET_NUMBER references) to WordNet 1.5.
In summary, a NOUN is of a certain CORELEX_TYPE, each CORELEX_TYPE corresponds to one or more SYSTEMATIC_POLYSEMOUS_CLASS, each consisting of a set of BASIC_TYPE, each of which corresponds to one or more SYNSET in WordNet 1.5.
The files and formats are:
- CoreLex (536k)
- NOUN :: CORELEX_TYPE
- CoreLex Classes (2k)
- CORELEX_TYPE :: SYSTEMATIC_POLYSEMOUS_CLASS
- CoreLex Basic Types (5k)
- BASIC_TYPE :: SYNSET_NUMBER :: SYNSET
Disclaimer : The CoreLex database as presented here results from an automatic extraction process. Some of the systematic polysemous classes therefore contain inappropriate instances, that should be manually removed. This step is described in the thesis, but should still be implemented on this version of the database.
The CoreLex database is also available on-line, giving a complete overview of all 126 underspecified semantic types, the 317 systematic polysemous classes they correspond to and all 39,937 noun instances for each class as derived from WordNet 1.5
This research has been supported by NSF ARPA grant IRI-931 4955 for the Core Lexical Engine project