Information extraction, Word Sense Disambiguation and NLP Architectures

Yorick Wilks
University of Sheffield, UK

Thursday, November 4, Volen 101, 2:00-3.00 pm

The talk decribes the Sheffield GATE architecture for NLP (General Architecture for Text Engineering) and in particular two projects within it: the LaSIE Informatiion Extraction project and our multi-engined wordsense disambiguator for English.

One first distinguishes information detection or retrieval (IR) from information extraction (IE) and then moves to recent advances in using IE technology for fast access to very large amounts of textual information in, for example, the world wide web and its extraction to a browsable database. It is argued that multilingual applications in IE/IR make the distinction between these information processing technologies and machine translation and automatic question answering and summarisation less clear than before, and they can now be combined in original ways to optimise information access via electronic text. Promising applications are mentioned in security, publishing, communications, finance, science, patents etc. The problems in advancing the field rapidly are described, particularly an appropriate interface, the modelling of the users needs and automatic adaptation of such systems to new domains. We then disuss the more precise question of whether it is possible to sense-tag systematically, and on a large scale, and how we should assess progress so far. That is to say, how to attach each occurrence of a word in a text to one and only one sense in a dictionary---a particular dictionary of course, and that is part of the problem. The paper proposes a partial solution to the question, and we have reported empirical findings elsewhere (Cowie et al. 1992 and Wilks and Stevenson, 1998), and intend to continue and refine that work. The paper also examines two well-known contributions critically, one (Kilgarriff 1993) which is widely taken as showing that the task, as defined, cannot be carried out systematically by humans, and secondly (Yarowsky 1995) which claims strikingly good results at doing exactly that, but over very small selected sets of words. We argue that it is more reasonable to stick to general sense tagging (i.e. over all content words)as the primary task. The empirical results we report (95% of words correctly sense tagged) are, we believe, currently the best world-wide that are obtained from a training and test methodology. We used several different methods for tagging, based on different information (e.g. word definitions, preferences, thesaurus codes etc.)

Host: James Pustejovsky