Exploration in Automatic Thesaurus Discovery
Total Page:16
File Type:pdf, Size:1020Kb
See discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/266452571 Exploration in Automatic Thesaurus Discovery Article · January 1994 DOI: 10.1007/978-1-4615-2710-7 CITATIONS READS 496 36 1 author: Gregory Grefenstette Florida Institute for Human and Machine Cognition 152 PUBLICATIONS 3,893 CITATIONS SEE PROFILE Some of the authors of this publication are also working on these related projects: Ontology Development from NLP View project All content following this page was uploaded by Gregory Grefenstette on 16 September 2016. The user has requested enhancement of the downloaded file. EXPLORATIONS IN AUTOMATIC THESAURUS DISCOVERY EXPLORATIONS IN AUTOMATIC THESAURUS DISCOVERY Gregory GREFENSTETTE University of Pittsburgh Pittsburgh, Pennsylvania, USA Rank Xerox Research Centre Grenoble, France KLUWER ACADEMIC PUBLISHERS BostonLondonDo rdrecht CONTENTS PREFACE v 1 INTRODUCTION 1 2 SEMANTIC EXTRACTION 7 2.1 Historical Overview 7 2.2 Cognitive Science Approaches 8 2.3 Recycling Approaches 17 2.4 Knowledge-PoorApproaches 23 3 SEXTANT 33 3.1 Philosophy 33 3.2 Methodology 34 3.3 Other examples 54 3.4 Discussion 57 4 EVALUATION 69 4.1 DeeseAntonymsDiscovery 70 4.2 Artificial Synonyms 75 4.3 GoldStandardsEvaluations 81 4.4 Webster's 7th 89 4.5 Syntactic vs. Document Co-occurrence 91 4.6 Summary 100 5 APPLICATIONS 101 5.1 Query Expansion 101 5.2 Thesaurus enrichment 114 i Automatic Thesaurus Discovery ii Explorations in 5.3 WordMeaningClustering 126 5.4 AutomaticThesaurusConstruction 131 5.5 DiscussionandSummary 133 6 CONCLUSION 137 6.1 Summary 137 6.2 Criticisms 139 6.3 Future Directions 141 6.4 Vision 147 1 PREPROCESSORS 149 2 WEBSTER STOPWORD LIST 151 3 SIMILARITY LIST 153 4 SEMANTIC CLUSTERING 163 5 AUTOMATIC THESAURUS GENERATION 171 6 CORPORA TREATED 181 6.1 ADI 181 6.2 AI 187 6.3 AIDS 192 6.4 ANIMALS 197 6.5 BASEBALL 202 6.6 BROWN 207 6.7 CACM 212 6.8 CISI 219 6.9 CRAN 228 6.10 HARVARD 235 6.11 JFK 240 6.12 MED 247 6.13 MERGERS 252 6.14 MOBYDICK 257 6.15 NEJM 261 6.16 NPL 266 Contents iii 6.17 SPORTS 273 6.18 TIME 278 6.19 XRAY 285 6.20 THESIS 290 INDEX 303 Automatic Thesaurus Discovery iv Explorations in PREFACE by David A. Evans Associate Professor of Linguistics and Computer Science Carnegie Mellon University There are several `quests' in contemporary computational linguistics; at least one version of the `Holy Grail' is represented by a procedure to discovery semantic relations in free text, automatically. No one has found such a procedure, but everyone could use one. Semantic relations are important, of course, because the essential content of texts is not revealed by `words' alone. To identify the propositions, the attributes and their values, even the phrases that count as `names' or other lexicalatomsintexts,onemustappealtosemanticrelations. Indeed, all natural- language understanding (NLU) efforts require resources for representing semantic relations and some mechanism to interpret them. Of course, not all systems that process natural language aspire to natural- language understanding. Historically, for example, information retrieval systems have not attempted to use natural-language processing (NLP) of any sort, let alone processing requiring semantic relations, in analyzing (indexing) free text. Increasingly, however, in the face of degrading performance in large- scale applications, the designers of such systems are attempting to overcome the problem of language variation (e.g., the many ways to express the `same' idea) and the polysemy of `words' by using linguistic and semantic resources. Unfortunately, the modest success of computational linguists who have used knowledge bases and declarative representations of lexical-semantic relations to support NLP systems for relatively constrained domains have not been repeated in the context of large-scale information retrieval systems. The reasons are no mystery: no general, comprehensive linguisticknowledge bases exist; no one knows how to build them; even with adequate understanding, no one could afford to build them; the number of `senses' of words and the v Automatic Thesaurus Discovery vi Explorations in number of relation types is virtually unlimited- each new corpus requires new knowledge. The interests and needs of information scientists, computational linguists, and NL-applications-oriented computer scientists have been converging. But robust and reliable techniques to support larger-scale linguistic processing have been hard to find. In such a context, the work presented in this volume is remarkable. Dr. Grefenstette has not only developed novel techniques to solve practical problems of large-scale text processing but he has demonstrated effective methods for discovering classes of related lexical items (or `concepts') and he has proposed a variety of independent techniques for evaluating the results of semantic-relation discovery. While the results of Dr. Grefenstette's processing are not `semantic' networks or interpretations of texts, they are arguably sets of related concepts - going beyond a simple notion of synonymy to include pragmatically associated terms. Such association sets reveal implicit sense restrictions and manifest the attributes of more complex semantic frames that would be required for a complete, declarative representation of the target concepts. The key function facilitated by Dr. Grefenstette's process is the discovery of such sets of related terms `on the fly', as they are expressed in the actual texts in a database or domain. There is no need for `external' knowledge (such as thesauri) to suggest the structure of information in a database; the process effectively finds empirically-based and pragmatically appropriate term-relation sets for idiosyncratic corpora. Dr. Grefenstette's approach is distinctive, in part, because it combines both linguistic and information-scientific perspectives. He begins with the obser- vations (that others have made as well) that (1) ``you can begin to know the meaning of a word (or term) by the company it keeps'' and (2) ``words or terms that occur in `the same contexts' are `equivalent' ''. He has followed the logical implication: if you find a sensible definition for ``context'' and you keep track of all the co-occurring elements in that context, you will find a `co-substitution set,' arguably a semantic property revealed through syntactic analysis. For Dr. Grefenstette, a practical definition of ``context'' is given by a handful of syntactic features. In particular, he takes both syntactic structural relations (such as ``subject of the verb v'' or ``modifier of the noun n'') as well as general relations (such as ``in the clause c'' or ``in the sentence s'') to be potential contexts. He identifies all the elements (= words or terms) that occur or co-occur in a context and finds statistically prominent patterns of Preface vii co-occurrence. Each element, then, can be identified with a vector (or list) of other elements, which are the context markers or attributes of the head element. By calculating `similarity' scores for each vector, he can find lists of elements that are `close' to each other, as determined by their shared attributes. These sets are, operationally, the equivalence classes of the words or terms relative to the documents that were analyzed. Dr. Grefenstette's work is noteworthy in several respects. First, he uses sophisticated (but practical and robust) NLP to derive context sets from free text. This sets his work apart from most other research in information science that has addressed similar concerns. While some scientists are beginning to discover the utility of combining NLP with information-management technology, Dr. Grefenstette has already begun using varieties of NLP tools in routine processing. It should be emphasized that Dr. Grefenstette's work has involved serious NLP: his parser is an excellent example of a `robust' NLP tool practically tuned to its task. Second, Dr. Grefenstette asks (and answers) empirical questions over large quantities of data. This distinguishes his work both from theoretical linguists, who generally have neither the means nor the inclination to work with large volumes of performance data, and also from many computational linguists (and artificial-intelligence researchers), whose tools are less robust than Dr. Grefenstette's. In fact, in exploring the thesis presented in this volume, Dr. Grefenstette processed more than a dozen separate databases, representing more than thirty megabytes of text. While multi-megabyte collections are not unusual in information-science applications, they are still rare in AI and linguistic research targeted on semantic features of language. Finally, Dr. Grefenstette focuses his efforts as much on the scientific evaluation of his techniques as on their implementation and raw performance. This is an especially challenging and important concern because there are no well- established techniques for evaluating the results of processes that operate over large volumes of free text. If we are to make scientific progress, however, we must establish new techniques for measuring success or failure. Dr. Grefenstette has embraced this challenge. For example, he evaluates his results from the point of view of intuitive validity by appealing to psychological data. He considers the question of the stability of his techniques by testing whether the results found in increasingly large subcollections of a corpus converge. He assesses the results of (pseudo-)synonym discovery by comparing