Natural Language Text, Speech and Language Technology

VOLUME7

Series Editors

Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France

Editorial Board

Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT& T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMS/-CNRS, France

The titles published in this series are listed at the end of this volume. Natural Language Information Retrieval

Edited by

Tomek Strzalkowski General Electric, Research & Development

SPRINGER-SCIENCE+BUSINESS MEDIA, B.V. A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN 978-90-481-5209-4 ISBN 978-94-017-2388-6 (eBook) DOI 10.1007/978-94-017-2388-6

Printed on acid-free paper

All Rights Reserved ©1999 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 1999 No part of the material protected by this <:opyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner TABLE OF CONTENTS

PREFACE xiii

CONTRIBUTING AUTHORS xxiii

1 WHAT IS THE ROLE OF NLP IN TEXT RETRIEVAL? }(aren Sparck Jones 1 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 1 2 Linguistically-motivated indexing 2 201 Basic Concepts 0 0 0 0 0 . 2 202 Complex Descriptions and Terms 6 3 Research and tests 0 0 0 . 0 0 0 0 0 0 0 0 9 301 Phase 1 : Experiments from the 1960s to the 1980s 10 302 Phase 2 : The Nineties 0 16 303 The TREC Programme 17 4 Other roles for NLP 0 0 . 0 0 0 21

2 NLP FOR TERM VARIANT EXTRACTION: SYNERGY BETWEEN MORPHOLOGY, LEXICON, AND SYNTAX Christian Jacquemin and Evelyne Tzoukermann 25 1 Introduction: From Term Conflation to Linguistic Analysis of Term Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 1.1 From Syntactic Variants to Morpho-syntactic Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 2 Controlled Indexing and Term Variant Extraction 0 28 201 Conflation of Single Terms 29 202 Multi-word Term Conflation 0 0 0 31 203 Purpose of this Chapter 0 0 0 0 0 34 3 An Architecture for Controlled Indexing 35 301 Morphological Analysis 0 0 0 0 . 35 302 Multi-word Term Extraction and Conflation 0 38 4 Morphological Analysis 0 0 0 0 0 0 0 0 0 0 38 401 Inflectional Analysis 0 0 0 0 0 0 0 0 41 402 Morpho-syntactic Disambiguation 43

v VI CONTENTS

4.3 Derivational Analysis 47 4.4 System Implementation 50 4.5 Advantages of Overgenerating . 50 5 FASTR: A Tool for Term Variant Extraction 52 5.1 A Grammar of Terms and a Metagrammar of Transformations 52 5.2 A Metagrammar for Syntactic Variants 54 5.3 A Metagrammar for Morpho-syntactic Variants 56 5.4 A Method for the Formulation of Metarules 60 5.5 Evaluation 66 6 Conclusion 70

3 COMBINING AND HUMAN MEMORY MODELS FOR AUTOMATIC TERM ASSOCIATION Gerda Ruge 75 1 Introduction ...... 75 2 Various Approaches to Automatic Term Association 77 2.1 Term Association by Statistic Corpus Analysis ...... 78 2.2 Term Association by Linguistically Based Corpus Analysis ...... 79 3 Human Memory Models ...... 81 3.1 A Well Known Memory Model ...... 81 3.2 A Memory Model Explaining Human Recall Capability ...... 82 4 Associationism and Term Association . . . . . 85 5 Spreading Activation with Heads and Modifiers 88 5.1 Spreading Activation on the Basis of Heads and Modifiers ...... 88 5.2 Indirect Activation of Semantically Similar 89 5.3 Taking into account Synonymous Heads and Modifiers 90 6 Experiments ...... 92 6.1 Test Data ...... 92 6.2 Parameters of the Network 92 6.3 Similarity Measure . . . . . 94 6.4 Results ...... 94 7 Valuation of the Spreading Activation Approach 95 CONTENTS vii

4 USING NLP OR NLP RESOURCES FOR INFORMATION RETRIEVAL TASKS Alan F. Smeaton 99 1 Introduction ...... 99 2 Early Experiments ...... 100 3 Using Natural Language Processes or NLP Resources . 103 4 Using WordNet for Information Retrieval 105 5 Status and Plans ...... 109

5 EVALUATING NATURAL LANGUAGE PROCESSING TECHNIQUES IN INFORMATION RETRIEVAL Tomek Strzalkowski, Fang Lin, .lin Wang and Jose Perez-Carballo 113 1 Introduction and Motivation...... 113 2 NLP-Based Indexing in Information Retrieval 116 3 NLP in Information Retrieval: A Perspective 118 4 Stream-based Information Retrieval Model . 121 5 Advanced Linguistic Streams . . . . 123 5.1 Head+Modifier Pairs Stream 123 5.2 Simple Noun Phrase Stream . 128 5.3 Name Stream ...... 129 5.4 Other Streams ...... 130 6 Stream Merging and Weighting . . . 133 6.1 Inter-stream merging using score calculation . 133 6.2 Inter-stream merging using precision distribution estimates ...... 135 6.3 Stream coefficients . . . 136 7 Query Expansion Experiments 136 7.1 Why Query Expansion? 136 7.2 Guidelines for manual query expansion . 138 7.3 Automatic Query Expansion 139 8 Summary of Results . 140 8.1 Ad-hoc runs . . 140 8.2 Routing Runs . 141 9 Conclusions ...... 142

6 STYLISTIC EXPERIMENTS IN INFORMATION RETRIEVAL .Jussi K aT'lgren 147 1 Stylistics . 147 2 Materials and Processing . . . 148 2.1 Experiments Performed 148 2.2 Corpus ...... 148 Vlll CONTENTS

2.3 Variables Examined . . . . 149 2.4 On Non-Parametric Multivariate Statistics 153 2.5 Correlation Between Variables 153 3 Visualizing Stylistic Variation 153 4 Stylistics and Relevance ...... 158 4.1 Relevance Judgments .... . 158 4.2 Relevance of Stylistics to Relevance 159 4.3 Hypotheses ...... 160 4.4 Results and Discussion . 160 5 Stylistics and Precision . 161 6 Further Work ...... 163

7 EXTRACTION-BASED TEXT : GENERATING DOMAIN-SPECIFIC ROLE RELATIONSHIPS AUTOMATICALLY Ellen RilojJ and Jeffrey Lorenzen 167 1 Introduction ...... 167 2 Extraction-based text categorization 169 2.1 Extraction patterns ..... 169 2.2 Relevancy signatures . . . . . 172 2.3 Augmented relevancy signatures 175 3 Automatically generating extraction patterns 177 4 Word-augmented relevancy signatures . . . . 179 4.1 Fully Automatic Text Categorization . 181 5 Experimental results ...... 182 5.1 The Terrorism Category 182 5.2 The Attack Category . . 185 5.3 The Bombing Category 188 5.4 The Kidnapping Category 191 5.5 Comparing automatic and hand-crafted dictionaries 193 6 Conclusions ...... 194

8 LASIE JUMPS THE GATE Yorick Wilks and Robert Gaizauskas 197 1 Introduction . . . 197 2 Background . . . 199 2.1 TIPSTER 200 3 GATE Design .. 201 3.1 GDM .. 202 3.2 CREOLE 202 3.3 GGI .. 204 4 LaSIE: An Application In GATE 205 4.1 Significant Features ... 207 CONTENTS IX

4.2 LaSIE Modules ...... 207 4.3 System Performance ...... 209 5 Other IE systems and modules within GATE 210 6 The European IE scene . 211 7 Limitations of IE systems 212 8 Conclusion ...... 213

9 PHRASAL TERMS IN REAL-WORLD IR APPLICATIONS Joe Zhou 215 1 Introduction ...... 215 2 Phrasing/Proximity in IR: A Compatibility Study 217 2.1 Method ...... 217 2.2 Empirical Data Input . 219 2.3 Empirical Data Output 220 2.4 Evaluation ...... 221 2.5 Claims ...... 224 3 Automatic Suggestion of Key Terms 225 3.1 Introduction ...... 225 3.2 Methodology ...... 227 3.3 Results and Discussion . . . 231 4 Information Retrieval Applications 247 4.1 Introduction ...... 24 7 4.2 Document Surrogater: A Summarization Prototype . 248 4.3 Document Sampler: A Categorization Prototype 253 5 Conclusion ...... 257

10 NAME RECOGNITION AND RETRIEVAL PERFORMANCE Paul Thompson and Christopher Dozier 261 1 Introduction ...... 261 2 Definitions, Problems, and Issues . 262 3 The Study ...... 264 3.1 Name Recognition Accuracy . 264 3.2 Evaluation of Name Recognition and Retrieval Performance ...... 264 3.3 Name Recognition Case Law Collection 265 4 Results ...... 266 4.1 Name Recognition Accuracy...... 266 4.2 Effect on Retrieval Performance. . . . . 267 4.3 Name Frequencies in the Case Law Collection . 267 5 Discussion ...... 268 X CONTENTS

6 Conclusions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 270 A The 38 Case Law Queries with Names Highlighted 272

11 COLLAGE: AN NLP TOOLSET TO SUPPORT BOOLEAN RETRIEVAL Jim Cowie 273 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 273 2 Objectives 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 274

3 Rube Goldberg (or Heath Robinson) Recipe 0 275 4 Language Processing Technology 277 5 Query Algebra 0 0 0 0 0 0 277 6 Topic Structuring 0 0 0 0 0 0 0 0 0 278 601 Name Recognition 0 0 0 0 278 602 Noun Phrase Recognition 279 7 Topic 0 0 0 0 279 8 Query Generation 281

9 Document Ra nking 0 281

10 BRS/SEARCH(t) 0 282

11 Lexical Resources 0 283 12 Wordnet 0 0 0 0 0 0 283 13 Transfer Lexicons 283

14 Standard Source Lookup 0 284 15 Bi-gram generation 285 16 Further Work 0 0 0 0 0 0 0 286

12 DOCUMENT CLASSIFICATION AND ROUTING Louise Guthrie, Joe Guthr-ie and James Le istensnider 289 1 Background 0 0 0 0 0 0 0 0 289 101 Meaning of a Text 290

1.2 Flavor of a Te xt 0 290 2 Introduction 0 0 0 0 0 0 0 0 291 201 The Intuitive Model 291 202 Routing vso Classification vs o Retrieval 0 292 203 The Relevance of a Topic 293 2.4 Some Approaches 0 0 0 0 0 0 0 0 0 0 0 0 0 294 205 Overview of this Paper 0 0 0 0 0 0 0 0 0 0 294 3 Application of the Multinomial Distribution to Classification 295 3 01 Flavors 0 0 0 0 295 3 02 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 296 303 A Simple Test 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 297 4 Application of the Multinomial Distribution to Routing 297 401 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 298 CONTENTS Xl

4.2 Zero Word Counts ...... 299 4.3 Routing System Performance . 299 4.4 Document Frequency Measure 300 4.5 Boolean Test ...... 301 4.6 The TREC5 Evaluation .... 303 5 Application to Retrospective Retrieval 303 6 Conclusions ...... 304 A Appendix ...... 304 A.1 Use of the Multinomial Distribution 304 A.2 Routing Using the Multinomial Distribution . 306

13 MURAX: FINDING AND ORGANIZING ANSWERS FROM TEXT SEARCH Julian Kupiec 311 1 Introduction ...... 311 1.1 Corpus-Based Analysis . 312 1.2 Demand-Based Analysis 312 2 Exploiting the Query . 313 3 Murax ...... 314 3.1 An Example ...... 314 3.2 Primary Document Matching 316 3.3 Answer Extraction 316 4 Question Characteristics 317 5 System Architecture ... 319 5.1 Primary Queries . 319 5.2 Primary Query Construction 321 5.3 Index Organizations ... 323 5.4 Scoring Primary Matches 323 6 Extracting Answers ...... 324 6.1 Equivalent Hypotheses .. 324 6.2 Scoring Equivalent Hypotheses 325 6.3 Verifying Implied Expectations 326 6.4 Combining Evidence for Answer Hypotheses . 328 6.5 Secondary Queries 329 7 Discussion . 331 8 Conclusions ...... 331

14 THE USE OF CATEGORIES AND CLUSTERS FOR ORGANIZING RETRIEVAL RESULTS Marti H ear·st 333 1 Introduction . 333 2 Preliminaries 336 Xll CONTENTS

2.1 Meta-Data 336 2.2 Definitions 337 2.3 The Collection TextBcd 338 3 Using Categories to Organize Documents 340 3.1 Examples of MeSH Category Assignments 342 3.2 User Interfaces for Category Organization 343 3.3 Relationship of Categories to Ad Hoc and Standing Queries .. 350 4 Using Clusters to Organize Documents . 352 4.1 Text Clustering Algorithms 352 4.2 Cluster Example 1 354 4.3 Cluster Example 2 356 4.4 Some Characteristics of Clustering Retrieval Results 359 4.5 Applying Clustering to Ad Hoc Queries 362 4.6 Graphical Displays of Text Clusters 362 5 Relationships between Categories and Clusters 363 5.1 Supervised vs. Unsupervised Algorithms . 364 5.2 Comparing DynaCat to Clustering 367 5.3 Using Results of Clustering as a Category Hierarchy 367 6 Conclusions 369

INDEX 375 PREFACE

The last decade has been one of dramatic progress in the field of Natural Language Processing (NLP). This hitherto largely academic discipline has found itself at the center of an information revolution ushered in by the Internet age, as demand for human-computer communication and informa• tion access has exploded. Emerging applications in computer-assisted infor• mation production and dissemination, automated understanding of news, understanding of spoken language, and processing of foreign languages have given impetus to research that resulted in a new generation of robust tools, systems, and commercial products. Well-positioned government research funding, particularly in the U.S., has helped to advance the state-of-the• art at an unprecedented pace, in no small measure thanks to the rigorous evaluations.1 This volume focuses on the use of Natural Language Processing in In• formation Retrieval (IR), an area of science and technology that deals with cataloging, categorization, classification, and search of large amounts of information, particularly in textual form. An outcome of an information retrieval process is usually a set of documents containing information on a given topic, and may consist of newspaper-like articles, memos, reports of any kind, entire books, as well as annotated image and sound files . Since we assume that the information is primarily encoded as text, IR is also a natural language processing problem: in order to decide if a document is relevant to a given information need, one needs to be able to understand its content. Modern NLP technology has been quite successful in providing reliable content analysis in specific situations, such as named entity ex• traction, topic tracking, co-reference resolution, and sense disambiguation. Unfortunately, more general content "understanding" remains elusive, par-

1 Prime examples of now international evaluations are the Text Retrieval Confer• ences (TRECs) (Harman, 1998), and the Message Understanding Conferences (MUCs) (DARPA, 1995), both associated with the Tipster Text research programme (DARPA, 1996). XIV PREFACE ticularly when large amounts of text are involved. When full processing becomes impractical, the second best thing is, perhaps, to make an edu• cated guess, that is, to perform some amount of in order to see if something can be derived from the content. A fair amount of current NLP research attempts to discover SYSTEMATIC methods to do just that. In contrast, classical IR techniques, both ad-hoc and empirical, arc aimed primarily at detecting relevance, with little regard for linguistic phe• nomena. An IR technique is judged effective if it can differentiate one piece of text from another, namely a relevant document from an irrelevant one. This has been done fairly successfully using quantitative methods based on word and/or character counts, especially when relatively coarse-grained distinctions in content were sufficient. However, IR techniques were never systematic in their effectiveness, and these limitations have led to recur• ring interest in NLP approaches, as the latter became more efficient, more robust, and can handle larger amounts of data. Today, IR remains the primary technology for dealing with our ever growing information needs. Whether NLP can revolutionize IR or help to make it more effective is still largely an open question; this volume attempts to provide at least a part of the answer. The fourteen original contributions give a broad overview of the work being done at the junction of these two important fields, and suggest directions for future explorations. This book will be a valuable references to researchers and practitioners in the fields of Natural Language Processing, Information Retrieval, and Computational Linguistics.

1. NLP and Information Retrieval: A Perspective

Natural language processing is a key technology for building the information systems of the future. Unlike today's relatively crude search engines that retrieve long lists of documents of often questionable relevance, the future systems will deliver the exact information that the user is seeking, and will do so with the highest precision and reliability. To accomplish this will require the systems to "understand" both the user's information need, as well as the information they possess in their databases. Modern IR technology for the most part is not linguistically motivated. Both conceptually and philosophically it is still closer to the library-style catalogue search than to detailed information seeking. Indeed, much of the current IR practice is built upon techniques that represent text as collec• tions of terms: words or word surrogates along with statistically or otherwise determined weights that indicate their "importance" as content discrimina• tors. These relatively simple quantitative representations, the computer-age equivalent of catalogue cards, have been studied extensively for the past 30 PREFACE XV years, but just like their library cousins they are largely inadequate as con• tent descriptors. Doing research in a library has always meant obtaining a short list of "hits" from a catalogue search, and then running through the floors and stacks only to find most of them non-relevant. Today's Web search does not involve running through the stacks, instead it returns hun• dreds of irrelevant hits that are just as painful to go through. In both cases the problem is that the search is only superficially related to content. In other words, the bag-of-terms representations of text are just insufficient to support an effective and accurate content search. This is not to say that the qualitative search methods are not useful, especially when the accuracy is not of a critical importance. When accuracy and thoroughness matter, and as the concept of relevance becomes more complex, as it is in legal or medical domains or when the national security is involved, better, more advanced technologies are required. The language, of course, is more than just a collection of words. It is used to communicate about entities, concepts, and relations which may be expressed in many different linguistic forms. For instance, the word order may matter or not as seen in the classical examples such as Venetian blind vs. blind Venetian, and Poland is attacked by Germany vs. Germany attacks Poland. Words are combined into phrases and other larger units, which are then tied together using a variety of correlations that include structural dependencies, co-references, semantic roles, discourse dependencies, inten• tions, and so forth. Given such considerations it has been often postulated that a better representation of text should include groups of words, namely phrases and other expressions, which denote meaningful entities, concepts, or relations within the search domain (Fagan, 1987; Metzler et al., 1989; Smeaton and Rijsbergen, 1988; Sparck Jones and Tait, 1984) . For example, "joint venture" is a meaningful concept that has only a loose connection to "venture" and none at all to "joint" . Indeed, the use of multi-word terms has become a common practice in IR; many systems participating in the annual Text Retrieval Conferences (TRECs) now use one or another form of phrase extraction (Harman, 1998). This approach led to measurable advances (Strzalkowski, 1995; Evans and Lefferts, 1994; Allan et al., 1998). Phrases, in particular, have been a perennial favorite since the early days of IR, partly because they were rela• tively easy to obtain, and partly because no specific definition of phrasehood was adopted. Historically, what has been referred to as a "phrase" in IR varies considerably among researchers and practitioners, and may include anything from simple word collocations to statistically validated N-grams to part-of-speech sequences to syntactic structures, and on to semantic concepts (Smeaton and Rijsbergen, 1988; Sparck Jones and Tait, 1984; Metzler et al., 1989; Fagan, 1987; Mauldin, 1991; Strzalkowski, 1995). The XVI PREFACE lower end of this spectrum encompasses what has been referred to as "sta• tistical phrases" because they could be obtained using quantitative, non• linguistic methods. It may be worth noting that these methods were aimed primarily at identifying multi-word semi-fixed expressions such as "white collar" or "electric car". The linguistic methods, on the other hand, tar• geted more opportunistic associations that may be strongly domain- and subject-dependent, and thus could vary from topic to topic. The debate as to which of these two approaches may be more useful in improving re• trieval performance has been going on for years and even now appears to be only partially settled. Today we know only that "statistical phrases" have a positive but limited impact on performance, and that "syntactic phrases" when used in place of statistical terms are not significantly more effective (Fagan, 1987; Lewis and Croft, 1990; Strzalkowski et al. , 1998; Mitra et al. , 1997). We still know very little about how linguistic phrases should be used, and what happens when we manipulate entities, concepts and relations that these phrases denote, and not just the words used to make them. Advanced methods of phrase extraction that have been evaluated in realistic IR tasks include syntactic analysis and structural disambiguation techniques. These methods attempt to capture an underlying semantic uni• formity across various surface forms of expression, and therefore are a step closer to dealing with content. At the same time, the primarily syntactic nature of the process makes it practically computable, though certainly more complex a task than simple word counting. There were good reasons to believe that this was a worthwhile exercise. Syntactic phrases appear to be reasonable indicators of content, arguably better than proximity-based statistical phrases, since they can adequately deal with word order changes and other structural variations (e.g. , "college junior" vs. "junior in college" vs. "junior college"). A subsequent regularization process, where alterna• tive structures could be reduced to a "normal form", helped to achieve the desired uniformity, for example, "college+junior" may represent a junior• level college, while "junior+college" may represent a junior in a college. A more radical normalization would also have "verb object", "head-noun relative-clause", and other syntactic and structural dependencies converted into indexable terms. We should note here that these largely non-linear de• pendencies were able to deliver terms that no purely quantitative methods could ever hope to get. This presented an exciting opportunity to break out of the bag-of-words paradigm, but there was also a significant risk in• volved. Syntactic techniques are a poor substitute for semantic analysis, and as it turned out they were unable to deliver sufficiently high quality content representation (Sparck Jones, 1998). Moving beyond phrase extraction, attempts at using advanced NLP PREFACE XVll techniques in information retrieval have met with rather severe obstacles, not the least among which was a paralyzing lack of robustness and effi• ciency (Mauldin, 1991; Guthrie et al., 1996). For example, the methods which led to such remarkable progress in automated language understanding, are nowhere near to becoming practical in deal• ing with large amounts of text in unrestricted domain. Therefore, a more graded approach has usually been favored. NLP techniques have been used to assist an IR system in selecting appropriate indexing terms, both words and phrases, that could be deemed to stand for actual entities, concepts and relations, and therefore sharpen the word-based search. This provided the extra maneuverability for softening any inadequacies of existing NLP software without incapacitating the entire system: in the worst case the system would default to the original bag-of-words representation.

Why hasn't NLP had more success in IR? One explanation is that the predominantly syntactic approaches investigated to-date are just not going far enough. The semantic content predictions made on the basis of syntactic structures or quantitative information are less reliable than we have hoped for. In other words, if we are aiming at a semantically-motivated concept• based representation of text, other, more challenging avenues need to be explored. These include entity and event extraction as mentioned above, and also cross-document co-references, topic tracking and expansion, text summarization, various knowledge acquisition techniques, as well as a full exploitation of linguistic and lexical resources which are available today like never before. This research continues and important insights are gained every day.

Another possibility is that a more radical change of focus is needed. The future of IR is not to obtain a better ranking of retrieved documents, but to supply the user with the very information he or she is seeking. This will involve more than just pointing at "relevant" texts, or even making a putative "relevance ranking" . We will need techniques to organize and present the information in the form that is at the same time understandable and immediately usable. Ranked lists of documents may be understandable (although they are often misinterpreted) but they are rarely usable. A sum• mary report, a list of facts, a one-page brief, or a chart are instantly usable. However, to obtain these requires advanced NLP techniques. Moreover, it isn't clear whether the traditional IR quality metrics of recall and preci• sion can adequately reflect a system's ability to derive such information. In other words, as we watch the recall-precision gauge for a breakthrough, we should not ignore the more global changes sweeping the information processing field that may transform the IR game forever. XV Ill PREFACE

Contents of this volume

This book is organized into two loosely structured parts. The first part, consisting of chapters 1 through 7 discusses research systems that repre• sent major avenues where the impact of NLP technologies in IR is being explored. The second part (chapters 8 through 14) describes specific imple• mentations and prototypes of information systems where NLP techniques are used or proposed to assist in accurate retrieval, text categorization, , and in organizing the results for the user. In the opening chapter, Karen Sparck Jones discusses the value of linguis• tically motivated indexing (LMI) for text retrieval. She reviews basic con• cepts, assumptions, and results of using NLP techniques for text indexing since the early days of information retrieval, and concludes that LMI makes little difference in traditional IR. This chapter provides an extended intro• duction to historical and current LMI approaches, and it is recommended to be read before other chapters, particularly for those new to the field .

Christian J acquemin and Evelyne Tzoukermann present a detailed account of an NLP approach to indexing which accounts for morphological, lexical and syntactic variation of terms. The approach com• bines a morphological form generator, a part-of-speech tagger and a shallow parser to achieve more effective indexing and appreciably higher recall in French language text retrieval.

Gerda Ruge's paper explores the problem of term association and its rel• evance to IR. It describes a spreading activation network representation of human memory model which has been designed to produce additional search terms associated with those supplied by the user. The author ar• gues that her approach can lead to a more effective selection of appropriate query terms than can be expected from similarity-based techniques.

Alan Smeaton presents a precis of a series of experiments in text retrieval using NLP performed over several years. These experiments focus on the use of NLP resources, such as lexicons and thesauri, to add power to tradi• tional statistical indexing and retrieval methods. Specifically, the value of Princeton's WordNet ontology is discussed.

Tomek Strzalkowski, Fang Lin, Jin Wang and Jose Perez-Carballo report on progress of the Natural Language Information Retrieval project, and its evaluation in a series of Text Retrieval Conferences (TREC) conducted since 1992 under the guidance of the National Institute of Standards and Technology. The experiments were designed to demonstrate the value of PREFACE XlX various NLP techniques in large-scale text retrieval. It showed only limited success for LMI methods, while full-text query expansion techniques have been found more promising.

Jussi Karlgren looks at the stylistic aspects of text for clues that may help to determine relevance. He proposes a variety of simple quantitative techniques to categorize text stylistically, then goes on to exploit this in• formation for document retrieval purposes.

Ellen Riloff and Jeffrey Lorenzen investigate the use of information ex• traction techniques to achieve a high-precision text categorization. They present the word-augmented relevancy signatures algorithm that uses lex• ical items to represent domain-specific role relationships. Extraction pat• terns are learned automatically using a supervised training method.

Yorick Wilks and Robert Gaizauskas describe the General Architecture for Text Engineering (GATE). GATE is a framework, similar to Tipster Ar• chitecture, which allows for reconfiguration of various language engineering modules to assemble alternative information extraction systems. Set within GATE, is the LaSIE information extraction system, which possesses anum• ber of distinctive features that make it suitable for a variety of applications, including IR.

Joe Zhou reports the results of large-scale empirical studies performed at Lexis-Nexis, to determine the most effective indexing methods for a com• mercial information retrieval engine. He describes an algorithm that au• tomatically learns how to select meaningful phrasal terms by comparing a focused text sample against a diverse baseline collection. The resulting index is tested on a text summarization and a text categorization tasks.

Paul Thompson and Christopher Dozier investigate the problem of pro• per name recognition in text, the accuracy of automated name extraction methods, and the impact of name searching on the overall performance of a commercial text search system. They conclude that in applications that involve searching legal or news databases, where names occur frequently, the retrieval performance can be improved.

Jim Cowie describes COLLAGE, a collection of processes and methods which carry out automatic analysis of topics in natural language text. The results of this analysis are used to determine which NLP resources should be applied to converting each part of a topic into a set of Boolean search XX PREFACE queries, and how the search results should be ranked for the output.

Louise Guthrie, Joe Guthrie and James Leistensnider describe a probabilis• tic method of categorizing text documents based on their semantic content. The method is shown to be effective for a pre-defined set of categories with multinomial distribution. The categories are represented as Boolean ex• pressions derived either manually or automatically from natural language descriptions. The basic method is subsequently expanded to handle routing tasks at TREC-5.

Julian Kupiec presents Murax, an advanced information retrieval system developed at Xerox. The system uses shallow linguistic analysis to obtain a high-precision retrieval, then assembles the answer for the user by com• piling information from the most relevant passages found in the retrieved documents. Murax performance is demonstrated searching for answers to Trivial Pursuit questions in Grolier's Encyclopedia.

Marti Hearst discusses the problem of organizing the retrieval results for an effective presentation to the user, using automated text categorization and text clustering methods. This chapter describes various user interfaces that use categories and clusters to organize the retrieval results, and examines the relationship between the two approaches.

Acknowledgements

The idea of writing a book on natural language processing and information retrieval has been first suggested to me in early 1996 by Professor Nancy Ide, an editor for Kluwer's Text, Speech and Language Technology book series. The plans for production of the present volume have been drafted shortly afterwards, with the first call for participation going out in April 1996. I would like to take this opportunity to thank all the authors who responded to that call for their outstanding contributions and for keeping up as close as they could with the breakneck schedule imposed on them. Preliminary papers started arriving in late December, followed by a some• times grueling cross-review and revision process that lasted until June 1997. Eventually, 14 contributions have been selected for the inclusion in the final volume. My thanks go to the reviewers who helped to assure the highest level of quality for this publication.

I would like to thank Polly Margules and Vanessa Nijweide, our Kluwer's editors for their support and assistance throughout the publication process. PREFACE XXI

This book would not be possible without the continuing financial support for my research by the Defense Advanced Research Projects Agency, under Tipster Phase 3 Contract 97-F157200-000. I would also like to thank Norm Sondheimer for his encouragement early in this project, and the members of GE's Natural Language Group: Amit Bagga, Ron Brandow, Fang Lin, Gees Stein, Jin Wang, Bowden Wise, and Longdon White, for their sup• port. Thanks to the ACM for permission to reprint parts of the material in Chapter 3.

Tomek Strzalkowski, Schenectady in November 1998

References Allan, J . J. Callan, B. Croft, L. Ballesteros, D. Byrd, R. Swan, J. Xu. (1998) . "IN• QUERY Does Battle with TREC-6." Proceedings of the 6th Message Understanding Conference, Morgan-Kaufmann Publishers, San Francisco, CA. pp. 169- 206 . DARPA. (1995). Proceedings of the 6th Message Understanding Conference, Morgan• Kaufmann Publishers, San Francisco, CA. DARPA. (1996). Tipster Text Phase 2: 24 month Conference, Morgan-Kaufmann Pub• lishers, San Francisco, CA. Evans, David, and Robert G. Lefferts. (1994). "Design and Evaluation of the CLARIT• TREC-2 System." Proceedings of the Second Text REtrieval Conference (TREC-2), NIST Special Publication 500-215, National Institute of Standards and Technology, Gaithersburg, MD. pp. 137-150. Fagan, Joel L. (1987). Experiments in Automated Phrase Indexing for Document Re• trieval: A Comparison of Syntactic and Non-Syntactic Methods . Ph.D. Thesis, De• partment of , Cornell University. Guthrie, L, T. Strzalkowski, F. Lin and J. Wang. (1996). "Integration of Document Detection and Information Extraction." In Proceedings of Tipster Phase II, Morgan• Kaufmann Publishers, San Francisco, CA. pp. 195- 200. Harman, Donna K. (ed.). (1998). The Sixth Text REtrieval Conference (TREC-6}. NIST Special Publication 500-240, National Institute of Standards and Technology, Gaithersburg, MD. Lewis, David D. and W. Bruce Croft. (1990). "Term Clustering of Syntactic Phrases". Proceedings of ACM SIGIR-90, pp. 385-405. Mauldin, Michael. (1991). "Retrieval Performance in Ferret: A Conceptual Information Retrieval System." Proceedings of ACM SIGIR-91, pp. 347-355. Metzler, Douglas P., Stephanie W. Haas, Cynthia L. Cosic, and Leslie H. Wheeler. (1989). "Constituent Object Parsing for Information Retrieval and Similar Text Processing Problems." Journal of the ASIS, 40(6), pp. 398-423. Mitra, Mandar, Chris Buckley, Amit Singhal, and Claire Cardie. (1997). "An Analysis of Statistical and Syntactic Phrases." Proceedings of RIA0-97 Conference, Centre de Hautes Etudes Internationales d'Informatique Ducumentaires, pp. 200-214. Smeaton, Alan F. and C. J . van Rijsbergen. (1988) . "Experiments on incorporating syn• tactic processing of user queries into a document retrieval strategy." Proceedings of ACM SIGIR-88, pp. 31-51. Sparck Jones, K. and J. I. Tait. (1984). "Automatic search term variant generation." Journal of Documentation, 40(1), pp. 50-66. Sparck Jones, K. (1998). "What is the Role of NLP in Text Retrieval?" This volume. Strzalkowski, Tomek. (1995). "Natural Language Information Retrieval" Information Processing and Management, 31(3) , pp. 397-417, Pergamon/Elsevier. XXll PREFACE

Strzalkowski, Tomek, Fang Lin and Jose Perez-Carballo. (1998). "Natural Language In• formation Retrieval: TREC-6 Report." Proceedings of The Sixth Text Retrieval Con• ference (TREC-6), NIST Special Publication 500-240. pp. 347-366. CONTRIBUTING AUTHORS

Jim Cowie Computing Research Laboratory, New Mexico State University, Las Cruces, NM 88003, USA [email protected]. edu

Christopher C. Dozier West Group, 610 Opperman Drive, Eagan, MN 55123, USA [email protected]

Robert Gaizauskas Department of Computer Science, University of Sheffield, Sheffield, ENG• LAND [email protected]

Louise Guthrie Department of Mathematical Sciences, University of Texas at El Paso, El Paso, TX, USA [email protected]

Joe Guthrie Department of Mathematical Sciences, University of Texas at El Paso, El Paso, TX, USA [email protected]

Marti A. Hearst University of California Berkeley, School of Information Management & Systems, 102 South Hall, Berkeley, CA 94720, USA [email protected]

Christian Jacquemin Institut de Recherche en Informatique de Nantes, 2, chemin de la Houssiniere, XXIV CONTRIBUTORS

BP 92208, 44322 NANTES Cedex 3, FRANCE [email protected]

J ussi Karlgren Swedish Institute for Computer Science, Box 1263, Kista, SWEDEN jussi@sics . se

Julian M. Kupiec Xerox Palo Alto Research Center, 3333 Coyote Hill Road, Palo Alto, CA 94304, USA [email protected]

James Leistensnider Management and Data Systems, Lockheed Martin Corporation, King of Prussia, PA, USA [email protected]

Fang Lin GE Corporate Research & Development, Schenectady, NY 12301, USA [email protected] . com

Jeffrey Lorenzen Department of Computer Science, University of Utah, Salt Lake City, UT 84112, USA [email protected]

Jose Perez-Carballo School of Communication, Information and Library Studies, Rutgers Uni• versity, New Brunswick, NJ 04612, USA [email protected]

Ellen Riloff Department of Computer Science, University of Utah, Salt Lake City, UT 84112, USA [email protected] . edu

Gerda Ruge AI Group, Institut fuer lnformatik, TU Muenchen, Arcisstr. 21 , D-80290 Muenchen, GERMANY ruge@informatik . tu-muenchen .de CONTRIBUTORS XXV

Alan F. Smeaton Dublin City University, Glasnevin, Dublin 9, IRELAND [email protected]

Karen Sparck Jones Computer Laboratory, University of Cambridge, New Museum Site, Pem• broke Street, Cambridge CB2 3QG, ENGLAND [email protected]

Tomek Strzalkowski GE Corporate Research & Development, Schenectady, NY 12301 , USA [email protected]

Paul Thompson West Group, 610 Opperman Drive, Eagan, MN 55123, USA [email protected]

Evelyne Tzoukermann Bell Laboratories, Lucent Technologies, 700 Mountain Avenue, P.O. Box 636, Murray Hill, NJ 07974, USA [email protected]

Jin Wang GE Corporate Research & Development, Schenectady, NY 12301, USA [email protected]

Yorick Wilks Department of Computer Science, University of Sheffield, Sheffield, ENG• LAND [email protected]

Joe Zhou LEXIS-NEXIS, a Division of Reed Elsevier, Inc., 9555 Springboro Pike, Miamisburg, OH 45342, USA [email protected]