Natural Language Information Retrieval Text, Speech and Language Technology

Natural Language Information Retrieval Text, Speech and Language Technology VOLUME7 Series Editors Nancy Ide, Vassar College, New York Jean Veronis, Universite de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT& T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonoma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMS/-CNRS, France The titles published in this series are listed at the end of this volume. Natural Language Information Retrieval Edited by Tomek Strzalkowski General Electric, Research & Development SPRINGER-SCIENCE+BUSINESS MEDIA, B.V. A C.I.P. Catalogue record for this book is available from the Library of Congress. ISBN 978-90-481-5209-4 ISBN 978-94-017-2388-6 (eBook) DOI 10.1007/978-94-017-2388-6 Printed on acid-free paper All Rights Reserved ©1999 Springer Science+Business Media Dordrecht Originally published by Kluwer Academic Publishers in 1999 No part of the material protected by this <:opyright notice may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording or by any information storage and retrieval system, without written permission from the copyright owner TABLE OF CONTENTS PREFACE xiii CONTRIBUTING AUTHORS xxiii 1 WHAT IS THE ROLE OF NLP IN TEXT RETRIEVAL? }(aren Sparck Jones 1 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 1 2 Linguistically-motivated indexing 2 201 Basic Concepts 0 0 0 0 0 . 2 202 Complex Descriptions and Terms 6 3 Research and tests 0 0 0 . 0 0 0 0 0 0 0 0 9 301 Phase 1 : Experiments from the 1960s to the 1980s 10 302 Phase 2 : The Nineties 0 16 303 The TREC Programme 17 4 Other roles for NLP 0 0 . 0 0 0 21 2 NLP FOR TERM VARIANT EXTRACTION: SYNERGY BETWEEN MORPHOLOGY, LEXICON, AND SYNTAX Christian Jacquemin and Evelyne Tzoukermann 25 1 Introduction: From Term Conflation to Linguistic Analysis of Term Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 26 1.1 From Syntactic Variants to Morpho-syntactic Variants 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 27 2 Controlled Indexing and Term Variant Extraction 0 28 201 Conflation of Single Word Terms 29 202 Multi-word Term Conflation 0 0 0 31 203 Purpose of this Chapter 0 0 0 0 0 34 3 An Architecture for Controlled Indexing 35 301 Morphological Analysis 0 0 0 0 . 35 302 Multi-word Term Extraction and Conflation 0 38 4 Morphological Analysis 0 0 0 0 0 0 0 0 0 0 38 401 Inflectional Analysis 0 0 0 0 0 0 0 0 41 402 Morpho-syntactic Disambiguation 43 v VI CONTENTS 4.3 Derivational Analysis 47 4.4 System Implementation 50 4.5 Advantages of Overgenerating . 50 5 FASTR: A Tool for Term Variant Extraction 52 5.1 A Grammar of Terms and a Metagrammar of Transformations 52 5.2 A Metagrammar for Syntactic Variants 54 5.3 A Metagrammar for Morpho-syntactic Variants 56 5.4 A Method for the Formulation of Metarules 60 5.5 Evaluation 66 6 Conclusion 70 3 COMBINING CORPUS LINGUISTICS AND HUMAN MEMORY MODELS FOR AUTOMATIC TERM ASSOCIATION Gerda Ruge 75 1 Introduction . 75 2 Various Approaches to Automatic Term Association 77 2.1 Term Association by Statistic Corpus Analysis . 78 2.2 Term Association by Linguistically Based Corpus Analysis . 79 3 Human Memory Models .. .. .... ...... 81 3.1 A Well Known Memory Model . 81 3.2 A Memory Model Explaining Human Recall Capability . 82 4 Associationism and Term Association . 85 5 Spreading Activation with Heads and Modifiers 88 5.1 Spreading Activation on the Basis of Heads and Modifiers .. ... ..... .. .. ... .. 88 5.2 Indirect Activation of Semantically Similar Words 89 5.3 Taking into account Synonymous Heads and Modifiers 90 6 Experiments . 92 6.1 Test Data .. ... ... 92 6.2 Parameters of the Network 92 6.3 Similarity Measure . 94 6.4 Results . 94 7 Valuation of the Spreading Activation Approach 95 CONTENTS vii 4 USING NLP OR NLP RESOURCES FOR INFORMATION RETRIEVAL TASKS Alan F. Smeaton 99 1 Introduction . 99 2 Early Experiments . 100 3 Using Natural Language Processes or NLP Resources . 103 4 Using WordNet for Information Retrieval 105 5 Status and Plans . 109 5 EVALUATING NATURAL LANGUAGE PROCESSING TECHNIQUES IN INFORMATION RETRIEVAL Tomek Strzalkowski, Fang Lin, .lin Wang and Jose Perez-Carballo 113 1 Introduction and Motivation. 113 2 NLP-Based Indexing in Information Retrieval 116 3 NLP in Information Retrieval: A Perspective 118 4 Stream-based Information Retrieval Model . 121 5 Advanced Linguistic Streams . 123 5.1 Head+Modifier Pairs Stream 123 5.2 Simple Noun Phrase Stream . 128 5.3 Name Stream .. .. 129 5.4 Other Streams . 130 6 Stream Merging and Weighting . 133 6.1 Inter-stream merging using score calculation . 133 6.2 Inter-stream merging using precision distribution estimates .... 135 6.3 Stream coefficients . 136 7 Query Expansion Experiments 136 7.1 Why Query Expansion? 136 7.2 Guidelines for manual query expansion . 138 7.3 Automatic Query Expansion 139 8 Summary of Results . 140 8.1 Ad-hoc runs . 140 8.2 Routing Runs . 141 9 Conclusions . ... 142 6 STYLISTIC EXPERIMENTS IN INFORMATION RETRIEVAL .Jussi K aT'lgren 147 1 Stylistics . 147 2 Materials and Processing . 148 2.1 Experiments Performed 148 2.2 Corpus . .. 148 Vlll CONTENTS 2.3 Variables Examined . 149 2.4 On Non-Parametric Multivariate Statistics 153 2.5 Correlation Between Variables 153 3 Visualizing Stylistic Variation 153 4 Stylistics and Relevance ....... 158 4.1 Relevance Judgments .... 158 4.2 Relevance of Stylistics to Relevance 159 4.3 Hypotheses ...... 160 4.4 Results and Discussion . 160 5 Stylistics and Precision . 161 6 Further Work . 163 7 EXTRACTION-BASED TEXT CATEGORIZATION: GENERATING DOMAIN-SPECIFIC ROLE RELATIONSHIPS AUTOMATICALLY Ellen RilojJ and Jeffrey Lorenzen 167 1 Introduction ............. 167 2 Extraction-based text categorization 169 2.1 Extraction patterns ..... 169 2.2 Relevancy signatures . 172 2.3 Augmented relevancy signatures 175 3 Automatically generating extraction patterns 177 4 Word-augmented relevancy signatures . 179 4.1 Fully Automatic Text Categorization . 181 5 Experimental results . 182 5.1 The Terrorism Category 182 5.2 The Attack Category . 185 5.3 The Bombing Category 188 5.4 The Kidnapping Category 191 5.5 Comparing automatic and hand-crafted dictionaries 193 6 Conclusions . 194 8 LASIE JUMPS THE GATE Yorick Wilks and Robert Gaizauskas 197 1 Introduction . 197 2 Background . 199 2.1 TIPSTER 200 3 GATE Design .. 201 3.1 GDM .. 202 3.2 CREOLE 202 3.3 GGI .. 204 4 LaSIE: An Application In GATE 205 4.1 Significant Features ... 207 CONTENTS IX 4.2 LaSIE Modules . 207 4.3 System Performance . 209 5 Other IE systems and modules within GATE 210 6 The European IE scene . 211 7 Limitations of IE systems 212 8 Conclusion .. ... .. 213 9 PHRASAL TERMS IN REAL-WORLD IR APPLICATIONS Joe Zhou 215 1 Introduction . 215 2 Phrasing/Proximity in IR: A Compatibility Study 217 2.1 Method ...... .. 217 2.2 Empirical Data Input . 219 2.3 Empirical Data Output 220 2.4 Evaluation . 221 2.5 Claims .. ....... 224 3 Automatic Suggestion of Key Terms 225 3.1 Introduction . 225 3.2 Methodology . 227 3.3 Results and Discussion . 231 4 Information Retrieval Applications 247 4.1 Introduction . 24 7 4.2 Document Surrogater: A Summarization Prototype . 248 4.3 Document Sampler: A Categorization Prototype 253 5 Conclusion . 257 10 NAME RECOGNITION AND RETRIEVAL PERFORMANCE Paul Thompson and Christopher Dozier 261 1 Introduction . 261 2 Definitions, Problems, and Issues . 262 3 The Study . 264 3.1 Name Recognition Accuracy . 264 3.2 Evaluation of Name Recognition and Retrieval Performance . 264 3.3 Name Recognition Case Law Collection 265 4 Results . 266 4.1 Name Recognition Accuracy. 266 4.2 Effect on Retrieval Performance. 267 4.3 Name Frequencies in the Case Law Collection . 267 5 Discussion . 268 X CONTENTS 6 Conclusions 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 270 A The 38 Case Law Queries with Names Highlighted 272 11 COLLAGE: AN NLP TOOLSET TO SUPPORT BOOLEAN RETRIEVAL Jim Cowie 273 1 Introduction 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 273 2 Objectives 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 274 3 Rube Goldberg (or Heath Robinson) Recipe 0 275 4 Language Processing Technology 277 5 Query Algebra 0 0 0 0 0 0 277 6 Topic Structuring 0 0 0 0 0 0 0 0 0 278 601 Name Recognition 0 0 0 0 278 602 Noun Phrase Recognition 279 7 Topic Parsing 0 0 0 0 279 8 Query Generation 281 9 Document Ra nking 0 281 10 BRS/SEARCH(t) 0 282 11 Lexical Resources 0 283 12 Wordnet 0 0 0 0 0 0 283 13 Transfer Lexicons 283 14 Standard Source Lookup 0 284 15 Bi-gram generation 285 16 Further Work 0 0 0 0 0 0 0 286 12 DOCUMENT CLASSIFICATION AND ROUTING Louise Guthrie, Joe Guthr-ie and James Le istensnider 289 1 Background 0 0 0 0 0 0 0 0 289 101 Meaning of a Text 290 1.2 Flavor of a Te xt 0 290 2 Introduction 0 0 0 0 0 0 0 0 291 201 The Intuitive Model 291 202 Routing vso Classification vs o Retrieval 0 292 203 The Relevance of a Topic 293 2.4 Some Approaches 0 0 0 0 0 0 0 0 0 0 0 0 0 294 205 Overview of this Paper 0 0 0 0 0 0 0 0 0 0 294 3 Application of the Multinomial Distribution to Classification 295 3 01 Flavors 0 0 0 0 295 3 02 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 296 303 A Simple Test 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 297 4 Application of the Multinomial Distribution to Routing 297 401 Word Selection 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 298 CONTENTS Xl 4.2 Zero Word Counts ....

Natural Language Information Retrieval Text, Speech and Language Technology

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Fasttext.Zip: Compressing Text Classification Models

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambiguous Synonyms

Investigating Classification for Natural Language Processing Tasks

Linked Data Triples Enhance Document Relevance Classification

Natural Language Processing Security- and Defense-Related Lessons Learned

Using Wordnet to Disambiguate Word Senses for Text Classification

An Automatic Text Document Classification Using Modified Weight and Semantic Method

The Impact of NLP Techniques in the Multilabel Text Classification Problem

Word Embeddings for Multi-Label Document Classification

Experiments with One-Class SVM and Word Embeddings for Document Classification

A Novel Approach to Document Classification Using Wordnet