Natural Language Processing for Online Applications Text Retrieval, Extraction and Categorization

Natural Language Processing for Online Applications Natural Language Processing Editor Prof. Ruslan Mitkov School of Humanities, Languages and Social Sciences University of Wolverhampton Stafford St. Wolverhampton WV1 1SB, United Kingdom Email: [email protected] Advisory Board Christian Boitet (University of Grenoble) Jn Carroll (University of Sussex, Brighton) Eugene Charniak (Brown University, Providence) Eduard Hovy (Information Sciences Institute, USC) Richard Kittredge (University of Montreal) Geoffrey Leech (Lancaster University) Carlos Martin-Vide (Rovira i Virgili Un., Tarragona) Andrei Mikheev (University of Edinburgh) Jn Nerbonne (University of Groningen) Nicolas Nicolov (IBM, T.J. Watson Research Center) Kemal Oflazer (Sabanci University) Allan Ramsey (UMIST, Manchester) Monique Rolbert (Université de Marseille) Richard Sproat (AT&T Labs Research, Florham Park) K-Y Su (Baviour Design Corp.) Isabelle Trancoso (INESC, Lisbon) Benjamin Tsou (City University of Hong Kong) Jun-ichi Tsujii (University of Tokyo) Evene Tzoukermann (Bell Laboratories, Murray Hill) Yorick Wilks (University of Sheffield) Volume 5 Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization by Peter Jackson and Isabelle Moulinier Natural Language Processing for Online Applications Text Retrieval, Extraction and Categorization Peter Jackson Isabelle Moulinier omson Legal & Regulatory John Benjamins Publishing Company Amsterdam / iladelia TM The paper used in this publication meets the minimum requirements of American 8 National Standard for Information Sciences – Permanence of Paper for Printed Library Materials, ansi z39.48-1984. Library of Congress Cataloging-in-Publication Data Jackson, Peter, 1948- Natural language processing for online applications : text retrieval, extraction, and categorization / Peter Jackson, Isabelle Moulinier. p.cm.(Natural Language Processing, issn 1567–8202 ; v.5) Includes bibliographical references and index. I.Jackson, Peter.II.Moulinier, Isabelle.III.Title.IV.Series. QA76.9.N38 I33 2002 006.3’5--dc21 2002066539 isbn 90 272 49881 (Eur.) / 1 58811 2497 (US) (Hb; alk.paper) isbn 90 272 4989X (Eur.) / 1 58811 2500 (US) (Pb; alk.paper) © 2002 – John Benjamins B.V. No part of this book may be reproduced in any form, by print, photoprint, microfilm, or any other means, without written permission from the publisher. John Benjamins Publishing Co.· P.O.Box 36224 · 1020 me Amsterdam · The Netherlands John Benjamins North America · P.O. Box 27519 · Philadelphia pa 19118-0519 · usa Table of contents Preface C 1 Natural language processing . What is NLP? . NLP and linguistics .. Syntax and semantics .. Pragmatics and context .. Two views of NLP .. Tasks and supertasks . Linguistic tools .. Sentence delimiters and tokenizers .. Stemmers and taggers .. Noun phrase and name recognizers .. Parsers and grammars . Plan of the book C 2 Document retrieval . Information retrieval . Indexing technology . Query processing .. Boolean search .. Ranked retrieval .. Probabilistic retrieval .. Language modeling . Evaluating search engines .. Evaluation studies .. Evaluation metrics .. Relevance judgments .. Total system evaluation . Attempts to enhance search performance Table of contents .. Query expansion and thesauri .. Query expansion from relevance information* . The future of Web searching .. Indexing the Web .. Searching the Web .. Ranking and reranking documents .. The state of online search . Summary of information retrieval C 3 Information extraction . The Message Understanding Conferences . Regular expressions . Finite automata in FASTUS .. Finite State Machines and regular languages .. Finite State Machines as parsers . Pushdown automata and context-free grammars .. Analyzing case reports .. Context free grammars .. Parsing with a pushdown automaton .. Coping with incompleteness and ambiguity . Limitations of current technology and future research .. Explicit versus implicit statements .. Machine learning for information extraction .. Statistical language models for information extraction . Summary of information extraction C 4 Text categorization . Overview of categorization tasks and methods . Handcrafted rule based methods . Inductive learning for text classification .. Naïve Bayes classifiers .. Linear classifiers* .. Decision trees and decision lists . Nearest Neighbor algorithms . Combining classifiers .. Data fusion .. Boosting Table of contents .. Using multiple classifiers . Evaluation of text categorization systems .. Evaluation studies .. Evaluation metrics .. Relevance judgments .. System evaluation C 5 Towards text mining . What is text mining? . Reference and coreference .. Named entity recognition .. The coreference task . Automatic summarization .. Summarization tasks .. Constructing summaries from document fragments .. Multi-document summarization (MDS) . Testing of automatic summarization programs .. Evaluation problems in summarization research .. Building a corpus for training and testing . Prospects for text mining and NLP Index Preface There is no single text on the market that covers the emerging technologies of document retrieval, information extraction, and text categorization in a coher- ent fashion. This book seeks to satisfy a genuine need on the part of technology practitioners in the Internet space, who are faced with having to make difficult decisions as to what research has been done, and what the best practices are. It is not intended as a vendor guide (such things are quickly out of date), or as a recipe for building applications (such recipes are very context-dependent). But it does identify the key technologies, the issues involved, and the strengths and weaknesses of the various approaches. There is also a strong emphasis on evaluation in every chapter, both in terms of methodology (how to evaluate) and what controlled experimentation and industrial experience have to tell us. I was prompted to write this book after spending seven years running an R&D group in an Internet publishing and solutions business. During that time, we were able to put into production a number of systems that either generated revenue or enabled cost savings for the company, leveraging technologies from information retrieval, information extraction, and text categorization. This is not a chronicle of these exploits, but a primer for those who are already inter- ested in natural language processing for online applications. Nevertheless, my treatment of the philosophy and practice of language processing is colored by the context in which I function, namely the arena of commercial exploitation. Thus, although there is a focus on technical detail and research results, I also address some of the issues that arise in applying such systems to data collections of realistic size and complexity. The book is not intended exclusively as an academic text, although I sus- pect that it will be of interest to students who wish to use these technologies in an industrial setting. It is also aimed at software engineers, project man- agers, and technology executives who want or need to understand the technology at some level. I hope that such people find it useful, and that it provokes ideas, discussion, and action in the field of applied research and development. Each chapter begins with lighter material and then progresses to heavier stuff, with some of the later sections and sidebars being marked with an asterisk as Preface optional. Every attempt has been made to cover more ground, and make the material more accessible, than one would expect from the typical textbook. Thanks are due to Khalid Al-Kofahi, Chris Dozier, David Lewis, Jane Lund, Ruslan Mitkov, and two anonymous referees, for providing insightful com- ments on one or more chapters. I would also like to thank James Allen and David Israel for helping me get started in information extraction back in 1995, and Chris Wolff for backing that project at Thomson Legal Publishing. I thank my colleagues at Thomson Legal & Regulatory for bearing with me while I fin- ished the book, and for providing a stimulating work environment. Finally, I thank my wife Sandy for putting up with me while I devoted long hours to this project. Peter Jackson St Paul, MN C 1 Natural language processing A conversation with Eliza Eliza: how are you today? PJ: not so good Eliza: why do you say not so good? PJ: i suffer from allergies Eliza: that’s interesting PJ: i come out in spots Eliza: i’m not sure i know what you’re talking about PJ: spots on my face Eliza: your face PJ: that’sright Eliza: is it really that right? PJ: you’re a program, aren’tyou? Eliza: does it please you to believe i am a program, am not i? PJ: it sure does Eliza: i see PJ: yup Eliza: a while back you talked about your face PJ: what about your face? Eliza: what is it you really want to know? PJ: are you my kind of girl? Eliza: perhaps in your fantasies i am your kind of girl PJ: in my fantasies you are blonde Eliza: what makes you think i am blonde? PJ: i don’tknow Eliza: don’t you really know? PJ: no Eliza: are you saying no just to be negative? PJ: no Eliza: you are being a bit negative PJ: no i’mnot Eliza: how long have you been not? Chapter 1 The above dialogue took place recently between one of the authors and a com- puter program called Eliza,1 downloaded from the World Wide Web. Eliza was an AI brainchild of the 1960s that employed rather primitive pattern-matching methods to respond to inputs and sustain a conversation

Natural Language Processing for Online Applications Text Retrieval, Extraction and Categorization

Shakespeare in the Eighteenth Century: Algorithm for Quotation Identification

Using N-Grams to Understand the Nature of Summaries

Text and Data Mining: Technologies Under Construction

Multi-Document Biography Summarization

Natural Language Processing

1 Application of Text Mining to Biomedical Knowledge Extraction: Analyzing Clinical Narratives and Medical Literature

Quesgen Using Nlp 01

An Automatic Text Summarization for Malayalam Using Sentence Extraction

NEXT GENERATION CATALOGUES: an ANALYSIS of USER SEARCH STRATEGIES and BEHAVIOR by FREDRICK KIWUWA LUGYA DISSERTATION Submitted I

Automatic Document Summarization by Sentence Extraction

The Elements of Automatic Summarization

A Framework for Evaluating the Retrieval Effectiveness of Search Engines Dirk Lewandowski Hamburg University of Applied Sciences, Germany