Text Analysis with Lingpipe 4
Total Page:16
File Type:pdf, Size:1020Kb
Text Analysis with LingPipe 4 Text Analysis with LingPipe 4 Bob Carpenter Breck Baldwin LingPipe Publishing New York 2011 © LingPipe Publishing Pending: Library of Congress Cataloging-in-Publication Data Carpenter, Bob. Text Analysis with LingPipe 4.0 / Bob Carpenter, Breck Baldwin p. cm. Includes bibliographical references and index. ISBN X-XXX-XXXXX-X (pbk.) 1. Natural Language Processing 2. Java (computer program language) I. Carpenter, Bob II. Baldwin, Breck III. Title QAXXX.XXXXXXXX 2011 XXX/.XX’X-xxXX XXXXXXXX All rights reserved. This book, or parts thereof, may not be reproduced in any form without permission of the publishers. Contents 1 Getting Started1 1.1 Tools of the Trade............................... 1 1.2 Hello World Example ............................. 9 1.3 Introduction to Ant ..............................11 2 Handlers, Parsers, and Corpora 19 2.1 Handlers and Object Handlers .......................19 2.2 Parsers......................................20 2.3 Corpora .....................................25 2.4 Cross Validation ................................29 3 Tokenization 33 3.1 Tokenizers and Tokenizer Factories....................33 3.2 LingPipe’s Base Tokenizer Factories....................37 3.3 LingPipe’s Filtered Tokenizers........................40 3.4 Morphology, Stemming, and Lemmatization...............46 3.5 Soundex: Pronunciation-Based Tokens ..................53 3.6 Character Normalizing Tokenizer Filters.................56 3.7 Penn Treebank Tokenization ........................57 3.8 Adapting to and From Lucene Analyzers.................64 3.9 Tokenizations as Objects...........................71 4 Suffix Arrays 75 4.1 What is a Suffix Array? ............................75 4.2 Character Suffix Arrays............................76 4.3 Token Suffix Arrays..............................78 4.4 Document Collections as Suffix Arrays ..................81 4.5 Implementation Details............................84 5 Symbol Tables 85 5.1 The SymbolTable Interface .........................85 5.2 The MapSymbolTable Class.........................86 5.3 The SymbolTableCompiler Class .....................89 6 Character Language Models 93 6.1 Applications of Language Models......................93 6.2 The Basics of N-Gram Language Models .................94 6.3 Character-Level Language Models and Unicode .............95 v 6.4 Language Model Interfaces..........................95 6.5 Process Character Language Models....................98 6.6 Sequence Character Language Models...................101 6.7 Tuning Language Model Smoothing ....................104 6.8 Underlying Sequence Counter........................107 6.9 Learning Curve Evaluation..........................107 6.10 Pruning Counts.................................112 6.11 Compling and Serializing Character LMs . 112 6.12 Thread Safety..................................113 6.13 The Mathematical Model...........................113 7 Tokenized Language Models 119 7.1 Applications of Tokenized Language Models . 119 7.2 Token Language Model Interface......................119 8 Spelling Correction 121 9 Classifiers and Evaluation 123 9.1 What is a Classifier?..............................123 9.2 Kinds of Classifiers ..............................125 9.3 Gold Standards, Annotation, and Reference Data . 129 9.4 Confusion Matrices ..............................130 9.5 Precision-Recall Evaluation..........................140 9.6 Micro- and Macro-Averaged Statistics ...................144 9.7 Scored Precision-Recall Evaluations ....................147 9.8 Contingency Tables and Derived Statistics . 155 9.9 Bias Correction.................................167 9.10 Post-Stratification ...............................168 10 Naive Bayes Classifiers 169 10.1 Introduction to Naive Bayes.........................169 10.2 Getting Started with Naive Bayes......................173 10.3 Independence, Overdispersion and Probability Attenuation . 175 10.4 Tokens, Counts and Sufficient Statistics . 177 10.5 Unbalanced Category Probabilities.....................177 10.6 Maximum Likelihood Estimation and Smoothing . 178 10.7 Item-Weighted Training............................181 10.8 Document Length Normalization......................183 10.9 Serialization and Compilation........................185 10.10 Training and Testing with a Corpus ....................187 10.11 Cross-Validating a Classifier.........................192 10.12 Formalizing Naive Bayes ...........................199 11 Tagging 205 11.1 Taggings.....................................205 11.2 Tag Lattices...................................208 11.3 Taggers......................................210 11.4 Tagger Evaluators ...............................211 12 Tagging with Hidden Markov Models 215 13 Conditional Random Fields 217 14 Latent Dirichlet Allocation 219 14.1 Corpora, Documents, and Tokens .....................219 14.2 LDA Parameter Estimation..........................220 14.3 Interpreting LDA Output...........................224 14.4 LDA’s Gibbs Samples .............................227 14.5 Handling Gibbs Samples ...........................228 14.6 Scalability of LDA ...............................233 14.7 Understanding the LDA Model Parameters . 238 14.8 LDA Instances for Multi-Topic Classification . 239 14.9 Comparing Documents with LDA......................244 14.10 Stability of Samples..............................245 14.11 The LDA Model.................................249 15 Singular Value Decomposition 251 16 Sentence Boundary Detection 253 A Mathematics 255 A.1 Basic Notation .................................255 A.2 Useful Functions................................255 B Statistics 259 B.1 Discrete Probability Distributions .....................259 B.2 Continuous Probability Distributions ...................261 B.3 Maximum Likelihood Estimation ......................261 B.4 Maximum a Posterior Estimation......................261 B.5 Information Theory ..............................261 C Java Basics 267 C.1 Generating Random Numbers........................267 D Corpora 271 D.1 Canterbury Corpus ..............................271 D.2 20 Newsgroups.................................272 D.3 MedTag......................................272 D.4 WormBase MEDLINE Citations........................273 E Further Reading 275 E.1 Algorithms ...................................275 E.2 Probability and Statistics...........................276 E.3 Machine Learning ...............................276 E.4 Linguistics....................................277 E.5 Natural Language Processing ........................277 F Licenses 279 F.1 LingPipe License ................................279 F.2 Java Licenses ..................................280 F.3 Apache License 2.0...............................287 F.4 Common Public License 1.0 .........................288 F.5 X License.....................................290 F.6 Creative Commons Attribution-Sharealike 3.0 Unported License . 290 Preface LingPipe is a software library for natural language processing implemented in Java. This book explains the tools that are available in LingPipe and provides examples of how they can be used to build natural language processing (NLP) ap- plications for multiple languages and genres, and for many kinds of applications. LingPipe’s application programming interface (API) is tailored to abstract over low-level implementation details to enable components such as tokenizers, fea- ture extractors, or classifiers to be swapped in a plug-and-play fashion. LingPipe contains a mixture of heuristic rule-based components and statistical compo- nents, often implementing the same interfaces, such as chunking or tokeniza- tion. The presentation here will be hands on. You should be comfortable reading short and relatively simple Java programs. Java programming idioms like loop boundaries being inclusive/exclusive and higher-level design patterns like visi- tors will also be presupposed. More specific aspects of Java coding relating to text processing, such as streaming I/O, character decoding, string representa- tions, and regular expression processing will be discussed in more depth. We will also go into some detail on collections, XML/HTML parsing with SAX, and serialization patterns. We do not presuppose any knowledge of linguistics beyond a simple under- standing of the terms used in dictionaries such as words, syllables, pronuncia- tions, and parts of speech such as noun and preposition. We will spend consid- erable time introducing linguistic concepts, such as word senses or noun phrase chunks, as they relate to natural language processing modules in LingPipe. We will do our best to introduce LingPipe’s modules and their application from a hands-on practical API perspective rather than a theoretical one. In most cases, such as for logistic regression classifiers and conditional random field (CRF) taggers and chunkers, it’s possible learn how to effectively fit complex and useful models without fully understanding the mathematical basis of LingPipe’s estimation and optimization algorithms. In other cases, such as naive Bayes classifiers, hierarchical clusterers and hidden Markov models (HMM), the models are simpler, estimation is a matter of counting, and there is almost no hand- tuning required. Deeper understanding of LingPipe’s algorithms and statistical models re- quires familiarity with computational complexity analysis and basic probability theory including information theory. We provide suggested readings in algo- rithms, statistics, machine learning, and linguistics in Appendix