Text Analysis with Lingpipe 4
Total Page:16
File Type:pdf, Size:1020Kb
Text Analysis with LingPipe 4 Volume I Text Analysis with LingPipe 4 Volume I Bob Carpenter LingPipe Publishing New York 2010 © LingPipe Publishing Pending: Library of Congress Cataloging-in-Publication Data Carpenter, Bob. Text Analysis with LingPipe 4.0 / Bob Carpenter p. cm. Includes bibliographical references and index. ISBN X-XXX-XXXXX-X (pbk. : alk. paper) 1. Natural Language Processing 2. Java (computer program language) I. Carpenter, Bob II. Title QAXXX.XXXXXXXX 2011 XXX/.XX’X-xxXX XXXXXXXX Text printed on demand in the United States of America. All rights reserved. This book, or parts thereof, may not be reproduced in any form without permis- sion of the publishers. Contents 1 Getting Started1 1.1 Tools of the Trade............................... 1 1.2 Hello World Example ............................. 9 1.3 Introduction to Ant ..............................11 2 Characters and Strings 19 2.1 Character Encodings..............................19 2.2 Encoding Java Programs ...........................27 2.3 char Primitive Type..............................28 2.4 Character Class................................30 2.5 CharSequence Interface ...........................34 2.6 String Class ..................................35 2.7 StringBuilder Class.............................41 2.8 CharBuffer Class...............................43 2.9 Charset Class .................................46 2.10 International Components for Unicode..................48 3 Regular Expressions 55 3.1 Matching and Finding.............................55 3.2 Character Regexes...............................57 3.3 Character Classes ...............................58 3.4 Concatenation .................................61 3.5 Disjunction ...................................62 3.6 Greedy Quantifiers...............................63 3.7 Reluctant Quantifiers.............................65 3.8 Possessive Quantifiers ............................66 3.9 Non-Capturing Regexes............................67 3.10 Parentheses for Grouping ..........................69 3.11 Back References ................................70 3.12 Pattern Match Flags ..............................70 3.13 Pattern Construction Exceptions......................72 3.14 Find-and-Replace Operations ........................72 3.15 String Regex Utilities .............................73 3.16 Thread Safety, Serialization, and Reuse..................73 4 Input and Output 75 4.1 Files........................................75 v 4.2 I/O Exceptions .................................81 4.3 Security and Security Exceptions......................82 4.4 Input Streams..................................82 4.5 Output Streams.................................83 4.6 Closing Streams ................................84 4.7 Readers......................................85 4.8 Writers......................................86 4.9 Converting Byte Streams to Characters..................86 4.10 File Input and Output Streams .......................88 4.11 Buffered Streams................................91 4.12 Array-Backed Input and Output Streams .................92 4.13 Compressed Streams .............................93 4.14 Tape Archive (Tar) Streams .........................98 4.15 Accessing Resources from the Classpath . 100 4.16 Print Streams and Formatted Output ...................102 4.17 Standard Input, Output, and Error Streams . 103 4.18 URIs, URLs and URNs .............................105 4.19 The Input Source Abstraction........................108 4.20 File Channels and Memory Mapped Files . 108 4.21 Object and Data I/O..............................109 4.22 Lingpipe I/O Utilities .............................114 5 Handlers, Parsers, and Corpora 125 5.1 Handlers and Object Handlers .......................125 5.2 Parsers......................................126 5.3 Corpora .....................................131 5.4 Cross Validation ................................135 6 Tokenization 139 6.1 Tokenizers and Tokenizer Factories....................139 6.2 LingPipe’s Base Tokenizer Factories....................143 6.3 LingPipe’s Filtered Tokenizers........................146 6.4 Morphology, Stemming, and Lemmatization . 152 6.5 Soundex: Pronunciation-Based Tokens ..................159 6.6 Character Normalizing Tokenizer Filters . 162 6.7 Penn Treebank Tokenization ........................163 6.8 Adapting to and From Lucene Analyzers . 169 6.9 Tokenizations as Objects...........................177 7 Symbol Tables 181 7.1 The SymbolTable Interface .........................181 7.2 The MapSymbolTable Class.........................182 7.3 The SymbolTableCompiler Class .....................185 8 Character Language Models 189 8.1 Applications of Language Models......................189 8.2 The Basics of N-Gram Language Models . 190 8.3 Character-Level Language Models and Unicode . 191 8.4 Language Model Interfaces..........................191 8.5 Process Character Language Models....................193 8.6 Sequence Character Language Models...................196 8.7 Tuning Language Model Smoothing ....................200 8.8 Underlying Sequence Counter........................203 8.9 Learning Curve Evaluation..........................203 8.10 Pruning Counts.................................207 8.11 Compling and Serializing Character LMs . 208 8.12 Thread Safety..................................209 8.13 The Mathematical Model...........................209 9 Tokenized Language Models 215 9.1 Applications of Tokenized Language Models . 215 9.2 Token Language Model Interface......................215 10 Classifiers and Evaluation 217 10.1 What is a Classifier?..............................217 10.2 Gold Standards, Annotation, and Reference Data . 222 10.3 Confusion Matrices ..............................223 10.4 Precision-Recall Evaluation..........................233 10.5 Micro- and Macro-Averaged Statistics ...................238 10.6 Scored Precision-Recall Evaluations ....................241 10.7 Contingency Tables and Derived Statistics . 249 10.8 Bias Correction.................................261 10.9 Post-Stratification ...............................261 11 Naive Bayes Classifiers 263 11.1 Introduction to Naive Bayes.........................263 11.2 Getting Started with Naive Bayes......................267 11.3 Independence, Overdispersion and Probability Attenuation . 269 11.4 Tokens, Counts and Sufficient Statistics . 271 11.5 Unbalanced Category Probabilities.....................271 11.6 Maximum Likelihood Estimation and Smoothing . 272 11.7 Item-Weighted Training............................275 11.8 Document Length Normalization......................277 11.9 Serialization and Compilation........................279 11.10 Training and Testing with a Corpus ....................281 11.11 Cross-Validating a Classifier.........................286 11.12 Formalizing Naive Bayes ...........................293 12 Tagging 299 12.1 Taggings.....................................299 12.2 Tag Lattices...................................302 13 Latent Dirichlet Allocation 305 13.1 Corpora, Documents, and Tokens .....................305 13.2 LDA Parameter Estimation..........................306 13.3 Interpreting LDA Output...........................310 13.4 LDA’s Gibbs Samples .............................313 13.5 Handling Gibbs Samples ...........................314 13.6 Scalability of LDA ...............................319 13.7 Understanding the LDA Model Parameters . 324 13.8 LDA Instances for Multi-Topic Classification . 325 13.9 Comparing Documents with LDA......................330 13.10 Stability of Samples..............................331 13.11 The LDA Model.................................335 14 Singular Value Decomposition 337 15 Tagging with Hidden Markov Models 339 16 Conditional Random Fields 341 17 Spelling Correction 343 18 Sentence Boundary Detection 345 A Mathematics 347 A.1 Basic Notation .................................347 A.2 Useful Functions................................347 B Statistics 351 B.1 Discrete Probability Distributions .....................351 B.2 Continuous Probability Distributions ...................353 B.3 Maximum Likelihood Estimation ......................353 B.4 Maximum a Posterior Estimation......................353 B.5 Information Theory ..............................353 C Corpora 359 C.1 Canterbury Corpus ..............................359 C.2 20 Newsgroups.................................360 C.3 MedTag......................................360 C.4 WormBase MEDLINE Citations........................361 D Java Basics 363 D.1 Numbers.....................................363 D.2 Objects......................................367 D.3 Arrays ......................................371 D.4 Synchronization ................................371 D.5 Generating Random Numbers........................375 E The Lucene Search Library 379 E.1 Fields.......................................379 E.2 Documents ...................................382 E.3 Analysis and Token Streams.........................384 E.4 Directories....................................386 E.5 Indexing .....................................387 E.6 Queries ande Query Parsing.........................390 E.7 Search ......................................392 E.8 Deleting Documents..............................397 E.9 Lucene and Databases.............................399 F Further Reading 401 F.1 Unicode .....................................401 F.2 Java........................................401 F.3 General Programming.............................403 F.4 Algorithms ...................................404