Natural Language Annotation for Machine Learning

Natural Language Annotation for Machine Learning James Pustejovsky and Amber Stubbs Natural Language Annotation for Machine Learning by James Pustejovsky and Amber Stubbs Copyright © 2013 James Pustejovsky and Amber Stubbs. All rights reserved. Printed in the United States of America. Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472. O’Reilly books may be purchased for educational, business, or sales promotional use. Online editions are also available for most titles (http://my.safaribooksonline.com). For more information, contact our corporate/ institutional sales department: 800-998-9938 or [email protected]. Editors: Julie Steele and Meghan Blanchette Proofreader: Linley Dolby Production Editor: Kristen Borg Indexer: WordCo Indexing Services Copyeditor: Audrey Doyle Cover Designer: Randy Comer Interior Designer: David Futato Illustrator: Rebecca Demarest October 2012: First Edition Revision History for the First Edition: 2012-10-10 First release See http://oreilly.com/catalog/errata.csp?isbn=9781449306663 for release details. Nutshell Handbook, the Nutshell Handbook logo, and the O’Reilly logo are registered trademarks of O’Reilly Media, Inc. Natural Language Annotation for Machine Learning, the image of a cockatiel, and related trade dress are trademarks of O’Reilly Media, Inc. Many of the designations used by manufacturers and sellers to distinguish their products are claimed as trademarks. Where those designations appear in this book, and O’Reilly Media, Inc., was aware of a trade mark claim, the designations have been printed in caps or initial caps. While every precaution has been taken in the preparation of this book, the publisher and authors assume no responsibility for errors or omissions, or for damages resulting from the use of the information contained herein. ISBN: 978-1-449-30666-3 [LSI] Table of Contents Preface. ix 1. The Basics. 1 The Importance of Language Annotation 1 The Layers of Linguistic Description 3 What Is Natural Language Processing? 4 A Brief History of Corpus Linguistics 5 What Is a Corpus? 8 Early Use of Corpora 10 Corpora Today 13 Kinds of Annotation 14 Language Data and Machine Learning 21 Classification 22 Clustering 22 Structured Pattern Induction 22 The Annotation Development Cycle 23 Model the Phenomenon 24 Annotate with the Specification 27 Train and Test the Algorithms over the Corpus 29 Evaluate the Results 30 Revise the Model and Algorithms 31 Summary 31 2. Defining Your Goal and Dataset. 33 Defining Your Goal 33 The Statement of Purpose 34 Refining Your Goal: Informativity Versus Correctness 35 Background Research 41 Language Resources 41 Organizations and Conferences 42 iii NLP Challenges 43 Assembling Your Dataset 43 The Ideal Corpus: Representative and Balanced 45 Collecting Data from the Internet 46 Eliciting Data from People 46 The Size of Your Corpus 48 Existing Corpora 48 Distributions Within Corpora 49 Summary 51 3. Corpus Analytics. 53 Basic Probability for Corpus Analytics 54 Joint Probability Distributions 55 Bayes Rule 57 Counting Occurrences 58 Zipf’s Law 61 N-grams 61 Language Models 63 Summary 65 4. Building Your Model and Specification. 67 Some Example Models and Specs 68 Film Genre Classification 70 Adding Named Entities 71 Semantic Roles 72 Adopting (or Not Adopting) Existing Models 75 Creating Your Own Model and Specification: Generality Versus Specificity 76 Using Existing Models and Specifications 78 Using Models Without Specifications 79 Different Kinds of Standards 80 ISO Standards 80 Community-Driven Standards 83 Other Standards Affecting Annotation 83 Summary 84 5. Applying and Adopting Annotation Standards. 87 Metadata Annotation: Document Classification 88 Unique Labels: Movie Reviews 88 Multiple Labels: Film Genres 90 Text Extent Annotation: Named Entities 94 Inline Annotation 94 Stand-off Annotation by Tokens 96 iv | Table of Contents Stand-off Annotation by Character Location 99 Linked Extent Annotation: Semantic Roles 101 ISO Standards and You 102 Summary 103 6. Annotation and Adjudication. 105 The Infrastructure of an Annotation Project 105 Specification Versus Guidelines 108 Be Prepared to Revise 109 Preparing Your Data for Annotation 110 Metadata 110 Preprocessed Data 110 Splitting Up the Files for Annotation 111 Writing the Annotation Guidelines 112 Example 1: Single Labels—Movie Reviews 113 Example 2: Multiple Labels—Film Genres 115 Example 3: Extent Annotations—Named Entities 119 Example 4: Link Tags—Semantic Roles 120 Annotators 122 Choosing an Annotation Environment 124 Evaluating the Annotations 126 Cohen’s Kappa (κ) 127 Fleiss’s Kappa (κ) 128 Interpreting Kappa Coefficients 131 Calculating κ in Other Contexts 132 Creating the Gold Standard (Adjudication) 134 Summary 135 7. Training: Machine Learning. 139 What Is Learning? 140 Defining Our Learning Task 142 Classifier Algorithms 144 Decision Tree Learning 145 Gender Identification 147 Naïve Bayes Learning 151 Maximum Entropy Classifiers 157 Other Classifiers to Know About 158 Sequence Induction Algorithms 160 Clustering and Unsupervised Learning 162 Semi-Supervised Learning 163 Matching Annotation to Algorithms 165 Table of Contents | v Summary 166 8. Testing and Evaluation. 169 Testing Your Algorithm 170 Evaluating Your Algorithm 170 Confusion Matrices 171 Calculating Evaluation Scores 172 Interpreting Evaluation Scores 177 Problems That Can Affect Evaluation 178 Dataset Is Too Small 178 Algorithm Fits the Development Data Too Well 180 Too Much Information in the Annotation 181 Final Testing Scores 181 Summary 182 9. Revising and Reporting. 185 Revising Your Project 186 Corpus Distributions and Content 186 Model and Specification 187 Annotation 188 Training and Testing 189 Reporting About Your Work 189 About Your Corpus 191 About Your Model and Specifications 192 About Your Annotation Task and Annotators 192 About Your ML Algorithm 193 About Your Revisions 194 Summary 194 10. Annotation: TimeML. 197 The Goal of TimeML 198 Related Research 199 Building the Corpus 201 Model: Preliminary Specifications 201 Times 202 Signals 202 Events 203 Links 203 Annotation: First Attempts 204 Model: The TimeML Specification Used in TimeBank 204 Time Expressions 204 Events 205 vi | Table of Contents Signals 206 Links 207 Confidence 208 Annotation: The Creation of TimeBank 209 TimeML Becomes ISO-TimeML 211 Modeling the Future: Directions for TimeML 213 Narrative Containers 213 Expanding TimeML to Other Domains 215 Event Structures 216 Summary 217 11. Automatic Annotation: Generating TimeML. 219 The TARSQI Components 220 GUTime: Temporal Marker Identification 221 EVITA: Event Recognition and Classification 222 GUTenLINK 223 Slinket 224 SputLink 225 Machine Learning in the TARSQI Components 226 Improvements to the TTK 226 Structural Changes 227 Improvements to Temporal Entity Recognition: BTime 227 Temporal Relation Identification 228 Temporal Relation Validation 229 Temporal Relation Visualization 229 TimeML Challenges: TempEval-2 230 TempEval-2: System Summaries 231 Overview of Results 234 Future of the TTK 234 New Input Formats 234 Narrative Containers/Narrative Times 235 Medical Documents 236 Cross-Document Analysis 237 Summary 238 12. Afterword: The Future of Annotation. 239 Crowdsourcing Annotation 239 Amazon’s Mechanical Turk 240 Games with a Purpose (GWAP) 241 User-Generated Content 242 Handling Big Data 243 Boosting 243 Table of Contents | vii Active Learning 244 Semi-Supervised Learning 245 NLP Online and in the Cloud 246 Distributed Computing 246 Shared Language Resources 247 Shared Language Applications 247 And Finally... 248 A. List of Available Corpora and Specifications. 249 B. List of Software Resources. 271 C. MAE User Guide. 291 D. MAI User Guide. 299 E. Bibliography. 305 Index. 317 viii | Table of Contents Preface This book is intended as a resource for people who are interested in using computers to help process natural language. A natural language refers to any language spoken by humans, either currently (e.g., English, Chinese, Spanish) or in the past (e.g., Latin, ancient Greek, Sanskrit). Annotation refers to the process of adding metadata informa tion to the text in order to augment a computer’s capability to perform Natural Language Processing (NLP). In particular, we examine how information can be added to natural language text through annotation in order to increase the performance of machine learning algorithms—computer programs designed to extrapolate rules from the infor mation provided over texts in order to apply those rules to unannotated texts later on. Natural Language Annotation for Machine Learning This book details the multistage process for building your own annotated natural lan guage dataset (known as a corpus) in order to train machine learning (ML) algorithms for language-based data and knowledge discovery. The overall goal of this book is to show readers how to create their own corpus, starting with selecting an annotation task, creating the annotation specification, designing the guidelines, creating a “gold stan dard” corpus, and then beginning the actual data creation with the annotation process. Because the annotation process is not linear, multiple iterations can be required for defining the tasks, annotations, and evaluations, in order to achieve the best results for a particular goal. The process can be summed up in terms of the MATTER Annotation Development Process: Model, Annotate, Train, Test, Evaluate, Revise. This book guides the reader through the cycle, and provides detailed examples and discussion

Load more