Natural Language Processing in Python
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing in Python Authors: Steven Bird, Ewan Klein, Edward Loper Version: 0.9.1 (draft only, please send feedback to authors) Copyright: © 2001-2008 the authors License: Creative Commons Attribution-Noncommercial-No Derivative Works 3.0 United States License Revision: Date: January 24, 2008 2 Bird, Klein & Loper Contents 1 Introduction to Natural Language Processing 19 1.1 The Language Challenge ................................. 19 1.1.1 The Richness of Language ............................ 19 1.1.2 The Promise of NLP ............................... 20 1.2 Language and Computation ................................ 21 1.2.1 NLP and Intelligence ............................... 21 1.2.2 Language and Symbol Processing ........................ 22 1.2.3 Philosophical Divides .............................. 23 1.3 The Architecture of Linguistic and NLP Systems .................... 24 1.3.1 Generative Grammar and Modularity ...................... 24 1.4 Before Proceeding Further... ............................... 27 1.5 Further Reading ...................................... 27 I BASICS 29 2 Programming Fundamentals and Python 33 2.1 Getting Started ...................................... 33 2.2 Understanding the Basics: Strings and Variables ..................... 34 2.2.1 Representing text ................................. 34 2.2.2 Storing and Reusing Values ........................... 35 2.2.3 Printing and Inspecting Strings .......................... 36 2.2.4 Creating Programs with a Text Editor ...................... 37 2.2.5 Exercises ..................................... 38 2.3 Slicing and Dicing .................................... 38 2.3.1 Accessing Individual Characters ......................... 38 2.3.2 Accessing Substrings ............................... 40 2.3.3 Exercises ..................................... 41 2.4 Strings, Sequences, and Sentences ............................ 41 2.4.1 Lists ........................................ 42 2.4.2 Working on Sequences One Item at a Time ................... 44 2.4.3 String Formatting ................................. 45 2.4.4 Converting Between Strings and Lists ...................... 47 2.4.5 Mini-Review ................................... 48 2.4.6 Exercises ..................................... 49 2.5 Making Decisions ..................................... 51 3 CONTENTS 2.5.1 Making Simple Decisions ............................ 51 2.5.2 Conditional Expressions ............................. 52 2.5.3 Iteration, Items, and if ............................. 53 2.5.4 A Taster of Data Types .............................. 54 2.5.5 Exercises ..................................... 55 2.6 Getting Organized ..................................... 56 2.6.1 Accessing Data with Data ............................ 57 2.6.2 Counting with Dictionaries ............................ 59 2.6.3 Getting Unique Entries .............................. 60 2.6.4 Scaling Up .................................... 61 2.6.5 Exercises ..................................... 61 2.7 Regular Expressions ................................... 62 2.7.1 Groupings ..................................... 65 2.7.2 Practice Makes Perfect .............................. 66 2.7.3 Exercises ..................................... 67 2.8 Summary ......................................... 68 2.9 Further Reading ...................................... 69 2.9.1 Python ...................................... 69 2.9.2 Regular Expressions ............................... 69 2.9.3 Unicode ...................................... 69 3 Words: The Building Blocks of Language 71 3.1 Introduction ........................................ 71 3.2 Tokens, Types and Texts ................................. 71 3.2.1 Extracting Text from Files ............................ 73 3.2.2 Extracting Text from the Web .......................... 74 3.2.3 Extracting Text from NLTK Corpora ....................... 75 3.2.4 Exercises ..................................... 76 3.3 Text Processing with Unicode .............................. 78 3.3.1 What is Unicode? ................................. 79 3.3.2 Extracting encoded text from files ........................ 79 3.3.3 Using your local encoding in Python ....................... 82 3.3.4 Chinese and XML ................................ 82 3.3.5 Exercises ..................................... 83 3.4 Tokenization and Normalization ............................. 83 3.4.1 Tokenization with Regular Expressions ..................... 83 3.4.2 Lemmatization and Normalization ........................ 84 3.4.3 Transforming Lists ................................ 86 3.4.4 Exercises ..................................... 87 3.5 Counting Words: Several Interesting Applications .................... 89 3.5.1 Frequency Distributions ............................. 89 3.5.2 Stylistics ..................................... 90 3.5.3 Aside: Defining Functions ............................ 91 3.5.4 Lexical Dispersion ................................ 91 3.5.5 Comparing Word Lengths in Different Languages ................ 93 3.5.6 Generating Random Text with Style ....................... 93 3.5.7 Collocations ................................... 94 January 24, 2008 4 Bird, Klein & Loper CONTENTS Introduction to Natural Language Processing (DRAFT) 3.5.8 Exercises ..................................... 94 3.6 WordNet: An English Lexical Database ......................... 97 3.6.1 Senses and Synonyms .............................. 97 3.6.2 The WordNet Hierarchy ............................. 98 3.6.3 WordNet Similarity ................................ 100 3.6.4 Exercises ..................................... 101 3.7 Conclusion ........................................ 101 3.8 Summary ......................................... 101 3.9 Further Reading ...................................... 102 4 Categorizing and Tagging Words 103 4.1 Introduction ........................................ 103 4.2 Getting Started with Tagging ............................... 105 4.2.1 Representing Tags and Reading Tagged Corpora ................ 105 4.2.2 Nouns and Verbs ................................. 107 4.2.3 Nouns and Verbs in Tagged Corpora ....................... 109 4.2.4 The Default Tagger ................................ 109 4.2.5 Exercises ..................................... 111 4.3 Looking for Patterns in Words .............................. 112 4.3.1 Some Morphology ................................ 112 4.3.2 The Regular Expression Tagger ......................... 113 4.3.3 Exercises ..................................... 114 4.4 Baselines and Backoff .................................. 115 4.4.1 The Lookup Tagger ................................ 115 4.4.2 Backoff ...................................... 116 4.4.3 Choosing a Good Baseline ............................ 116 4.4.4 Exercises ..................................... 116 4.5 Getting Better Coverage ................................. 118 4.5.1 More English Word Classes ........................... 118 4.5.2 Some Diagnostics ................................ 119 4.5.3 Unigram Tagging ................................. 119 4.5.4 Affix Taggers ................................... 120 4.5.5 N-Gram Taggers ................................. 120 4.5.6 Combining Taggers ................................ 121 4.5.7 Tagging Unknown Words ............................ 122 4.5.8 Storing Taggers .................................. 122 4.5.9 Exercises ..................................... 123 4.6 Summary ......................................... 124 4.7 Further Reading ...................................... 124 4.8 Appendix: Brown Tag Set ................................ 125 4.8.1 Acknowledgments ................................ 126 5 Data-Intensive Language Processing 127 Bird, Klein & Loper 5 January 24, 2008 CONTENTS II PARSING 129 6 Structured Programming in Python 133 6.1 Introduction ........................................ 133 6.2 Back to the Basics ..................................... 133 6.2.1 Assignment .................................... 134 6.2.2 Sequences: Strings, Lists and Tuples ....................... 135 6.2.3 Combining Different Sequence Types ...................... 136 6.2.4 Stacks and Queues ................................ 137 6.2.5 More List Comprehensions ............................ 138 6.2.6 Dictionaries .................................... 140 6.2.7 Exercises ..................................... 141 6.3 Presenting Results ..................................... 143 6.3.1 Strings and Formats ............................... 144 6.3.2 Lining Things Up ................................. 144 6.3.3 Writing Results to a File ............................. 145 6.3.4 Graphical Presentation .............................. 148 6.3.5 Exercises ..................................... 148 6.4 Functions ......................................... 149 6.4.1 Function Arguments ............................... 151 6.4.2 An Important Subtlety .............................. 153 6.4.3 Functional Decomposition ............................ 153 6.4.4 Documentation (notes) .............................. 154 6.4.5 Functions as Arguments ............................. 155 6.4.6 Exercises ..................................... 156 6.5 Algorithm Design Strategies ............................... 157 6.5.1 Recursion (notes) ................................. 158 6.5.2 Deeply Nested Objects (notes) .......................... 159 6.5.3 Dynamic Programming .............................. 159 6.5.4 Timing (notes) .................................