Natural Language Processing with Python CS372: Spring, 2015
Total Page:16
File Type:pdf, Size:1020Kb
Natural Language Processing with Python CS372: Spring, 2015 Lecture 12 Categorizing and Tagging Words Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology CATEGORIZING AND TAGGING WORDS Using a Tagger Tagged Corpora Mapping Words to Properties Using Python Dictionaries Automatic Tagging N-Gram Tagging Transformation-based Tagging How to Determine the Category of a Word 2015-04-09 CS372: NLP with Python 2 Introduction Questions • What are lexical categories, and how are they used in natural language processing? • What is a good Python data structure for storing words and their categories? • How can we automatically tag each word of a text with its word class? 2015-04-09 CS372: NLP with Python 3 Mapping Words to Properties Using Python Dictionaries Indexing Lists Versus Dictionaries Dictionaries in Python Defining Dictionaries Default Dictionaries Incrementally Updating a Dictionary Complex Keys and Values Inverting a Dictionary dictionary data type 2015-04-09 CS372: NLP with Python 4 Indexing Lists Versus Dictionaries List • A text is treated in Python as a list of words. • We can look up a particular item by giving its index. • text1[100] Figure 5-2. List lookup. 2015-04-09 CS372: NLP with Python 5 Indexing Lists Versus Dictionaries With frequency distributions, we specify a word and get back a number. • fdist[‘monstrous’] Figure 5-3. Dictionary lookup. Other names for dictionary are map, hashmap, hash, and associative array. 2015-04-09 CS372: NLP with Python 6 Indexing Lists Versus Dictionaries In Figure 5-3, we mapped from names to numbers, unlike with a list. Table 5-4. Linguistic objects as mappings from keys to values. The mapping is from a “word” to some structured object. 2015-04-09 CS372: NLP with Python 7 Dictionaries in Python Python provides a dictionary data type that can be used for mapping between arbitrary types. pos is defined as an empty dictionary. 2015-04-09 CS372: NLP with Python 8 Dictionaries in Python We can employ the keys to retrieve values. Question: • How do we work out the legal keys for a dictionary, where in the case of lists and strings we can use len() to work out which integers will be legal indexes? If the dictionary is not big, we can simply inspect its contents by evaluating the variable pos. 2015-04-09 CS372: NLP with Python 9 Dictionaries in Python To just find the keys, we can either convert the dictionary to a list or use the dictionary in a context where a list is expected, as the parameter of sorted() or in a for loop. 2015-04-09 CS372: NLP with Python 10 Dictionaries in Python The dictionary methods keys(), values(), and items() allow us to access the keys, values, and key-value pairs as separate lists. 2015-04-09 CS372: NLP with Python 11 Dictionaries in Python When we look something up in a dictionary, we get only one value for each key. However, there is a way of storing multiple values in an entry. We may use a list value, e.g., pos[‘sleep’] = [‘N’, ‘V’]. Cf. the CMU Pronouncing Dictionary 2015-04-09 CS372: NLP with Python 12 Defining Dictionaries We can use the same key-value pair format to create a dictionary. Dictionary keys must be immutable types, such as strings and tuples. 2015-04-09 CS372: NLP with Python 13 Default Dictionaries If we try to access a key that is not in a dictionary, we get an error. Since Python 2.5, a special kind of dictionary called a defaultdict has been available. int, float, str, list, dict, tuple When we access a non-existent entry, it is automatically added to the dictionary. 2015-04-09 CS372: NLP with Python 14 Default Dictionaries We can use default dictionaries to deal with hapaxes and low frequency words. We can replace low frequency words with a special “out of vocabulary” token. 2015-04-09 CS372: NLP with Python 15 Incrementally Updating a Dictionary Example 5-3. Incrementally updating a dictionary, and sorting by value. 2015-04-09 CS372: NLP with Python 16 Incrementally Updating a Dictionary 2015-04-09 CS372: NLP with Python 17 Incrementally Updating a Dictionary itemgetter(n) returns a function that can be called on some other sequence object to obtain the nth element. 2015-04-09 CS372: NLP with Python 18 Incrementally Updating a Dictionary Useful programming idiom: • We initialize a defaultdict and then use a for loop to update its values. 2015-04-09 CS372: NLP with Python 19 Incrementally Updating a Dictionary The following example uses the same pattern to create an anagram dictionary. NLTK provides a convenient way of accumulating words through nltk.Index(). 2015-04-09 CS372: NLP with Python 20 Complex Keys and Values Default dictionaries can have complex keys and values. 2015-04-09 CS372: NLP with Python 21 Inverting a Dictionary Dictionaries support efficient lookup. • However, finding a key given a value is slower and more cumbersome. • If we expect to do this kind of “reverse lookup” often, it helps to construct a dictionary that maps values to keys. 2015-04-09 CS372: NLP with Python 22 Inverting a Dictionary Examples of reverse lookup 2015-04-09 CS372: NLP with Python 23 Inverting a Dictionary Table 5-5. Python’s dictionary methods. 2015-04-09 CS372: NLP with Python 24 Automatic Tagging The Default Tagger The Regular Expression Tagger The Lookup Tagger Evaluation >>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories=‘news’) >>> brown_sents = brown.sents(categories=‘news’) 2015-04-09 CS372: NLP with Python 25 The Default Tagger The simplest possible tagger assigns the same tag to each token. • It establishes an important baseline. • In order to get the best result, we tag each word with the most likely tag. Pros? 2015-04-09 CS372: NLP with Python 26 The Regular Expression Tagger Assign tags to tokens on the basis of matching patterns. 2015-04-09 CS372: NLP with Python 27 The Lookup Tagger Find the hundred most frequent words and store their most likely tag. • Use it as the model for a “lookup tagger”. 2015-04-09 CS372: NLP with Python 28 The Lookup Tagger Example 5-4. Lookup tagger performance with varying model size. 2015-04-09 CS372: NLP with Python 29 The Lookup Tagger 2015-04-09 CS372: NLP with Python 30 Evaluation We evaluate the performance of a tagger relative to the tags a human expert would assign. • Since we usually don’t have access to an expert and impartial human judge we make do instead with gold standard test data. • The tagger is regarded as being correct if the tag it guesses for a given word is the same as the gold standard tag. 2015-04-09 CS372: NLP with Python 31 Summary Mapping Words to Properties Using Python Dictionaries • Indexing Lists Versus Dictionaries • Dictionaries in Python • Defining Dictionaries • Default Dictionaries • Incrementally Updating a Dictionary • Complex Keys and Values • Inverting a Dictionary 2015-04-09 CS372: NLP with Python 32 Summary Automatic Tagging • The Default Tagger • The Regular Expression Tagger • The Lookup Tagger • Evaluation 2015-04-09 CS372: NLP with Python 33 Project: First Presentation 30 April, 2015 (in class) Prepare a 5 minute presentation for your term project (approximately 7 slides). • The project must have a clear I/O. • Explain the measure of the quality of the output. • Give a measure of how good your system is (together with a prediction of your system’s performance against this measure). 2015-04-09 CS372: NLP with Python 34.