Natural Language Processing with Python CS372: Spring, 2015

Lecture 12 Categorizing and Tagging

Jong C. Park Department of Computer Science Korea Advanced Institute of Science and Technology CATEGORIZING AND TAGGING WORDS

Using a Tagger Tagged Corpora Mapping Words to Properties Using Python Automatic Tagging N-Gram Tagging Transformation-based Tagging How to Determine the Category of a

2015-04-09 CS372: NLP with Python 2 Introduction

 Questions • What are lexical categories, and how are they used in natural language processing? • What is a good Python data structure for storing words and their categories? • How can we automatically tag each word of a text with its word class?

2015-04-09 CS372: NLP with Python 3 Mapping Words to Properties Using Python Dictionaries

 Indexing Lists Versus Dictionaries  Dictionaries in Python  Defining Dictionaries  Default Dictionaries  Incrementally Updating a  Complex Keys and Values  Inverting a Dictionary dictionary data type

2015-04-09 CS372: NLP with Python 4 Indexing Lists Versus Dictionaries

 List • A text is treated in Python as a list of words. • We can look up a particular item by giving its index. • text1[100]  Figure 5-2. List lookup.

2015-04-09 CS372: NLP with Python 5 Indexing Lists Versus Dictionaries

 With frequency distributions, we specify a word and get back a number. • fdist[‘monstrous’]  Figure 5-3. Dictionary lookup.

Other names for dictionary are map, hashmap, hash, and associative array.

2015-04-09 CS372: NLP with Python 6 Indexing Lists Versus Dictionaries

 In Figure 5-3, we mapped from names to numbers, unlike with a list.  Table 5-4. Linguistic objects as mappings from keys to values.

The mapping is from a “word” to some structured object.

2015-04-09 CS372: NLP with Python 7 Dictionaries in Python

 Python provides a dictionary data type that can be used for mapping between arbitrary types.

pos is defined as an empty dictionary.

2015-04-09 CS372: NLP with Python 8 Dictionaries in Python

 We can employ the keys to retrieve values.

 Question: • How do we work out the legal keys for a dictionary, where in the case of lists and strings we can use len() to work out which integers will be legal indexes? If the dictionary is not big, we can simply inspect its contents by evaluating the variable pos.

2015-04-09 CS372: NLP with Python 9 Dictionaries in Python

 To just find the keys, we can either convert the dictionary to a list or use the dictionary in a context where a list is expected, as the parameter of sorted() or in a for loop.

2015-04-09 CS372: NLP with Python 10 Dictionaries in Python

 The dictionary methods keys(), values(), and items() allow us to access the keys, values, and key-value pairs as separate lists.

2015-04-09 CS372: NLP with Python 11 Dictionaries in Python

 When we look something up in a dictionary, we get only one value for each key.

 However, there is a way of storing multiple values in an entry.  We may use a list value, e.g., pos[‘sleep’] = [‘N’,

‘V’]. Cf. the CMU Pronouncing Dictionary

2015-04-09 CS372: NLP with Python 12 Defining Dictionaries

 We can use the same key-value pair format to create a dictionary.

 Dictionary keys must be immutable types, such as strings and tuples.

2015-04-09 CS372: NLP with Python 13 Default Dictionaries

 If we try to access a key that is not in a dictionary, we get an error.  Since Python 2.5, a special kind of dictionary called a defaultdict has been available.

int, float, str, list, dict, tuple

When we access a non-existent entry, it is automatically added to the dictionary.

2015-04-09 CS372: NLP with Python 14 Default Dictionaries

 We can use default dictionaries to deal with hapaxes and low frequency words.

We can replace low frequency words with a special “out of ” token.

2015-04-09 CS372: NLP with Python 15 Incrementally Updating a Dictionary

 Example 5-3. Incrementally updating a dictionary, and sorting by value.

2015-04-09 CS372: NLP with Python 16 Incrementally Updating a Dictionary

2015-04-09 CS372: NLP with Python 17 Incrementally Updating a Dictionary

itemgetter(n) returns a function that can be called on some other sequence object to obtain the nth element.

2015-04-09 CS372: NLP with Python 18 Incrementally Updating a Dictionary

 Useful programming : • We initialize a defaultdict and then use a for loop to update its values.

2015-04-09 CS372: NLP with Python 19 Incrementally Updating a Dictionary

 The following example uses the same pattern to create an anagram dictionary.

 NLTK provides a convenient way of accumulating words through nltk.Index().

2015-04-09 CS372: NLP with Python 20 Complex Keys and Values

 Default dictionaries can have complex keys and values.

2015-04-09 CS372: NLP with Python 21 Inverting a Dictionary

 Dictionaries support efficient lookup. • However, finding a key given a value is slower and more cumbersome.

• If we expect to do this kind of “reverse lookup” often, it helps to construct a dictionary that maps values to keys.

2015-04-09 CS372: NLP with Python 22 Inverting a Dictionary

 Examples of reverse lookup

2015-04-09 CS372: NLP with Python 23 Inverting a Dictionary

 Table 5-5. Python’s dictionary methods.

2015-04-09 CS372: NLP with Python 24 Automatic Tagging

 The Default Tagger  The Regular Expression Tagger  The Lookup Tagger  Evaluation

>>> from nltk.corpus import brown >>> brown_tagged_sents = brown.tagged_sents(categories=‘news’) >>> brown_sents = brown.sents(categories=‘news’)

2015-04-09 CS372: NLP with Python 25 The Default Tagger

 The simplest possible tagger assigns the same tag to each token. • It establishes an important baseline. • In order to get the best result, we tag each word with the most likely tag.

Pros? 2015-04-09 CS372: NLP with Python 26 The Regular Expression Tagger

 Assign tags to tokens on the basis of matching patterns.

2015-04-09 CS372: NLP with Python 27 The Lookup Tagger

 Find the hundred most frequent words and store their most likely tag. • Use it as the model for a “lookup tagger”.

2015-04-09 CS372: NLP with Python 28 The Lookup Tagger

 Example 5-4. Lookup tagger performance with varying model size.

2015-04-09 CS372: NLP with Python 29 The Lookup Tagger

2015-04-09 CS372: NLP with Python 30 Evaluation

 We evaluate the performance of a tagger relative to the tags a human expert would assign. • Since we usually don’t have access to an expert and impartial human judge we make do instead with gold standard test data. • The tagger is regarded as being correct if the tag it guesses for a given word is the same as the gold standard tag.

2015-04-09 CS372: NLP with Python 31 Summary

 Mapping Words to Properties Using Python Dictionaries • Indexing Lists Versus Dictionaries • Dictionaries in Python • Defining Dictionaries • Default Dictionaries • Incrementally Updating a Dictionary • Complex Keys and Values • Inverting a Dictionary

2015-04-09 CS372: NLP with Python 32 Summary

 Automatic Tagging • The Default Tagger • The Regular Expression Tagger • The Lookup Tagger • Evaluation

2015-04-09 CS372: NLP with Python 33 Project: First Presentation

 30 April, 2015 (in class)  Prepare a 5 minute presentation for your term project (approximately 7 slides). • The project must have a clear I/O. • Explain the measure of the quality of the output. • Give a measure of how good your system is (together with a prediction of your system’s performance against this measure).

2015-04-09 CS372: NLP with Python 34