<<

Announcements Part-of-Speech Tagging • Lit Review Part 2 • Written review of 2 articles, due April 1

• Final Project Proposal CS 341: Natural Processing Prof. Heather Pon-Barry • Due Monday April 6 www.mtholyoke.edu/courses/ponbarry/cs341.html Today POS Tagging

• Process of assigning marker to each in a collection

! She/ !

• POS Tagging found/ ! herself/pronoun ! falling/verb ! ... POS Tagging Penn Treebank Tagset

often have more than one POS: e.g., back

• The back door = (JJ)

• On my back = (NN)

• Win the voters back = (RB)

• Promised to back the bill = verb (VB)

• The POS tagging problem is to determine the POS tag for a particular instance of a word. Applications POS Tagging Performance

• How many tags are correct? (Tag accuracy) • State of the art: about 97% • “I ” vs. “This object...” • But baseline is already 90%

• Baseline is performance is: • • Tag every word with its most frequent tag • • Tag unknown words as

• Partly easy because • Named entity recognition • Many words are unambiguous

• Word sense disambiguation • You get points for them (the, a, etc.) and for punctuation marks! How difficult is POS Tagging? Automatic POS Tagging

• In the Brown corpus: • Symbolic • ~ 11% of the word types are ambiguous with regard to part of speech • Rule-based

• ~ 40% of the word tokens are ambiguous • Transformation-based

• But they tend to be very common words. E.g., that • Probabilistic • I know that he is honest = preposition (IN) • Hidden Markov models • Yes, that play was nice = (DT)

• You can’t go that far = adverb (RB) • Log-linear models Rule-based Tagging Rule-based Example

• Start with a dictionary !

• Assign all possible tags to words from the !!!! ! NN! dictionary !!!! ! RB!!! ! VBN!! JJ VB! • Write rules by hand to selectively remove tags PRP! VBD!! TO VB DT NN! • Leaving the correct tag for each word She!promised to back the!bill Rule-based Example Transformation-based

Eliminate VBN if VBD is an option when • Combines rule-based and probabilistic tagging

VBN|VBD follows “ PRP” • rules are used to specify tags in a certain environment !!!! ! NN! • probabilistic, we use a tagged corpus to find the best RB!!! performing rules (supervised learning)

VBN ! JJ VB! • Input PRP VBD!! TO VB DT NN! • tagged corpus She!promised to back the!bill • dictionary (with most frequent tags)

• Example: Brill tagger HMM: Part-of-Speech Automatic POS Tagging Transition Probabilities • Symbolic

• Rule-based

• Transformation-based

• Probabilistic

• Hidden Markov models

• Log-linear models Observation Likelihoods: P(word|tag) HMM Maxent P(tag|word) MEMMs

• Can do surprisingly well just looking at a word by itself:

• Word the: the DT • Maximum Entropy Markov Model

• Prefixes unfathomable: un- JJ • A sequence version of the maximum entropy • Suffixes Importantly: -ly RB classifier.

• Capitalization Meridian: CAP NNP ti-2 ti-1

• Word shapes 35-year: d-x JJ NNP MD VB

• Then build a classifier to predict tag wi-1 wi-1 wi wi+1 Janet will back the bill • Maxent P(tag|word): 93.7% overall / 82.6% unknown

Slide adapted from Dan Jurafsky MEMMs More Features

ti-2 ti-1

NNP MD VB

wi-1 wi-1 wi wi+1 Janet will back the bill

Slide adapted from Dan Jurafsky MEMM Decoding POS Tagging Accuracies

• Rough accuracies: • Simplest algorithm • Baseline: most freq tag: ~90% • Greedy: at each step in sequence, select tag that maximizes P(tag | nearby words, nearby tags) • HMM: ~95%

• Maxent P(t|w): 93.7% • In practice • MEMM tagger: 96.9% • Viterbi algorithm • Bidirectional MEMM: 97.2% • Beam search • Upper bound: ~98% (human )

Slide adapted from Dan Jurafsky More Resources References

• Log-linear models • Stanford POS Tagger (cyclic dependency network, bidirectional version of MEMM) • Ratnaparkhi, EMNLP 1996 • http://nlp.stanford.edu/software/tagger.shtml • Toutanova et al., NAACL 2003 • CMU Twitter POS tagger • Excellent recent survey: “Part-of-speech tagging from 97% to 100%: is it time for some • http://www.ark.cs.cmu.edu/TweetNLP/ ?” (Manning, 2011) Summary Training a Tagger

• Input • Penn Treebank: standard tagset • tagged corpus • Approaches to POS tagging: • dictionary (with most frequent tags) • Symbolic: rule-based, transformation-based • These are available for English • Probabilistic: HMMs, MEMMs • What about other ? Research in POS Tagging

• Low resource languages

• Learning a Part-of-Speech Tagger from Two Hours of Annotation (Garrette and Baldridge, 2013) [video]