Announcements Part-of-Speech Tagging • Lit Review Part 2 • Written review of 2 articles, due April 1
• Final Project Proposal CS 341: Natural Language Processing Prof. Heather Pon-Barry • Due Monday April 6 www.mtholyoke.edu/courses/ponbarry/cs341.html Today POS Tagging
• Process of assigning part of speech marker to each word in a collection
! She/pronoun !
• POS Tagging found/verb ! herself/pronoun ! falling/verb ! ... POS Tagging Penn Treebank Tagset
• Words often have more than one POS: e.g., back
• The back door = adjective (JJ)
• On my back = noun (NN)
• Win the voters back = adverb (RB)
• Promised to back the bill = verb (VB)
• The POS tagging problem is to determine the POS tag for a particular instance of a word. Applications POS Tagging Performance
• Speech synthesis • How many tags are correct? (Tag accuracy) • State of the art: about 97% • “I object” vs. “This object...” • But baseline is already 90%
• Baseline is performance is: • Parsing • Tag every word with its most frequent tag • Machine translation • Tag unknown words as nouns
• Partly easy because • Named entity recognition • Many words are unambiguous
• Word sense disambiguation • You get points for them (the, a, etc.) and for punctuation marks! How difficult is POS Tagging? Automatic POS Tagging
• In the Brown corpus: • Symbolic • ~ 11% of the word types are ambiguous with regard to part of speech • Rule-based
• ~ 40% of the word tokens are ambiguous • Transformation-based
• But they tend to be very common words. E.g., that • Probabilistic • I know that he is honest = preposition (IN) • Hidden Markov models • Yes, that play was nice = determiner (DT)
• You can’t go that far = adverb (RB) • Log-linear models Rule-based Tagging Rule-based Example
• Start with a dictionary !
• Assign all possible tags to words from the !!!! ! NN! dictionary !!!! ! RB!!! ! VBN!! JJ VB! • Write rules by hand to selectively remove tags PRP! VBD!! TO VB DT NN! • Leaving the correct tag for each word She!promised to back the!bill Rule-based Example Transformation-based
Eliminate VBN if VBD is an option when • Combines rule-based and probabilistic tagging
VBN|VBD follows “
VBN ! JJ VB! • Input PRP VBD!! TO VB DT NN! • tagged corpus She!promised to back the!bill • dictionary (with most frequent tags)
• Example: Brill tagger HMM: Part-of-Speech Automatic POS Tagging Transition Probabilities • Symbolic
• Rule-based
• Transformation-based
• Probabilistic
• Hidden Markov models
• Log-linear models Observation Likelihoods: P(word|tag) HMM Maxent P(tag|word) MEMMs
• Can do surprisingly well just looking at a word by itself:
• Word the: the DT • Maximum Entropy Markov Model
• Prefixes unfathomable: un- JJ • A sequence version of the maximum entropy • Suffixes Importantly: -ly RB classifier.
• Capitalization Meridian: CAP NNP ti-2 ti-1
• Word shapes 35-year: d-x JJ NNP MD VB
• Then build a classifier to predict tag wi-1 wi-1 wi wi+1 Janet will back the bill • Maxent P(tag|word): 93.7% overall / 82.6% unknown
Slide adapted from Dan Jurafsky MEMMs More Features
ti-2 ti-1
NNP MD VB
wi-1 wi-1 wi wi+1 Janet will back the bill
Slide adapted from Dan Jurafsky MEMM Decoding POS Tagging Accuracies
• Rough accuracies: • Simplest algorithm • Baseline: most freq tag: ~90% • Greedy: at each step in sequence, select tag that maximizes P(tag | nearby words, nearby tags) • Trigram HMM: ~95%
• Maxent P(t|w): 93.7% • In practice • MEMM tagger: 96.9% • Viterbi algorithm • Bidirectional MEMM: 97.2% • Beam search • Upper bound: ~98% (human agreement)
Slide adapted from Dan Jurafsky More Resources References
• Log-linear models • Stanford POS Tagger (cyclic dependency network, bidirectional version of MEMM) • Ratnaparkhi, EMNLP 1996 • http://nlp.stanford.edu/software/tagger.shtml • Toutanova et al., NAACL 2003 • CMU Twitter POS tagger • Excellent recent survey: “Part-of-speech tagging from 97% to 100%: is it time for some • http://www.ark.cs.cmu.edu/TweetNLP/ linguistics?” (Manning, 2011) Summary Training a Tagger
• Input • Penn Treebank: standard tagset • tagged corpus • Approaches to POS tagging: • dictionary (with most frequent tags) • Symbolic: rule-based, transformation-based • These are available for English • Probabilistic: HMMs, MEMMs • What about other languages? Research in POS Tagging
• Low resource languages
• Learning a Part-of-Speech Tagger from Two Hours of Annotation (Garrette and Baldridge, 2013) [video]