Part of Speech Tagging

Home , Treebank

p.3/18 p.4/18 – – PoS1 PoS1 - - it). ICL ICL y , e or b f v eg y erb of there near ed v b t tags) w the beha action ed the that par in ords (useful used: ollo w f xtr (353 is w noun, that e on ollo systems f nouns) the occur been predict ams recognition) often and conjunction, y r e , to b e v al discount NLP es w often v classes n-g speech: (and Susanne y ha erb ely ie el of dependent lik v can speech ord into adv retr ts ord? man sets w w . be w adjectiv tags), are content, in a parsing) pronouns par and Ho w ord-le pronouns e ticle (eg (45 can w tial ne divided larger speech? mation ords ar a about eight w or par object, ords! of be with of w inf preposition, lot and eg y synthesis a e personal can component what reebank possessiv lem ly , parts T recently , viour help unciations us us man are prob ords aditionally enn r ell tell erbs speech adjectiv similar (stemming, pronoun, P speech, A W T More T Pron Can Useful are nouns neighbourhood speech baha v y use • • • • • • • • of The arts P What p.1/18 p.2/18 – – PoS1 PoS1 - - ICL ICL 2004 Linguistics October

agging 11

T Renals e v Computational Monday

Speech Ste classes to 4, [email protected]

of TK eek

art W taggers NL closed P 6, in speech oduction e and of Intr ts aluating ar agsets agging Lectur w P Open T T Ev ICL: • • • • • ervie Ov Closed and open classes Open classes

• Parts of speech may be categorised as open or nouns Proper nouns (Scotland, BBC), closed classes common nouns: • Closed classes have a ﬁxed membership of words • count nouns (goat, glass) (more or less), eg determiners, pronouns, • mass nouns (snow, paciﬁsm) prepositions verbs actions and processes (run, hope), also • Closed class words are usually function words — auxiliary verbs frequently occurring, grammatically important, adjectives properties and qualities (age, colour, value) often short (eg of,it,the,in) adverbs modify verbs, or verb phrases, or other • The major open classes are nouns, verbs, adverbs: adjectives and adverbs Unfortunately John walked home extremely slowly yesterday

ICL - PoS1 – p.5/18 ICL - PoS1 – p.7/18

Closed classes in English The Penn Treebank tagset (1)

prepositions on, under, over, to, with, by CC Coord Conjuncn and, but, or NN Noun, sing. or mass dog determiners the, a, an, some CD Cardinal number one, two NNS Noun, plural dogs DT Determiner the, a, some NNP Proper noun, sing. Edinburgh pronouns she, you, I, who EX Existential there there NNPS Proper noun, plural Orkneys FW Foreign Word mon dieu PDT Predeterminer all, both conjunctions and, but, or, as, when, if IN Preposition of,in,by POS Possessive ending 's auxiliary verbs can, may, are JJ Adjective big PP Personal pronoun I, you, she JJR Adj., comparative bigger PP$ Possessive pronoun my, your, one's particles up, down, at, by JJS Adj., superlative biggest RB Adverb quickly, never numerals one, two, ﬁrst, second LS List item marker 1, One RBR Adverb, comparative faster MD Modal can,should RBS Adverb, superlative fastest

ICL - PoS1 – p.6/18 ICL - PoS1 – p.8/18 The Penn Treebank tagset (2) Tagging in NLTK

RP Particle up, off WP$ Possessive-Wh whose The simplest possible tagger tags everything as a SYM Symbol +,%,& WRB Wh-adverb how, where noun: TO “to” to $ Dollar sign $ from nltk.tokenizer import * UH Interjection oh, oops # Pound sign # from nltk.tagger import * VB verb, base form eat “ Left quote ‘ , “ VBD verb, past tense ate ” Right quote ’, ” text_token = Token(TEXT= VBG verb, gerund eating ( Left paren ( ’There are 11 players in a football team’) VBN verb, past part eaten ) Right paren ) WhitespaceTokenizer().tokenize(text_token) VBP Verb, non-3sg, pres eat , Comma , NN_tagger=DefaultTagger(’NN’) VBZ Verb, 3sg, pres eats . Sent-ﬁnal punct . ! ? NN_tagger.tag(text_token) WDT Wh-determiner which,that : Mid-sent punct. : ; — ... WP Wh-pronoun what, who >>> print text_token <[, , <11/NN>, , , , , ]>

ICL - PoS1 – p.9/18 ICL - PoS1 – p.11/18

Tagging A regular expression tagger

• Deﬁnition: Tagging is the assignment of a single We can use regular expressions to tag tokens based part-of-speech tag to each word (and punctuation on regularities in the text, eg numerals: marker) in a corpus. For example: NN_CD_tagger = RegexpTagger([ ª/ª The/DT guys/NNS that/WDT make/VBP traditional/JJ ( r’ ˆ[0-9]+(.[0-9]+)?$’, ’CD’), (r’.*’, ’NN’)]) hardware/NN are/VBP really/RB being/VBG obsoleted/VBN NN_CD_tagger.tag(text_token) by/IN microprocessor-based/JJ machines/NNS ,/, º/º said/VBD Mr./NNP Benton/NNP ./. >>> print text_token • Non-trivial: POS tagging must resolve ambiguities <[, , <11/CD>, , since the same word can have different tags in , , , ]> different contexts • In the Brown corpus 11.5% of word types and 40% of word tokens are ambiguous • In many cases one tag is much more likely for a

given word than any other. ICL - PoS1 – p.10/18 ICL - PoS1 – p.12/18 A unigram tagger Combining taggers in NLTK

The NLTK UnigramTagger class implements a tagging The UnigramTagger assigns the default tag None to algorithm based on a table of unigram probabilities: unseen words. We can combine taggers to ensure every word is tagged: tag(w) = argmaxP(ti|w) ti comb_tagger = BackoffTagger([uni_tagger, NN_CD_tagger], SUBTOKENS=’WORDS’, TAG=’POS’) Training a UnigramTagger on the Penn Treebank: comb_tagger.tag(text_token) from nltk.corpus import treebank train_tokens = [] >>> print text_token[’WORDS’] for item in treebank.items(’tagged’): [, , <11/CD>, , train_tokens.append(treebank.read(item)) , , , , ] uni_tagger = UnigramTagger(SUBTOKENS=’WORDS’, TAG=’POS’) for tok in train_tokens: for stok in tok[’SENTS’]: uni_tagger.train(stok)

ICL - PoS1 – p.13/18 ICL - PoS1 – p.15/18

Unigram tagging Evaluating taggers

text_token = Token(TEXT= • Basic idea: compare the output of a tagger with a ’There are 11 bizarre players in a football team’) human-labelled gold standard WhitespaceTokenizer().tokenize(text_token) • Need to compare how well an automatic method uni_tagger.tag(text_token) does with the agreement between people • >>> print text_token[’WORDS’] The best automatic methods have an accuracy of [, , <11/CD>, , about 96-97% when using the (small) Penn treebank tagset , , , , ] • Inter-annotator agreement is also only about 97% • A good unigram baseline (with smoothing) can obtain 90-91%!

ICL - PoS1 – p.14/18 ICL - PoS1 – p.16/18 Error analysis

• The % correct score doesn’t tell you everything — it is useful know what is misclassified as what • Confusion matrix: A matrix (ntags x ntags) where the rows correspond to the correct tags and the columns correspond to the tagger output. Cell (i, j) gives the count of the number of times tag i was classified as tag j • The leading diagonal elements correspond to correct classifications • Off diagonal elements correspond to misclassifications • Thus a confusion matrix gives information on the major problems facing a tagger (eg NNP vs. NN vs. JJ) ICL - PoS1 – p.17/18 • See section 3 of the NLTK tutorial on Tagging Recap

• Reading: Chapter 8 of Jurafsky and Martin • Parts of speech and tagsets • Tagging • Constructing simple taggers in NLTK • Evaluating taggers • Next lecture: An n-gram based part-of-speech tagger

ICL - PoS1 – p.18/18