2/18/04

Words and their meaning Disambiguation Lecture #6 Three lectures: Introduction to Natural Language Processing • Last time: Collocations CMPSCI 585, Spring 2004 – multiple together, different meaning than University of Massachusetts Amherst than the sum of its parts • Today: Word disambiguation – one word, multiple meanings – Expectation Propagation • Next time: Word clustering Andrew McCallum – multiple words, “same” meaning

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

Word Sense Disambiguation WSD important for…

• The task is to determine which of various senses of a • Translation word are invoked in context. – “The spirit is willing but the flesh is weak.” – True annuals are plants grown from seed that blossom, set new – “The vodka is good, but the meat is spoiled.” seed and die in a single year. • – Nissan’s Tennessee manufacturing plant beat back a United Auto Workers organizing effort with aggressive tactics. – query: “wireless mouse” – document: “Australian long tail hopping mouse” • This is an important problem: Most words are ambiguous • Computational lexicography (have multiple senses) – To automatically identify multiple definitions to be listed in a dictionary • Problem statement: • – A word is assumed to have a finite number of discrete senses – To give preference to parses with correct use of senses – Make a forced choice between each word usage based on some limited context around the word • There isn’t generally one way to divide the uses of a word ito a set of non-overlapping categories. • Converse: word or senses that mean (almost) the same: • Senses depend on the task [Kilgarriff 1997] – image, likeness, portrait, facsimile, picture – (Next lecture)

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

WSD: Many other cases are harder WSD: types of problems

• Homonymy: meanings are unrelated: • “title” – bank of a river – Name/heading of a book, statute, work of art of music, – bank financial institution etc. • Polysemy: related meanings (as on previous slide) – Material at the start of a film – title of a book – title material at the start of a film – The right of legal ownership (of land) • Systematic polysemy: standard methods of extending – The document that is evidence of this right a meaning, such as from an organization to the – An appellation of respect attached to a person’s name building where it is housed. – A written work – The speaker of the legislature… – The legislature decided today… – He opened the door, and entered the legislature • A word frequently takes on further related meanings through systematic polysemy or metaphor.

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

1 2/18/04

Upper and lower bounds on performance Senseval competitions

• Upper bound: human performance • Senseval 1: September 1998. Results in – How often do human judges agree on the correct sense assignment? Computers and the Humanities 34(1-2). OUP – Particularly interesting if you only give humans the same input Hector corpus. context given to machine method. • Senseval 2: In first half of 2001. WordNet (A good test for any NLP method!) – Gale 1992: give pairs of words in context, humans say if they senses. are the same sense. Agreement 97-99% for word with clear – http://www.itri.brighton.ac.uk/events/senseval senses, but ~65-70% for polysemous words.

• Lower bound: simple baseline algorithm – Always pick the most common sense for each word. – Accuracy depends greatly on sense distribution! 90-50%?

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

WSD automated method performance WSD solution #1: expert systems [Small 1980] [Hirst 1988]

• Varies widely depending on how difficult the • Most early work used semantic networks, disambiguation task is. frames, logical reasoning, or “expert system” • Accuracies over 90% are commonly reported on the methods for disambiguation based on classic, often fairly easy, word disambiguation tasks contexts. (pike, star, interest) • Senseval brought careful evaluation of difficult WSD • The problem got quite out of hand: (many senses, different POS) – The word expert for “throw” is “currently six pages • Senseval 1, fine grained senses, wide range of types long, but should be ten times that size” (Small and – Overall: about 75% accuracy Rieger 1982) – Nouns: about 80% accuracy – Verbs: about 70% accuracy

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

WSD solution #2: dictionary-based WSD solution #3: -based [Lesk 1986] [Walker 1987] [Yarowsky 1992]

• A word’s dictionary definitions are likely to be good • Occurrences of a word in multiple thesaurus “subject indicators for the senses they define. codes” is a good indicator of its senses. – One sense for each dictionary definition • Count number of times context words appear among the entries for each possible “subject code”. – Look for overlap between words in definition and words in context at hand • Increase coverage of rare words and proper nouns by also looking in the thesaurus for words that co-occur with context words more often than chance. E.g. Word=“ash” “Hawking” co-occurs with “cosmology”, “black hole” Sense Definition 1. tree a tree of the olive family 2. burned the solid residue left when combustible material is burned Word Sense Roget category Accuracy “This cigar burns slowly and creates a stiff ash” sense1=0 sense2=1 star space object UNIVERSE 96% “The ash is one of the last trees to come into leaf” sense1=1 sense2=0 celebrity ENTERTAINER 95% star-shaped INSIGNIA 82% • Insufficient information in definitions. Accuracy 50-70%

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

2 2/18/04

An extra trick: global constraints Other similar “disambiguation” problems [Yarowsky 1995]

• One sense per discourse: the sense of a • Sentence boundary detection word is highly consistent within a document – “I live on Palm Dr. Smith lives downtown.” – Get a lot more context words because combine – Only really ambiguous when the context of multiple occurrences • word before the period is an abbreviation (which can end – True for topic dependent words a sentence - not something like a title) • word after the period is capitalized (and can be a proper – Not so true for other items like adjectives and name - otherwise it must be a sentence end) verbs, e.g. “make”, “take”. • Context-sensitive spelling correction – “I know there is a problem with there account.”

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

WSD solution #4: supervised classification WSD sol’n #5: unsupervised disambiguation

• Gather a lot of labeled data: words in context, word+context, word+context, unlabeled hand-labeled into different sense categories. labeled according • Use naïve Bayes document classification with to sense “context” as the document! – Straightforward classification problem. – Simple, powerful method! :-) – Requires hand-labeling a lot of data :-(

• Can we still use naïve Bayes, but without labeled data?… Train one multinomial per class Label is missing! via maximum likelihood. What you just did for HW#1

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

28 years ago… Filling in Missing Labels with EM

[Dempster et al ‘77], [Ghahramani & Jordan ‘95], [McLachlan & Krishnan ‘97]

Expectation Maximization is a class of iterative algorithms for maximum likelihood estimation with incomplete data.

• E-step: Use current estimates of model parameters to “guess” value of missing labels. • M-step: Use current “guesses” for missing labels to calculate new estimates of model parameters. • Repeat E- and M-steps until convergence.

Finds the model parameters that locally maximize the probability of both the labeled and the unlabeled data.

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

3 2/18/04

Recall: “Naïve Bayes” Recall: Parameter Estimation in Naïve Bayes

Estimate of P(c) 1+Count(d ∈ c j ) Pick the most probable class, given the evidence: P(c j ) = | C | + Count(d ∈ c ) c∗ = arg max Pr(c | d) ∑ k c j j k c j - a class (like “Planning”) d - a document (like “language intelligence proof...”) Estimate of P(w|c) 1+ Count(w ,d ) Bayes Rule: “Naïve Bayes”: € ∑ i k |d| ˆ d k ∈c j Pr(c ) Pr(w |c ) P (wi | c j ) = Pr(c ) Pr(d | c ) j ∏ di j |V | j j i=1 Pr(c j | d) = ≈ |d| Pr(d) |V | +∑ ∑Count(wt ,dk ) ∑ Pr(ck )∏ Pr(wd |ck ) i=1 i t=1 d k ∈c j ck w - the i th word in d (like “proof”) di

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst €

EM Recipe EM • Initialization – Create an array P(c|d) for each document, and fill it with random values. • We could have simply written down likelihood, taken • M-step (likelihood Maximization) derivative and solved… – Calculate maximum-likelihood estimates for parameters P(c) and – but unlike “complete data” case, not solvable in closed form P(w|c) using current P(c|d). – must use iterative method: 1+ Count(w ,d )⋅ P(c | d ) 1+ ∑P(c j | d) ∑ i k j k – gradient ascent d ∈D d ∈D Pˆ (w | c ) = k P(c j ) = i j |V | – EM is another form of ascent on this likelihood surface | C | + | D | |V | + Count(w ,d )⋅ P(c | d ) ∑ ∑ t k j k – Convergence, speed and local minima are all issues. t=1 d k ∈D • E-step (missing-value Estimation) € – Using current parameters, calculate new P(c|d) the same way • If you make “hard 0 versus 1” assignments in P(c|d), you would at test€ time. you get the K-means algorithm. Pr(c j ) Pr(d | c j ) Pr(c j | d) = Pr(d) • Likelihood will always be highest with more classes. • Loop back to M-step, until convergence. – Use a prior over number of classes, or just pick arbitrarily. – Converged when maximum change in a parameter P(w|c) is below some threshold.

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

EM Semi-Supervised Document Classification

• Some good things about EM Training data Data available at training – no learning rate parameter with class labels time, but without class labels – very fast for low dimensions – each iteration is guaranteed to improve likelihood – adapts unused units rapidly • Some bad things about EM – can get stuck in local minima – ignores model cost (how many classes?) Web pages Web pages Web pages user user says are user says are hasn’t seen or said – both steps require considering all explanations of interesting uninteresting anything about the data (all classes) Can we use the unlabeled documents to increase accuracy?

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

4 2/18/04

Semi-Supervised Document Classification An Example

Labeled Data Unlabeled Data Build a classification Use model to estimate the model using limited labels of the unlabeled Baseball Ice Skating labeled data documents Tara Lipinski’s substitute Fell on the ice... The new hitter struck ice skates didn’t hurt her out... performance. She Perfect triple jump... graced the ice with a Struck out in last series of perfect jumps inning... and won the gold medal. Katarina Witt’s gold Homerun in the first medal performance... inning... Tara Lipinski bought New ice skates... a new house for her Pete Rose is not parents. as good an athlete Use all documents to build a new classification as Tara Lipinski... Practice at the ice After EM: model, which is often more accurate because it rink every day... Before EM: Pr ( Lipinski | Ice Skating ) = 0.02 is trained using more data. Pr ( Lipinski ) = 0.01 Pr ( Lipinski ) = 0.001 Pr ( Lipinski | Baseball ) = 0.003

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

WebKB Data Set Word Vector Evolution with EM

Iteration 0 Iteration 1 Iteration 2 intelligence DD D DD D DD artificial lecture lecture understanding cc cc student faculty course project DDw D* DD:DD dist DD:DD due identical handout D* rus due homework arrange problem assignment 4 classes, 4199 documents games set handout dartmouth tay set natural DDam hw cognitive yurtas exam logic homework problem from CS academic departments proving kfoury DDam prolog sec postscript

(D is a digit)

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

EM as Clustering EM as Clustering, Gone Wrong

X X X X

X X

= unlabeled

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

5 2/18/04

Newsgroups Classification Accuracy 20 Newsgroups Data Set varying # labeled documents

alt.atheism sci sci talk.politics.guns comp.graphicscomp. comp.sys. comp.sys. comp.windows.xrec rec sci sci talk.politics. talk.politics. talk.religion. … .crypt .med .electronics .space .sport.baseball .sport.hockey

os

.ms-windows.

ibm mac

mideast misc misc

.pc.hardware .hardware

misc

20 class labels, 20,000 documents 62k unique words

Andrew McCallum, UMass Amherst Andrew McCallum, UMass Amherst

6