<<

CS546: Machine in NLP (Spring 2020) http://courses.engr.illinois.edu/cs546/

Lecture 4: Static embeddings

Julia Hockenmaier [email protected] 3324 Siebel Center Office hours: Monday, 11am—12:30pm (Static) Word Embeddings

A (static) is a function that maps each word type to a single vector

— these vectors are typically dense and have much lower dimensionality than the size of the vocabulary

— this mapping function typically ignores that the same string of letters may have different senses (dining table vs. a table of contents) or parts of speech (to table a motion vs. a table)

— this mapping function typically assumes a fixed size vocabulary (so an UNK token is still required) CS447: Natural Language Processing (J. Hockenmaier) 2 Embeddings

Main idea: Use a classifier to predict which appear in the context of (i.e. near) a target word (or vice versa) This classifier induces a dense vector representation of words (embedding)

Words that appear in similar contexts (that have high distributional similarity) will have very similar vector representations.

These models can be trained on large amounts of raw text (and pre-trained embeddings can be downloaded)

CS447: Natural Language Processing (J. Hockenmaier) 3 Word2Vec (Mikolov et al. 2013)

The first really influential dense word embeddings

Two ways to think about Word2Vec: — a simplification of neural language models — a binary classifier

Variants of Word2Vec — Two different context representations: CBOW or Skip-Gram — Two different optimization objectives: Negative sampling (NS) or hierarchical softmax

CS447: Natural Language Processing (J. Hockenmaier) 4 Word2Vec architectures

INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT INPUT INPUTPROJECTION PROJECTION OUTPUT OUTPUT

w(t-2) w(t-2) w(t-2) w(t-2)

w(t-1) w(t-1) w(t-1) w(t-1) SUM SUM

w(t) w(t) w(t) w(t)

w(t+1) w(t+1) w(t+1) w(t+1)

w(t+2) w(t+2) w(t+2) w(t+2)

CBOW CBOW Skip-gram Skip-gram

FigureCS546 1:Figure New Machine model 1: New architectures. Learning model architectures. in NLP The CBOW The architecture CBOW architecture predicts the predicts current the word current based word on the based on5 the context, andcontext, the Skip-gram and the Skip-gram predicts surrounding predicts surrounding words given words the givencurrent the word. current word.

R wordsR fromwords the futurefrom the of futurethe current of the word current as correct word as labels. correct This labels. will This require will us require to do R us to2 do R 2 word classifications,word classifications, with the current with the word current as input, word and as input, each ofand the eachR + ofR thewordsR + asR output.words as In output.⇥ the In⇥ the followingfollowing experiments, experiments, we use C we= 10 use. C = 10.

4 Results4 Results

To compareTo the compare quality the of quality different of versions different of versions word vectors, of word previous vectors, papers previous typically papers usetypically a table use a table showing exampleshowing words example and words their andmost their similar most words, similar and words, understand and understand them intuitively. them intuitively. Although Although it is easy toit is show easy that to show word thatFrance wordisFrance similaris to similarItaly and to perhapsItaly and some perhaps other some countries, other countries, it is much it is much more challengingmore challenging when subjecting when subjecting those vectors those in vectors a more in complex a more similarity complex similarity task, as follows. task, as We follows. We follow previousfollow observation previous observation that there that can bethere many can different be many types different of similarities types of similarities between words, between for words, for example,example, word big wordis similarbig is to similarbigger toinbigger the samein thesense same that sensesmall thatis similarsmall is to similarsmaller to. Examplesmaller. Example of anotherof type another of relationship type of relationship can be word can pairs be wordbig - pairs biggestbigand - biggestsmalland - smallestsmall -[20]. smallest We further[20]. We further denote twodenote pairs two of words pairs of with words the same with the relationship same relationship as a question, as a question, as we can as ask: we ”Whatcan ask: is ”What the is the word thatword is similar that is to similarsmall in to thesmall samein sense the same as biggest sense asisbiggest similaris to similarbig?” to big?” SomewhatSomewhat surprisingly, surprisingly, these questions these questions can be answered can be answered by performing by performing simple algebraic simple algebraicoperations operations with the vectorwith the representation vector representation of words. of To words. find a wordTo find that a word is similar that is to similarsmall in to thesmall samein sensethe same as sense as biggest isbiggest similaris to similarbig, we to canbig simply, we can compute simply vector computeX vector= vectorX (”=biggestvector(””)biggestvector”)(”bigvector”) +(”big”) + vector(”smallvector”)(”.small Then,”) we. Then, search we in the search vector in the space vector for thespace word for closest the word to closestX measured to X measured by cosine by cosine distance, anddistance, use it and as the use answer it as the to answer the question to the (we question discard (we the discard input thequestion input words question during words this during this search). Whensearch). the When word the vectors word are vectors well trained, are well it trained, is possible it is to possible find the to correct find the answer correct (word answer (word smallest)smallest using this) using method. this method. Finally, weFinally, found we that found when that we whentrain high we train dimensional high dimensional word vectors word on vectors a large on amount a large of amount , the of data, the resulting vectorsresulting can vectors be used can to be answer used to very answer subtle very semantic subtle relationshipssemantic relationships between words, between such words, as such as a city anda the city country and the it country belongs it to, belongs e.g. France to, e.g. is France to Paris is as to Germany Paris as Germany is to Berlin. is to Word Berlin. vectors Word vectors with suchwith semantic such semanticrelationships relationships could be usedcould to be improve used to many improve existing many NLP existing applications, NLP applications, such such as machineas translation, , information and retrieval question and answering systems, and systems, may enable and may other enable other future applicationsfuture applications yet to be invented. yet to be invented.

5 5 CBOW: predict target from context

(CBOW=Continuous Bag of Words) Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4

Given the surrounding context words (tablespoon, of, jam, a), predict the target word (apricot).

Input: each context word is a one-hot vector Projection : map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the target word with softmax

CS546 in NLP 6 Skipgram: predict context from target

Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4

Given the target word (apricot), predict the surrounding context words (tablespoon, of, jam, a), Input: each target word is a one-hot vector Projection layer: map each one-hot vector down to a dense D-dimensional vector, and average these vectors Output: predict the context word with softmax

CS546 Machine Learning in NLP 7 Skipgram w11 … w1h … w1H … … … … … w11 … w1T … w1V wC1 … wCh … wCH … … … … … … … … … … wh1 … whT … whV wV1 … wVh … wVH … … … … … Score of wH1 … wHT … wHV wj context word

p(wc |wt) ∝ exp(wc ⋅ wt) One-hot encoding wi of target word

The rows in the weight for the hidden layer correspond to the weights for each hidden unit. The columns in the weight matrix from input to the hidden layer correspond to the input vectors for each (target) word [typically, those are used as word2vec vectors] The rows in the weight matrix from the hidden to the output layer correspond to the output vectors for each (context) word [typically, those are ignored]

CS546 Machine Learning in NLP 8 Negative sampling

Skipgram aims to optimize the avg log of the data: T T 1 1 exp(wt+jwt) log p(w ∣ w ) = log T ∑ ∑ t+j t T ∑ ∑ ( V ) t=1 −c≤j≤c,j≠0 t=1 −c≤j≤c,j≠0 ∑k=1 exp(wkwt)

V But the partition function is very expensive ∑ exp(wkwt) k=1 — This can be mitigated by hierarchical softmax (represent each wt+j by Huffman encoding, and predict the sequence of nodes in the resulting binary via softmax). — Noise Contrastive Estimation is an alternative to (hierarchical) softmax that aims to distinguish actual data points wt+j from noise via logistic regression — But we just want good word representations, so we do something simpler:

Negative Sampling instead aims to optimize k 1 log σ(w ⋅ w ) + E log σ(−w w ) with σ(x) = T c ∑ wi∼P(w)[ T i ] 1 + exp(−x) i=1 CS546 Machine Learning in NLP 9 Skip-Gram Training data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4

Training data: input/output pairs centering on apricot Assume a +/- 2 word window (in reality: use +/- 10 words) Positive examples: (apricot, tablespoon), (apricot, of), (apricot, jam), (apricot, a) For each positive example, k negative examples, using noise words (according to [adjusted] unigram probability) (apricot, aardvark), (apricot, puddle)…

CS546 Machine Learning in NLP 10 P(Y | X) with Logistic Regression

The sigmoid lies between 0 and 1 and is used in (binary) logistic regression

1 σ(x) = 1 + exp(−x) Logistic regression for binary classification (y ∈ {0,1}): 1 P( Y=1 ∣ x ) = σ(wx + b) = 1 + exp( −(wx + b)) Parameters to learn: one feature weight vector w and one term b

CS546 Machine Learning in NLP 11 Back to word2vec

Skipgram with negative sampling also uses the sigmoid, but requires two sets of parameters that are multiplied together (for target and context vectors)

k log σ(w ⋅ w ) + E log σ(−w w ) T c ∑ wi∼P(w)[ T i ] i=1 We can view word2vec as training a binary classifier for the decision whether c is an actual context word for t.

The probability that c is a positive (real) context word for t: P( D = + | t, c) The probability that c is a negative (sampled) context word for t: P( D = − | t, c) = 1 − P(D = + | t, c)

CS546 Machine Learning in NLP 12 Negative Sampling

k log σ(w ⋅ w ) + E log σ(−w ⋅ w ) t c ∑ wi∼P(w)[ t i ] i=1

1 k 1 = log + E log ( ) ∑ wi∼P(w)[ ( )] 1 + exp(−wt ⋅ wc) i=1 1 + exp(wt ⋅ wi)

1 k 1 = log + E log 1 − ∑ wi∼P(w)[ ( )] 1 + exp(−wt ⋅ wc) i=1 1 + exp(−wt ⋅ wi)

= log P(D = + ∣ w , w ) + E log(1 − P(D = + |w , w ) c t ∑ wi∼P(w)[ i t ] i Should be high for Should be low for actual context words sampled context words CS546 Machine Learning in NLP 13 Negative Sampling Basic idea: — For each actual (positive) target-context word pair, sample k negative examples consisting of the target word and a randomly sampled word. — Train a model to predict a high conditional probability for the actual (positive)context words, and a low conditional probability for the sampled (negative) context words.

This can be reformulated as (approximated by) predicting whether a word-context pair comes from the actual (positive) data, or from the sampled (negative) data:

k log σ(w ⋅ w ) + E log σ(−w w ) T c ∑ wi∼P(w)[ T i ] i=1 CS546 Machine Learning in NLP 14 Word2Vec: Negative Sampling

Distinguish “good” (correct) word-context pairs (D=1), from “bad” ones (D=0)

Probabilistic objective: P( D = 1 | t, c ) defined by sigmoid: 1 P(D = 1 w,c)= | 1 + exp( s(w,c)) P( D = 0 | t, c ) = 1 — P( D = 0 | t, c ) P( D = 1 | t, c ) should be high when (t, c) ∈ D+, and low when (t,c) ∈ D-

CS546 Machine Learning in NLP 15 Word2Vec: Negative Sampling

Training data: D+ ∪ D-

D+ = actual examples from training data

Where do we get D- from? Word2Vec: for each good pair (w,c), sample k words and add each wi as a negative example (wi,c) to D’ (D’ is k times as large as D)

Words can be sampled according to corpus frequency or according to smoothed variant where freq’(w) = freq(w)0.75 (This gives more weight to rare words and performs better)

CS546 Machine Learning in NLP 16 Word2Vec: Negative Sampling

Training objective: Maximize log-likelihood of training data D+ ∪ D-:

L (Q,D,D0)= Â logP(D = 1 w,c) (w,c) D | 2 + Â logP(D = 0 w,c) (w,c) D | 2 0

CS546 Machine Learning in NLP 17 Skip-Gram with negative sampling

Train a binary classifier that decides whether a target word t appears in the context of other words c1..k — Context: the set of k words near (surrounding) t — Treat the target word t and any word that actually appears in its context in a real corpus as positive examples — Treat the target word t and randomly sampled words that don’t appear in its context as negative examples — Train a (variant of a) binary logistic regression classifier with two sets of weights (target and context embeddings) to distinguish these cases — The weights of this classifier depend on the similarity of t and the words in c1..k Use the target embeddings to represent t

CS546 Machine Learning in NLP 18 20 CHAPTER 6 VECTOR SEMANTICS • 20 CHAPTER 6 VECTOR SEMANTICS • The probability that word c is not a real context word for t is just 1 minus Eq. 6.23: 20 CHAPTER 6 VECTOR SEMANTICS • The probability that word c is notP( at,c real)=1 contextP(+ t,c word) for t is just 1(6.24) minus Eq. 6.23: |The probability that| word c is not a real context word for t is just 1 minus How does the classifier compute the probability P? The intuition of the skip- Eq. 6.23: gram model is to base this probability on similarity: a word is likely to occur near the target if its embeddingP( t is,c similar)=1 to theP(+ targett,c embedding.) P( t, Howc)=1 canP(+ wet, computec) (6.24) (6.24) similarity between embeddings?| Recall that| two vectors are| similar if they| have a How does the classifier compute theHow probability does the classifierP? computeThe intuition the probability of theP? The skip- intuition of the skip- high dot product (cosine, the mostgram popular model is similarity to base this metric, probability is just on a similarity: normalized a word dot is likely to occur near gram modelproduct). is to base In other this probabilitywords: the on target similarity: if its embedding a word is similar is to likely the target to embedding. occur near How can we compute the target if its embedding is similar tosimilarity the target between embedding. embeddings? Recall How that can two we vectors compute are similar if they have a Similarity(t,c) t c (6.25) high dot product⇡ (cosine,· the most popular similarity metric, is just a normalized dot similarity between embeddings? Recallproduct). that In two other vectors words: are similar if they have a high dot productOf course, (cosine, the dot the product most populart c is not similarity a probability, metric, it’s just is a just number a normalized ranging from dot · Similarity(t,c) t c (6.25) product). In other0 to •. words: (Recall, for that matter, that cosine isn’t a probability either).⇡ To· turn the dot product into a probability,Of we’ll course, use the the dotlogistic productort sigmoidc is not a probability,function s it’s(x) just, the a number ranging from · fundamental core of logisticSimilarity regression:0 to(t•,.c) (Recall,t forc that matter, that cosine isn’t a probability(6.25) either). To turn the dot product into a probability, we’ll use the logistic or s(x), the ⇡ 1· fundamentals(x)= core of logistic regression: (6.26) Of course, the dot product t c is not a probability,1 + e x it’s just a number ranging from 1 · ( )= (6.26) 0 to •. (Recall, for that matter, that cosine isn’t a probabilitys x either).x To turn the The probability that word c is a real context word for target word t1is+ thuse computed dot product intoas: a probability, we’ll useThe probability the logistic that wordor csigmoidis a real contextfunction word fors target(x), word thet is thus computed fundamental core of logistic regression:as: 1 (+ , )= (6.27) P t c t c 1 | 1 + e · P(+ t,c)= (6.27) 1 t c The Skip-Gram| classifier1 + e · The sigmoid function justs( returnsx)= a numberx between 0 and 1, so to make it a proba-(6.26) bility we’ll need to make sure thatThe1 sigmoid the+ totale function probability just returns of the a twonumber possible between events 0 and ( 1,c so to make it a proba- Use logisticbility regression we’ll need to make to predict sure that thewhether total probability the pair of the ( twot, c possible) (target events (c The probabilitybeing that a context word word,c is a and realc not contextbeing being a context a word context word, for word) andtargetc sumnot word being to 1. at contextis thus word) computed sum to 1. word t andc a context word c), is at positive or negative example: as: The probability that word isThe not aprobability real context that word wordc foris notis a realthus: context word for t is thus: P( t,c)=1 P(+ t,c)P( t,c)=1 P(+ t,c) | 1 | | t c | t c e · P(+ t,c)= e · = (6.27) (6.28) = t c t c (6.28) | 1 + e · t c 1 + e · 1 + e · The sigmoid function just returns a numberEquation between6.27 give us 0 the and probability 1, so for to one make word, it but a proba- we need to take account of Equation 6.27 giveAssume us the probability thatthe multiple t and for one context c are word, words represented but in we the need window. to as take Skip-gramvectors account, makes of the strong but very bility we’ll needthe multiple to make context sureso words that that the intheiruseful the total window.dot simplifying probability product Skip-gram assumption tc of captures the makes that two all context the their possible strong words similarity but areevents veryindependent, (c allowing us to being a contextuseful word, simplifying and c assumptionnotTo capture being thatjust a contextthe multiply all contextentire their word) wordscontext : sum are independent,window to 1. c1..k allowing, assume us to the words in just multiply their probabilities: k The probability that wordc1:kc is are not independent a real context (multiply) word for andt is take thus: the log:1 P(+ t,c1:k)= (6.29) k | 1 + e t ci 1 i=1 · P( t,Pc(+)=t,c1:k1)=P(+ t,c) Y (6.29) | 1 + e t ci k | i=1 | · 1 t c logP(+ t,c1:k)= log (6.30) e ·Y | 1 + e t ci k i=1 · = t c 1 X (6.28) logP(+ t,c1:k1)=+ e · log (6.30) In| summary, skip-gram1 trains+ e at c probabilistici classifier that, given a test target word i=1 · Equation 6.27 give us the probabilityt forand its one context word,X window but of wek words needc1:k to, assigns take a account probability basedof on how similar this context window is to the target word. The probability is based on applying the the multipleIn context summary, words skip-gram in the trains window. a probabilistic Skip-gram classifier makes that, given the a strongtest target but word very t and its context windowCS546 of Machinek wordslogistic Learningc (sigmoid), assignsin NLP function a probability to the dot based product on of how the embeddings similar of the target word19 useful simplifying assumption that allwith context each1:k context words word. are independent, We could thus compute allowing this probability us to if only we had this context window is to the targetembeddings word. for The each probability word target is and based context on word applying in the vocabulary. the Let’s now turn just multiplylogistic their probabilities: (sigmoid) function to theto learning dot product these embeddings of the embeddings (which is the of realthe targetgoal of word training this classifier in the with each context word. Wefirst could place). thus compute this probability if only we had k embeddings for each word target and context word1 in the vocabulary. Let’s now turn to learning these embeddingsP(+ t,c1:k (which)= is the real goal of training this classifier in(6.29) the | 1 + e t ci first place). = · Yi 1 k 1 logP(+ t,c1:k)= log t c (6.30) | 1 + e · i Xi=1 In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on applying the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place). Where do we get vectors t, c from?

Iterative approach (): Assume an initial set of vectors, and then adjust them during training to maximize the probability of the training examples.

CS546 Machine Learning in NLP 20 Summary: How to learn word2vec (skip-gram) embeddings

For a vocabulary of size V: Start with two sets of V random 300-dimensional vectors as initial embeddings

Train a logistic regression classifier to distinguish words that co-occur in corpus from those that don’t Pairs of words that co-occur are positive examples Pairs of words that don't co-occur are negative examples Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance

Throw away the classifier code and keep the embeddings.

CS546 Machine Learning in NLP 21 Evaluating embeddings Compare to human scores on word similarity-type tasks: WordSim-353 (Finkelstein et al., 2002) SimLex-999 (Hill et al., 2015) Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) TOEFL dataset: Levied is closest in meaning to: imposed, believed, requested, correlated

CS447: Natural Language Processing (J. Hockenmaier) 22 Properties of embeddings Similarity depends on window size C C = ±2 The nearest words to Hogwarts: Sunnydale Evernight

C = ±5 The nearest words to Hogwarts: Dumbledore Malfoy halfblood

CS447: Natural Language Processing (J. Hockenmaier) 23 Vectors “capture concepts”

CS546 Machine Learning in NLP 24 Analogy pairs

CS546 Machine Learning in NLP 25 Analogy: Embeddings capture relational meaning! vector(‘king’) - vector(‘man’) + vector(‘woman’) = vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) = vector(‘Rome’)

CS447: Natural Language Processing (J. Hockenmaier) 26 Word2vec results

CS546 Machine Learning in NLP 27 Word2Vec and distributional similarities

Why does the word2vec objective yield sensible results?

Levy and Goldberg (NIPS 2014): Skipgram with negative sampling can be seen as a weighted factorization of a word-context PMI matrix. => It is actually very similar to traditional distributional approaches!

Levy, Goldberg and Dagan (TACL 2015) suggest tricks that can be applied to traditional approaches that yield similar results on these lexical tests.

CS546 Machine Learning in NLP 28 Using Word Embeddings

CS447: Natural Language Processing (J. Hockenmaier) 29 Using pre-trained embeddings

Assume you have pre-trained embeddings E. How do you use them in your model?

-Option 1: Adapt E during training Disadvantage: only words in training data will be affected. -Option 2: Keep E fixed, but add another hidden layer that is learned for your task -Option 3: Learn matrix T ∈ dim(emb)×dim(emb) and use rows of E’ = ET (adapts all embeddings, not specific words) -Option 4: Keep E fixed, but learn matrix Δ ∈ R|V|×dim(emb) and use E’ = E + Δ or E’ = ET + Δ (this learns to adapt specific words)

CS546 Machine Learning in NLP 30 More on embeddings

Embeddings aren’t just for words! You can take any discrete input feature (with a fixed number of K outcomes, e.g. POS tags, etc.) and learn an embedding matrix for that feature.

Where do we get the input embeddings from? We can learn the embedding matrix during training. Initialization matters: use random weights, but in special range (e.g. [-1/(2d), +(1/2d)] for d-dimensional embeddings), or use Xavier initialization We can also use pre-trained embeddings LM-based embeddings are useful for many NLP task

CS546 Machine Learning in NLP 31 Dense embeddings you can download!

Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Fasttext http://www.fasttext.cc/

Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/

CS546 Machine Learning in NLP 32 Traditional Distributional similarities and PMI

CS546 Machine Learning in NLP 33 Distributional similarities

Distributional similarities use the set of contexts in which words appear to measure their similarity.

They represent each word w as a vector w w = (w1, …, wN) ∈ RN in an N-dimensional .

-Each dimension corresponds to a particular context cn

-Each element wn of w captures the degree to which the word w is associated with the context cn.

- wn depends on the co-occurrence counts of w and cn

The similarity of words w and u is given by the similarity of their vectors w and u

CS447: Natural Language Processing (J. Hockenmaier) 34 Using nearby words as contexts

-Decide on a fixed vocabulary of N context words c1..cN Context words should occur frequently enough in your corpus that you get reliable co-occurrence counts, but you should ignore words that are too common (‘stop words’: a, the, on, in, and, or, is, have, etc.)

-Define what ‘nearby’ means For example: w appears near c if c appears within ±5 words of w

-Get co-occurrence counts of words w and contexts c

-Define how to transform co-occurrence counts of words w and contexts c into vector elements wn For example: compute (positive) PMI of words and contexts

-Define how to compute the similarity of word vectors For example: use the cosine of their angles.

CS447: Natural Language Processing (J. Hockenmaier) 35 Defining and counting co-occurrence

Defining co-occurrences: -Within a fixed window: vi occurs within ±n words of w -Within the same sentence: requires sentence boundaries -By grammatical relations: vi occurs as a subject/object/modifier/… of verb w (requires - and separate features for each relation)

Counting co-occurrences: -fi as binary features (1,0): w does/does not occur with vi

-fi as frequencies: w occurs n times with vi -fi as probabilities: e.g. fi is the probability that vi is the subject of w.

CS447: Natural Language Processing (J. Hockenmaier) 36 Getting co-occurrence counts

Co-occurrence as a binary feature: Does word w ever appear in the context c? (1 = yes/0 = no) arts boil data function large sugar water apricot 0 1 0 0 1 1 1 pineapple 0 1 0 0 1 1 1 digital 0 0 1 1 1 0 0 information 0 0 1 1 1 0 0 Co-occurrence as a frequency count: How often does word w appear in the context c? (0…n times) arts boil data function large sugar water apricot 0 1 0 0 5 2 7 pineapple 0 2 0 0 10 8 5 digital 0 0 31 8 20 0 0 information 0 0 35 23 5 0 0

Typically: 10K-100K dimensions (contexts), very sparse vectors

CS447: Natural Language Processing (J. Hockenmaier) 37 Counts vs PMI

Sometimes, low co-occurrences counts are very informative, and high co-occurrence counts are not: -Any word is going to have relatively high co-occurrence counts with very common contexts (e.g. “it”, “anything”, “is”, etc.), but this won’t tell us much about what that word means. -We need to identify when co-occurrence counts are more likely than we would expect by chance.

We therefore want to use PMI values instead of raw frequency counts: p(w, c) PMI(w, c) = log p(w)p(c) But this requires us to define p(w, c), p(w) and p(c)

CS447: Natural Language Processing (J. Hockenmaier) 38 Pointwise (PMI)

Recall that two events x, y are independent if their joint probability is equal to the product of their individual probabilities: x,y are independent iff p(x,y) = p(x)p(y) x,y are independent iff p(x,y)∕p(x)p(y) = 1

In NLP, we often use the pointwise mutual information (PMI) of two outcomes/events (e.g. words): p(X = x, Y = y) PMI(x, y) = log p(X = x)p(Y = y)

CS447: Natural Language Processing (J. Hockenmaier) 39 Positive Pointwise Mutual Information

PMI is negative when words co-occur less than expected by chance. This is unreliable without huge corpora:

With P(w1) ≈ P(w2) ≈ 10-6, we can’t estimate whether P(w1,w2) is significantly different from 10-12

We often just use positive PMI values, and replace all PMI values < 0 with 0:

Positive Pointwise Mutual Information (PPMI): PPMI(w,c) = PMI(w,c) if PMI(w,c) > 0 = 0 if PMI(w,c) ≤ 0

CS447: Natural Language Processing (J. Hockenmaier) 40 Frequencies vs. PMI

Objects of ‘drink’ (Lin, 1998)

Count PMI bunch beer 2 12.34 tea 2 11.75 liquid 2 10.53 champagne 4 11.75 anything 3 5.15 it 3 1.25

CS447: Natural Language Processing (J. Hockenmaier) 41