<<

Dan Jurafsky and James Martin Speech and Language Processing

Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI vectors are ◦long (length |V|= 20,000 to 50,000) ◦sparse (most elements are zero) Alternative: dense vectors vectors which are ◦ short (length 50-1000) ◦ dense (most elements are non-zero)

3 Sparse versus dense vectors Why dense vectors? ◦ Short vectors may be easier to use as features in (less weights to tune) ◦ Dense vectors may generalize better than storing explicit counts ◦ They may do better at capturing synonymy: ◦ car and automobile are synonyms; but are distinct ◦ a word with car as a neighbor and a word with automobile as a

neighbor4 should be similar, but aren't ◦ In practice, they work better Dense embeddings you can download!

Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Fasttext http://www.fasttext.cc/ Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/ Word2vec

Popular embedding method Very fast to train Code available on the web Idea: predict rather than count Word2vec ◦Instead of counting how often each word w occurs near "apricot" ◦Train a classifier on a binary prediction task: ◦Is w likely to show up near "apricot"?

◦We don’t actually care about this task ◦But we'll take the learned classifier weights as the word embeddings Brilliant insight: Use running text as implicitly supervised training data!

• A word s near apricot • Acts as gold ‘correct answer’ to the question • “Is word w likely to show up near apricot?” • No need for hand-labeled supervision • The idea comes from neural language modeling • Bengio et al. (2003) • Collobert et al. (2011) Word2Vec: Skip-Gram Task

Word2vec provides a variety of options. Let's do ◦ "skip-gram with negative sampling" (SGNS) Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples. 2. Randomly sample other words in the lexicon to get negative samples 3. Use to train a classifier to distinguish those two cases 4. Use the weights as the embeddings

10

9/7/18 Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 target c3 c4

Asssume context words are those in +/- 2 word window 11

9/7/18 Skip-Gram Goal

Given a tuple (t,c) = target, context ◦(apricot, jam) ◦(apricot, aardvark) Return probability that c is a real context word:

P(+|t,c) P(−|t,c) = 1−P(+|t,c)

12

9/7/18 How to compute p(+|t,c)? Intuition: ◦ Words are likely to appear near similar words ◦ Model similarity with dot-product! ◦ Similarity(t,c) ∝ t · c Problem: ◦Dot product is not a probability! ◦ (Neither is cosine) 16 CHAPTER 6 VECTOR SEMANTICS • 6.7.1 The classifier Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot and assume we’re using a window of 2 context words: ± ... lemon, a [tablespoon of apricot jam, a] pinch ... c1 c2 t c3 c4 Our goal is to train a classifier such that, given a tuple (t,c) of a target word t paired with a candidate context word c (for example (apricot, jam), or perhaps (apricot, aardvark) it will return the probability that c is a real context word (true for jam, false for aardvark):

P(+ t,c) (6.15) | The probability that word c is not a real context word for t is just 1 minus Eq. 6.15:

P( t,c)=1 P(+ t,c) (6.16) | | How does the classifier compute the probability P? The intuition of the skip- gram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words:

Similarity(t,c) t c (6.17) ⇡ · Of course, the dot product t c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into aTurning probability, dot we’ll product use the intologistic a or s(x), the fundamental core ofprobability logistic regression: The sigmoid lies between 0 and 1: 1 s(x)= x (6.18) 1 + e The probability that word c is a real context word for target word t is thus com- puted as:

1 P(+ t,c)= t c (6.19) | 1 + e · The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total probability of the two possible events (c being a context word, and c not being a context word) sum to 1. The probability that word c is not a real context word for t is thus:

P( t,c)=1 P(+ t,c) | t c | e · = t c (6.20) 1 + e · 16 16CHAPTERCHAPTER6 6 VECTORVECTORSEMANTICSSEMANTICS • • 6.7.16.7.1 The The classifier classifier Let’s start by thinking about the classification task, and then turn to how to train. Let’sImagine start by a sentence thinking like about the the following, classification with a target task, word andapricot then turnand to assume how to we’re train. Imagineusing a a sentence window of like2 the context following, words: with a target word apricot and assume we’re using a window of 2± context words: ... lemon,± a [tablespoon of apricot jam, a] pinch ...... lemon, a [tablespoonc1 of c2 apricot t jam, c3 c4 a] pinch ... c1 c2 t c3 c4 Our goal is to train a classifier such that, given a tuple (t,c) of a target word Ourt paired goal with is to a train candidate a classifier context such word that,c (for given example a tuple (apricot(t,c, )jamof), a or target perhaps word t paired(apricot with, aardvark a candidate) it will context return word the probabilityc (for example that c is (apricot a real context, jam), word or perhaps (true (apricotfor jam, aardvark, false for) itaardvark will return): the probability that c is a real context word (true for jam, false for aardvark): P(+ t,c) (6.15) | The probability that word c isP not(+ at, realc) context word for t is just 1 minus(6.15) Eq. 6.15: | The probability that word c is not a real context word for t is just 1 minus Eq. 6.15: P( t,c)=1 P(+ t,c) (6.16) | | How does the classifierP compute( t,c)= the1 probabilityP(+ t,c)P? The intuition of the skip-(6.16) gram model is to base this probability| on similarity:| a word is likely to occur near Howthe target does if the its embedding classifier compute is similar the to the probability target embedding.P? The How intuition can we of compute the skip- gramsimilarity model is between to base embeddings? this probability Recall on that similarity: two vectors a word are similar is likely if to they occur have near a the targethigh dot if its product embedding (cosine, is the similar most popular to the target similarity embedding. metric, is How just a can normalized we compute dot similarityproduct). between In other embeddings? words: Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words: Similarity(t,c) t c (6.17) ⇡ · Of course, the dot product t c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for thatSimilarity matter, that(t,c cosine) t isn’tc a probability either). To turn(6.17) ⇡ · the dot product into a probability, we’ll use the logistic or sigmoid function s(x), Ofthe course, fundamental the dot core product of logistict regression:c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into a probability, we’ll use1 the logistic or sigmoid function s(x), the fundamental core of logistic regression:s(x)= x (6.18) 1 + e The probability that word c is a real context word for target word t is thus com- puted as: 1 s(x)= x (6.18) 1 + e The probability that wordTurningc is a dot real contextproduct word1 into for a target word t is thus com- P(+ t,c)= t c (6.19) puted as: probability| 1 + e · The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total1 probability of the two possible events (c being a context word,P(+ andt,cc)=not being a context word) sum to 1. (6.19) | 1 + e t c The probability that word c is not a real context · word for t is thus: The sigmoid function just returns a number between 0 and 1, so to make it a ( , )= (+ , ) probability we’ll need to makeP suret c that the1 totalP t probabilityc of the two possible | t c | events (c being a context word, and c not beinge a· context word) sum to 1. = t c (6.20) The probability that word c is not a real1 context+ e · word for t is thus:

P( t,c)=1 P(+ t,c) | t c | e · = t c (6.20) 1 + e · 6.7 WORD2VEC 17 •

Equation 6.19 give us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very useful simplifyingFor assumption all the that context all context words: words are independent, allowing us to just multiply their probabilities:Assume all context words are independent

k 1 P(+ t,c1:k)= t c (6.21) | 1 + e · i Yi=1 k 1 logP(+ t,c1:k)= log t c (6.22) | 1 + e · i Xi=1 In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on apply- ing the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

6.7.2 Learning skip-gram embeddings Word2vec learns embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embedding of each word w to be more like the em- beddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby. Let’s start by considering a single piece of the training data, from the sentence above: ... lemon, a [tablespoon of apricot jam, a] pinch ... c1 c2 t c3 c4 This example has a target word t (apricot), and 4 context words in the L = 2 ± window, resulting in 4 positive training instances (on the left below): positive examples + negative examples - tc tc tc apricot tablespoon apricot aardvark apricot twelve apricot of apricot puddle apricot hello apricot preserves apricot where apricot dear apricot or apricot coaxial apricot forever For training a binary classifier we also need negative examples, and in fact skip- gram uses more negative examples than positive examples, the ratio set by a param- eter k. So for each of these (t,c) training instances we’ll create k negative samples, each consisting of the target t plus a ‘noise word’. A noise word is a random word from the lexicon, constrained not to be the target word t. The right above shows the setting where k = 2, so we’ll have 2 negative examples in the negative training set for each positive example t,c. The noise words are chosen according to their weighted unigram frequency pa (w), where a is a weight. If we were sampling according to unweighted fre- quency p(w), it would mean that with unigram probability p(“the”) we would choose Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 t c3 c4

Training data: input/output pairs centering on apricot Asssume a +/- 2 word window 17

9/7/18 6.7 WORD2VEC 17 •

Equation 6.19 give us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to just multiply their probabilities:

k 1 P(+ t,c1:k)= t c (6.21) | 1 + e · i Yi=1 k 1 logP(+ t,c1:k)= log t c (6.22) | 1 + e · i Xi=1 In summary, skip-gram trains a probabilistic classifier that, given a test target word t and its context window of k words c1:k, assigns a probability based on how similar this context window is to the target word. The probability is based on apply- ing the logistic (sigmoid) function to the dot product of the embeddings of the target word with each context word. We could thus compute this probability if only we had embeddings for each word target and context word in the vocabulary. Let’s now turn to learning these embeddings (which is the real goal of training this classifier in the first place).

6.7.2 Learning skip-gram embeddings Word2vec learns embeddings by starting with an initial set of embedding vectors and then iteratively shifting the embedding of each word w to be more like the em- beddings of words that occur nearby in texts, and less like the embeddings of words that don’t occur nearby. Let’sSkip start-Gram by considering Training a single piece of the training data, from the sentence above: ...Training lemon, a sentence: [tablespoon of apricot jam, a] pinch ...... lemon, a tablespoonc1 c2of apricot tjam c3 a pinch c4 ... This example has a target word t (apricot), and 4 context words in the L = 2 c1 c2 t c3 c4 ± window, resulting in 4 positive training instances (on the left below): positive examples + negative examples - •For each positive example, tc tc tc we'll create k negative apricot tablespoon apricot aardvark apricot twelve apricot of examples.apricot puddle apricot hello •Using noise words apricot preserves apricot where apricot18 dear apricot or •Any randomapricot word coaxial that apricotisn't t forever For9/7/18 training a binary classifier we also need negative examples, and in fact skip- gram uses more negative examples than positive examples, the ratio set by a param- eter k. So for each of these (t,c) training instances we’ll create k negative samples, each consisting of the target t plus a ‘noise word’. A noise word is a random word from the lexicon, constrained not to be the target word t. The right above shows the setting where k = 2, so we’ll have 2 negative examples in the negative training set for each positive example t,c. The noise words are chosen according to their weighted unigram frequency pa (w), where a is a weight. If we were sampling according to unweighted fre- quency p(w), it would mean that with unigram probability p(“the”) we would choose 6.7 WORD2VEC 17 6.7 WORD• 2VEC 17 • Equation 6.19 give us the probability for one word, but we need to take account Equation 6.19 give us the probability for one word, but we need to take account of the multiple context words in the window. Skip-gram makes the strong but very of the multiple context words in the window. Skip-gram makes the strong but very useful simplifying assumption that all context words are independent, allowing us to useful simplifying assumption that all context words are independent, allowing us to just multiplyjust multiply their probabilities: their probabilities:

k k 1 1 P(+ t,c1:k)= t c (6.21) P(+ t,c1:k)=| t c 1 + e i (6.21) | 1 + ei=1 i · i=1 Y · Y k k 1 1 logP(+ t,c1:k)= log (6.22) logP(+ t,c1:k)=| log t c 1 + e t ci (6.22) | 1i+=1e · i · Xi=1 X In summary,In summary, skip-gram skip-gram trains a trains probabilistic a probabilistic classifier classifier that, given that, a test given target a test target word t and its context window of k words c , assigns a probability based on how word t and its context window of k words c1:k, assigns1:k a probability based on how similar thissimilar context this contextwindow window is to the is target to the word. target The word. probability The probability is based on is based apply- on apply- ing the logisticing the logistic (sigmoid) (sigmoid) function function to the dot to product the dot of product the embeddings of the embeddings of the target of the target word withword each with context each context word. We word. could We thus could compute thus compute this probability this probability if only we if only we had embeddingshad embeddings for each for word each target word and target context and word context in the word vocabulary. in the vocabulary. Let’s now Let’s now turn to learningturn to learning these embeddings these embeddings (which is (which the real is goal the real of training goal of this training classifier this classifier in in the firstthe place). first place).

6.7.26.7.2 Learning Learning skip-gram skip-gram embeddings embeddings Word2vecWord2vec learns embeddings learns embeddings by starting by withstarting an initialwith an set initial of embedding set of embedding vectors vectors and thenand iteratively then iteratively shifting shifting the embedding the embedding of each word of eachw to word be morew to like be more the em- like the em- beddingsbeddings of words of that words occur that nearby occur in nearby texts, inand texts, less likeand theless embeddings like the embeddings of words of words that don’tthat occur don’t nearby. occur nearby. Let’s startLet’sSkip by start considering-Gram by considering a singleTraining a piece single of piece the training of the data, training from data, the from sentence the sentence above: above: ... lemon,...Training lemon, a [tablespoon a sentence: [tablespoon of apricot of apricot jam, jam, a] pinch a] ... pinch ...... lemon,c1 a tablespoonc1 c2 t c2of apricot c3 tjam c3 c4 a pinch c4 ... This exampleThis example has a target has a word targett (apricot), word t (apricot), and 4 context and 4 words context in words the L = in the2 L = 2 c1 c2 t c3 c4 ± ± window,window, resulting resulting in 4 positive in 4 positive training traininginstances instances (on the left (on below): the left below): k=2 positivepositive examples examples + + negativenegative examples examples - - tctc tctc tc tc apricotapricot tablespoon tablespoon apricot aardvarkapricot aardvark apricot twelve apricot twelve apricotapricot of of apricot puddleapricot puddle apricot hello apricot hello apricot where apricot dear apricotapricot preserves preserves apricot where apricot19 dear apricotapricot or or apricot coaxialapricot coaxial apricot forever apricot forever For trainingFor9/7/18 training a binary a classifier binary classifier we also need we also negative need negative examples, examples, and in fact and skip- in fact skip- gram usesgram more uses negative more negative examples examples than positive than examples, positive examples, the ratio set the by ratio a param- set by a param- eter k. Soeter fork. each So for of each these of(t, thesec) training(t,c) training instances instances we’ll create we’llk negative create k samples,negative samples, each consistingeach consisting of the target of thet plus target a ‘noiset plus a word’. ‘noise A word’. noise word A noise is a word random is a word random word from thefrom lexicon, the lexicon, constrained constrained not to be not the to target be the word targett. The word rightt. The above right shows above the shows the setting wheresettingk where= 2, sok = we’ll2, so have we’ll 2 negative have 2 negative examples examples in the negative in the trainingnegative set training set for eachfor positive each positive example examplet,c. t,c. The noiseThe words noise are words chosen are according chosen according to their toweighted their weighted unigram unigram frequency frequency pa (w), where a is a weight. If we were sampling according to unweighted fre- pa (w), where a is a weight. If we were sampling according to unweighted fre- quency quencyp(w), itp would(w), it mean would that mean with that unigram with unigram probability probabilityp(“the”) pwe(“the would”) we choose would choose 18 CHAPTER 6 VECTOR SEMANTICS • 18 CHAPTER 6 VECTOR SEMANTICS •Choosing noise words the word the as a noise word, with unigram probability p(“aardvark”) we would the word the as a noise word, with unigram probability p(“aardvark”) we would choose aardvarkchoose aardvark, and so, and on. so But on. But in practice in practice it it is common common to set toa set= .a75,= i.e..75, use the i.e. use the 3 Could pick3 w according to their unigram frequency P(w) weighting weightingp 4 (w): p 4 (w):

More common to chosen then according to pα(w) a count(w) a Pa (w)= count(w) a (6.23) P (w)= w count(w) (6.23) a count(w)a Setting a = .75 gives better performancePw because it gives rare noise words slightly higher probability: for rareP words, Pa (w) > P(w). To visualize this intu- Settingition,a =α= it might.¾75 works gives help well to betterbecause work out performanceit thegives probabilities rare noise becausewords for an slightly example it giveshigher with tworare events, noise words probability slightly higherP(a)= probability:.99 and P(b)= for.01: rare words, Pa (w) > P(w). To visualize this intu- ition, it mightTo help show to this, work imagine out two the events probabilities p(a)=.99 and for p(b) an = .01: example with two events, P(a)=.99 and P(b)=.01: .99.75 P (a)= = .97 a .99.75 + .01.75 .01.75 P (b)= .75 = .03 (6.24) a .99..9975 + .01.75 Pa (a)= .75 .75 = .97 Given the set of positive and. negative99 + training.01 instances, and an initial set of embeddings, the goal of the learning algorithm.01.75 is to adjust those embeddings such P (b)= = .03 (6.24) that we a .99.75 + .01.75 Maximize the similarity of the target word, context word pairs (t,c) drawn • Given the setfrom of the positive positive examples and negative training instances, and an initial set of Minimize the similarity of the (t,c) pairs drawn from the negative examples. embeddings,• the goal of the learning algorithm is to adjust those embeddings such that we We can express this formally over the whole training set as:

Maximize the similarityL(q)= of thelogP target(+ t,c)+ word,log contextP( t,c word) pairs(6.25) (t,c) drawn | | • (t,c) + (t,c) from the positive examplesX2 X2 MinimizeOr, focusing the similarity in on one of word/context the (t,c) pairspair (t, drawnc) with its fromk noise the words negativen1...nk, examples. the • learning objective L is: We can express this formally over the whole training set as: k L(q)=logP(+ t,c)+ logP( t,n ) | | i L(q)= logP(+ t,c)+i=1 logP( t,c) (6.25) | X | (t,c) + k (t,c) X2 = logs(c t)+ logXs2( n t) · i · Or, focusing in on one word/context pairXi(=t1,c) with its k noise words n1...nk, the k learning objective L is: 1 1 = log + log (6.26) 1 + e c t 1 + eni t · i=1 · Xk That is, we want to maximize the dot product of the word with the actual context L(q)=logP(+ t,c)+ logP( t,ni) words, and minimize the dot products| of the word with thek |negative sampled non- i= neighbor words. X1 We can then use stochastic gradient descentk to train to this objective, iteratively modifying the parameters= log (thes embeddings(c t)+ for eachlogs target( n wordi t)t and each context c · · word or noise word in the vocabulary) to maximizei=1 the objective. Note that the skip-gram model thus actuallyX learns two separate embeddings target k embedding for each word w: the target embedding1 t and the context1 embedding c. These context embedding embeddings are stored= in twolog matrices,c thet +target matrixlog T andn thet context matrix (6.26) 1 + e · 1 + e i· Xi=1 That is, we want to maximize the dot product of the word with the actual context words, and minimize the dot products of the word with the k negative sampled non- neighbor words. We can then use stochastic to train to this objective, iteratively modifying the parameters (the embeddings for each target word t and each context word or noise word c in the vocabulary) to maximize the objective. Note that the skip-gram model thus actually learns two separate embeddings target embedding for each word w: the target embedding t and the context embedding c. These context embedding embeddings are stored in two matrices, the target matrix T and the context matrix Setup Let's represent words as vectors of some length (say 300), randomly initialized. So we start with 300 * V random parameters Over the entire training set, we’d like to adjust those word vectors such that we ◦ Maximize the similarity of the target word, context word pairs (t,c) drawn from the positive data ◦ Minimize the similarity of the (t,c) pairs drawn from the negative data.

21

9/7/18 Learning the classifier Iterative process. We’ll start with 0 or random weights Then adjust the word weights to ◦ make the positive pairs more likely ◦ and the negative pairs less likely over the entire training set: Objective Criteria We want to maximize… logP(+ t, c)+ logP( t, c) | | (t,c) + (t,c) X2 X2 Maximize the + label for the pairs from the positive training data, and the – label for the pairs sample from the negative data.

23

9/7/18 18 CHAPTER 6 VECTOR SEMANTICS • the word the as a noise word, with unigram probability p(“aardvark”) we would choose aardvark, and so on. But in practice it is common to set a = .75, i.e. use the 3 weighting p 4 (w):

count(w)a Pa (w)= a (6.23) w count(w) Setting a = .75 gives better performanceP because it gives rare noise words slightly higher probability: for rare words, Pa (w) > P(w). To visualize this intu- ition, it might help to work out the probabilities for an example with two events, P(a)=.99 and P(b)=.01:

.99.75 P (a)= = .97 a .99.75 + .01.75 .01.75 P (b)= = .03 (6.24) a .99.75 + .01.75 Given the set of positive and negative training instances, and an initial set of embeddings, the goal of the learning algorithm is to adjust those embeddings such that we Maximize the similarity of the target word, context word pairs (t,c) drawn • from the positive examples Minimize the similarity of the (t,c) pairs drawn from the negative examples. • We can express this formally over the whole training set as:

L(q)= logP(+ t,c)+ logP( t,c) (6.25) | | (t,c) + (t,c) X2 X2 Or, focusing inFocusing on one word/context on one pair (t ,targetc) with its kwordnoise words t: n1...nk, the learning objective L is:

k L(q)=logP(+ t,c)+ logP( t,n ) | | i Xi=1 k = logs(c t)+ logs( n t) · i · Xi=1 k 1 1 = log c t + log n t (6.26) 1 + e · 1 + e i· Xi=1 That is, we want to maximize the dot product of the word with the actual context words, and minimize the dot products of the word with the k negative sampled non- neighbor words. We can then use stochastic gradient descent to train to this objective, iteratively modifying the parameters (the embeddings for each target word t and each context word or noise word c in the vocabulary) to maximize the objective. Note that the skip-gram model thus actually learns two separate embeddings target embedding for each word w: the target embedding t and the context embedding c. These context embedding embeddings are stored in two matrices, the target matrix T and the context matrix increase similarity( apricot , jam) C w . c W j k 1. .. … d apricot 1.2…….j………V 1 . 1 k jam neighbor word . “…apricot jam…” . . n aardvark random noise . . word d V decrease similarity( apricot , aardvark) wj . cn Train using gradient descent Actually learns two separate embedding matrices W and C Can use W and throw away C, or merge them somehow Summary: How to learn word2vec (skip-gram) embeddings Start with V random 300-dimensional vectors as initial embeddings Use logistic regression, the second most basic classifier used in machine learning after naïve bayes ◦ Take a corpus and take pairs of words that co-occur as positive examples ◦ Take pairs of words that don't co-occur as negative examples ◦ Train the classifier to distinguish these by slowly adjusting all the embeddings to improve the classifier performance ◦ Throw away the classifier code and keep the embeddings. Evaluating embeddings Compare to human scores on word similarity-type tasks: • WordSim-353 (Finkelstein et al., 2002) • SimLex-999 (Hill et al., 2015) • Stanford Contextual Word Similarity (SCWS) dataset (Huang et al., 2012) • TOEFL dataset: Levied is closest in meaning to: imposed, believed, requested, correlated Properties of embeddings

Similarity depends on window size C

C = ±2 The nearest words to Hogwarts: ◦ Sunnydale ◦ Evernight C = ±5 The nearest words to Hogwarts: ◦ Dumbledore ◦ Malfoy ◦ halfblood 29 Analogy: Embeddings capture relational meaning! vector(‘king’) - vector(‘man’) + vector(‘woman’) ≈ vector(‘queen’) vector(‘Paris’) - vector(‘France’) + vector(‘Italy’) ≈ vector(‘Rome’)

30

Embeddings can help study word history! Train embeddings on old books to study changes in word meaning!!

Will Hamilton Diachronic word embeddings for studying language change!

Word vectors for 1920 Word vectors 1990 “dog” 1990 word vector “dog” 1920 word vector

vs.

1900 1950 2000

3 4 Visualizing changes

Project 300 dimensions down into 2

~30 million books, 1850-1990, Books data The evolution of sentiment words Negative words change faster than positive words

36 Embeddings and bias Embeddings reflect cultural bias Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer programmer as woman is to homemaker? debiasing word embeddings." In Advances in Neural Information Processing Systems, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x” ◦ x = Japan Ask “father : doctor :: mother : x” ◦ x = nurse Ask “man : computer programmer :: woman : x” ◦ x = homemaker Embeddings reflect cultural bias

Caliskan, Aylin, Joanna J. Bruson and Arvind Narayanan. 2017. Semantics derived automatically from language corpora contain human-like biases. Science 356:6334, 183-186.

Implicit Association test (Greenwald et al 1998): How associated are ◦ concepts (flowers, insects) & attributes (pleasantness, unpleasantness)? ◦ Studied by measuring timing latencies for categorization. Psychological findings on US participants: ◦ African-American names are associated with unpleasant words (more than European- American names) ◦ Male names associated more with math, female names with arts ◦ Old people's names with unpleasant words, young people with pleasant words. Caliskan et al. replication with embeddings: ◦ African-American names (Leroy, Shaniqua) had a higher GloVe cosine with unpleasant words (abuse, stink, ugly) ◦ European American names (Brad, Greg, Courtney) had a higher cosine with pleasant words (love, peace, miracle) Embeddings reflect and replicate all sorts of pernicious biases. Directions Debiasing algorithms for embeddings ◦ Bolukbasi, Tolga, Chang, Kai-Wei, Zou, James Y., Saligrama, Venkatesh, and Kalai, Adam T. (2016). Man is to computer programmer as woman is to homemaker? debiasing word embeddings. In Advances in Neural Infor- mation Processing Systems, pp. 4349–4357. Use embeddings as a historical tool to study bias Embeddings as a window onto history

Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

Use the Hamilton historical embeddings The of embeddings for decade X for occupations (like teacher) to male vs female names ◦ Is correlated with the actual percentage of women teachers in decade X History of biased framings of women Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644

Embeddings for competence adjectives are biased toward men ◦Smart, wise, brilliant, intelligent, resourceful, thoughtful, logical, etc. This bias is slowly decreasing Embeddings reflect ethnic stereotypes over time Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences, 115(16), E3635–E3644 • Princeton trilogy experiments • Attitudes toward ethnic groups (1933, 1951, 1969) scores for adjectives • industrious, superstitious, nationalistic, etc • Cosine of Chinese name embeddings with those adjective embeddings correlates with human ratings. Table 3. Top Asian (vs. White) adjectives in 1910, 1950, and 1990 Another limitation of our current approach is that all of the by relative norm difference in the COHA embedding embeddings used are fully “black box,” where the dimensions 1910 1950 1990 have no inherent meaning. To provide a more causal explana- tion of how the stereotypes appear in language, and to under- Irresponsible Disorganized Inhibited stand how they function, future work can leverage more recent Envious Outrageous Passive embedding models in which certain dimensions are designed to Barbaric Pompous Dissolute capture various aspects of language, such as the polarity of a Aggressive Unstable Haughty word or its parts of speech (45). Similarly, structural proper- Transparent Effeminate Complacent ties of words—beyond their census information or human-rated Monstrous Unprincipled Forceful stereotypes—can be studied in the context of these dimensions. Hateful Venomous Fixed One can also leverage recent Bayesian embeddings models and Cruel Disobedient Active train more fine-grained embeddings over time, rather than a sep- Greedy Predatory Sensitive arate embedding per decade as done in this work (46, 47). These Bizarre Boisterous Hearty approaches can be used in future work. We view the main contribution of our work as introducing and validating a framework for exploring the temporal dynam- ics of stereotypes through the lens of word embeddings. Our qualitatively through the results in the snapshot analysis for gen- framework enables the computation of simple but quantitative der, which replicates prior work, and quantitatively as the metrics measures of bias as well as easy visualizations. It is important to correlate highly with one another, as shown in SI Appendix, note that our goal in Quantifying Gender Stereotypes and Quanti- section A.5. fying Ethnic Stereotypes is quantitative exploratory analysis rather Furthermore, we primarily use linear models to fit the relation- than pinning down specific causal models of how certain stereo- ship between embedding bias and various external metrics; how- types arise or develop, although the analysis in Occupational ever, the true relationships may be nonlinear and warrant further Stereotypes Beyond Census Data suggests that common language study. This concern is especially salient when studying ethnic is more biased than one would expect based on external, objec- stereotypes over time in the United States, as immigration dras- tive metrics. We believe our approach sharpens the analysis of tically shifts the size of each group as a percentage of the popu- large cultural shifts in US history; e.g., the women’s movement lation, which may interact with stereotypes and occupation per- of the 1960s correlates with a sharp shift in the encoding matrix centages. However, the models are sufficient to show consistency (Fig. 4) as well as changes in the biases associated with spe- COMPUTER SCIENCES in the relationships between embedding bias and external metrics cific occupations and gender-biased adjectives (e.g., hysterical vs. across datasets over time. Further, the results do not qualitatively emotional). change when, for example, population logit proportion instead In standard quantitative social science, machine learning is of raw percentage difference is used, as in ref. 44; we reproduce used as a tool to analyze data. Our work shows how the artifacts our primary figures with such a transformation in SI Appendix, of machine learning (word embeddings here) can themselves section A.6. be interesting objects of sociological analysis. We believe this

Another potential concern may be the dependency of our paradigm shift can lead to many fruitful studies. SOCIAL SCIENCES results on the specific word lists used and that the recall of our methods in capturing human biases may not be adequate. Materials and Methods We take extensive care to reproduce similar results with other In this section we describe the datasets, embeddings, and word lists used, word lists and types of measurements to demonstrate recall. For as well as how bias is quantified. More detail, including descriptions of example, in SI Appendix, section B.1, we repeat the static occu- additional embeddings and the full word lists, are in SI Appendix, section pation analysis using only professional occupations and repro-ChangeA. All of our data in and code linguistic are available on GitHub framing (https://github.com/ duce an identical figure to Fig. 1 in SI Appendix, section B.1. nikhgarg/EmbeddingDynamicStereotypes), and we link to external data Furthermore, the plots themselves contain bootstrapped confi-1910sources as- appropriate.1990 dence intervals; i.e., the coefficients for random subsets of the Embeddings.Garg, Nikhil, ThisSchiebinger work, Londa uses, severalJurafsky, Dan, pretrained and Zou, James word (2018). embeddings Word embeddings publicly quantify 100 years of gender occupations/adjectives and the intervals are tight. Similarly, for availableand ethnic online; stereotypes. refer Proceedings to the respective of the National sources Academy for of in-depth Sciences, 115 discussion(16), E3635 of–E3644 adjectives, we use two different lists: one list from refs. 6 and 7 their training parameters. These embeddings are among the most com- for which we have labeled stereotype scores and then a larger monly used English embeddings, vary in the datasets on which they were one for the rest of the analysis where such scores are not needed.Change in association of Chinese names with adjectives We note that we do not tune either the embeddings or the wordframed as "othering" (barbaric, monstrous, bizarre) lists, instead opting for the largest/most general publicly avail- able data. For reproducibility, we share our code and all word lists in a repository. That our methods replicate across many dif- ferent embeddings and types of biases measured suggests their generalizability. A common challenge in historical analysis is that the written text in, say 1910, may not completely reflect the popular social attitude of that time. This is an important caveat to consider in interpreting the results of the embeddings trained on these ear- lier text corpora. The fact that the embedding bias for gender and ethnic groups does track with census proportion is a positive control that the embedding is still capturing meaningful patterns despite possible limitations in the training text. Even this con- trol may be limited in that the census proportion does not fully capture gender or ethnic associations, even in the present day. However, the written text does serve as a window into the atti- tudes of the day as expressed in popular culture, and this work Fig. 6. Asian bias score over time for words related to outsiders in COHA allows for a more systematic study of such text. data. The shaded region is the bootstrap SE interval.

Garg et al. PNAS Latest Articles 7 of 10 | Changes in framing: adjectives associated with Chinese Table 3. Top Asian (vs. White) adjectives in 1910, 1950, and 1990 Another limitation of our current approach is that all of the Garg, Nikhil, Schiebinger, Londa, Jurafsky, Dan, and Zou, James (2018). Word embeddings quantify 100 years of gender by relativeand ethnic norm stereotypes. difference Proceedings of inthe National the COHAAcademy of Sciences embedding, 115(16), E3635–E3644 embeddings used are fully “black box,” where the dimensions 1910 1950 1990 have no inherent meaning. To provide a more causal explana- tion of how the stereotypes appear in language, and to under- Irresponsible Disorganized Inhibited stand how they function, future work can leverage more recent Envious Outrageous Passive embedding models in which certain dimensions are designed to Barbaric Pompous Dissolute capture various aspects of language, such as the polarity of a Aggressive Unstable Haughty word or its parts of speech (45). Similarly, structural proper- Transparent Effeminate Complacent ties of words—beyond their census information or human-rated Monstrous Unprincipled Forceful stereotypes—can be studied in the context of these dimensions. Hateful Venomous Fixed One can also leverage recent Bayesian embeddings models and Cruel Disobedient Active train more fine-grained embeddings over time, rather than a sep- Greedy Predatory Sensitive arate embedding per decade as done in this work (46, 47). These Bizarre Boisterous Hearty approaches can be used in future work. We view the main contribution of our work as introducing and validating a framework for exploring the temporal dynam- ics of stereotypes through the lens of word embeddings. Our qualitatively through the results in the snapshot analysis for gen- framework enables the computation of simple but quantitative der, which replicates prior work, and quantitatively as the metrics measures of bias as well as easy visualizations. It is important to correlate highly with one another, as shown in SI Appendix, note that our goal in Quantifying Gender Stereotypes and Quanti- section A.5. fying Ethnic Stereotypes is quantitative exploratory analysis rather Furthermore, we primarily use linear models to fit the relation- than pinning down specific causal models of how certain stereo- ship between embedding bias and various external metrics; how- types arise or develop, although the analysis in Occupational ever, the true relationships may be nonlinear and warrant further Stereotypes Beyond Census Data suggests that common language study. This concern is especially salient when studying ethnic is more biased than one would expect based on external, objec- stereotypes over time in the United States, as immigration dras- tive metrics. We believe our approach sharpens the analysis of tically shifts the size of each group as a percentage of the popu- large cultural shifts in US history; e.g., the women’s movement lation, which may interact with stereotypes and occupation per- of the 1960s correlates with a sharp shift in the encoding matrix

centages. However, the models are sufficient to show consistency (Fig. 4) as well as changes in the biases associated with spe- COMPUTER SCIENCES in the relationships between embedding bias and external metrics cific occupations and gender-biased adjectives (e.g., hysterical vs. across datasets over time. Further, the results do not qualitatively emotional). change when, for example, population logit proportion instead In standard quantitative social science, machine learning is of raw percentage difference is used, as in ref. 44; we reproduce used as a tool to analyze data. Our work shows how the artifacts our primary figures with such a transformation in SI Appendix, of machine learning (word embeddings here) can themselves section A.6. be interesting objects of sociological analysis. We believe this

Another potential concern may be the dependency of our paradigm shift can lead to many fruitful studies. SOCIAL SCIENCES results on the specific word lists used and that the recall of our methods in capturing human biases may not be adequate. Materials and Methods We take extensive care to reproduce similar results with other In this section we describe the datasets, embeddings, and word lists used, word lists and types of measurements to demonstrate recall. For as well as how bias is quantified. More detail, including descriptions of example, in SI Appendix, section B.1, we repeat the static occu- additional embeddings and the full word lists, are in SI Appendix, section pation analysis using only professional occupations and repro- A. All of our data and code are available on GitHub (https://github.com/ duce an identical figure to Fig. 1 in SI Appendix, section B.1. nikhgarg/EmbeddingDynamicStereotypes), and we link to external data Furthermore, the plots themselves contain bootstrapped confi- sources as appropriate. dence intervals; i.e., the coefficients for random subsets of the Embeddings. This work uses several pretrained word embeddings publicly occupations/adjectives and the intervals are tight. Similarly, for available online; refer to the respective sources for in-depth discussion of adjectives, we use two different lists: one list from refs. 6 and 7 their training parameters. These embeddings are among the most com- for which we have labeled stereotype scores and then a larger monly used English embeddings, vary in the datasets on which they were one for the rest of the analysis where such scores are not needed. We note that we do not tune either the embeddings or the word lists, instead opting for the largest/most general publicly avail- able data. For reproducibility, we share our code and all word lists in a repository. That our methods replicate across many dif- ferent embeddings and types of biases measured suggests their generalizability. A common challenge in historical analysis is that the written text in, say 1910, may not completely reflect the popular social attitude of that time. This is an important caveat to consider in interpreting the results of the embeddings trained on these ear- lier text corpora. The fact that the embedding bias for gender and ethnic groups does track with census proportion is a positive control that the embedding is still capturing meaningful patterns despite possible limitations in the training text. Even this con- trol may be limited in that the census proportion does not fully capture gender or ethnic associations, even in the present day. However, the written text does serve as a window into the atti- tudes of the day as expressed in popular culture, and this work Fig. 6. Asian bias score over time for words related to outsiders in COHA allows for a more systematic study of such text. data. The shaded region is the bootstrap SE interval.

Garg et al. PNAS Latest Articles 7 of 10 | Conclusion

Concepts or word senses ◦ Have a complex many-to-many association with words (homonymy, multiple senses) ◦ Have relations with each other ◦ Synonymy, Antonymy, Superordinate ◦ But are hard to define formally (necessary & sufficient conditions) Embeddings = vector models of meaning ◦ More fine-grained than just a string or index ◦ Especially good at modeling similarity/analogy ◦ Just download them and use cosines!! ◦ Can use sparse models (tf-idf) or dense models (word2vec, GLoVE) ◦ Useful in practice but know they encode cultural stereotypes