Chapter 6: Vector Semantics, Part II

Dan Jurafsky and James Martin Speech and Language Processing Chapter 6: Vector Semantics, Part II Tf-idf and PPMI are sparse representations tf-idf and PPMI vectors are ◦long (length |V|= 20,000 to 50,000) ◦sparse (most elements are zero) Alternative: dense vectors vectors which are ◦ short (length 50-1000) ◦ dense (most elements are non-zero) 3 Sparse versus dense vectors Why dense vectors? ◦ Short vectors may be easier to use as features in machine learning (less weights to tune) ◦ Dense vectors may generalize better than storing explicit counts ◦ They may do better at capturing synonymy: ◦ car and automobile are synonyms; but are distinct dimensions ◦ a word with car as a neighbor and a word with automobile as a neighbor4 should be similar, but aren't ◦ In practice, they work better Dense embeddings you can download! Word2vec (Mikolov et al.) https://code.google.com/archive/p/word2vec/ Fasttext http://www.fasttext.cc/ Glove (Pennington, Socher, Manning) http://nlp.stanford.edu/projects/glove/ Word2vec Popular embedding method Very fast to train Code available on the web Idea: predict rather than count Word2vec ◦Instead of counting how often each word w occurs near "apricot" ◦Train a classifier on a binary prediction task: ◦Is w likely to show up near "apricot"? ◦We don’t actually care about this task ◦But we'll take the learned classifier weights as the word embeddings Brilliant insight: Use running text as implicitly supervised training data! • A word s near apricot • Acts as gold ‘correct answer’ to the question • “Is word w likely to show up near apricot?” • No need for hand-labeled supervision • The idea comes from neural language modeling • Bengio et al. (2003) • Collobert et al. (2011) Word2Vec: Skip-Gram Task Word2vec provides a variety of options. Let's do ◦ "skip-gram with negative sampling" (SGNS) Skip-gram algorithm 1. Treat the target word and a neighboring context word as positive examples. 2. Randomly sample other words in the lexicon to get negative samples 3. Use logistic regression to train a classifier to distinguish those two cases 4. Use the weights as the embeddings 10 9/7/18 Skip-Gram Training Data Training sentence: ... lemon, a tablespoon of apricot jam a pinch ... c1 c2 target c3 c4 Asssume context words are those in +/- 2 word window 11 9/7/18 Skip-Gram Goal Given a tuple (t,c) = target, context ◦(apricot, jam) ◦(apricot, aardvark) Return probability that c is a real context word: P(+|t,c) P(−|t,c) = 1−P(+|t,c) 12 9/7/18 How to compute p(+|t,c)? Intuition: ◦ Words are likely to appear near similar words ◦ Model similarity with dot-product! ◦ Similarity(t,c) ∝ t · c Problem: ◦Dot product is not a probability! ◦ (Neither is cosine) 16 CHAPTER 6 VECTOR SEMANTICS • 6.7.1 The classifier Let’s start by thinking about the classification task, and then turn to how to train. Imagine a sentence like the following, with a target word apricot and assume we’re using a window of 2 context words: ± ... lemon, a [tablespoon of apricot jam, a] pinch ... c1 c2 t c3 c4 Our goal is to train a classifier such that, given a tuple (t,c) of a target word t paired with a candidate context word c (for example (apricot, jam), or perhaps (apricot, aardvark) it will return the probability that c is a real context word (true for jam, false for aardvark): P(+ t,c) (6.15) | The probability that word c is not a real context word for t is just 1 minus Eq. 6.15: P( t,c)=1 P(+ t,c) (6.16) −| − | How does the classifier compute the probability P? The intuition of the skip- gram model is to base this probability on similarity: a word is likely to occur near the target if its embedding is similar to the target embedding. How can we compute similarity between embeddings? Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words: Similarity(t,c) t c (6.17) ⇡ · Of course, the dot product t c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into aTurning probability, dot we’ll product use the intologistic a or sigmoid function s(x), the fundamental core ofprobability logistic regression: The sigmoid lies between 0 and 1: 1 s(x)= x (6.18) 1 + e− The probability that word c is a real context word for target word t is thus com- puted as: 1 P(+ t,c)= t c (6.19) | 1 + e− · The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total probability of the two possible events (c being a context word, and c not being a context word) sum to 1. The probability that word c is not a real context word for t is thus: P( t,c)=1 P(+ t,c) −| − t c | e− · = t c (6.20) 1 + e− · 16 16CHAPTERCHAPTER6 6 VECTORVECTORSEMANTICSSEMANTICS • • 6.7.16.7.1 The The classifier classifier Let’s start by thinking about the classification task, and then turn to how to train. Let’sImagine start by a sentence thinking like about the the following, classification with a target task, word andapricot then turnand to assume how to we’re train. Imagineusing a a sentence window of like2 the context following, words: with a target word apricot and assume we’re using a window of 2± context words: ... lemon,± a [tablespoon of apricot jam, a] pinch ... ... lemon, a [tablespoonc1 of c2 apricot t jam, c3 c4 a] pinch ... c1 c2 t c3 c4 Our goal is to train a classifier such that, given a tuple (t,c) of a target word Ourt paired goal with is to a train candidate a classifier context such word that,c (for given example a tuple (apricot(t,c, )jamof), a or target perhaps word t paired(apricot with, aardvark a candidate) it will context return word the probabilityc (for example that c is (apricot a real context, jam), word or perhaps (true (apricotfor jam, aardvark, false for) itaardvark will return): the probability that c is a real context word (true for jam, false for aardvark): P(+ t,c) (6.15) | The probability that word c isP not(+ at, realc) context word for t is just 1 minus(6.15) Eq. 6.15: | The probability that word c is not a real context word for t is just 1 minus Eq. 6.15: P( t,c)=1 P(+ t,c) (6.16) −| − | How does the classifierP compute( t,c)= the1 probabilityP(+ t,c)P? The intuition of the skip-(6.16) gram model is to base this probability−| on− similarity:| a word is likely to occur near Howthe target does if the its embedding classifier compute is similar the to the probability target embedding.P? The How intuition can we of compute the skip- gramsimilarity model is between to base embeddings? this probability Recall on that similarity: two vectors a word are similar is likely if to they occur have near a the targethigh dot if its product embedding (cosine, is the similar most popular to the target similarity embedding. metric, is How just a can normalized we compute dot similarityproduct). between In other embeddings? words: Recall that two vectors are similar if they have a high dot product (cosine, the most popular similarity metric, is just a normalized dot product). In other words: Similarity(t,c) t c (6.17) ⇡ · Of course, the dot product t c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for thatSimilarity matter, that(t,c cosine) t isn’tc a probability either). To turn(6.17) ⇡ · the dot product into a probability, we’ll use the logistic or sigmoid function s(x), Ofthe course, fundamental the dot core product of logistict regression:c is not a probability, it’s just a number ranging · from 0 to •. (Recall, for that matter, that cosine isn’t a probability either). To turn the dot product into a probability, we’ll use1 the logistic or sigmoid function s(x), the fundamental core of logistic regression:s(x)= x (6.18) 1 + e− The probability that word c is a real context word for target word t is thus com- puted as: 1 s(x)= x (6.18) 1 + e− The probability that wordTurningc is a dot real contextproduct word1 into for a target word t is thus com- P(+ t,c)= t c (6.19) puted as: probability| 1 + e− · The sigmoid function just returns a number between 0 and 1, so to make it a probability we’ll need to make sure that the total1 probability of the two possible events (c being a context word,P(+ andt,cc)=not being a context word) sum to 1. (6.19) | 1 + e t c The probability that word c is not a real context− · word for t is thus: The sigmoid function just returns a number between 0 and 1, so to make it a ( , )= (+ , ) probability we’ll need to makeP suret c that the1 totalP t probabilityc of the two possible −| − t c | events (c being a context word, and c not beinge− a· context word) sum to 1. = t c (6.20) The probability that word c is not a real1 context+ e− · word for t is thus: P( t,c)=1 P(+ t,c) −| − t c | e− · = t c (6.20) 1 + e− · 6.7 WORD2VEC 17 • Equation 6.19 give us the probability for one word, but we need to take account of the multiple context words in the window.

Load more