Word Embeddings

Word Embeddings Benjamin Roth, Nina Poerner Centrum fürInformations- und Sprachverarbeitung Ludwig-Maximilian-UniversitätMünchen [email protected] November 14, 2018 Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 1 / 53 Motivation How to represent words in a neural network? Possible solution: indicator vectors of length jV j (vocabulary size). 213 203 203 607 617 607 (the) 6 7 (cat) 6 7 (dog) 6 7 w = 607 w = 607 w = 617 4.5 4.5 4.5 . Question: Why is this a bad idea? I Parameter explosion (jV j might be > 1M) I All word vectors are orthogonal to each other ! no notion of word similarity Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 3 / 53 Motivation (i) D Learn one word vector w 2 R (\word embedding") per word i Typical dimensionality: 50 ≤ D ≤ 1000 jV j jV |×D Embedding matrix: W 2 R Question: Advantages of using word vectors? I We can express similarities between words, e.g., with cosine similarity: w(i)T w(j) cos(w(i); w(j)) = (i) (j) kw k2 · kw k2 I Since the embedding operation is a lookup operation, we only need to update the vectors that occur in a given training batch Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 4 / 53 Motivation Training from scratch: Initialize embedding matrix randomly and learn it during training phase ! words that play similar roles w.r.t. task get similar embeddings e.g., from sentiment classification, we might expect w(great) ≈ w(awesome) Question: What could be a problem at test time? I If training set is small, many words are unseen during training and therefore have random vectors We typically have more unlabelled than labelled data. Can we learn embeddings from the unlabelled data? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 5 / 53 Motivation Distributional hypothesis: \a word is characterized by the company it keeps"' (Firth, 1957) Basic idea: learn similar vectors for words that occur in similar contexts GloVe, Word2Vec, FastText Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 6 / 53 Questions? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 7 / 53 Recap: Language Models Question: What is a Language Model? I Function to assign probability to a sequence of words. Question: What is an n-gram language Model? I Markov assumption: probability of word only depends on no more than n − 1 other (previous) words: T Y P(w[1] ::: w[T ]) = P(w[t]jw[t−1]:::w[t−n+1]) t=1 Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 10 / 53 Word2Vec as a Bigram Language Model Words in our vocabulary are represented as two sets of vectors: (i) D I w 2 R if they are to be predicted (i) D I v 2 R if they are conditioned on as context Predict word i given previous word j: P(ijj) = f (w(i); v(j)) Question: What is a possible function f (·)? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 11 / 53 A Simple Neural Network Bigram Language Model Softmax! exp(w(i)T v(j)) P(ijj) = PjV j (k)T (j) k=1 exp(w v ) Question: Problem with training softmax? I ) Slow. Needs to compute dot products with the whole vocabulary for every single prediction. Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 12 / 53 Questions? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 13 / 53 Speeding up Training: Hierarchical Softmax Context vectors v are defined like before. Word vectors w are replaced by a binary tree: θ(1) θ(2) θ(3) cat sat dog the Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 15 / 53 Hierarchical Softmax Each tree node l has parameter vector θ(l) Probability of going left at node l given context word j: T p(leftjl; j) = σ(θ(l) v(j)) Probability of going right: p(rightjl; j) = 1 − p(leftjl; j) Probability of word i given j: product of probabilities on the path from root to i Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 16 / 53 Example Calculate p(satjcat). v(cat) T σ(θ(1) v(cat)) T (1) 1 − σ(θ(2) v(cat)) θ θ(2) θ(3) cat sat dog the T T p(satjcat) = σ(θ(1) v(cat))[1 − σ(θ(2) v(cat))] Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 17 / 53 Questions Question: How many dot products do we need to calculate to get to p(ijj)? How does this compare to the naive softmax? I log2 jV j jV j P 0 Question: Show that i 0 p(i jj) sums to 1. Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 18 / 53 Questions? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 19 / 53 Speeding up Training: Negative Sampling Another trick: negative sampling (aka noise contrastive estimation) This changes the objective function, and the resulting model is not a language model anymore! Idea: Instead of predicting probability distribution over whole vocabulary, make binary decisions for a small number of words. Positive training set: Bigrams seen in the corpus. Negative training set: Random bigrams (not seen in the corpus). Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 21 / 53 Negative Sampling: Likelihood Given: I Positive training set: pos(O) I Negative training set: neg(O) Y Y 0 0 L = P(posjw(i); v(j)) P(negjw(i ); v(j )) (i;j)2pos(O) (i 0;j0)2neg(O) P(posjw; v) = σ(wT v) P(negjw; v) = 1 − P(posjw; v) Y Question: Why not just maximize P(posjw(i); v(j))? (i;j)2pos(O) I Trivial solution: make all w; v identical Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 22 / 53 Word2Vec with negative sampling as classification Maximize likelihood of training data: Y L(θ) = P(y (i)jx(i); θ) i , minimize negative log likelihood: X NLL(θ) = − log L(θ) = − log P(y (i)jx(i); θ)) i Question: What do these components stand for in Word2Vec with negative sampling? (i) I x Word pair, from corpus OR randomly created (i) I y Label: 1 = word pair is from positive training set, 0 = word pair is from negative training set I θ Parameters v, w T T I P(:::) Logistic sigmoid: P(1|·) = σ(w v), resp. P(0|·) = 1 − σ(w v). Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 23 / 53 Stochastic Gradient Descent dσ(x) dlog(x) 1 dx = σ(x)(1 − σ(x)) dx = x L(w; v; y) = −ylogσ(wT v) − (1 − y)log1 − σ(wT v) @L = @w 1 − y σ(wT v)1 − σ(wT v)v σ(wT v) 1 − (1 − y) (−1)σ(wT v)1 − σ(wT v)v 1 − σ(wT v) =σ(wT v) − yv Same for v: @L = σ(wT v) − yw @v Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 24 / 53 Stochastic Gradient Descent One update step for one word pair i; j: (i) (i) (i)T (j) (j) vupdated v + η y − σ(w v ) w (j) (j) (i)T (j) (i) wupdated w + η y − σ(w v ) v η > 0 is learning rate, y is label 2 f0; 1g. When do the vectors of a pair become more/less similar, and why? (i)T (j) I Let c = η(y − σ(v w )) I Positive (observed) word pair: y = 1=)c > 0. (i) (j) F Hence, c · v is added to w and vice versa ! more similar. I Negative (random) word pair: y = 0=)c < 0. (i) (j) F Hence, c · v is subtracted from w and vice versa ! less similar. What does the step size c depend on? v Difference of y and σ(w(i)T v(j)) w Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 25 / 53 Speeding up Training: Negative Sampling Constructing a good negative training set can be difficult Often it is some random perturbation of the training data (e.g. replacing the second word of each bigram by a random word). The number of negative samples is often a multiple (1x to 20x) of the number of posisive samples Negative sets are often constructed per batch Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 26 / 53 Questions Question: How many dot products do we need to calculate for a given word pair? How does this compare to the naive and hierarchical softmax? I M + 1 ≈ log2jV j jV j I (for M = 20; jV j = 1M) Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 27 / 53 Questions? Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 28 / 53 Skip-gram (Word2Vec) Idea: Learn many bigram language models at the same time. Given word w[t], predict words inside a window around w[t]: I One position before the target word: p(w[t−1]jw[t]) I One position after the target word: p(w[t+1]jw[t]) I Two positions before the target word: p(w[t−2]jw[t]) I ... up to a specified window size c. Models share all w, v parameters! Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 30 / 53 Skip-gram: Objective Optimize the joint likelihood of the 2c language models: Y p(w[t−c] ::: w[t−1]w[t+1] ::: w[t+c]jw[t]) = p(w[t+i]jw[t]) i∈{−c:::cg i6=0 Negative Log-likelihood for whole corpus (of size N): N X X NLL = − log p(w[t+i]jw[t]) t=1 i∈{−c:::cg i6=0 Using negative sampling as approximation: N M N X X T X X X T (∗) ≈ − log σ(w[t+i]v[t]) − log[1 − σ(w[t+i]v )] t=1 i∈{−c:::cg m=1 t=1 i∈{−c:::cg i6=0 i6=0 v(∗) is a random context word, M is the number of negatives per positive sample Benjamin Roth, Nina Poerner (CIS) Word Embeddings November 14, 2018 31 / 53 C(ontinuous) B(ag) o(f) W(ords) Like Skipgram, but..

Load more