Word2vec Embeddings: CBOW and Skipgram

Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Word2vec embeddings: CBOW and Skipgram VL Embeddings Uni Heidelberg SS 2019 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: Maus P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: frißt P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 What is θ? T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 θ: vector representations of each word T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word Objective function (cost function, loss function): Maximise the probability of any context word given the current center word wt Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Objective function { Motivation • We want to model the probability distribution over mutually exclusive classes • measure the difference between predicted probabilitiesy ^ and ground-truth probabilities y • during training: tune parameters so that this difference is minimised • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood • We can interpret negative log-probability as information content or surprisal What is the log-likelihood of a model, given an event? ) The negative of the surprisal of the event, given the model: A model is supported by an event to the extent that the event is unsurprising, given the model. Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence

Word2vec Embeddings: CBOW and Skipgram

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support