Word2vec Embeddings: CBOW and Skipgram

Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Word2vec embeddings: CBOW and Skipgram VL Embeddings Uni Heidelberg SS 2019 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: Maus P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: frißt P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Intuition • Window size: 2 • Center word at position t: P(wt−2jwt ) P(wt−1jwt ) P(wt+1jwt ) P(wt+2jwt ) Die kleine graue Maus frißt den leckeren Käse wt−2 wt−1 wt wt+1 wt+2 Same probability distribution used for all context words T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 What is θ? T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y L(θ) = P(w jw ; θ) (1) Likelihood = t+j t t=1 −m≤j≤m j 6= 0 θ: vector representations of each word T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word Objective function (cost function, loss function): Maximise the probability of any context word given the current center word wt Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Skipgram { Objective function For each position t = 1; :::; T , predict context words within a window of fixed size m, given center word wj . T Y Y Likelihood = L(θ) = P(wt+j jwt ; θ) (1) t=1 −m≤j≤m j 6= 0 θ: vector representations of each word The objective function J(θ) is the (average) negative log-likelihood: (cost function, loss function) T 1 1 X X J(θ) = − log L(θ) = − log P(w jw ; θ) (2) T T t+j t t=1 −m≤j≤m j 6= 0 Minimising objective function , maximising predictive accuracy Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Objective function { Motivation • We want to model the probability distribution over mutually exclusive classes • measure the difference between predicted probabilitiesy ^ and ground-truth probabilities y • during training: tune parameters so that this difference is minimised • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood Why is minimising the negative log likelihood equivalent to maximum likelihood estimation (MLE)? T Y Y L(θ) = P(wt+j jwt ; θ) t=1 −m≤j≤m MLE = argmax L(θ; x) • The log allows us to convert a product of factors into a summation of factors (nicer mathematical properties) • arg max(x) is equivalent to arg min(−x) x x T 1 1 X X J(θ) = − T log L(θ) = − T log P(wt+j jwt ; θ) t=1 −m≤j≤m Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Negative log-likelihood • We can interpret negative log-probability as information content or surprisal What is the log-likelihood of a model, given an event? ) The negative of the surprisal of the event, given the model: A model is supported by an event to the extent that the event is unsurprising, given the model. Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss Negative log likelihood is the same as cross entropy Recap: Entropy • If a discrete random variable X has the probability p(x), then the entropy of X is X 1 X H(X ) = p(x)log = − p(x)log p(x) p(x) x x ) expected number of bits needed to encode X if we use an optimal coding scheme Cross entropy ) number of bits needed to encode X if we use a suboptimal coding scheme q(x) instead of p(x) X 1 X H(p; q) = p(x)log = − p(x)log q(x) q(x) x x Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Minimising H(p; q) ! minimising the KL divergence from q to p Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Skipgram { Intuition Gradient Descent Stochastic Gradient Descent Backpropagation Cross entropy loss and Kullback-Leibler divergence Cross entropy is always larger than entropy (exception: if p = q) Kullback-Leibler (KL) divergence: difference between cross entropy and entropy X 1 X 1 X p(x) KL(pjjq) = p(x)log − p(x)log = p(x)log q(x) p(x) q(x) x x x ) number of extra bits needed when using q(x) instead of p(x) (also known as the relative entropy of p with respect to q) Cross entropy: X H(p; q) = − p(x) log q(x) = H(p) + KL(pjjq) x2X Minimising H(p; q) ! minimising the KL divergence

Word2vec Embeddings: CBOW and Skipgram

Malware Classification with BERT

Backpropagation and Deep Learning in the Brain

Training Autoencoders by Alternating Minimization

Lecture 4 Feedforward Neural Networks, Backpropagation

Q-Learning in Continuous State and Action Spaces

Revisiting the Softmax Bellman Operator: New Beneﬁts and New Perspective

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Learned in Speech Recognition: Contextual Acoustic Word Embeddings

On the Learning Property of Logistic and Softmax Losses for Deep Neural Networks

CS281B/Stat241b. Statistical Learning Theory. Lecture 7. Peter Bartlett

Neural Network in Hardware

Lecture 26 Word Embeddings and Recurrent Nets