<<

gradients

Tambet Matiisen

October 6, 2015

1 Softmax loss and gradients

Let’s denote T xi = wi rˆ

xi is a scalar and can be considered as (unnormalized) ”similarity” of vectors wi and rˆ. Vectors wi and rˆ are both Dx1-dimensional. Full matrix w is VxD dimensional.

exi yˆi = P r(wordi|rˆ, w) = PV xj j=1 e

yˆi turns similarity into probability - probability that predicted word rˆ is the word wi. V is the total number of words in vocabulary.

V X J = CE(y, yˆ) = − yi log(ˆyi) i=1 J is the cross-entropy loss, where y is the vector of target values and yˆ is the vector of predictions. In y only yk is 1 and all others are 0s. The loss could have been written also as J = − log(ˆyk), but gradients are much easier to write with above notation.

1.1 Gradients with respect to rˆ

∂J V ∂J ∂x = X i ∂rˆ ∂x ∂rˆ i=1 i

The sum is needed, because all xi-s depend on rˆ andy ˆi-s (which depend on xi-s) are summed together in J. We know cross-entropy loss with respect to softmax input from task 2b:

∂J =y ˆi − yi ∂xi Also

∂x i = w ∂rˆ i

1 Altogether

∂J V = X(ˆy − y )w ∂rˆ i i i i=1

Instead of wi we can write it in terms of full matrix w. ∂J = wT (yˆ − y) ∂rˆ

1.2 Gradients with respect to wi ∂J ∂J ∂x = i ∂wi ∂xi ∂wi There is no sum this time, because only one xi depends on wi. Because ∂x i = rˆ ∂wi the result is

∂J = (ˆyi − yi)rˆ ∂wi In terms of full matrix w that is

∂J = (yˆ − y)rˆT ∂w 2 Negative sampling loss and gradient

Let’s start with notation again. xi will stay the same.

T xi = wi rˆ

For probability let’s use pi this time, because now we are using sigmoid :

pi = σ(xi) For negative sampling let’s use different notation:

T zk = −wk rˆ

qk = σ(zk) Negative sampling loss becomes then:

K X J = J(rˆ, wi, w1..K ) = − log pi − log qk k=1

2 2.1 Gradient with respect to rˆ

∂J ∂J ∂p ∂x K ∂J ∂q ∂z = i i + X k k ∂rˆ ∂p ∂x ∂rˆ ∂q ∂z ∂rˆ i i k=1 k k

Here i is a parameter to J (the target word wi), so we can leave it free. Then are:

∂J 1 ∂J 1 = − = − ∂pi pi ∂qk qk

∂pi ∂qk = pi(1 − pi) = qk(1 − qk) ∂xi ∂zk ∂x ∂z i = w k = −w ∂rˆ i ∂rˆ k Altogether

∂J K = −(1 − p )w + X(1 − q )w ∂rˆ i i k k k=1

2.2 Gradient with respect to wj

∂J ∂J ∂p ∂x K ∂J ∂q ∂z = i i + X k k ∂w ∂p ∂x ∂w ∂q ∂z ∂w j i i j k=1 k k j Notice that we have to use new index j, because the same word may (or may not) appear in positive part as wi and in negative part as wk. Only new derivatives we need to take here are

∂xi ∂zk = δijrˆ = −δkjrˆ ∂wj ∂wj

Where δxy is Kronecker delta, which is 1 if x = y and 0 otherwise. Altogether

∂J K = −(1 − p )δ rˆ + X(1 − q )δ rˆ ∂w i ij k kj j k=1

We can simplify it further, if we write it in terms of wi and wk: ∂J ∂J = −(1 − pi)rˆ = (1 − qk)rˆ ∂wi ∂wk So we have to use the left term to update target vector weights and right term to update all negative word weights. Pay attention, that the target word may also occur in negative sample - we want the model learn, that the same word shouldn’t occur next to itself. In this case both updates must be applied to target word vector.

3