
word2vec gradients Tambet Matiisen October 6, 2015 1 Softmax loss and gradients Let's denote T xi = wi r^ xi is a scalar and can be considered as (unnormalized) "similarity" of vectors wi and r^. Vectors wi and r^ are both Dx1-dimensional. Full matrix w is VxD dimensional. exi y^i = P r(wordijr^; w) = PV xj j=1 e y^i turns similarity into probability - probability that predicted word r^ is the word wi. V is the total number of words in vocabulary. V X J = CE(y; y^) = − yi log(^yi) i=1 J is the cross-entropy loss, where y is the vector of target values and y^ is the vector of predictions. In y only yk is 1 and all others are 0s. The loss could have been written also as J = − log(^yk), but gradients are much easier to write with above notation. 1.1 Gradients with respect to r^ @J V @J @x = X i @r^ @x @r^ i=1 i The sum is needed, because all xi-s depend on r^ andy ^i-s (which depend on xi-s) are summed together in J. We know cross-entropy loss derivative with respect to softmax input from task 2b: @J =y ^i − yi @xi Also @x i = w @r^ i 1 Altogether @J V = X(^y − y )w @r^ i i i i=1 Instead of wi we can write it in terms of full matrix w. @J = wT (y^ − y) @r^ 1.2 Gradients with respect to wi @J @J @x = i @wi @xi @wi There is no sum this time, because only one xi depends on wi. Because @x i = r^ @wi the result is @J = (^yi − yi)r^ @wi In terms of full matrix w that is @J = (y^ − y)r^T @w 2 Negative sampling loss and gradient Let's start with notation again. xi will stay the same. T xi = wi r^ For probability let's use pi this time, because now we are using sigmoid function: pi = σ(xi) For negative sampling let's use different notation: T zk = −wk r^ qk = σ(zk) Negative sampling loss becomes then: K X J = J(r^; wi; w1::K ) = − log pi − log qk k=1 2 2.1 Gradient with respect to r^ @J @J @p @x K @J @q @z = i i + X k k @r^ @p @x @r^ @q @z @r^ i i k=1 k k Here i is a parameter to loss function J (the target word wi), so we can leave it free. Then derivatives are: @J 1 @J 1 = − = − @pi pi @qk qk @pi @qk = pi(1 − pi) = qk(1 − qk) @xi @zk @x @z i = w k = −w @r^ i @r^ k Altogether @J K = −(1 − p )w + X(1 − q )w @r^ i i k k k=1 2.2 Gradient with respect to wj @J @J @p @x K @J @q @z = i i + X k k @w @p @x @w @q @z @w j i i j k=1 k k j Notice that we have to use new index j, because the same word may (or may not) appear in positive part as wi and in negative part as wk. Only new derivatives we need to take here are @xi @zk = δijr^ = −δkjr^ @wj @wj Where δxy is Kronecker delta, which is 1 if x = y and 0 otherwise. Altogether @J K = −(1 − p )δ r^ + X(1 − q )δ r^ @w i ij k kj j k=1 We can simplify it further, if we write it in terms of wi and wk: @J @J = −(1 − pi)r^ = (1 − qk)r^ @wi @wk So we have to use the left term to update target vector weights and right term to update all negative word weights. Pay attention, that the target word may also occur in negative sample - we want the model learn, that the same word shouldn't occur next to itself. In this case both updates must be applied to target word vector. 3.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages3 Page
-
File Size-