10707 Deep Learning Russ Salakhutdinov

10707 Deep Learning Russ Salakhutdinov Machine Learning Department [email protected] http://www.cs.cmu.edu/~rsalakhu/10707/ Neural Networks II Neural Networks Online Course • Disclaimer: Much of the material and slides for this lecture were borrowed from Hugo Larochelle’s class on Neural Networks: https://sites.google.com/site/deeplearningsummerschool2016/ • Hugo’s class covers many other topics: convolutional networks, neural language model, Boltzmann machines, autoencoders, sparse coding, etc. • We will use his material for some of the other lectures. 2 g0(a)=g(a)(1 g(ag))0(a)=g(a)(1 g(a)) • − • − 2 2g0(a)=g0(ag)=1(a)(1 gg((aa))) g0(a)=1 g(a•) • −− • − 2 g0(a)=1 Initializationg(a) 2 2 (k) (k) 2 • ⌦(✓)=(k−) W = W k i j 2 (i,jk) 2 k F ⌦(✓)=• Initializek i biases•j W toi,j 0 = k(k)W F (k) 2|| || • ⌦(✓)= k i j Wi,j||⇣ = ||⌘k W F g0(a)= g(a)(1 • g(a)) P P P (k) || P|| • • For weights− ⇣ W(k) ⌦⌘(✓)=2⇣ W ⌘ P P P•(kr) P P P (Pk) P (k) ⌦ (✓)=2W2 W(k) ⌦(✓)=2W • rWg0(a)=1- Cang( nota•) rinitialize weights to 0 with tanh (activationk) • − ⌦(✓)= k i (k)j Wi,j Ø All ⌦gradients•(✓)=(k would)2 be zeroW (saddle| point)| ⌦(✓)= • W(k) k i j | i,j(k|) 2 ⌦(✓)= kk i jj Wi,ji,j =P Pk WP F (k) • • - Can not initialize| WP all(k) |weights⌦P(✓P) = ||to sign( the(k) sameW|| value) •W⇣r(k) ⌦(✓⌘) = sign(W ) P PØ All•P rhidden(k) units in aP layer will always behave the same (k) ⌦P(✓)=2P WP (k) (k) (Wk) ⌦(✓) = sign(Wsign((k)W) ) =1 (k) 1 (k) • Wr sign(W )i,j =1i,j (k) W 1 >0(k) W <0 • r Ø Need• to break symmetryWi,j >0 i,jWi,j−<0 i,j • (k) − ⌦(✓)=(k) k i j Wi,j sign(• W - )Samplei,j =1 W | ( k ( ) k from)U(k|[) b, b 1 ] b =( k ) , wherep6 pH6 Wi,jWi,j >i,j0 U [ Wb,i,j b] <b0 = k Hk • • −− pHk+Hpk H1 k+Hk 1 P P P • (k) − − − W(k) ⌦(✓) = sign(W ) • r(k) p6 Sample around 0 and W U [(k) b, b] b = H sign(i,j W ) =1 (k) 1 (k) k i,j W >p0 Hk+WHk <10 break symmetry • • − i,j − i,j − (k) p6 (k) Wi,j U [ b, b] b = Size of Hk h (x) 3 pHk+Hk 1 • − − 4 4 4 4 µ = 1 x(t) • T t 1 (t) σ2 = P1 (x(t) µ)2 µ = T t x T 1 t • • b − − 1 (t) σ2 = P1 (x(t) µµ1 )=2 PT (ttx) (t) T 1 t ⌃b =• T 1 t(x µb)(x µ)> • b − • − − − − 2 1 (t) 2 1 σ = P (x µ) ⌃ = P(x(t) µ)(xP(t)T 12µ) t 2 b T 1 t E[b µ•]=bbµ E[−σ ]=> σb E− ⌃ =b ⌃ • − • − − P ⌃ = 1 P(x(t) hµ)(ix(t) µ) E[b µ]=µ E[σ2]=µ µσb2 E T⌃ 1 = ⌃t b > b−• b −b b − b − • • pσ2/T h i P 2 2 µ µ b E[b µ]=µ E[σ ]=σ E ⌃ = ⌃ b− b b b b • pσ2/T µ b•µ 1.96 σ2/T h i b • 2 ±µ −µ b 2 b− p b b µ µ 1.96 σ• /Tpσ2/T • 2 ± − • b b (1) (T ) b ✓ = arg max p(x ,...,x ) p µ b µ 1.96 σ2/T ✓ • b b• 2 ± − (1) (T ) ✓ = arg max p(xb ,...,x ) ✓ p • p(x(1),...,x(T ))= p(x(t)) • b b b (1) (T ) ✓ = arg max p(xt ,...,x ) • p(x(1),...,x(T ))= p(x(t)) ✓ Y T 1 1 (t) (tt) T− ⌃ = T t(x µ)(xYb µ)> • • − −(1) (T ) (t) T 1 1 (t) (t) p(x ,...,x )= p(x ) − ⌃ = (x µ)(xP µ)> • T T t b− − Machine learning t b b Y P Model Selectionb Supervised T 1 Machine learning1 example:learning(t) (x,y()t) x y −b ⌃ = b (x µ)(x µ)> • • T T t − − • Training Protocol: Supervised learningTraining example: set: (xtrain,y) =x y(x(t),y(t)) • • b D P { Machine} learning train (t) (t) b b - Train your model on theTraining Training set:Set f(x;=✓) (x ,y ) • •D Supervised{ learning} example: (x,y) x y f(x; ✓) valid• test - For model selection,• use Validation Set •D TrainingD set: train = (x(t),y(t)) Ø Hyper-parameter search: hidden layer size,• learning rate, D { } number of iterations/epochs, etc. f(x; ✓) • - Estimate generalization performance using the Testvalid Set test •D D • Remember: Generalization is the behavior of the model on unseen examples. 4 5 5 5 Early Stopping • To select the number of epochs, stop training when validation set error increases (with some look ahead). 5 Tricks of the Trade: • Normalizing your (real-valued) data: Ø for each dimension xi subtract its training set mean Ø divide each dimension xi by its training set standard deviation Ø this can speed up training • Decreasing the learning rate: As we get closer to the optimum, take smaller update steps: i. start with large learning rate (e.g. 0.1) ii. maintain until validation error stops improving iii. divide learning rate by 2 and go back to (ii) 6 g0(a)=g(a)(1 g(a)) • − 2 g0(a)=1 g(a) • − 2 ⌦(✓)= W (k) = W(k) 2 • k i j i,j k || ||F ⇣ ⌘ P P P (k) P (k) ⌦(✓)=2W • rW ⌦(✓)= W (k) • k i j | i,j | (k) (k) ⌦P(✓)P = sign(P W ) • rW (k) sign(W )i,j =1W(k)>0 1W(k)<0 • i,j − i,j (k) p6 (k) Wi,j U [ b, b] b = Hk h (x) pHk+Hk 1 • − − a(3)(x)=b(3) + W(3)h(2) • a(2)(x)=b(2) + W(2)h(1) • a(1)(x)=b(1) + W(1)x • h(3)(x)=o(a(3)(x)) • h(2)(x)=g(a(2)(x)) • h(1)(x)=g(a(1)(x)) • b(3) b(2) b(1) • W(3) W(2) W(1) xf(x) • @f(x) f(x+✏) f(x ✏) − − • Mini-batch,@x ⇡ 2✏ Momentum f(x) x ✏ • • Make updatesf(x based+ ✏) onf( xa mini-batch✏) of examples (instead of a single example):• − 1 ↵ = Ø the• gradientt=1 is tthe average1 regularized loss for that mini-batch Ø can give a more2 accurate estimate of the gradient Pt1=1 ↵t < ↵t Ø can• leverage matrix/matrix1 operations, which are more efficient ↵P = ↵ • t 1+δt • Momentum: Can ↵use an exponential average of previous ↵t = tδ 0.5 < δ 1 δ gradients:• (t) (t 1) = l(f(x(t)),y(t))+β − • r✓ r✓ r✓ Ø can get pass plateaus more quickly, by ‘‘gaining momentum’’ 7 4 Adapting Learning Rates • Updates with adaptive learning rates (“one learning rate per parameter”) Ø Adagrad: learning rates are scaled by the square root of the cumulative sum of squared gradients 2 (t) (t) (t) (t 1) (t) (t) (t) ✓l(f(x ),y ) γ = γ − + ✓l(f(x ),y ) ✓ = r r r γ(t) + ✏ ⇣ ⌘ Ø RMSProp: instead of cumulative sum, use exponentialp moving average 2 (t) (t 1) (t) (t) γ = βγ − +(1 β) l(f(x ),y ) − r✓ ⇣ ⌘ (t) (t) (t) ✓l(f(x ),y ) ✓ = r Ø Adam: essentially combines r γ(t) + ✏ RMSProp with momentum p 8 g0(a)=g(a)(1 g(a)) • − 2 g0(a)=1 g(a) • − 2 ⌦(✓)= W (k) = W(k) 2 • k i j i,j k || ||F ⇣ ⌘ P P P (k) P (k) ⌦(✓)=2W • rW ⌦(✓)= W (k) • k i j | i,j | (k) (k) ⌦P(✓)P = sign(P W ) • rW (k) g0(a)=sign(g(a)(1 Wg(a)) )i,j =1W(k)>0 1W(k)<0 • • − i,j − i,j g0(a)=g(a)(1 g(a)) 2 g0(a)=1 g(a) • • −− 2 (k) g0(a)=g(a)(1 g(a)) p g0(a)=1 g(a) • 2 − 6 (k) • − W U [ (k)b, b] b = (k) 2 Hk h (x) ⌦(✓)= ki,ji j Wi,j g0=(a)=k g(Wa)(12 Fg(a)) • • 2 −g0•(a)=1 ||g(a) p−|| Hk+Hk 1 g (a)=g(a)(1 g((ka))) (k) 2 − g (a⌦)=(✓)=g(a0)(1 g(a)) W ⇣ •= ⌘ W − 2 •0 • k iP jP−Pi,j (k) kg0||(a)=1P ||F g(a) 2 • W(−k) ⌦(✓)=22W • − (k) (k) 2 • rg0(a)=12 (3)g(⇣a) ⌘ ⌦((3)✓)= (3) W(2)i,j = W F g0(a)=1• Pg(a)P P− (k) P k i j 2 k (k) ⌦(✓)=2aW (x)=• (bk) + W h (k) || ||(k) 2 • W − 2 ⌦(✓)= W = W • r ⌦(✓)= k i 2 j W(ki,j) (kk) 2 i ⇣ j ⌘i,j k F ⌦(✓•)= (k) W • = PWP P (k) P || || • k i (kj) | i,j |W((kk)) ⌦2k(✓)=2WF ⌦(✓)=• k i j Wi,j = k W F || || ⇣ ⌘ • ⌦(✓)= k iP(2)jPWi,jP⇣ •(k||r⌘) (2)|| P (2)P P(1)(k) P • W(k) ⌦aP(⇣✓)P =|(P sign(x⌘ )=(k|)W b) PW+(k) ⌦W(✓)=2hW (k) •PrPW(kP) ⌦((✓k))=2W P ⌦•(✓r)= W W(k) ⌦•(✓r)=2•W (k) k i j i,j (k) ⌦P(✓)P =( sign(kP) W ) • | | • r W (k) (k) • r sign(W )i,j =1W(k)>0 ⌦1W(✓()=k)<0 W • ⌦(✓)= (1)k (k)i j Wi,ji,j −(1) i,jP Pk(1)Pi j(k) i,j ⌦(✓)=• (k) aW (x)=| |b•W(k) ⌦+(✓)W = sign(xW | ) | • sign(Wk )ii,j =1j | i,j(k|) 1• r(k) (k) P WPi,jP>0 (kW) i,jp<0 (k) • •(k) ⌦(✓) = sign(W− ) 6 (k) P P(k) P WW U [ b,(k) b] b = W(k)H⌦(✓)h = sign((x) W ) •PrPi,j P k (k) (k) W(k) ⌦•(✓) = sign(W− ) pHsign(k•+HrWk 1 )i,j =1W >0 1W <0 • r (k) (k)(3) p6 • (3)− (k) i,j − i,j sign(W ) =1 (k) 1 (k) (k) W U [ b, bh] bi,j=(xW)=>0 o(WHak <h0 (x(x))) (k) (k) i,j(k) i,j sign(i,j W )i,j =1 1 • (3) (k) (3) pHk+((3)kH) k −(2)1 W >0 W <0 sign(• W )i,j−=1W >0 1W <−0 •(k) i,jp − i,j • a •(x)=i,jb −+ Wi,j h 6 (k) (k) pWi,j U [ b, b] b = Hk h (x) • 6 (k) pHk+Hk 1 (3) Wi,j (3)U [ b, b(3)] b =(2) • (kH)k− h (x) p−6 (k) a(k) (•x)=(2)b +−(2)W(2)p6h p(2)Hk+(1)Hk (1k)(2) W U [ ab, b(]x)=b = b + W Hhk h−W(i,jx) U [ b, b] b = Hk h (x) • i,j h H(+xH)=g(a (x)) pHk+Hk 1 • • − p k k 1 (3)• (3)− (3) (2) − (2) (3) (2) (3)(2) (1)−(3) (2)a (x)=b + W h a (x)=a(1)b•(x)=+bW(1)+hW (1)h• •(3) • a(3)(x)=(3)b (2)+ W x (3) (3) (3) (2) a (x)=• b + W (1)h (2)a (1)(x)=(2)b +(2)W (1)h • (1) a(2)(x(1))=b(2)(1)+ W(2)h(1)a• (x)=b + W h a (x)=(3)b h+ W ((3)xx )=• g(a (x)) a•(2)(x)=• hb(2) (+x)=W(2)oh(a(1) (x)) (2) (2) (2) (1) • (1)• (1) (1) (1)a (x)=(1)b +(1)W h • (3) a (x)=(3)b + W x a• (x)=b + W x (1)h (•x)=(1)(2)o(a (1)(x))(2) • a• (x)=hb (+x)=W (3)gx(a (x(2))) (1)(1) (1) (1) • • h(3)(x)=boGradient(a(3)(bx)) b(3) Checkinga (x)=(3)b + W x (2) • (2) h• (x)=o(a (x)) h(3)h(x)=(x)=o(1)(a•(3)g(a(x))(x))(1) • • h(2) (x)=g(a(2) (x)) (3) (3) • • To• debugh (x your)=g implementation(a (x)) (2) ofh fprop/bprop(x)=(2)o(,a you(x can)) compare (2) (1) • (2) (1) (3) (2)h• (x)=(1)g(a (x)) h h(x)=(x)=g(3)(ag(a((2)xW))(x(1))) W• W xf(x) • • with abh finite-difference(1)(xb)=gb(a(1)( xapproximation)) (2) of the gradient:(2) • (1)h (x)=(1)g(a (x)) (1) (3) • (2) •(1)(1) h• (x)=g(a (x)) h b(x)=b g(3)(a(3)b ((2)x))(2)(1) (1) • • • Wb bW@fb(xW) xff((xx)+✏(1)) f(x ✏)(1) •• (3)h (2)(x)=(1)g(a (x)) b(3) b(3)(2) b(1)(2) (1) b• b− b − W WW@f((3)x) WW(2)f@(xx+Wxf✏)(1)f((xxxf) ✏)•(x) 2✏ • • • − ⇡− (3) (2) (1) (3) • (2)@x (1) 2✏ (3)b b(2) b(1) W @f(xW)• @ff(W(xx)+⇡✏) fxf(fx(+x✏()x✏)f(x ✏) W• W W xf(x) • − −− − • @xØ • f (@xx ) would⇡x2✏ ✏be 2the✏ loss (3) (2) (1) @•f(x) f⇡(x+✏) f(x ✏) @f(xW) f(Wx+✏) fW(x ✏) xf(x) @x • 2−✏ − • − − • ⇡Ø f(x) x ✏ @x 2✏ f(x)• x would✏ be a parameter• @f⇡(x) f(x+✏) f(x ✏) f•(x) x ✏ − − Ø f (x + ✏ ) wouldf(x be✏) the lossf if(• xyou) @ xaddx ⇡ ✏ to the parameter2✏ • • − • f(x +Ø✏) f (x ✏ ) would be the loss if youf( xsubtract) x ✏ to the parameter • − • 9 4 4 4 4 4 4 4 Debugging on Small Dataset • Next, make sure your model can overfit on a smaller dataset (~ 500-1000 examples) • If not, investigate the following situations: Ø Are some of

10707 Deep Learning Russ Salakhutdinov

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support