On Recurrent and Deep Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
On Recurrent and Deep Neural Networks Razvan Pascanu Advisor: Yoshua Bengio PhD Defence Universit´ede Montr´eal,LISA lab September 2014 Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks. Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation \A computer once beat me at chess, but it was no match for me at kick boxing" | Emo Phillips Studying the mechanism behind learning provides a meta-solution for solving tasks. Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) F Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing I f :Θ D T F × ! I fθ(x) = f (θ; x) ? F I f = arg minθ Θ EEx;t π [d(fθ(x); t)] 2 ∼ Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning θ[k+1] θ[k] Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks Output neurons Last hidden layer bias = 1 Second hidden layer First hidden layer Input layer Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks Output neurons Output neurons Last hidden layer bias = 1 bias = 1 Recurrent Layer Second hidden layer First hidden layer Input layer Input layer (b) Recurrent network (a) Feedforward network Pascanu On Recurrent and Deep Neural Networks 6/ 38 On the number of linear regions of Deep Neural Networks Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio International Conference on Learning Representations 2014 Submitted to Conference on Neural Information Processing Systems 2014 Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38 Big picture 0 ; x < 0 I rect(x) = x ; x > 0 I Idea: Composition of piece-wise functions is a piece-wise function I Approach: count the number of pieces for a deep versus shallow model Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38 Single Layer models R R23 12 R123 R R2 2 R12 L L R13 2 2 R R R R1 ∅ 1 ∅ L3 L L1 1 Zaslavsky's Theorem (1975): n P inp nhid s=0 s Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38 Multi-Layer models: how would it work? x1 P 2 P 1 S3 S -4 4 x0 P -2 P 2 S4 S Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38 Multi-Layer models: how would it work? Input Space First Layer Space S10 S40 S40 S10 S20 S30 S30 S20 S40 S10 S S 4 1 S30 S20 S3 S2 S S 20 S30 30 S20 Second Layer S S S10 S40 40 10 Space 3. 1. Fold along the 2. Fold along the vertical axis horizontal axis Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38 Multi-Layer models: how would it work? Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38 Visualizing units Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38 Revisiting Natural Gradient for Deep Networks Razvan Pascanu and Yoshua Bengio International Conference on Learning Representations 2014 Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38 Gist of this work I Natural Gradient is a generalized Trust Region method 1 I Hessian-Free Optimization is Natural Gradient I Using the Empirical Fisher (TONGA) is not equivalent to the same trust region method as natural gradient I Natural Gradient can be accelerated if we add second order information of the error I Natural Gradient can use unlabeled data I Natural Gradient is more robust to change in order of the training set 1for particular pairs of activation functions and error functions Pascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38 On the saddle point problem for non-convex optimization Yann Dauphin, Razvan Pascanu, Caglar Gulcehre? Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio Submitted to Conference of Neural Information Processing Systems 2014 Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38 Existing evidence I Statistical physics (on random gaussian fields) 0.7 0.25 0.6 0.20 0.5 0.15 0.4 0.10 error 0.3 0.05 0.2 0.00 0.1 0.05 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.5 0.0 0.5 1.0 1.5 index eigenvalue Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38 Existing evidence I empirical evidence 2 ) 30 10 Error 0.32% % 1 ( 10 Error 23.49% ² 0 r 20 10 Error 28.23% ) o -1 λ r r ( 10 p e -2 10 10 n i -3 a 10 r 0 -4 T 10 0.00 0.12 0.25 0.0 0.5 1.0 1.5 2.0 Index of critical point α Eigenvalue λ Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38 START Problem I saddle points are attractors of second order dynamics 0.15 Newton SFN SGD 0.10 Newton 0.05 SFN SGD 0.00 0.05 0.10 0.15 0.6 0.4 0.2 0.0 0.2 0.4 0.6 Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38 Solution arg min 1 (θ) ∆θ T fL g s. t. 2 (θ) 1 (θ) kT fL g − T fL g k ≤ Using Lagrange multipliers @ (θ) ∆θ = L H − @θ j j Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38 Experiments | MSGD ) ) MSGD λ Damped Newton 2 1 % % 60 10 e ( Damped Newton 10 SFN ( 0 v ² SFN i 10 ² -1 t r r 10 a o -2 o r g 10 r r -3 r e e MSGD 10 e -4 n Damped Newton n 10 t i n -5 i SFN s a 10 a -6 r 1 32 o r CIFAR-10 10 10 T T 5 25 50 0 20 40 60 80 100 m 0 20 40 60 80100 # hidden units # epochs | # epochs Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38 Experiments Deep Autoencoder Recurrent Neural Network 101 101 3.5 3.5 MSGD 3.0 MSGD 3.0 SFN 2.5 SFN 2.5 2.0 2.0 1.5 1.5 1.0 1.0 100 100 0.5 0.5 0.0 0.0 500 1300 3000 3150 0 2k 4k 250k 300k Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38 A Neurodynamical Model for Working Memory Razvan Pascanu, Herbert Jaeger Journal of Neural Networks 2011 Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38 Gist of this work Reservoir ( x ) Input Units Output units ( u ) ( y ) WM units ( m ) Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38 On the difficulty of training recurrent neural networks Razvan Pascanu, Tomas Mikolov, Yoshua Bengio International Conference on Machine Learning 2013 Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38 The exploding gradients problem C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − − x(t 1) x(t) x(t + 1) − @C P @C(t) P Pt @C(t) @h(t) @h(t k) @W = t @W = t k=0 @h(t) @h(t k) @W− − @h(t) Qt @h(j) @h(t k) = j=k+1 @h(j 1) − − Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38 Possible geometric interpretation and norm clipping The error is (h(50) 0:7)2 for h(t) = wσ(h(t 1)) + b with Classical view: h(0) = 0:5 − − error θ θ Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38 The vanishing gradients problem C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − − x(t 1) x(t) x(t + 1) − @C P @C(t) P Pt @C(t) @h(t) @h(t k) @W = t @W = t k=0 @h(t) @h(t k) @W− − @h(t) Qt @h(j) @h(t k) = j=k+1 @h(j 1) − − Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38 Regularization term 0 12 @C @hk+1 X X @hk+1 @hk Ω = Ωk = B 1C @ @C − A k k @hk+1 Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38 Temporal Order Important symbols : A,B Distractor symbols: c,d,e,f de..fAef ccefc..e fAef..e ef..c AA ! | 1{z } | 4{z } | 1{z } | 4{z } 10 T 10 T 10 T 10 T edefcAccfef..ceceBedef..fedef AB ! feBefccde..efddcAfccee..cedcd BA ! Bfffede..cffecdBedfd..cedfedc BB ! Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38 Results - Temporal order task sigmoid 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38 Results - Temporal order task basic tanh 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38 Results - Temporal order task smart tanh 1.0 0.8 0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success 0.2 0.0 50 100 150 200 250 Sequence length Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38 Results - Natural tasks Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38 How to construct Deep Recurrent Neural Networks Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio International Conference on Learning Representations 2014 Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38 Gist of this work yt yt yt h h ht-1 t ht-1 t xt xt DT-RNN DOT-RNN + yt yt h ht-1 t xt z t z t-1 Operator view h ht ht-1 t ht-1 xt xt Stacked RNNs DOT(s)-RNN Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38 Overview of contributions I The efficiency of deep feedforward models with piece-wise linear activation functions I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient I Importance of saddle points for optimization algorithms when applied to deep learning I Training