On Recurrent and Deep Neural Networks
Razvan Pascanu Advisor: Yoshua Bengio
PhD Defence Universit´ede Montr´eal,LISA lab September 2014
Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks.
Motivation
“A computer once beat me at chess, but it was no match for me at kick boxing”
— Emo Phillips
Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation
“A computer once beat me at chess, but it was no match for me at kick boxing”
— Emo Phillips
Studying the mechanism behind learning provides a meta-solution for solving tasks.
Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ, x) ? F I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼
Supervised Learing
I f :Θ D T F × →
Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼
Supervised Learing
I f :Θ D T F × → I fθ(x) = f (θ, x) F
Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing
I f :Θ D T F × → I fθ(x) = f (θ, x) ? F I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼
Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning
θ[k+1]
θ[k]
Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks
Output neurons
Last hidden layer bias = 1
Second hidden layer
First hidden layer
Input layer
Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks
Output neurons Output neurons
Last hidden layer bias = 1 bias = 1
Recurrent Layer Second hidden layer
First hidden layer
Input layer Input layer (b) Recurrent network (a) Feedforward network
Pascanu On Recurrent and Deep Neural Networks 6/ 38 On the number of linear regions of Deep Neural Networks
Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio
International Conference on Learning Representations 2014 Submitted to Conference on Neural Information Processing Systems 2014
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38 Big picture
0 , x < 0 I rect(x) = x , x > 0
I Idea: Composition of piece-wise functions is a piece-wise function
I Approach: count the number of pieces for a deep versus shallow model
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38 Single Layer models
R R23 12 R123
R R2 2 R12 L L R13 2 2 R R R R1 ∅ 1 ∅ L3 L L1 1 Zaslavsky’s Theorem (1975): n P inp nhid s=0 s
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38 Multi-Layer models: how would it work?
x1
P 2 P 1 S3 S
-4 4 x0
P -2 P 2 S4 S
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38 Multi-Layer models: how would it work?
Input Space First Layer Space
S10 S40 S40 S10 S20 S30 S30 S20 S40 S10 S S 4 1 S30 S20 S3 S2 S S 20 S30 30 S20 Second Layer S S S10 S40 40 10 Space
3.
1. Fold along the 2. Fold along the vertical axis horizontal axis Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38 Multi-Layer models: how would it work?
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38 Visualizing units
Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38 Revisiting Natural Gradient for Deep Networks
Razvan Pascanu and Yoshua Bengio
International Conference on Learning Representations 2014
Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38 Gist of this work
I Natural Gradient is a generalized Trust Region method 1 I Hessian-Free Optimization is Natural Gradient
I Using the Empirical Fisher (TONGA) is not equivalent to the same trust region method as natural gradient
I Natural Gradient can be accelerated if we add second order information of the error
I Natural Gradient can use unlabeled data
I Natural Gradient is more robust to change in order of the training set
1for particular pairs of activation functions and error functions Pascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38 On the saddle point problem for non-convex optimization
Yann Dauphin, Razvan Pascanu, Caglar Gulcehre? Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio
Submitted to Conference of Neural Information Processing Systems 2014
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38 Existing evidence
I Statistical physics (on random gaussian fields)
0.7 0.25
0.6 0.20 0.5 0.15 0.4 0.10 error 0.3
0.05 0.2
0.00 0.1
0.05 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.5 0.0 0.5 1.0 1.5 index eigenvalue
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38 Existing evidence
I empirical evidence
2 ) 30 10 Error 0.32%
% 1 (
10 Error 23.49% ² 0
r 20 10 Error 28.23% )
o -1 λ r r ( 10 p e -2 10 10 n
i -3
a 10 r 0 -4 T 10 0.00 0.12 0.25 0.0 0.5 1.0 1.5 2.0 Index of critical point α Eigenvalue λ
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38 START
Problem
I saddle points are attractors of second order dynamics
0.15 Newton SFN SGD 0.10
Newton 0.05 SFN SGD
0.00
0.05
0.10
0.15 0.6 0.4 0.2 0.0 0.2 0.4 0.6
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38 Solution
arg min 1 (θ) ∆θ T {L } s. t. 2 (θ) 1 (θ) kT {L } − T {L } k ≤
Using Lagrange multipliers
∂ (θ) ∆θ = L H − ∂θ | |
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38 Experiments
| MSGD ) ) MSGD λ Damped Newton 2 1 % % 60 10 e ( Damped Newton 10 SFN (
0 v
²
SFN i 10 ² -1 t
r
r 10 a
o -2 o r g 10 r
r -3 r e
e MSGD 10 e -4 n
Damped Newton
n 10 t i n -5
i SFN s
a 10
a -6
r 1
32 o r
CIFAR-10 10 10 T T
5 25 50 0 20 40 60 80 100 m 0 20 40 60 80100 # hidden units # epochs | # epochs
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38 Experiments
Deep Autoencoder Recurrent Neural Network 101 101 3.5 3.5 MSGD 3.0 MSGD 3.0 SFN 2.5 SFN 2.5 2.0 2.0 1.5 1.5 1.0 1.0 100 100 0.5 0.5 0.0 0.0 500 1300 3000 3150 0 2k 4k 250k 300k
Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38 A Neurodynamical Model for Working Memory
Razvan Pascanu, Herbert Jaeger
Journal of Neural Networks 2011
Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38 Gist of this work
Reservoir ( x )
Input Units Output units ( u ) ( y )
WM units ( m )
Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38 On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, Yoshua Bengio
International Conference on Machine Learning 2013
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38 The exploding gradients problem
C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − −
x(t 1) x(t) x(t + 1) −
∂C P ∂C(t) P Pt ∂C(t) ∂h(t) ∂h(t k) ∂W = t ∂W = t k=0 ∂h(t) ∂h(t k) ∂W− − ∂h(t) Qt ∂h(j) ∂h(t k) = j=k+1 ∂h(j 1) − −
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38 Possible geometric interpretation and norm clipping
The error is (h(50) 0.7)2 for h(t) = wσ(h(t 1)) + b with Classical view: h(0) = 0.5 − −
error θ θ
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38 The vanishing gradients problem
C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − −
x(t 1) x(t) x(t + 1) −
∂C P ∂C(t) P Pt ∂C(t) ∂h(t) ∂h(t k) ∂W = t ∂W = t k=0 ∂h(t) ∂h(t k) ∂W− − ∂h(t) Qt ∂h(j) ∂h(t k) = j=k+1 ∂h(j 1) − −
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38 Regularization term
2 ∂C ∂hk+1 X X ∂hk+1 ∂hk Ω = Ωk = 1 ∂C − k k ∂hk+1
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38 Temporal Order
Important symbols : A,B Distractor symbols: c,d,e,f de..fAef ccefc..e fAef..e ef..c AA → | 1{z } | 4{z } | 1{z } | 4{z } 10 T 10 T 10 T 10 T
edefcAccfef..ceceBedef..fedef AB → feBefccde..efddcAfccee..cedcd BA → Bfffede..cffecdBedfd..cedfedc BB →
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38 Results - Temporal order task
sigmoid
1.0
0.8
0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success
0.2
0.0
50 100 150 200 250 Sequence length
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38 Results - Temporal order task
basic tanh
1.0
0.8
0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success
0.2
0.0
50 100 150 200 250 Sequence length
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38 Results - Temporal order task
smart tanh
1.0
0.8
0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success
0.2
0.0
50 100 150 200 250 Sequence length
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38 Results - Natural tasks
Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38 How to construct Deep Recurrent Neural Networks
Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio
International Conference on Learning Representations 2014
Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38 Gist of this work
yt yt
yt
h h ht-1 t ht-1 t
xt xt DT-RNN DOT-RNN
+ yt yt h ht-1 t
xt z t z t-1 Operator view h ht ht-1 t ht-1 xt xt Stacked RNNs DOT(s)-RNN
Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions
I The efficiency of deep feedforward models with piece-wise linear activation functions
I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient
I Importance of saddle points for optimization algorithms when applied to deep learning
I Training Echo-State Networks to exhibit short term memory
I Training Recurrent Networks with gradient based methods to exhibit short term memory
I How can one construct deep Recurrent Networks
Pascanu On Recurrent and Deep Neural Networks 37/ 38 Thank you !
Thank you !
Pascanu On Recurrent and Deep Neural Networks 38/ 38