<<

On Recurrent and Deep Neural Networks

Razvan Pascanu Advisor: Yoshua Bengio

PhD Defence Universit´ede Montr´eal,LISA lab September 2014

Pascanu On Recurrent and Deep Neural Networks 1/ 38 Studying the mechanism behind learning provides a meta-solution for solving tasks.

Motivation

“A computer once beat me at chess, but it was no match for me at kick boxing”

— Emo Phillips

Pascanu On Recurrent and Deep Neural Networks 2/ 38 Motivation

“A computer once beat me at chess, but it was no match for me at kick boxing”

— Emo Phillips

Studying the mechanism behind learning provides a meta-solution for solving tasks.

Pascanu On Recurrent and Deep Neural Networks 2/ 38 I fθ(x) = f (θ, x) ? F I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼

Supervised Learing

I f :Θ D T F × →

Pascanu On Recurrent and Deep Neural Networks 3/ 38 ? I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼

Supervised Learing

I f :Θ D T F × → I fθ(x) = f (θ, x) F

Pascanu On Recurrent and Deep Neural Networks 3/ 38 Supervised Learing

I f :Θ D T F × → I fθ(x) = f (θ, x) ? F I f = arg minθ Θ EEx,t π [d(fθ(x), t)] ∈ ∼

Pascanu On Recurrent and Deep Neural Networks 3/ 38 Optimization for learning

θ[k+1]

θ[k]

Pascanu On Recurrent and Deep Neural Networks 4/ 38 Neural networks

Output neurons

Last hidden bias = 1

Second hidden layer

First hidden layer

Input layer

Pascanu On Recurrent and Deep Neural Networks 5/ 38 Recurrent neural networks

Output neurons Output neurons

Last hidden layer bias = 1 bias = 1

Recurrent Layer Second hidden layer

First hidden layer

Input layer Input layer (b) Recurrent network (a) Feedforward network

Pascanu On Recurrent and Deep Neural Networks 6/ 38 On the number of linear regions of Deep Neural Networks

Razvan Pascanu, Guido Montufar∗, Kyunghyun Cho? and Yoshua Bengio

International Conference on Learning Representations 2014 Submitted to Conference on Neural Information Processing Systems 2014

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 7/ 38 Big picture

 0 , x < 0 I rect(x) = x , x > 0

I Idea: Composition of piece-wise functions is a piece-wise function

I Approach: count the number of pieces for a deep versus shallow model

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 8/ 38 Single Layer models

R R23 12 R123

R R2 2 R12 L L R13 2 2 R R R R1 ∅ 1 ∅ L3 L L1 1 Zaslavsky’s Theorem (1975): n P inp nhid s=0 s

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 9/ 38 Multi-Layer models: how would it work?

x1

P 2 P 1 S3 S

-4 4 x0

P -2 P 2 S4 S

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 10/ 38 Multi-Layer models: how would it work?

Input Space First Layer Space

S10 S40 S40 S10 S20 S30 S30 S20 S40 S10 S S 4 1 S30 S20 S3 S2 S S 20 S30 30 S20 Second Layer S S S10 S40 40 10 Space

3.

1. Fold along the 2. Fold along the vertical axis horizontal axis Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 11/ 38 Multi-Layer models: how would it work?

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 12/ 38 Visualizing units

Pascanu, Montufar, Cho and Bengio On Recurrent and Deep Neural Networks 13/ 38 Revisiting Natural Gradient for Deep Networks

Razvan Pascanu and Yoshua Bengio

International Conference on Learning Representations 2014

Pascanu and Bengio On Recurrent and Deep Neural Networks 14/ 38 Gist of this work

I Natural Gradient is a generalized Trust Region method 1 I Hessian-Free Optimization is Natural Gradient

I Using the Empirical Fisher (TONGA) is not equivalent to the same trust region method as natural gradient

I Natural Gradient can be accelerated if we add second order information of the error

I Natural Gradient can use unlabeled data

I Natural Gradient is more robust to change in order of the training set

1for particular pairs of activation functions and error functions Pascanu and Bengio On Recurrent and Deep Neural Networks 15/ 38 On the saddle point problem for non-convex optimization

Yann Dauphin, Razvan Pascanu, Caglar Gulcehre? Kyunghyun Cho?, Surya Ganguli and Yoshua Bengio

Submitted to Conference of Neural Information Processing Systems 2014

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 16/ 38 Existing evidence

I Statistical physics (on random gaussian fields)

0.7 0.25

0.6 0.20 0.5 0.15 0.4 0.10 error 0.3

0.05 0.2

0.00 0.1

0.05 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2 1.0 0.5 0.0 0.5 1.0 1.5 index eigenvalue

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 17/ 38 Existing evidence

I empirical evidence

2 ) 30 10 Error 0.32%

% 1 (

10 Error 23.49% ² 0

r 20 10 Error 28.23% )

o -1 λ r r ( 10 p e -2 10 10 n

i -3

a 10 r 0 -4 T 10 0.00 0.12 0.25 0.0 0.5 1.0 1.5 2.0 Index of critical point α Eigenvalue λ

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 18/ 38 START

Problem

I saddle points are attractors of second order dynamics

0.15 Newton SFN SGD 0.10

Newton 0.05 SFN SGD

0.00

0.05

0.10

0.15 0.6 0.4 0.2 0.0 0.2 0.4 0.6

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 19/ 38 Solution

arg min 1 (θ) ∆θ T {L } s. t. 2 (θ) 1 (θ)  kT {L } − T {L } k ≤

Using Lagrange multipliers

∂ (θ) ∆θ = L H − ∂θ | |

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 20/ 38 Experiments

| MSGD ) ) MSGD λ Damped Newton 2 1 % % 60 10 e ( Damped Newton 10 SFN (

0 v

²

SFN i 10 ² -1 t

r

r 10 a

o -2 o r g 10 r

r -3 r e

e MSGD 10 e -4 n

Damped Newton

n 10 t i n -5

i SFN s

a 10

a -6

r 1

32 o r

CIFAR-10 10 10 T T

5 25 50 0 20 40 60 80 100 m 0 20 40 60 80100 # hidden units # epochs | # epochs

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 21/ 38 Experiments

Deep 101 101 3.5 3.5 MSGD 3.0 MSGD 3.0 SFN 2.5 SFN 2.5 2.0 2.0 1.5 1.5 1.0 1.0 100 100 0.5 0.5 0.0 0.0 500 1300 3000 3150 0 2k 4k 250k 300k

Pascanu, Dauphin, Gulcehre, Cho, Ganguli and Bengio On Recurrent and Deep Neural Networks 22/ 38 A Neurodynamical Model for Working Memory

Razvan Pascanu, Herbert Jaeger

Journal of Neural Networks 2011

Pascanu, Jaeger On Recurrent and Deep Neural Networks 23/ 38 Gist of this work

Reservoir ( x )

Input Units Output units ( u ) ( y )

WM units ( m )

Pascanu, Jaeger On Recurrent and Deep Neural Networks 24/ 38 On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, Yoshua Bengio

International Conference on 2013

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 25/ 38 The exploding gradients problem

C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − −

x(t 1) x(t) x(t + 1) −

∂C P ∂C(t) P Pt ∂C(t) ∂h(t) ∂h(t k) ∂W = t ∂W = t k=0 ∂h(t) ∂h(t k) ∂W− − ∂h(t) Qt ∂h(j) ∂h(t k) = j=k+1 ∂h(j 1) − −

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 26/ 38 Possible geometric interpretation and norm clipping

The error is (h(50) 0.7)2 for h(t) = wσ(h(t 1)) + b with Classical view: h(0) = 0.5 − −

error θ θ

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 27/ 38 The vanishing gradients problem

C(t 1) − C(t) C(t + 1) ∂C(t 1) ∂C(t) ∂C(t+1) ∂h(t−1) ∂h(t) − ∂h(t+1) h(t 1) h(t) h(t + 1) ∂h(t 1) − ∂h(t) ∂h(t+1) ∂h(t+2) − ∂h(t 2) ∂h(t 1) ∂h(t) ∂h(t+1) − −

x(t 1) x(t) x(t + 1) −

∂C P ∂C(t) P Pt ∂C(t) ∂h(t) ∂h(t k) ∂W = t ∂W = t k=0 ∂h(t) ∂h(t k) ∂W− − ∂h(t) Qt ∂h(j) ∂h(t k) = j=k+1 ∂h(j 1) − −

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 28/ 38 Regularization term

 2 ∂C ∂hk+1 X X ∂hk+1 ∂hk Ω = Ωk =  1  ∂C −  k k ∂hk+1

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 29/ 38 Temporal Order

Important symbols : A,B Distractor symbols: c,d,e,f de..fAef ccefc..e fAef..e ef..c AA → | 1{z } | 4{z } | 1{z } | 4{z } 10 T 10 T 10 T 10 T

edefcAccfef..ceceBedef..fedef AB → feBefccde..efddcAfccee..cedcd BA → Bfffede..cffecdBedfd..cedfedc BB →

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 30/ 38 Results - Temporal order task

sigmoid

1.0

0.8

0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success

0.2

0.0

50 100 150 200 250 Sequence length

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 31/ 38 Results - Temporal order task

basic tanh

1.0

0.8

0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success

0.2

0.0

50 100 150 200 250 Sequence length

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 32/ 38 Results - Temporal order task

smart tanh

1.0

0.8

0.6 MSGD MSGD-C MSGD-CR 0.4 Rate of success

0.2

0.0

50 100 150 200 250 Sequence length

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 33/ 38 Results - Natural tasks

Pascanu, Mikolov, Bengio On Recurrent and Deep Neural Networks 34/ 38 How to construct Deep Recurrent Neural Networks

Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio

International Conference on Learning Representations 2014

Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 35/ 38 Gist of this work

yt yt

yt

h h ht-1 t ht-1 t

xt xt DT-RNN DOT-RNN

+ yt yt h ht-1 t

xt z t z t-1 Operator view h ht ht-1 t ht-1 xt xt Stacked RNNs DOT(s)-RNN

Pascanu, Gulcehre, Cho, Bengio On Recurrent and Deep Neural Networks 36/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for , with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Overview of contributions

I The efficiency of deep feedforward models with piece-wise linear activation functions

I The relationship between a few optimization techniques for deep learning, with a focus on understanding natural gradient

I Importance of saddle points for optimization algorithms when applied to deep learning

I Training Echo-State Networks to exhibit short term memory

I Training Recurrent Networks with gradient based methods to exhibit short term memory

I How can one construct deep Recurrent Networks

Pascanu On Recurrent and Deep Neural Networks 37/ 38 Thank you !

Thank you !

Pascanu On Recurrent and Deep Neural Networks 38/ 38