Introduction Feedforward Neural Networks

Deep learning

J´er´emyFix

CentraleSup´elec jeremy.fi[email protected]

2016

1 / 94 Introduction Feedforward Neural Networks

Introduction and historical perspective

[Schmidhuber, 2015] in Neural Networks: An Overview, J¨urgenSchmidhuber, Neural Networks (61), Pages 85-117

2 / 94 Introduction Feedforward Neural Networks

Historical perspective on neural networks

Perceptron (Rosenblatt, 1962) : linear classifier • AdaLinE (Widrow, Hoff, 1962): linear regressor • Minsky/Papert (1969): first winter • Convolutional Neural Networks (1980, 1998) : great! • Multilayer and backprop (Rumelhart, 1986) : • great! but it is hard to train, and the SVM come in the 1990s .... : • second winter 2006 : pretraining ! • 2012 : AlexNet on Imagenet (10% better on test than the • 2nd) Now on : lot of state of the art neural networks •

3 / 94 Introduction Feedforward Neural Networks

Some reasons of the current success

GPU (speed of processing) / Data (regularizing) • theoretical understandings on the difficulty of training deep • networks

Which libs?

Torch(Lua)/PyTorhc , Caffee(Python/C++) • Theano/Lasagne (python, RIP 2017) , Tensorflow (Google, • Python, C++), Keras (wrapper over Tensorflow and Theano), CNTK, MXNET, Chainer, DyNet, ...

4 / 94 Introduction Feedforward Neural Networks

What to read ? www.deeplearningbook.org Goodfellow, Bengio, Courville(2016)

Who to follow ?

N-1 : LeCun, Bengio, Hinton, Schmidhuber • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, • Krizevsky, .. obviously, the community is much larger.

Which conferences ? ICML, NIPS, ICLR, ..

https://github.com/terryum/awesome-deep-learning-papers 5 / 94 Introduction Feedforward Neural Networks

What is a neural network The tree a Neural network is a directed graph edges : weighted connections • nodes : computational units • no cycle : feedforward neural networks (FNN) • with cycles : recurrent neural networks (RNN) •

hides the jungle What is a convolutional neural network with a softmax output, ReLu hidden activations, with batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout ?

6 / 94 Introduction Feedforward Neural Networks

Feedforward Neural Networks (FNN)

Input Hidden Output

Skip layer connection

7 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

classification, Given (xi , yi ), yi 1, 1 • ∈ {− } SAR Architecture, Basis functions φ (x) with φ (x) = 1 • j 0 Algorithm • Geometrical interpretation • Sensory Associative Result

x0 a0 = φ0(x) w00 Σ g r0 w10 x1 a1 = φ1(x) w01 r Σ 1 x2 a2 = φ2(x) w02 Σ r2 w22 x3

8 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Classifier

Given feature functions φj , with φ0(x) = 1, the perceptron classifies x as :

y = g(w T Φ(x)) (1) 1 if x < 0 g(x) − (2) (+1 if x 0 ≥ 1 φ (x) n +1 1 with φ(x) R a , φ(x) =   ∈ φ2(x)  .   .     

9 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Online Training algorithm

Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }

w if the input is correctly classified w = w + φ(xi ) if the input is incorrectly classified as -1 (3) w φ(xi ) if the input is incorrectly classified as +1 − 

10 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is correctly classified Case y = +1 Case y = 1 i i −

w

φ(xi) φ(xi)

w

11 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is misclassified Case y = +1 Case y = 1 i i −

w φ(xi) φ(xi)

w + φ(xi) w φ(x ) − i w

12 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962) The cone of feasible solutions

Consider two samples x1, x2 with y1 = +1, y2 = 1 −

φ(x ) w 1

φ(x2)

T v φ(x1) = 0

T v φ(x2) = 0

13 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Online Training algorithm

Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }

w if the input is correctly classified w = w + φ(xi ) if the input is incorrectly classified as -1 (4) w φ(xi ) if the input is incorrectly classified as +1 −  T w if g(w φ(xi )) = yi T w = w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 (5) T − w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − 

14 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962) Online Training algorithm

Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }

T w if g(w φ(xi )) = yi T w = w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 T − w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − −  w if g(w T φ(x )) = y w = i i T (w + yi φ(xi ) if g(w φ(xi )) = yi 6

1 T w = w + (yi yˆi )φ(xi ) withy ˆi = g(w φ(xi )) 2 − 15 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

Definition (Linear separability) d A binary classification problem (xi , yi ) R 1, 1 , i [1..N] is ∈ × {−d } ∈ said to be linearly separable if there exists w R such that : ∈ T i, sign(w xi ) = yi ∀ with x < 0, sign(x) = 1, x 0, sign(x) = +1. ∀ − ∀ ≥ Theorem (Perceptron convergence theorem) d A classification problem (xi , yi ) R 1, 1 , i [1..N] is linearly separable if and only if the∈ perceptron× {− } learning∈ rule converges to an optimal solution in a finite number of steps.

: easy; : we upper/lower bound w(t) 2 ⇐ ⇒ | |2 16 / 94 Introduction Feedforward Neural Networks

Perceptron (Rosenblatt, 1962)

w = w + y φ(x ), with (t) the set of misclassified • t 0 i∈I(t) i i samples I P 1 T it minimizes a loss : J(w) = max(0, yi w φ(xi )) • N i − the solution can be written as • P 1 wt = w0 + (yi yˆi )φ(xi ) 2 − i X (yi yˆi ) is the prediction error −

17 / 94 Introduction Feedforward Neural Networks

Kernel Perceptron Any linear predictor involving only scalar products can be

kernelized (kernel trick, cf SVM); Given w(t) = w0 + i∈I yi xi P < w, x > =< w0, x > + yi < xi , x > i∈I X k(w, x) = k(w0, x) + yi k(xi , x) ⇒ i∈I X

3

2

1

0

1

2

3 3 2 1 0 1 2 3

18 / 94 Introduction Feedforward Neural Networks

Adaptive Linear Elements (Widrow, Hoff, 1962)

Linear regression, Analytically

Given (xi , yi ), yi R • ∈ minimize J(w) = 1 y w T x 2 • N i i i || − T || Analytically w J(w) = 0 XX w = Xy • ∇ P ⇒ XX T non singular : w = (XX T )−1Xy • XX T singular (e.g. points along a line in 2D), infinite nb • solutions • regularized least square : min G(w) = J(w) + αw T w T • w G(w) = 0 (XX + αI )w = Xy • as∇ soon as α >⇒0, (XX T + αI ) is not singular Needs to compute XX T , i.e. over the whole training set...

19 / 94 Introduction Feedforward Neural Networks

Adaptive Linear Elements (Widrow, Hoff, 1962)

Linear regression with stochastic gradient descent

start at w • 0 take each sample one after the other (online) x , y • i i denotey ˆ = w T x the prediction • i i update wt+1 = wt  w J(wt ) = wt + (yi yˆi )xi • − ∇ − delta rule, δ = (yi yˆi ) prediction error • −

wt+1 = wt + δxi

note the similarity with the perceptron learning rule •

The samples xi are supposed to be “extended” with one dimension set to 1.

20 / 94 Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Batch gradient descent

compute the gradient of the loss J(w) over the whole training • set

performs one step in direction of w J(w, x, y) • −∇

wt+1 = wt t w J(w, x, y) − ∇  : learning rate •

21 / 94 Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Stochastic gradient descent (SGD)

one sample at a time, noisy estimate of w J • ∇ performs one step in direction of w L(w, xi , yi ) • −∇

wt+1 = wt t w L(w, xi , yi ) − ∇ faster to converge than gradient descent •

22 / 94 Introduction Feedforward Neural Networks

Batch/Minibatch/Stochastic gradient descent

1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Minibatch

noisy estimate of the true gradient with M samples (e.g. • M = 64, 128); M is the minibatch size Randomize with = M, one set at a time • J |J | 1 wt+1 = wt t w L(w, xj , yj ) − M ∇ j∈J X smoother estimate than SGD • great for parallel architectures (GPU) • 23 / 94 Introduction Feedforward Neural Networks

Does it make sense to use gradient descent ?

Convex function n A function f : R R is convex : 7→ n 1 x1, x2 R , t [0, 1] ⇐⇒ ∀ ∈ ∀ ∈ f (tx1 + (1 t)x2) tf (x1) + (1 t)f (x2) − ≤ − 2 with f twice diff., n 2 x R , H = f (x) is postive semidefinite ⇐⇒ ∀ ∈n T ∇ i.e. x R , x Hx 0 ∀ ∈ ≥ For a convex function f , all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent 2 converge, typically t = , t < (cf lectures on convex optimization). ∞ ∞ P P

24 / 94 Introduction Feedforward Neural Networks

Does it make sense to use gradient descent ?

Linear regression with L2 loss is convex Indeed, 1 T 2 Given xi , yi , L(w) = 2 (w xi yi ) is convex: • T − w L = (w xi yi)xi ∇2 T − w L = xi xi ∇ n T T T 2 x R x xi x x = (x x) 0 ∀ ∈ i i ≥ a non negative weighted sum of convex functions is convex •

25 / 94 Introduction Feedforward Neural Networks

Linear regression, synthesis

Linear regression

samples (xi , yi ), yi R, • ∈ extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear model yˆ = w T x • i i L2 loss L(ˆy, y) = 1 yˆ y 2 • 2 || − || by gradient descent • ∂L ∂yˆ w L(w, xi , yi ) = = (yi yˆi )xi ∇ ∂yˆ ∂w − −

26 / 94 Introduction Feedforward Neural Networks

Linear classification, synthesis Linear binary classification ()

samples (xi , yi ), yi [0, 1] • ∈ extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear model w T x • sigmoid transfer functiony ˆ = σ(w T x ), • i i • 1 σ(x) = 1+exp( x) , σ(x) [0, 1] − ∈ • d σ(x) = σ(x)(1 σ(x)) dx − Cross entropy loss L(ˆy, y) = y logy ˆ (1 y) log(1 yˆ) • − − − − by gradient descent : • ∂L ∂yˆ w L(w, xi , yi ) = = (yi yˆi )xi ∇ ∂yˆ ∂w − −

27 / 94 Introduction Feedforward Neural Networks

Linear classification, synthesis

Logistic regression is convex Indeed,

Given xi , yi = 1, • T T L1(w) = log(σ(w xi ) = log(1 + exp( w xi )), − T − w L1 = (1 σ(w xi ))xi ∇2 − T− T T L1 = σ(w xi )(1 σ(w xi )) xi x ∇w − i >0 T Given xi , yi = 0, L2(w) = log(1 σ(w xi )) • | T {z − } − w L2 = σ(w xi )x ∇2 T T T L2 = σ(w xi )(1 σ(w xi )) xi x ∇w − i >0 a non negative weighted sum of convex functions is convex • | {z }

28 / 94 Introduction Feedforward Neural Networks

Why L2 loss for linear classification with SGD is bad Compute the gradient to see why...

Take L2 loss L(ˆy, y) = 1 yˆ y 2 • 2 || − || Take the “linear” model :y ˆ = σ(w T x ) • i i Check that d σ(x) = σ(x)(1 σ(x)) • dx − Compute the gradient wrt w: •

∂L ∂yˆ T T w L(w, xi , yi ) = = (yi yˆi )σ(w xi )(1 σ(w xi ))xi ∇ ∂yˆ ∂w − − −

T If xi is strongly misclassified (e.g. yi = 1, w xi = big) • T T − Then σ(w xi )(1 σ(w xi )) 0, i.e. w L(w, xi , yi ) 0 stepsize is very− small while≈ the sample∇ is misclassified≈ ⇒ With a cross entropy loss, w L(w, xi , yi ) is proportional to the error ∇

29 / 94 Introduction Feedforward Neural Networks

Linear classification, synthesis Linear multiclass classification

samples (xi , yi ), labels yi [ 0, k 1 ] • ∈ | − | extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear models for each class: w T x • j T exp(wj x) softmax transfer function : P(y = j/x) =y ˆj = P T • k exp(wk x) generalization of the sigmoid for a vectorial output • Cross entropy loss L(ˆy, y) = logy ˆy • − by gradient descent • ∂L ∂yˆk wj L(w, x, y) = = (δj,y yˆj )x ∇ ∂yˆk ∂wj − − k X

30 / 94 Introduction Feedforward Neural Networks

Perceptron and linear separability

Perceptrons performs linear separation in a predefined, fixed feature space The XOR xor(x1, x2) xor(x1, x2) = x1x2 + x1x2

x2 ?? x1x2 1 1 0 ?? 1 1 ?? 0 0 1 0 0 1

0 1 x1 0 1 x1x2

Can we learn the φj (x) ???

31 / 94 Introduction Feedforward Neural Networks

Radial basis functions(RBF) RBF (Broomhead, 1988)

2 −||x−µj || RBF kernel φ0(x) = 1, φj (x) = exp 2 • 2σj for regression (L2 loss) or classification (cross entropy loss) • e.g. for regression : • yˆ(x) = w T φ(x)

T 2 L(w, xi , yi ) = yi w φ(xi ) || − || What about the centers and variances ?[Schwenker, 2001] • • place them uniformly, randomly, by vector quantization (k-means, GNG [Fritzke, 1994]) • two phases : fix the centers/variances, fit the weights • three phases : fix the centers/variances, fit the weights, fit everything ( L, L, w L) ∇µ ∇σ ∇ 32 / 94 Introduction Feedforward Neural Networks

Radial basis functions(RBF)

RBF are universal approximators [Park, Sandberg (1991)] d Denote the family of functions based on RBF in R : S d N = g R R, g(x) = wi φi (x), w R S { ∈ → ∈ } i X p Then is dense in L (R) for every p [1, ) S ∈ ∞ Actually, the theorem applies for a larger class of functions φi .

33 / 94 Introduction Feedforward Neural Networks

Feedforward neural networks (or MLP [Rumelhart, 1986])

Input Hidden layers Output Layer 0 Layer 1 Layer L-1 Layer L ···

x0 (1) w00 (L 1) (1) w00− w (L 1) 01 a − 0 (L 1) y − (L 1) 0 x1 w0i − (1) w02 ··· (L) (L) a0 y0

x2 ··· (L 1) a − 1 (L 1) (L) (L) y − 1 a1 y1 ···

x3 1 1 1 (1) (1) (L 1) (L 1) (L 2) (L) (L) (L 1) ai = j wij xj ai − = j wij − yj − ai = j wij yj − (1) (1) (L 1) (L 1) (L) (L) yi = gP(ai ) yi − = gP(ai − ) yi = fP(ai )

Named MLP for historical reasons. Should be called FNN.

34 / 94 Introduction Feedforward Neural Networks

Feedforward neural networks Input Hidden layers Output Layer 0 Layer 1 Layer L-1 Layer L ···

x0 (1) w00 (L 1) (1) w00− w (L 1) 01 a − 0 (L 1) y − (L 1) 0 x1 w0i − (1) w02 ··· (L) (L) a0 y0

x2 ··· (L 1) a − 1 (L 1) (L) (L) y − 1 a1 y1 ···

x3 1 1 1 (1) (1) (L 1) (L 1) (L 2) (L) (L) (L 1) ai = j wij xj ai − = j wij − yj − ai = j wij yj − (1) (1) (L 1) (L 1) (L) (L) yi = gP(ai ) yi − = gP(ai − ) yi = fP(ai ) Architecture

Depth : number of layers without counting the input • deep = large depth Width : number of units per layer • weights and biases for each unit • Hidden transfer function f , Output transfer function g • 35 / 94 Introduction Feedforward Neural Networks

Feedforward neural networks

Architecture Hidden transfer function Historically, f taken as a sigmoid or tanh. • Now, mainly Rectified Linear Units (ReLu) or similar • f (x) = max(x, 0)

ReLu are more favorable for the gradient flow than the saturating functions [Krizhevsky(2012), Nair(2010), Jarrett(2009)]

1.0 5 1.0

0.8 4

0.5

0.6 3

0.0 4 2 0 2 4 0.4 2

0.5 0.2 1

0.0 0 1.0 4 2 0 2 4 4 2 0 2 4

36 / 94 Introduction Feedforward Neural Networks

Feedforward neural networks

Architecture Output transfer function and loss for regression : • • linear f (x) = x • L2 loss L(ˆy, y) = y yˆ 2 || − || for multiclass classification: • eaj • softmaxy ˆj = P ak , k e • negative log likelihood loss L(ˆy, y) = log(ˆyy ) −

37 / 94 Introduction Feedforward Neural Networks

FNN training : error backpropagation

Training by gradient descent

initialize weights and biases w • 0 at every iteration, compute : •

w w  w J ← − ∇

The partial derivatives ∂J ?? ∂wi Fundamentally, use the chain rule within the computational graph linking any variable (inputs,weights, biases) to the output of the loss.

Backprop is usually attributed to [Rumelhart,1986] but [Werbos,1981] already introduced the idea.

38 / 94 Introduction Feedforward Neural Networks

Computing partial derivatives Computational graph A computational graph is a directed graph nodes : variables (weights, inputs, outputs, targets,..) • edges : operations (ReLu, Softmax, w T x + b, .., Losses,..) •

We only need to know, for each operations : the partial derivatives wrt its parameters • the partial derivatives wrt its inputs • 39 / 94 Introduction Feedforward Neural Networks

Computing partial derivatives ∂J ∂wi

The chain rule : single path

Suppose there is a single path, e.g. xi u3 →

∂u3 ∂u3 ∂u2 ∂u1 ∂ 2 Applying the chain rule = . . = (yi wxi b) ∂xi ∂u2 ∂u1 ∂xi ∂xi − −

2 u3 = u2 ∂u3 u2 = yi u1  = 2u2.( 1).w = 2w(yi wxi b) − ∂xi − − − − u1 = wxi + b



40 / 94 Introduction Feedforward Neural Networks

Computing partial derivatives ∂J ∂wi

The chain rule : multiple paths

Sum over all the paths (e.g. u3 = w1xi + w2xi ): ∂u ∂u ∂u 3 = 3 j ∂xi ∂uj ∂xi j 1,2 ∈{X }

u3 = u1 + u2 ∂u3 u2 = w2xi  = 1.w2+1.w1 ∂xi u1 = w1xi 



41 / 94 Introduction Feedforward Neural Networks

But, it is computationally expensive There are a lot of paths...

There are 4 paths from xi to u5

42 / 94

We need to identify and sum over all these paths Oups, there can be an exponential number of paths. L fully-connected layers, N units each, 1 output : NL paths Introduction Feedforward Neural Networks

Let us be more efficient: Forward-mode differentiation Forward differentiation Idea : to compute ∂u5 , propagate forward ∂ ∂xi ∂xi

∂u1 ∂u1 ∂z1 ∂u1 ∂z2 = + = z2.1 + z1.0 = z2 = w1 But how to ∂xi ∂z1 ∂xi ∂z2 ∂xi compute ∂u5 ? Well, propagate ∂ ∂w1 ∂w1 And ∂u5 ? propagate again...... or.... ∂w2 Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/ 43 / 94 Introduction Feedforward Neural Networks

Let us be even more efficient: reverse-mode differentiation Reverse differentiation Idea : to compute ∂u5 , backpropagate ∂u5 ∂xi ∂

∂u5 ∂u5 ∂u3 ∂u5 ∂u4 ∂u5 = + = 1.w3 + 1.w6 We have , but also ∂u1 ∂u3 ∂u1 ∂u4 ∂u1 ∂xi ∂u5 , ∂u5 ,... all in a single pass ! ∂w1 ∂w2 Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/ 44 / 94 Introduction Feedforward Neural Networks

FNN training : error backpropagation

In Neural Networks, reverse-mode differentation is called error backpropagation Training in two phases

Evaluation of the output : forward propagation • Evaluation of the gradient : reverse-mode differentation • Carefull The reverse-mode differentation uses the activations computed during the forward propagation

Libraries like theano augment the computational graph with nodes computing numerically the gradient by reverse-mode differentiation.

45 / 94 Introduction Feedforward Neural Networks

Universal approximator Any well behaved functions can be arbitrarily approximated with a single hidden layer FNN. Intuition

Take a sigmoid transfer function f (x) = 1 : this • 1+exp(−α(x−bi )) is the hidden layer substract two such activations to get gaussian like kernels • 1.4

1.2

1.0

0.8

0.6

0.4

0.2

0.0 0.0 0.2 0.4 0.6 0.8 1.0 weight such substractions, you are back to the RBFs • 46 / 94 Introduction Feedforward Neural Networks

But then, why deep networks ?? Going deeper

Single FNN are universal approximators, but the hidden layer can be • arbitrarily large A deep network (large number of layers) builds high level features • by composing lower level features A shallow directly learns these high level features • Image analogy : • • first layers : extract oriented contours (e.g. gabors) • second layers : learn corners by combining contours • next layers : build up more and more complex features Theoretical works comparing expressiveness depth d FNN with • depth d-1 FNN { } { }

Learning deep architectures for AI, Bengio(2009), chap2; Benefits of depth in 47 / 94 neural networks, Telgarsky(2016) Introduction Feedforward Neural Networks

And why ReLu ?

Vanishing/exploding gradient [Hochreiter(1991),Bengio(1994)]

Consider u2 = f (wu1) Remember when the gradient is “backpropagated”, it involves • ∂J = ∂J ∂u2 = ∂J w.f 0(wu ) ∂u1 ∂u2 ∂u1 ∂u2 1 backpropagated through L layers (w.f 0)L • with f (x) = 1 , f 0(x) < 1 for x = 0 • 1+e−x 6 If w.f 0 = 1, (w.f 0)L 0, • 6 → { ∞} gradient vanishes or explodes • ⇒ With ReLu, f 0(x) 0, 1 . But you can get dead units ∈ { }

48 / 94 Introduction Feedforward Neural Networks

But the ReLus can die.... Why do they die ? If the input to ReLu is negative, the gradient is 0, that’s it...”forever” lost And then ?

Add a linear component for negative x : Leaky Relu, • Parametric Relu [He(2015)] Exponential Linear Units [Clevert, Hochreiter(2016)] •

5 5 5

4 4 4

3 3 3

2 2

2

1 1

1 0 0 4 2 0 2 4 4 2 0 2 4

0 4 2 0 2 4 1 1 49 / 94 Introduction Feedforward Neural Networks

How to deal with the vanishing/exploding gradient Preventing vanishing gradient by preserving gradient flow

Using ReLu, Leaky Relu, PReLu, ELU to ensure a good flow • of gradient Specific architectures : • • ResNet (CNN) : shortcut connections • LSTM (RNN) : constant error caroussel

Preventing exploding gradients

Gradient clipping [Pascanu, 2013] : clip the norm of the gradient 1 unit, σ(x), 50 lay- ers 50 / 94 Introduction Feedforward Neural Networks

Regularization

Like kids, FNN can do a lot of things, but we must focus their expressiveness Chap 7, [Bengio et al.(2016)]

51 / 94 Introduction Feedforward Neural Networks

Regularization L2 regularization Add a L2 penalty on the weights, α > 0. α α J(w) = L(w) + w 2 = L(w) + w T w 2 || ||2 2

w J = w L + αw ∇ ∇ Example : RBF, 1 kernel per sample, N=30, noisy sinus.

1.0 1.0 0.6

0.8 0.8 0.4

0.2 0.6 0.6

0.0

0.4 0.4 0.2

0.2 0.2 0.4

0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0, α = 2, w ? Chap 7 of [Bengio et al.(2016)] for a geometrical interpretation 52 / 94 Introduction Feedforward Neural Networks

Regularization

L2 regularization In principle, We should not regularize the bias. Example :

N 1 2 J(w) = yi w0 wk xi k N || − − , || i=1 k≥1 X X 1 1 w J = 0 w0 = ( yi ) wk xi k ∇ 0 ⇒ N − N , i k≥1 i X X X 1 e.g. if your data are centered, i.e. N i xi,k = 0, then 1 w0 = N i yi . Regularizing the bias might lead to underfitting.P P

53 / 94 Introduction Feedforward Neural Networks

Regularization L1 regularization promotes sparsity Add a L1 penalty on the weights.

J(w) = L(w) + α w 1 = L(w) + α wk || || | | k X

w J = w L + α sign(wk ) ∇ ∇ k X Example : RBF, 1 kernel per sample, N=30, noisy sinus, α = 0.003. 1.0 1.0 0.6

0.8 0.8 0.4

0.2 0.6 0.6

0.0

0.4 0.4 0.2

0.2 0.2 0.4

0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0, α = 0.003, w ?

54 / 94 Introduction Feedforward Neural Networks

Regularization Drop-out regularization [Srivastava(2014)]

Idea 1: to prevent co-adaptation. A unit is good by itself not because others are doing part of the job. Idea 2 : combine an exponential number of networks (Ensemble). How : For each minibatch, keep hidden and input activations with • probability p (p=0.5 for hidden, p=0.8 for inputs). At test time, multiply all the activations by p “Inverted”: scale by 1/p at training ; no scaling at test time • 55 / 94 Introduction Feedforward Neural Networks

Regularization (dropout)

,Srivastava(2014), Hinton(2012)

Usually after FC layers (p=0.5) and input layer. Can be interpreted as of training/averaging all the possible subnetworks. 56 / 94 Introduction Feedforward Neural Networks

Regularization

Split your data in three sets : training set : for training.. • validation set : for choosing hyperparameters • test set : for estimating the generalization error • Early stopping Idea : monitor your error on the validation set, U-shaped performance. Keep the model with the lowest validation error during training.

57 / 94 Introduction Feedforward Neural Networks

Training by some forms of gradient descent

w(t + 1) w(t)  w J(wt ) ← − ∇

* Chap 8 [Bengio et al. (2016)] * A. Karpathy : http://cs231n.github.io/neural-networks-3/

58 / 94 Introduction Feedforward Neural Networks

But wait...

Does it make sense to apply gradient descent to Neural networks ? we cannot get better than a local minima ?!?! • and neural networks lead to non convex optimization • problems, i.e. a lot of local minima (think about the symmetries) But empirically, most local minima are close to the global minima with large/deep networks.

Choromanska, 2015 : The Loss Surface of Multilayer Nets Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Pascanu, 2014 : On the saddle point problem for non-convex optimization

59 / 94 Introduction Feedforward Neural Networks

Identifying the critical points The hessian matrix

Matrix of second order derivatives. Inform on the local • curvature ∂2f ∂2f ∂2f ∂x2 ∂x ∂x ··· ∂x ∂x  1 1 2 1 n  ∂2f ∂2f ∂2f  2  ∂x2 ∂x1 ∂x ··· ∂x2 ∂xn  H(θ) = 2J =  2  ∇θ  . . . .   . . .. .         ∂2f ∂2f ∂2f    ∂x ∂x ∂x ∂x ∂x2   n 1 n 2 ··· n    for a convex function : H is symmetric, semi definite positive •

60 / 94 Introduction Feedforward Neural Networks

Identifying the type of critical points Eigenvalues of H

a critical point is where J = 0 • ∇θ if all eigenvalues(H) > 0 : local minima • if all eigenvalues(H) < 0 : local maxima • if eigenvalues(H) pos and neg : saddle point •

x2 + y 2 (x2 + y 2) x2 y 2 − − 61 / 94 Introduction Feedforward Neural Networks

Identifying the type of critical points

And if H is degenerate If H is degenerate (some eigenvalues=0, det(H) = 0), we can have : a local minimum : • 12x2 0 f (x, y) = (x4 + y 4), H(x, y) = − − 0 12y 2  −  a local maximum : • 12x2 0 f (x, y) = x4 + y 4, H(x, y) = 0 12y 2   6x 0 a saddle point : f (x, y) = x3 + y 2, H(x, y) = • 0 2  

62 / 94 Introduction Feedforward Neural Networks

Local minima are not an issue with deep networks

Local minima and their loss Experiment : One hidden layer, Mnist, Trained with SGD.

Converges mostly to local minima and some saddle points. The distribution of test loss of the local minima tend to shrink. Index α : fraction of negative eigenvalues of the Hessian Choromanska, 2015 : The Loss Surface of Multilayer Nets

63 / 94 Introduction Feedforward Neural Networks

Saddle points seem to be the issue Saddle points and their loss Experiment : “small MLP”, Trained with Saddle-free Newton (converges to critical points).

Used Newton to discover the critical points High loss critical points are saddle points Low loss critical points are local minima. Index α : fraction of negative eigenvalues of the Hessian Dauphin, 2014 : Identifying and attacking the saddle point

problem in high-dimensional non-convex optimization 64 / 94 Introduction Feedforward Neural Networks

Training 1st order methods

T J(θ) J(θ0) + (θ θ0) J(θ0) ≈ − ∇θ θ θ  J ← − ∇θ

Rationale : First order approximation ˆ T J(θ) = J(θ0) + (θ θ0) θJ(θ0) − ∇ 2 (θ θ0) =  J(θ0) Jˆ(θ) = J(θ0) ( J(θ0)) − − ∇θ ⇒ − ∇θ For small  J(θ0) : J(θ0  J(θ0)) J(θ0) || ∇θ || − ∇θ ≤

65 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods

Minibatch Stochastic Gradient descent

start at θ • 0 for every minibatch : • 1 θ(t + 1) = θ(t)  ( Ji (θ)) − ∇θ M i X M = 1 : very noisy estimate, Stochastic gradient descent • M = N : true gradient, batch gradient descent • (minibatch)SGD converges faster • The trajectory may converge slowly or diverge if  not • appropriate

66 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods Stochastic Gradient descent : example Take N = 30 samples with

y = 3x + 2 + ( 0.1, 0.1) U − Let us perform linear regression (ˆy = wx + b, L2 loss) with SGD 40

30

20

10 y 0

10

20

30 10 5 0 5 10 x

67 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods Stochastic Gradient descent : ZigZag

 = 0.005, b0 = 10, w0 = 5 Converges to w? = 2.9975, b? = 1.9882 2

10 1

0

5 1 w

log(J) 2 0 3

4 5

5 5 0 5 10 0 200 400 600 800 1000 b iteration

68 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods

Momentum Idea: let us damp the oscillations with a low-pass filter on ∇θ Start at θ , v = 0 • 0 for every minibatch : • v(t + 1) = αv(t)  − ∇θ θ(t + 1) = θ(t) + v(t)

Usually, α 0.9 ≈ Experiment on http://distill.pub/2017/momentum/

69 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods Stochastic Gradient descent with momentum

 = 0.005, α = 0.6, b0 = 10, w0 = 5 Converges to w? = 2.9933, b? = 1.9837 2

10 1

0

5 1 w

log(J) 2 0 3

4 5

5 5 0 5 10 0 200 400 600 800 1000 b iteration

Adviced : set α 0.5, 0.9, 0.99 ∈ { } 70 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods

SGD without/with momentum

2 2

1 1

0 0

1 1

log(J) 2 log(J) 2

3 3

4 4

5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration

71 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods

Nesterov Momentum [Sutskever, PhD Thesis] Idea : look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient Start at θ , v = 0 • 0 for every minibatch : • θ˜(t + 1) = θ(t) + αv(t) v(t + 1) = αv(t)  J(θ˜) − ∇θ θ(t + 1) = θ(t) + v(t + 1)

72 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods SGD with Nesterov Momentum

 = 0.005, α = 0.8, b0 = 10, w0 = 5 Converges to w? = 2.9914, b? = 1.9738 2

10 1

0

5 1 w

log(J) 2 0 3

4 5

5 5 0 5 10 0 200 400 600 800 1000 b iteration

In this experiment, with nesterov momentum, a larger momentum was allowed. With α = 0.8, momentum strongly oscillates. 73 / 94 Introduction Feedforward Neural Networks

Training, 1st order methods

SGD/ SGD momentum / SGD nesterov momentum

2 2 2

1 1 1

0 0 0

1 1 1

log(J) 2 log(J) 2 log(J) 2

3 3 3

4 4 4

5 5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration iteration

74 / 94 Introduction Feedforward Neural Networks

Training 1st order methods with adaptive learning rates

75 / 94 Introduction Feedforward Neural Networks

Training : adapting the learning rate

Learning rate annealing Some possible schedules : Linear decay between  and  . • 0 τ halve the learning rate when validation error stops improving •

76 / 94 Introduction Feedforward Neural Networks

Training : adapting the learning rate

Adagrad [Duchi, 2011]

Accumulate the square of the gradients : • r(t + 1) = r(t) + J(θ(t)) J(θ(t)) ∇θ ∇θ Scale individually the learning rates •  θ(t + 1) = θ(t) θJ(θ(t)) − δ+√r(t+1) ∇ The √. is experimentally critical. δ [1e 8, 1e 4], for numerical stability Small≈ gradients− − bigger learning rate Big gradients ⇒smaller learning rate ⇒ Accumulation from the beginning is too agressive. Learning rates decrease too fast.

77 / 94 Introduction Feedforward Neural Networks

Training : adapting the learning rate

RMSProp [Hinton, unpublished] Idea : Use an exponentially decaying integration of the gradient Accumulate the square of the gradients : • r(t + 1) = ρr(t) + (1 ρ) J(θ(t)) J(θ(t)) − ∇θ ∇θ Scale individually the learning rates •  θ(t + 1) = θ(t) θJ(θ(t)) − δ+√r(t+1) ∇ ρ 0.9. ≈ And some others : Adadelta [Zeiler, 2012], Adam [Kingma, 2014], ...

78 / 94 Introduction Feedforward Neural Networks

Training with 1st order methods So, which one do I use ? [Bengio et al. 2016] There is currently no consensus[...]no single best algorithm has emerged[...]the most popular and actively in use include SGD, SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam

Loading... A. Karpathy

Schaul(2014). Unit Tests for Stochastic Optimization 79 / 94 Introduction Feedforward Neural Networks

Training A glimpse into 2nd order methods

T 1 T 2 J(θ) J(θ0) + (θ θ0) J(θ0) + (θ θ0) J(θ0)(θ θ0) ≈ − ∇θ 2 − ∇ −

θJ(θ0) Gradient vector ∇ 2 J(θ0) Hessian matrix ∇θ

Idea : use a better local approximation to make a more informed update

80 / 94 Introduction Feedforward Neural Networks

Training : 2nd order methods Newton method From a 2nd order taylor approximation

T 1 T 2 J(θ) J(θ0) + (θ θ0) J(θ0) + (θ θ0) J(θ0)(θ θ0) ≈ − ∇θ 2 − ∇ − Critical point at :

J(θ) = 0 ∇θ −1 θ = θ0 H w J(θ0) ⇒ − ∇

Critical points (min, max, saddle) are attractors for Newton !! cool : we can locate critical points • but : do not use it for optimizing a neural network ! • Dauphin(2014) : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization 81 / 94 Introduction Feedforward Neural Networks

Training : 2nd order methods

Second order methods require a larger batchsize. Some algorithms

Conjugate gradient : no need to compute the hessian, • guaranteed to converge in k steps for k dimensional quadratic function, Saddle-free Newton [Daupin, 2014] • Hessian free optimization (truncated newton) [Martens, 2010] • BFGS (quasi-newton): approximation of H−1, needs to store • it which is larged for deep networks, L-BFGS : limited memory BFGS •

82 / 94 Introduction Feedforward Neural Networks

Initialization and the importance of good activation distributions

83 / 94 Introduction Feedforward Neural Networks

Preprocessing your inputs Gradient descent converges faster if your data are normalized and d decorrelated. Denote by xi R your input data,x ˆi its normalized ∈ Input normalization

Min-max scaling : •

xi,j mink xk,j i, jxˆi,j = − ∀ maxk xk j mink xk j +  , − , Z-score normalization : (goal:µ ˆ = 0, σˆ = 1) • j j xi,j µj i, j, xˆi,j = − ∀ σj +  ZCA-Whitening : (goal:µ ˆ = 0, σˆ = 1, 1 XˆXˆT = I • j j n−1 1 Xˆ = WX , W = (XX T )−1/2 √n 1 84 / 94 − Introduction Feedforward Neural Networks

Z-score normalization / Standardizing the inputs Remember our linear regression : y = 3x + 2 + ( 0.1, 0.1), L2 loss, 30 1D samples U −

25 10

20 5 w w

15 0

10 5

5 0 5 10 0 5 10 15 b b Loss with raw input Loss with standardized input With standardized inputs, the gradient always points to the minimum !!

85 / 94 Introduction Feedforward Neural Networks

The starting point of training is critical

Pretraining Historically, training deep FNN was known to be hard, i.e. bad generalization errors. The starting point of a gradient descent has a dramatic impact. neural history compressors [Schmidhuber, 1991] • competitive learning [Maclin and Shavlik, 1995] • unsupervised pretraining based on Boltzman Machines • [Hinton, 2006] unsupervised pretraining based on [Bengio, • 2006]

86 / 94 Introduction Feedforward Neural Networks

For example, pretraining with autoencoders

Idea : extract features that allow to reconstruct the previous layer activities Followed by fine-tuning with gradient descent Does not appear to be that critical nowadays (because of xxReLu and initialization strategies)

87 / 94 Introduction Feedforward Neural Networks

Initializing the weights/biases

Thoughts

intially behave as a linear predictor; Non linearities should be • activated by the learning algorithm only if necessary. units should not extract the same features : symmetry • breaking, otherwise, same gradients.

Suppose the inputs standardized, make output and gradients standardized: sigmoid : b = 0, w (0, √ 1 ) in the linear part fanin • ∼ N √ ⇒ √ sigm, tanh : b = 0, w ( √ 6 , √ 6 ) [Glorot, 2010] • ∼ U − ni +no ni +no ReLu : b = 0, w (0, 2/fanin) [He(2015)] • ∼ N p 88 / 94 Introduction Feedforward Neural Networks

LeCun initialization Initialization in the linear regime for the forward pass Aim : Initialize the weights so that f acts in its linear part, i.e. w close to 0 Use the symmetric transfer function f (x) = 1.7159 tanh( 2 x) • 3 f (1) = 1, f ( 1) = 1 ⇒ − − Center, normalize (unit variance) and decorrelate the input • dimensions initialize the weights from a distrib with µ = 0, σ = √1 • ni set the biases to 0 • This ensures the output of the layer is zero mean, unit • variance

Efficient Backprop, Lecun et al. (1998); Generalization and network design strategies, LeCun (1989) 89 / 94 Introduction Feedforward Neural Networks

Glorot initialization strategy Keep same distribution for the forward and backward pass

The activations and the gradients should have, initially, similar • distributions accross the layers to avoid vanishing/exploding gradient • The input dimensions should centered, normalized, • uncorrelated With a transfer function f , f 0(0) = 1, it turns to : • i 2 i, Var[W ] = fanin+fanout ∀ √ √ Glorot (Xavier) Uniform : W [ √ 6 , √ 6 ], b = 0 ∼ U − ni+no√ ni+no Glorot (Xavier) Normal : W (0, √ 2 ), b = 0 ∼ N fanin+fanout Understanding the difficulty of training deep feedforward neural networks, Glorot, Bengio, JMLR(2010). 90 / 94 Introduction Feedforward Neural Networks

He initialization strategy Designed for rectifier non linearities (ReLU, PReLU). Keep same distribution for the forward and backward pass

The activations and the gradients should have, initially, similar • distributions accross the layers The input dimensions should centered, normalized, • uncorrelated With a ReLU transfer function, d Conv k k filters on c • 1 2 1 2 × channels : 2 k cVar[wl ] = 1, 2 k dVar[wl ] = 1 √ √ He Uniform : W [ √ 6 , √ 6 ], b = 0 ∼ U − √ni ni He Normal : W (0, √ 2 ) ∼ N ni Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He et al, ICCV(2015). 91 / 94 Introduction Feedforward Neural Networks

Batch normalization [Ioffe, Szegedy(2015)]

Internal Covariate Shift Def [Ioffe(2015)]: The change in the distribution of network activations due to the change in network parameters during training Exp : 3 FC(100 units), sigmoid, output softmax, MNIST

Measure: distribution of activations of the last hidden layer during training, 15,50,85 th percentile { }

92 / 94 Introduction Feedforward Neural Networks

Batch normalization [Ioffe, Szegedy(2015)]

Batch normalization to prevent covariate shift Idea: standardize the activations of every layers to keep the same distributions during training. The gradient must be aware of this normalization, otherwise • may get parameter explosion (see Ioffe(2015)) Introduces a differentiable BN normalization layer : • z = g(Wu + b) z = g(BN(Wu + b)) →

yi = BNγ,β(xi ) = γxˆi + β xi µ xˆi = − B σ2 +  B 2 p µB, σB : minibatch mean, variance

93 / 94 Introduction Feedforward Neural Networks

Batch Normalization Train and test time

Where : everywhere along the network, before ReLus • at training : standardize each unit’s activations over a • minibatch at test : • • with one sample, standardize over the population • use mean/variance from the train set • standardize over a batch of test samples

Learning much faster, better generalization

94 / 94 Convolutional Neural Networks (CNN)

Neocognitron [Fukushima(1980)]

LeNet5 [LeCun(1998)]

1 / 35 Idea : Exploiting the structure of the inputs

Ideas

• Features detected by convolutions with local kernels • parameters sharing, sparse weights ⇒ strongly regularized FNN (e.g. detecting an oriented edge is translation invariant)

2 / 35 The CNN of LeCun(1998)

Architecture

• (Conv/NonLinear/Pool) * n • followed by fully connected layers

3 / 35 General architecture of a CNN)

Architecture : Conv/ReLu/Pool

1 kernel

s

Bias s

Max ReLu

5

4

3

2

1

0 4 2 0 2 4

3 channels K kernels K feature maps K feature maps K feature maps

• Convolution : depth, size (3x3, 5x5), pad, stride • Max pooling : size, stride (e.g. (2,2))

4 / 35 Recent CNN

Multicolumn CDNN, Ciressan(2012) Ensemble of Convolutional neural networks trained with dataset augmentation.

0.23 % test misclassification on MNIST. 1.5 million of parameters.

5 / 35 Recent CNN

SuperVision, Krizhevsky(2012)

• top 5 error of 16% compared to runner-up with 26% error. • several convolutions were stacked without pooling, • trained on 2 GPUs, for a week • 60 Millions parameters, dropout, momentum, L2 penalty, dataset augmentation (trans, reflections, PCA)

6 / 35 Recent CNN SuperVision, Krizhevsky(2012)

• top 5 error of 16% compared to runner-up with 26% error. • several convolutions were stacked without pooling • 60 Millions parameters, dropout, momentum, L2 penalty, dataset augmentation (trans, reflections, PCA)

7 / 35 Recent CNN

VGG, Simonyan(2014)

• 16 layers : 13 convolutive, 3 fully connected • 3x3 convolutions, 2x2 pooling • stacked 3x3 convolutions ⇒ result in a 5x5 convolution with less parameters : • K input channels, K output channels, 5x5 convolution ⇒ 25K 2 parameters • K input channels, K output channels, 2 3x3 convolutions ⇒ 18K 2 parameters • 140 Million of parameters, dropout, momentum, L2 penalty, learning rate annealing, trained progressively

8 / 35 Recent CNN

Inception, GoogLeNet (Szegedy,2015) Idea : decrease the number of parameters by using 1x1 convolutions for cross-channel interactions.

• dramatic decrease in the number of parameters ≈ 6 Million • multi-level feature extraction

9 / 35 Recent CNN Residual Networks (He,2015) Idea : shortcut connections, no fully connected layers

• L2 penalty, batch normalization (NO dropout), momentum • up to 150 layers, for only 2 Million parameters 10 / 35 Recent CNN Striving for simplicity: The all convolutional Net (Springenberg,2014)

Output : You can get out pooling and only use convolutions (All-CNN-C). • training with SGD, momentum, L2 penalty, dropout • only 3x3 convolutions with various stride • last layers use 3x3 and 1x1 convolutions instead of FC layers

11 / 35 An attempt at synthesizing CNN design principles

12 / 35 Design principles for Convolutional Neural networks

Increase the number of filters through the network Rationale : • first layers extract low level features • higher layers combine the previous features

Number of filters

• LeNet-5 (1998) : 6 5x5 - 16 5x5 • AlexNet(2012) : 96 11x11, 256 5x5, (384 3x3)*2, 256 3x3 • VGG (2014) : 64 - 128 - 256 - 512; all 3x3 • ResNet (2015) : 64 - 128 - 256 - 512; all 3x3 • Inception (2015) : 64 → 1024, 1x1, 3x3, “5x5”

13 / 35 Design principles for Convolutional Neural networks Effective Receptive Field size (Conv 3x3 - Conv 3x3 - Max Pool) blocks

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 Input RF Size 1x1 3x3

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 Input RF Size 1x1 3x3 5x5

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 14x14 Input RF Size 1x1 3x3 5x5 6x6

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 14x14 14x14 Input RF Size 1x1 3x3 5x5 6x6 10x10

14 / 35 Design principles for Convolutional Neural networks

Effective Receptive Field size (Conv 3x3 - Conv 3x3 - Max Pool) blocks

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 7x7 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14 15x15

Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3

same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1

Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 7x7 7x7 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14 15x15 23x23

=⇒ Stack layers to ensure the RF cover the objects to detect. https://github.com/vdumoulin/conv_arithmetic

15 / 35 Design principles for Convolutional Neural networks Stacking small kernels (VGG, Inception)

Szegedi(2015) n input filters,αn output filters : • αn5x5 conv : 25αn2 params √ √ √ • αn, 3x3- αn, 3x3 : 9 αn2 + 9 ααn2 params; α = 2 ⇒ 24% saving n input filters,αn output filters : • αn3x3 conv : 9αn2 params √ √ √ • αn, 1x3 - αn, 3x1 : 3 αn2 + 3α αn2 params; α = 2 ⇒ 30% saving 16 / 35 Design principles for Convolutional Neural Networks Depthwise convolutions MobileNets [Howard,2017]

Decrease the number of parameters by decoupling feature extraction in space and feature combination See also Xcep- tion[Chollet(2016)]

17 / 35 Design principles for Convolutional Neural networks

Multiscale feature extraction

18 / 35 Design principles for Convolutional Neural networks

Dimensionality reduction with 1x1 convolutions

1 Relu height height n 1 Conv 1x1 m filters

width width #channels #channels n m Equivalent to a single layer FFN, slided over the pixels. Multiple 1x1 ↔ MLP Trainable non-linear transformation of the channels. Network in Network (Lin, 2013).

19 / 35 Design principles for Convolutional Neural networks

Ease the gradient flow with shortcuts

20 / 35 Design principles for Convolutional Neural networks

Do we need max pooling and fully connected layers ? The all convnet [Springenberg, 2015], ResNet [He, 2015]:

Avantage : you can slide the network over larger images to produce a volume of class probabilities.

21 / 35 Usefull tricks

Using Pre-trained models Already trained models can be found : • https://github.com/tensorflow/models : Tensorflow zoo • https://keras.io/applications/ : Keras pre-trained models • Caffe Model Zoo e.g. use a VGG pretrained on ImageNet, 1) replace the softmax, 2) fine-tune some of the deepest layers

22 / 35 Usefull tricks

Dataset augmentation Regularize your network by providing more samples : • With small perturbations (rotation, shift, zoom, ..)

• by altering the RGB pixel values with a PCA [Krizhevsky et al.,2012] • by learning a generator (see Generative Adversial Networks)

23 / 35 Usefull tricks

Model averaging

1 Train several models with different initialization, architecture 2 Average the response of these models e.g. on CIFAR-100, 11 models with loss≈1.3, acc≈70%; averaged : loss ≈ 0.82, acc≈ 77%. All winners to challenge use model averaging

Model compression : speeding up inference time

• Binarized Neural Networks [Courbariaux(2016)] • Knowledge distillation (train a small models with soft targets of a big model) [Hinton(2015)]

24 / 35 Viewing and understanding deep networks

Demo of Deep Visualization Toolbox http://yosinski.com/deepvis

25 / 35 Some applications of CNN

26 / 35 Image classification Aim : assign a label to an image Some benchmarks

• MNIST (28x28, 10 classes, grayscale, 60000 training, 10000 testing) • CIFAR-10, CIFAR-100 • ImageNet Task 1 (256x256, 1000 classes, 1.2 million training, 50.000 validation, and 100.000 test)

Image from [He(2016)] 27 / 35 Image classification ImageNet

28 / 35 Image classification

ImageNet

Image from [Canziani(2016)]

29 / 35 Object detection

Aim : detect the objects and output bounding boxes Metrics : detected classes, bbox coverage Some benchmarks

• Pascal-VOC • ImageNet Task 3 • Microsoft COCO

30 / 35 Applications of CNN : Object detection

Region based CNN [Girshick,2014]

Using the model AlexNet [Krizhevsky(2012)] for classifying.

31 / 35 Applications of CNN : Object detection

Fast RCNN (Girshick, 2015)

32 / 35 Applications of CNN : Object detection Faster RCNN (Ren, Girshik, 2015) - Introduces Region Proposal Network feeding a Fast-RCNN - end-to-end training

More recently : YOLO(2015), YOLO9000(2016) 33 / 35 Applications of CNN : Semantic/Instance segmentation Instance segmentation [He,2017]: Mask-RCNN Predicts a binary mask in addition to the classes and boxes

Other approaches : SegNet(2015), FC-DenseNet(2017), UNet(2015), ENet(2016)

34 / 35 Spatial Transformer Network (Jaderberg, 2016) Learns a differentiable transformation Tθ (crop, translation, rotation, scale, and skew). Aim: decrease the number of degrees of freedom of the objects.

35 / 35 Recurrent Neural Networks

Recurrent neural networks (RNN) Handling sequential data (speech, handwriting, language,...) Predicting in context

Input Output

1 / 15 Recurrent Neural Networks

Handling sequences with FNN

Time delay neural networks [Waibel(1989)] Delay line Hidden layers Output layer xt 6 −

xt 5 −

xt 4 −

xt 3 −

xt 2 −

xt 1 −

xt

But which size of the time window ? Must the history size be always the same ? Do we need the data over the whole time span ?

2 / 15 Recurrent Neural Networks

Recurrent Neural Networks (RNN) Architecture

Input Output

• W in inputs to hidden • W back outputs to hidden • W hidden to hidden • W out hidden to output

Note it Applies repeatedly the weight matrix. 3 / 15 Recurrent Neural Networks

Training a RNN: Forward mode differentation

Real Time Reccurent Learning (RTRL), [Williams(1989)] Same idea than forward-mode differentiation for FNN. • Computationally more expensive than reverse mode differentation, • Online training

Sustkever (2013). Training recurrent neural networks, PhD.

4 / 15 Recurrent Neural Networks

Training a RNN : Reverse mode differentation

Backpropagation Through Time (BPTT), [Werbos(1990)]

• Unfolds the computational graph in time ⇒ ∼ backprop in a deep FNN • computationnaly cheaper than RTRL • batch training

Sustkever (2013). Training recurrent neural networks, PhD.

5 / 15 Recurrent Neural Networks

Training a RNN is hard

Long-term dependencies

• Exploding/Vanishing gradient • if one output depends on an input a long time ago (long-term dependencies), that information may actually be lost or hard to be sensitive to • ⇒ introduce memory units, specifically designed to hold some information

6 / 15 Recurrent Neural Networks

Long-Short Term Memory Architecture [Hochreiter,Schmidhuber(1997)][Gers(2000)] Specifically designed to store information for long time delays Gating units specify when to integrate, release, forget

Jozefowicz(2015); “LSTM: A search space odyssey” [Greff(2017)]; Recurrent Highway Networks Other possibilities : e.g. Gated Recurrent Units (GRUs) [Cho(2014)]; 7 / 15 Recurrent Neural Networks

Bidirectional LSTM Speech to text Bidirectional LSTM for Speech recognition [Graves(2013)] Both past and future contexts are used for classifying the current observation. When you speak, past and future phonemes influence the way you pronounce the current one.

8 / 15 Recurrent Neural Networks

Applications of RNN: Language modelling Char RNN [Karpathy(2015)] Train A LSTM network to predict the next character. Then provide a seed and let it generate, character by character, a sentence.

http: //karpathy.github.io/2015/05/21/rnn-effectiveness/

9 / 15 Recurrent Neural Networks

Applications of RNN: Text to text Mapping a sentence in one language to its translation. Encoder/Decode [Sustkever(2014), Cho(2014)]

https://devblogs.nvidia.com/parallelforall/ introduction-neural-machine-translation-with-gpus/ 10 / 15 Recurrent Neural Networks

Applications of RNN: Text to text The encoder/decoder suffers when the sentences are long. Idea : let the network decides on which part of the sentence to attend when translating. [Bahdanau(2015)] Attention based LSTM translation

See also [Cho et al.(2015)] for image captioning. 11 / 15 Recurrent Neural Networks

Applications of RNN: Multi language translation Google’s Multilingual Neural Machine Translation System

Model trained with English↔Portuguese and English↔Spanish generalizes to English↔Spanish.

[Wu(2016); Johnson(2016)] 12 / 15 Recurrent Neural Networks

Applications of RNN Text to handwritten text Handwriting [Graves(2013)]

Data : sequence of characters, sequence of pen positions (x,y,up/down) 1 learn a handwritten text generator: (δxt+1, δyt+1, udt ) = f ({δxk , δyk , udk }k∈[0,t−1] 2 condition the network by inputing a sequence of characters, and learn to attend to the right characters 3 prime the network with a given character/pen sequence to mimic style The network outputs parameters for a mixture of gaussians http://www.cs.toronto.edu/~graves/handwriting.html

13 / 15 Recurrent Neural Networks

Combining CNN and RNN

Automatic captioning [Karpathy(2015)]

http://cs.stanford.edu/people/karpathy/deepimagesent/

14 / 15 Recurrent Neural Networks

We did not speak about

• Encoders/decoders, Deconvolutional networks, • Models for Natural Language Processing (e.g. word2vec, GLoVe, recursive networks) • Probabilistic/Energy based models : Hopfield networks, Restrictied boltzman machines, deep belief networks • More generally : generative models (e.g. Generative Adversial Networks[Goodfellow, 2014]) • Neural Turing Machines, Attention based models https://distill.pub/2016/augmented-rnns/

15 / 15