Introduction Feedforward Neural Networks
Deep learning
J´er´emyFix
CentraleSup´elec jeremy.fi[email protected]
2016
1 / 94 Introduction Feedforward Neural Networks
Introduction and historical perspective
[Schmidhuber, 2015] Deep Learning in Neural Networks: An Overview, J¨urgenSchmidhuber, Neural Networks (61), Pages 85-117
2 / 94 Introduction Feedforward Neural Networks
Historical perspective on neural networks
Perceptron (Rosenblatt, 1962) : linear classifier • AdaLinE (Widrow, Hoff, 1962): linear regressor • Minsky/Papert (1969): first winter • Convolutional Neural Networks (1980, 1998) : great! • Multilayer Perceptron and backprop (Rumelhart, 1986) : • great! but it is hard to train, and the SVM come in the 1990s .... : • second winter 2006 : pretraining ! • 2012 : AlexNet on Imagenet (10% better on test than the • 2nd) Now on : lot of state of the art neural networks •
3 / 94 Introduction Feedforward Neural Networks
Some reasons of the current success
GPU (speed of processing) / Data (regularizing) • theoretical understandings on the difficulty of training deep • networks
Which libs?
Torch(Lua)/PyTorhc , Caffee(Python/C++) • Theano/Lasagne (python, RIP 2017) , Tensorflow (Google, • Python, C++), Keras (wrapper over Tensorflow and Theano), CNTK, MXNET, Chainer, DyNet, ...
4 / 94 Introduction Feedforward Neural Networks
What to read ? www.deeplearningbook.org Goodfellow, Bengio, Courville(2016)
Who to follow ?
N-1 : LeCun, Bengio, Hinton, Schmidhuber • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, • Krizevsky, .. obviously, the community is much larger.
Which conferences ? ICML, NIPS, ICLR, ..
https://github.com/terryum/awesome-deep-learning-papers 5 / 94 Introduction Feedforward Neural Networks
What is a neural network The tree a Neural network is a directed graph edges : weighted connections • nodes : computational units • no cycle : feedforward neural networks (FNN) • with cycles : recurrent neural networks (RNN) •
hides the jungle What is a convolutional neural network with a softmax output, ReLu hidden activations, with batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout ?
6 / 94 Introduction Feedforward Neural Networks
Feedforward Neural Networks (FNN)
Input Hidden Output
Skip layer connection
7 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
classification, Given (xi , yi ), yi 1, 1 • ∈ {− } SAR Architecture, Basis functions φ (x) with φ (x) = 1 • j 0 Algorithm • Geometrical interpretation • Sensory Associative Result
x0 a0 = φ0(x) w00 Σ g r0 w10 x1 a1 = φ1(x) w01 r Σ 1 x2 a2 = φ2(x) w02 Σ r2 w22 x3
8 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Classifier
Given feature functions φj , with φ0(x) = 1, the perceptron classifies x as :
y = g(w T Φ(x)) (1) 1 if x < 0 g(x) − (2) (+1 if x 0 ≥ 1 φ (x) n +1 1 with φ(x) R a , φ(x) = ∈ φ2(x) . .
9 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Online Training algorithm
Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }
w if the input is correctly classified w = w + φ(xi ) if the input is incorrectly classified as -1 (3) w φ(xi ) if the input is incorrectly classified as +1 −
10 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is correctly classified Case y = +1 Case y = 1 i i −
w
φ(xi) φ(xi)
w
11 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is misclassified Case y = +1 Case y = 1 i i −
w φ(xi) φ(xi)
w + φ(xi) w φ(x ) − i w
12 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962) The cone of feasible solutions
Consider two samples x1, x2 with y1 = +1, y2 = 1 −
φ(x ) w 1
φ(x2)
T v φ(x1) = 0
T v φ(x2) = 0
13 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Online Training algorithm
Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }
w if the input is correctly classified w = w + φ(xi ) if the input is incorrectly classified as -1 (4) w φ(xi ) if the input is incorrectly classified as +1 − T w if g(w φ(xi )) = yi T w = w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 (5) T − w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − −
14 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962) Online Training algorithm
Given (xi , yi ), yi 1, 1 , the perceptron learning rule operates online : ∈ {− }
T w if g(w φ(xi )) = yi T w = w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 T − w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − w if g(w T φ(x )) = y w = i i T (w + yi φ(xi ) if g(w φ(xi )) = yi 6
1 T w = w + (yi yˆi )φ(xi ) withy ˆi = g(w φ(xi )) 2 − 15 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
Definition (Linear separability) d A binary classification problem (xi , yi ) R 1, 1 , i [1..N] is ∈ × {−d } ∈ said to be linearly separable if there exists w R such that : ∈ T i, sign(w xi ) = yi ∀ with x < 0, sign(x) = 1, x 0, sign(x) = +1. ∀ − ∀ ≥ Theorem (Perceptron convergence theorem) d A classification problem (xi , yi ) R 1, 1 , i [1..N] is linearly separable if and only if the∈ perceptron× {− } learning∈ rule converges to an optimal solution in a finite number of steps.
: easy; : we upper/lower bound w(t) 2 ⇐ ⇒ | |2 16 / 94 Introduction Feedforward Neural Networks
Perceptron (Rosenblatt, 1962)
w = w + y φ(x ), with (t) the set of misclassified • t 0 i∈I(t) i i samples I P 1 T it minimizes a loss : J(w) = max(0, yi w φ(xi )) • N i − the solution can be written as • P 1 wt = w0 + (yi yˆi )φ(xi ) 2 − i X (yi yˆi ) is the prediction error −
17 / 94 Introduction Feedforward Neural Networks
Kernel Perceptron Any linear predictor involving only scalar products can be
kernelized (kernel trick, cf SVM); Given w(t) = w0 + i∈I yi xi P < w, x > =< w0, x > + yi < xi , x > i∈I X k(w, x) = k(w0, x) + yi k(xi , x) ⇒ i∈I X
3
2
1
0
1
2
3 3 2 1 0 1 2 3
18 / 94 Introduction Feedforward Neural Networks
Adaptive Linear Elements (Widrow, Hoff, 1962)
Linear regression, Analytically
Given (xi , yi ), yi R • ∈ minimize J(w) = 1 y w T x 2 • N i i i || − T || Analytically w J(w) = 0 XX w = Xy • ∇ P ⇒ XX T non singular : w = (XX T )−1Xy • XX T singular (e.g. points along a line in 2D), infinite nb • solutions • regularized least square : min G(w) = J(w) + αw T w T • w G(w) = 0 (XX + αI )w = Xy • as∇ soon as α >⇒0, (XX T + αI ) is not singular Needs to compute XX T , i.e. over the whole training set...
19 / 94 Introduction Feedforward Neural Networks
Adaptive Linear Elements (Widrow, Hoff, 1962)
Linear regression with stochastic gradient descent
start at w • 0 take each sample one after the other (online) x , y • i i denotey ˆ = w T x the prediction • i i update wt+1 = wt w J(wt ) = wt + (yi yˆi )xi • − ∇ − delta rule, δ = (yi yˆi ) prediction error • −
wt+1 = wt + δxi
note the similarity with the perceptron learning rule •
The samples xi are supposed to be “extended” with one dimension set to 1.
20 / 94 Introduction Feedforward Neural Networks
Batch/Minibatch/Stochastic gradient descent
1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Batch gradient descent
compute the gradient of the loss J(w) over the whole training • set
performs one step in direction of w J(w, x, y) • −∇
wt+1 = wt t w J(w, x, y) − ∇ : learning rate •
21 / 94 Introduction Feedforward Neural Networks
Batch/Minibatch/Stochastic gradient descent
1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Stochastic gradient descent (SGD)
one sample at a time, noisy estimate of w J • ∇ performs one step in direction of w L(w, xi , yi ) • −∇
wt+1 = wt t w L(w, xi , yi ) − ∇ faster to converge than gradient descent •
22 / 94 Introduction Feedforward Neural Networks
Batch/Minibatch/Stochastic gradient descent
1 N J(w, x, y) = L(w, x , y ) N i i i=1 X T 2 e.g. L(w, xi , yi ) = yi w xi || − || Minibatch
noisy estimate of the true gradient with M samples (e.g. • M = 64, 128); M is the minibatch size Randomize with = M, one set at a time • J |J | 1 wt+1 = wt t w L(w, xj , yj ) − M ∇ j∈J X smoother estimate than SGD • great for parallel architectures (GPU) • 23 / 94 Introduction Feedforward Neural Networks
Does it make sense to use gradient descent ?
Convex function n A function f : R R is convex : 7→ n 1 x1, x2 R , t [0, 1] ⇐⇒ ∀ ∈ ∀ ∈ f (tx1 + (1 t)x2) tf (x1) + (1 t)f (x2) − ≤ − 2 with f twice diff., n 2 x R , H = f (x) is postive semidefinite ⇐⇒ ∀ ∈n T ∇ i.e. x R , x Hx 0 ∀ ∈ ≥ For a convex function f , all local minima are global minima. Our losses are lower bounded, so these minima exist. Under mild conditions, gradient descent and stochastic gradient descent 2 converge, typically t = , t < (cf lectures on convex optimization). ∞ ∞ P P
24 / 94 Introduction Feedforward Neural Networks
Does it make sense to use gradient descent ?
Linear regression with L2 loss is convex Indeed, 1 T 2 Given xi , yi , L(w) = 2 (w xi yi ) is convex: • T − w L = (w xi yi)xi ∇2 T − w L = xi xi ∇ n T T T 2 x R x xi x x = (x x) 0 ∀ ∈ i i ≥ a non negative weighted sum of convex functions is convex •
25 / 94 Introduction Feedforward Neural Networks
Linear regression, synthesis
Linear regression
samples (xi , yi ), yi R, • ∈ extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear model yˆ = w T x • i i L2 loss L(ˆy, y) = 1 yˆ y 2 • 2 || − || by gradient descent • ∂L ∂yˆ w L(w, xi , yi ) = = (yi yˆi )xi ∇ ∂yˆ ∂w − −
26 / 94 Introduction Feedforward Neural Networks
Linear classification, synthesis Linear binary classification (logistic regression)
samples (xi , yi ), yi [0, 1] • ∈ extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear model w T x • sigmoid transfer functiony ˆ = σ(w T x ), • i i • 1 σ(x) = 1+exp( x) , σ(x) [0, 1] − ∈ • d σ(x) = σ(x)(1 σ(x)) dx − Cross entropy loss L(ˆy, y) = y logy ˆ (1 y) log(1 yˆ) • − − − − by gradient descent : • ∂L ∂yˆ w L(w, xi , yi ) = = (yi yˆi )xi ∇ ∂yˆ ∂w − −
27 / 94 Introduction Feedforward Neural Networks
Linear classification, synthesis
Logistic regression is convex Indeed,
Given xi , yi = 1, • T T L1(w) = log(σ(w xi ) = log(1 + exp( w xi )), − T − w L1 = (1 σ(w xi ))xi ∇2 − T− T T L1 = σ(w xi )(1 σ(w xi )) xi x ∇w − i >0 T Given xi , yi = 0, L2(w) = log(1 σ(w xi )) • | T {z − } − w L2 = σ(w xi )x ∇2 T T T L2 = σ(w xi )(1 σ(w xi )) xi x ∇w − i >0 a non negative weighted sum of convex functions is convex • | {z }
28 / 94 Introduction Feedforward Neural Networks
Why L2 loss for linear classification with SGD is bad Compute the gradient to see why...
Take L2 loss L(ˆy, y) = 1 yˆ y 2 • 2 || − || Take the “linear” model :y ˆ = σ(w T x ) • i i Check that d σ(x) = σ(x)(1 σ(x)) • dx − Compute the gradient wrt w: •
∂L ∂yˆ T T w L(w, xi , yi ) = = (yi yˆi )σ(w xi )(1 σ(w xi ))xi ∇ ∂yˆ ∂w − − −
T If xi is strongly misclassified (e.g. yi = 1, w xi = big) • T T − Then σ(w xi )(1 σ(w xi )) 0, i.e. w L(w, xi , yi ) 0 stepsize is very− small while≈ the sample∇ is misclassified≈ ⇒ With a cross entropy loss, w L(w, xi , yi ) is proportional to the error ∇
29 / 94 Introduction Feedforward Neural Networks
Linear classification, synthesis Linear multiclass classification
samples (xi , yi ), labels yi [ 0, k 1 ] • ∈ | − | extend x by adding a constant dimension equal to 1, accounts • i for the bias Linear models for each class: w T x • j T exp(wj x) softmax transfer function : P(y = j/x) =y ˆj = P T • k exp(wk x) generalization of the sigmoid for a vectorial output • Cross entropy loss L(ˆy, y) = logy ˆy • − by gradient descent • ∂L ∂yˆk wj L(w, x, y) = = (δj,y yˆj )x ∇ ∂yˆk ∂wj − − k X
30 / 94 Introduction Feedforward Neural Networks
Perceptron and linear separability
Perceptrons performs linear separation in a predefined, fixed feature space The XOR xor(x1, x2) xor(x1, x2) = x1x2 + x1x2
x2 ?? x1x2 1 1 0 ?? 1 1 ?? 0 0 1 0 0 1
0 1 x1 0 1 x1x2
Can we learn the φj (x) ???
31 / 94 Introduction Feedforward Neural Networks
Radial basis functions(RBF) RBF (Broomhead, 1988)
2 −||x−µj || RBF kernel φ0(x) = 1, φj (x) = exp 2 • 2σj for regression (L2 loss) or classification (cross entropy loss) • e.g. for regression : • yˆ(x) = w T φ(x)
T 2 L(w, xi , yi ) = yi w φ(xi ) || − || What about the centers and variances ?[Schwenker, 2001] • • place them uniformly, randomly, by vector quantization (k-means, GNG [Fritzke, 1994]) • two phases : fix the centers/variances, fit the weights • three phases : fix the centers/variances, fit the weights, fit everything ( L, L, w L) ∇µ ∇σ ∇ 32 / 94 Introduction Feedforward Neural Networks
Radial basis functions(RBF)
RBF are universal approximators [Park, Sandberg (1991)] d Denote the family of functions based on RBF in R : S d N = g R R, g(x) = wi φi (x), w R S { ∈ → ∈ } i X p Then is dense in L (R) for every p [1, ) S ∈ ∞ Actually, the theorem applies for a larger class of functions φi .
33 / 94 Introduction Feedforward Neural Networks
Feedforward neural networks (or MLP [Rumelhart, 1986])
Input Hidden layers Output Layer 0 Layer 1 Layer L-1 Layer L ···
x0 (1) w00 (L 1) (1) w00− w (L 1) 01 a − 0 (L 1) y − (L 1) 0 x1 w0i − (1) w02 ··· (L) (L) a0 y0
x2 ··· (L 1) a − 1 (L 1) (L) (L) y − 1 a1 y1 ···
x3 1 1 1 (1) (1) (L 1) (L 1) (L 2) (L) (L) (L 1) ai = j wij xj ai − = j wij − yj − ai = j wij yj − (1) (1) (L 1) (L 1) (L) (L) yi = gP(ai ) yi − = gP(ai − ) yi = fP(ai )
Named MLP for historical reasons. Should be called FNN.
34 / 94 Introduction Feedforward Neural Networks
Feedforward neural networks Input Hidden layers Output Layer 0 Layer 1 Layer L-1 Layer L ···
x0 (1) w00 (L 1) (1) w00− w (L 1) 01 a − 0 (L 1) y − (L 1) 0 x1 w0i − (1) w02 ··· (L) (L) a0 y0
x2 ··· (L 1) a − 1 (L 1) (L) (L) y − 1 a1 y1 ···
x3 1 1 1 (1) (1) (L 1) (L 1) (L 2) (L) (L) (L 1) ai = j wij xj ai − = j wij − yj − ai = j wij yj − (1) (1) (L 1) (L 1) (L) (L) yi = gP(ai ) yi − = gP(ai − ) yi = fP(ai ) Architecture
Depth : number of layers without counting the input • deep = large depth Width : number of units per layer • weights and biases for each unit • Hidden transfer function f , Output transfer function g • 35 / 94 Introduction Feedforward Neural Networks
Feedforward neural networks
Architecture Hidden transfer function Historically, f taken as a sigmoid or tanh. • Now, mainly Rectified Linear Units (ReLu) or similar • f (x) = max(x, 0)
ReLu are more favorable for the gradient flow than the saturating functions [Krizhevsky(2012), Nair(2010), Jarrett(2009)]
1.0 5 1.0
0.8 4
0.5
0.6 3
0.0 4 2 0 2 4 0.4 2
0.5 0.2 1
0.0 0 1.0 4 2 0 2 4 4 2 0 2 4
36 / 94 Introduction Feedforward Neural Networks
Feedforward neural networks
Architecture Output transfer function and loss for regression : • • linear f (x) = x • L2 loss L(ˆy, y) = y yˆ 2 || − || for multiclass classification: • eaj • softmaxy ˆj = P ak , k e • negative log likelihood loss L(ˆy, y) = log(ˆyy ) −
37 / 94 Introduction Feedforward Neural Networks
FNN training : error backpropagation
Training by gradient descent
initialize weights and biases w • 0 at every iteration, compute : •
w w w J ← − ∇
The partial derivatives ∂J ?? ∂wi Fundamentally, use the chain rule within the computational graph linking any variable (inputs,weights, biases) to the output of the loss.
Backprop is usually attributed to [Rumelhart,1986] but [Werbos,1981] already introduced the idea.
38 / 94 Introduction Feedforward Neural Networks
Computing partial derivatives Computational graph A computational graph is a directed graph nodes : variables (weights, inputs, outputs, targets,..) • edges : operations (ReLu, Softmax, w T x + b, .., Losses,..) •
We only need to know, for each operations : the partial derivatives wrt its parameters • the partial derivatives wrt its inputs • 39 / 94 Introduction Feedforward Neural Networks
Computing partial derivatives ∂J ∂wi
The chain rule : single path
Suppose there is a single path, e.g. xi u3 →
∂u3 ∂u3 ∂u2 ∂u1 ∂ 2 Applying the chain rule = . . = (yi wxi b) ∂xi ∂u2 ∂u1 ∂xi ∂xi − −
2 u3 = u2 ∂u3 u2 = yi u1 = 2u2.( 1).w = 2w(yi wxi b) − ∂xi − − − − u1 = wxi + b
40 / 94 Introduction Feedforward Neural Networks
Computing partial derivatives ∂J ∂wi
The chain rule : multiple paths
Sum over all the paths (e.g. u3 = w1xi + w2xi ): ∂u ∂u ∂u 3 = 3 j ∂xi ∂uj ∂xi j 1,2 ∈{X }
u3 = u1 + u2 ∂u3 u2 = w2xi = 1.w2+1.w1 ∂xi u1 = w1xi
41 / 94 Introduction Feedforward Neural Networks
But, it is computationally expensive There are a lot of paths...
There are 4 paths from xi to u5
42 / 94
We need to identify and sum over all these paths Oups, there can be an exponential number of paths. L fully-connected layers, N units each, 1 output : NL paths Introduction Feedforward Neural Networks
Let us be more efficient: Forward-mode differentiation Forward differentiation Idea : to compute ∂u5 , propagate forward ∂ ∂xi ∂xi
∂u1 ∂u1 ∂z1 ∂u1 ∂z2 = + = z2.1 + z1.0 = z2 = w1 But how to ∂xi ∂z1 ∂xi ∂z2 ∂xi compute ∂u5 ? Well, propagate ∂ ∂w1 ∂w1 And ∂u5 ? propagate again...... or.... ∂w2 Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/ 43 / 94 Introduction Feedforward Neural Networks
Let us be even more efficient: reverse-mode differentiation Reverse differentiation Idea : to compute ∂u5 , backpropagate ∂u5 ∂xi ∂
∂u5 ∂u5 ∂u3 ∂u5 ∂u4 ∂u5 = + = 1.w3 + 1.w6 We have , but also ∂u1 ∂u3 ∂u1 ∂u4 ∂u1 ∂xi ∂u5 , ∂u5 ,... all in a single pass ! ∂w1 ∂w2 Griewank(2010) Who Invented the Reverse Mode of Differentiation? http://colah.github.io/posts/2015-08-Backprop/ 44 / 94 Introduction Feedforward Neural Networks
FNN training : error backpropagation
In Neural Networks, reverse-mode differentation is called error backpropagation Training in two phases
Evaluation of the output : forward propagation • Evaluation of the gradient : reverse-mode differentation • Carefull The reverse-mode differentation uses the activations computed during the forward propagation
Libraries like theano augment the computational graph with nodes computing numerically the gradient by reverse-mode differentiation.
45 / 94 Introduction Feedforward Neural Networks
Universal approximator Any well behaved functions can be arbitrarily approximated with a single hidden layer FNN. Intuition
Take a sigmoid transfer function f (x) = 1 : this • 1+exp(−α(x−bi )) is the hidden layer substract two such activations to get gaussian like kernels • 1.4
1.2
1.0
0.8
0.6
0.4
0.2
0.0 0.0 0.2 0.4 0.6 0.8 1.0 weight such substractions, you are back to the RBFs • 46 / 94 Introduction Feedforward Neural Networks
But then, why deep networks ?? Going deeper
Single FNN are universal approximators, but the hidden layer can be • arbitrarily large A deep network (large number of layers) builds high level features • by composing lower level features A shallow directly learns these high level features • Image analogy : • • first layers : extract oriented contours (e.g. gabors) • second layers : learn corners by combining contours • next layers : build up more and more complex features Theoretical works comparing expressiveness depth d FNN with • depth d-1 FNN { } { }
Learning deep architectures for AI, Bengio(2009), chap2; Benefits of depth in 47 / 94 neural networks, Telgarsky(2016) Introduction Feedforward Neural Networks
And why ReLu ?
Vanishing/exploding gradient [Hochreiter(1991),Bengio(1994)]
Consider u2 = f (wu1) Remember when the gradient is “backpropagated”, it involves • ∂J = ∂J ∂u2 = ∂J w.f 0(wu ) ∂u1 ∂u2 ∂u1 ∂u2 1 backpropagated through L layers (w.f 0)L • with f (x) = 1 , f 0(x) < 1 for x = 0 • 1+e−x 6 If w.f 0 = 1, (w.f 0)L 0, • 6 → { ∞} gradient vanishes or explodes • ⇒ With ReLu, f 0(x) 0, 1 . But you can get dead units ∈ { }
48 / 94 Introduction Feedforward Neural Networks
But the ReLus can die.... Why do they die ? If the input to ReLu is negative, the gradient is 0, that’s it...”forever” lost And then ?
Add a linear component for negative x : Leaky Relu, • Parametric Relu [He(2015)] Exponential Linear Units [Clevert, Hochreiter(2016)] •
5 5 5
4 4 4
3 3 3
2 2
2
1 1
1 0 0 4 2 0 2 4 4 2 0 2 4
0 4 2 0 2 4 1 1 49 / 94 Introduction Feedforward Neural Networks
How to deal with the vanishing/exploding gradient Preventing vanishing gradient by preserving gradient flow
Using ReLu, Leaky Relu, PReLu, ELU to ensure a good flow • of gradient Specific architectures : • • ResNet (CNN) : shortcut connections • LSTM (RNN) : constant error caroussel
Preventing exploding gradients
Gradient clipping [Pascanu, 2013] : clip the norm of the gradient 1 unit, σ(x), 50 lay- ers 50 / 94 Introduction Feedforward Neural Networks
Regularization
Like kids, FNN can do a lot of things, but we must focus their expressiveness Chap 7, [Bengio et al.(2016)]
51 / 94 Introduction Feedforward Neural Networks
Regularization L2 regularization Add a L2 penalty on the weights, α > 0. α α J(w) = L(w) + w 2 = L(w) + w T w 2 || ||2 2
w J = w L + αw ∇ ∇ Example : RBF, 1 kernel per sample, N=30, noisy sinus.
1.0 1.0 0.6
0.8 0.8 0.4
0.2 0.6 0.6
0.0
0.4 0.4 0.2
0.2 0.2 0.4
0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0, α = 2, w ? Chap 7 of [Bengio et al.(2016)] for a geometrical interpretation 52 / 94 Introduction Feedforward Neural Networks
Regularization
L2 regularization In principle, We should not regularize the bias. Example :
N 1 2 J(w) = yi w0 wk xi k N || − − , || i=1 k≥1 X X 1 1 w J = 0 w0 = ( yi ) wk xi k ∇ 0 ⇒ N − N , i k≥1 i X X X 1 e.g. if your data are centered, i.e. N i xi,k = 0, then 1 w0 = N i yi . Regularizing the bias might lead to underfitting.P P
53 / 94 Introduction Feedforward Neural Networks
Regularization L1 regularization promotes sparsity Add a L1 penalty on the weights.
J(w) = L(w) + α w 1 = L(w) + α wk || || | | k X
w J = w L + α sign(wk ) ∇ ∇ k X Example : RBF, 1 kernel per sample, N=30, noisy sinus, α = 0.003. 1.0 1.0 0.6
0.8 0.8 0.4
0.2 0.6 0.6
0.0
0.4 0.4 0.2
0.2 0.2 0.4
0.0 0.0 0.6 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0 5 10 15 20 25 30 Shown are : α = 0, α = 0.003, w ?
54 / 94 Introduction Feedforward Neural Networks
Regularization Drop-out regularization [Srivastava(2014)]
Idea 1: to prevent co-adaptation. A unit is good by itself not because others are doing part of the job. Idea 2 : combine an exponential number of networks (Ensemble). How : For each minibatch, keep hidden and input activations with • probability p (p=0.5 for hidden, p=0.8 for inputs). At test time, multiply all the activations by p “Inverted”: scale by 1/p at training ; no scaling at test time • 55 / 94 Introduction Feedforward Neural Networks
Regularization (dropout)
,Srivastava(2014), Hinton(2012)
Usually after FC layers (p=0.5) and input layer. Can be interpreted as of training/averaging all the possible subnetworks. 56 / 94 Introduction Feedforward Neural Networks
Regularization
Split your data in three sets : training set : for training.. • validation set : for choosing hyperparameters • test set : for estimating the generalization error • Early stopping Idea : monitor your error on the validation set, U-shaped performance. Keep the model with the lowest validation error during training.
57 / 94 Introduction Feedforward Neural Networks
Training by some forms of gradient descent
w(t + 1) w(t) w J(wt ) ← − ∇
* Chap 8 [Bengio et al. (2016)] * A. Karpathy : http://cs231n.github.io/neural-networks-3/
58 / 94 Introduction Feedforward Neural Networks
But wait...
Does it make sense to apply gradient descent to Neural networks ? we cannot get better than a local minima ?!?! • and neural networks lead to non convex optimization • problems, i.e. a lot of local minima (think about the symmetries) But empirically, most local minima are close to the global minima with large/deep networks.
Choromanska, 2015 : The Loss Surface of Multilayer Nets Dauphin, 2014 : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Pascanu, 2014 : On the saddle point problem for non-convex optimization
59 / 94 Introduction Feedforward Neural Networks
Identifying the critical points The hessian matrix
Matrix of second order derivatives. Inform on the local • curvature ∂2f ∂2f ∂2f ∂x2 ∂x ∂x ··· ∂x ∂x 1 1 2 1 n ∂2f ∂2f ∂2f 2 ∂x2 ∂x1 ∂x ··· ∂x2 ∂xn H(θ) = 2J = 2 ∇θ . . . . . . .. . ∂2f ∂2f ∂2f ∂x ∂x ∂x ∂x ∂x2 n 1 n 2 ··· n for a convex function : H is symmetric, semi definite positive •
60 / 94 Introduction Feedforward Neural Networks
Identifying the type of critical points Eigenvalues of H
a critical point is where J = 0 • ∇θ if all eigenvalues(H) > 0 : local minima • if all eigenvalues(H) < 0 : local maxima • if eigenvalues(H) pos and neg : saddle point •
x2 + y 2 (x2 + y 2) x2 y 2 − − 61 / 94 Introduction Feedforward Neural Networks
Identifying the type of critical points
And if H is degenerate If H is degenerate (some eigenvalues=0, det(H) = 0), we can have : a local minimum : • 12x2 0 f (x, y) = (x4 + y 4), H(x, y) = − − 0 12y 2 − a local maximum : • 12x2 0 f (x, y) = x4 + y 4, H(x, y) = 0 12y 2 6x 0 a saddle point : f (x, y) = x3 + y 2, H(x, y) = • 0 2
62 / 94 Introduction Feedforward Neural Networks
Local minima are not an issue with deep networks
Local minima and their loss Experiment : One hidden layer, Mnist, Trained with SGD.
Converges mostly to local minima and some saddle points. The distribution of test loss of the local minima tend to shrink. Index α : fraction of negative eigenvalues of the Hessian Choromanska, 2015 : The Loss Surface of Multilayer Nets
63 / 94 Introduction Feedforward Neural Networks
Saddle points seem to be the issue Saddle points and their loss Experiment : “small MLP”, Trained with Saddle-free Newton (converges to critical points).
Used Newton to discover the critical points High loss critical points are saddle points Low loss critical points are local minima. Index α : fraction of negative eigenvalues of the Hessian Dauphin, 2014 : Identifying and attacking the saddle point
problem in high-dimensional non-convex optimization 64 / 94 Introduction Feedforward Neural Networks
Training 1st order methods
T J(θ) J(θ0) + (θ θ0) J(θ0) ≈ − ∇θ θ θ J ← − ∇θ
Rationale : First order approximation ˆ T J(θ) = J(θ0) + (θ θ0) θJ(θ0) − ∇ 2 (θ θ0) = J(θ0) Jˆ(θ) = J(θ0) ( J(θ0)) − − ∇θ ⇒ − ∇θ For small J(θ0) : J(θ0 J(θ0)) J(θ0) || ∇θ || − ∇θ ≤
65 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods
Minibatch Stochastic Gradient descent
start at θ • 0 for every minibatch : • 1 θ(t + 1) = θ(t) ( Ji (θ)) − ∇θ M i X M = 1 : very noisy estimate, Stochastic gradient descent • M = N : true gradient, batch gradient descent • (minibatch)SGD converges faster • The trajectory may converge slowly or diverge if not • appropriate
66 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods Stochastic Gradient descent : example Take N = 30 samples with
y = 3x + 2 + ( 0.1, 0.1) U − Let us perform linear regression (ˆy = wx + b, L2 loss) with SGD 40
30
20
10 y 0
10
20
30 10 5 0 5 10 x
67 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods Stochastic Gradient descent : ZigZag
= 0.005, b0 = 10, w0 = 5 Converges to w? = 2.9975, b? = 1.9882 2
10 1
0
5 1 w
log(J) 2 0 3
4 5
5 5 0 5 10 0 200 400 600 800 1000 b iteration
68 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods
Momentum Idea: let us damp the oscillations with a low-pass filter on ∇θ Start at θ , v = 0 • 0 for every minibatch : • v(t + 1) = αv(t) − ∇θ θ(t + 1) = θ(t) + v(t)
Usually, α 0.9 ≈ Experiment on http://distill.pub/2017/momentum/
69 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods Stochastic Gradient descent with momentum
= 0.005, α = 0.6, b0 = 10, w0 = 5 Converges to w? = 2.9933, b? = 1.9837 2
10 1
0
5 1 w
log(J) 2 0 3
4 5
5 5 0 5 10 0 200 400 600 800 1000 b iteration
Adviced : set α 0.5, 0.9, 0.99 ∈ { } 70 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods
SGD without/with momentum
2 2
1 1
0 0
1 1
log(J) 2 log(J) 2
3 3
4 4
5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration
71 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods
Nesterov Momentum [Sutskever, PhD Thesis] Idea : look ahead to potentially correct the update. Based on Nesterov Accelerated Gradient Start at θ , v = 0 • 0 for every minibatch : • θ˜(t + 1) = θ(t) + αv(t) v(t + 1) = αv(t) J(θ˜) − ∇θ θ(t + 1) = θ(t) + v(t + 1)
72 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods SGD with Nesterov Momentum
= 0.005, α = 0.8, b0 = 10, w0 = 5 Converges to w? = 2.9914, b? = 1.9738 2
10 1
0
5 1 w
log(J) 2 0 3
4 5
5 5 0 5 10 0 200 400 600 800 1000 b iteration
In this experiment, with nesterov momentum, a larger momentum was allowed. With α = 0.8, momentum strongly oscillates. 73 / 94 Introduction Feedforward Neural Networks
Training, 1st order methods
SGD/ SGD momentum / SGD nesterov momentum
2 2 2
1 1 1
0 0 0
1 1 1
log(J) 2 log(J) 2 log(J) 2
3 3 3
4 4 4
5 5 5 0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000 iteration iteration iteration
74 / 94 Introduction Feedforward Neural Networks
Training 1st order methods with adaptive learning rates
75 / 94 Introduction Feedforward Neural Networks
Training : adapting the learning rate
Learning rate annealing Some possible schedules : Linear decay between and . • 0 τ halve the learning rate when validation error stops improving •
76 / 94 Introduction Feedforward Neural Networks
Training : adapting the learning rate
Adagrad [Duchi, 2011]
Accumulate the square of the gradients : • r(t + 1) = r(t) + J(θ(t)) J(θ(t)) ∇θ ∇θ Scale individually the learning rates • θ(t + 1) = θ(t) θJ(θ(t)) − δ+√r(t+1) ∇ The √. is experimentally critical. δ [1e 8, 1e 4], for numerical stability Small≈ gradients− − bigger learning rate Big gradients ⇒smaller learning rate ⇒ Accumulation from the beginning is too agressive. Learning rates decrease too fast.
77 / 94 Introduction Feedforward Neural Networks
Training : adapting the learning rate
RMSProp [Hinton, unpublished] Idea : Use an exponentially decaying integration of the gradient Accumulate the square of the gradients : • r(t + 1) = ρr(t) + (1 ρ) J(θ(t)) J(θ(t)) − ∇θ ∇θ Scale individually the learning rates • θ(t + 1) = θ(t) θJ(θ(t)) − δ+√r(t+1) ∇ ρ 0.9. ≈ And some others : Adadelta [Zeiler, 2012], Adam [Kingma, 2014], ...
78 / 94 Introduction Feedforward Neural Networks
Training with 1st order methods So, which one do I use ? [Bengio et al. 2016] There is currently no consensus[...]no single best algorithm has emerged[...]the most popular and actively in use include SGD, SGD with momentum, RMSprop, RMSprop with momentum, Adadelta and Adam
Loading... A. Karpathy
Schaul(2014). Unit Tests for Stochastic Optimization 79 / 94 Introduction Feedforward Neural Networks
Training A glimpse into 2nd order methods
T 1 T 2 J(θ) J(θ0) + (θ θ0) J(θ0) + (θ θ0) J(θ0)(θ θ0) ≈ − ∇θ 2 − ∇ −
θJ(θ0) Gradient vector ∇ 2 J(θ0) Hessian matrix ∇θ
Idea : use a better local approximation to make a more informed update
80 / 94 Introduction Feedforward Neural Networks
Training : 2nd order methods Newton method From a 2nd order taylor approximation
T 1 T 2 J(θ) J(θ0) + (θ θ0) J(θ0) + (θ θ0) J(θ0)(θ θ0) ≈ − ∇θ 2 − ∇ − Critical point at :
J(θ) = 0 ∇θ −1 θ = θ0 H w J(θ0) ⇒ − ∇
Critical points (min, max, saddle) are attractors for Newton !! cool : we can locate critical points • but : do not use it for optimizing a neural network ! • Dauphin(2014) : Identifying and attacking the saddle point problem in high-dimensional non-convex optimization 81 / 94 Introduction Feedforward Neural Networks
Training : 2nd order methods
Second order methods require a larger batchsize. Some algorithms
Conjugate gradient : no need to compute the hessian, • guaranteed to converge in k steps for k dimensional quadratic function, Saddle-free Newton [Daupin, 2014] • Hessian free optimization (truncated newton) [Martens, 2010] • BFGS (quasi-newton): approximation of H−1, needs to store • it which is larged for deep networks, L-BFGS : limited memory BFGS •
82 / 94 Introduction Feedforward Neural Networks
Initialization and the importance of good activation distributions
83 / 94 Introduction Feedforward Neural Networks
Preprocessing your inputs Gradient descent converges faster if your data are normalized and d decorrelated. Denote by xi R your input data,x ˆi its normalized ∈ Input normalization
Min-max scaling : •
xi,j mink xk,j i, jxˆi,j = − ∀ maxk xk j mink xk j + , − , Z-score normalization : (goal:µ ˆ = 0, σˆ = 1) • j j xi,j µj i, j, xˆi,j = − ∀ σj + ZCA-Whitening : (goal:µ ˆ = 0, σˆ = 1, 1 XˆXˆT = I • j j n−1 1 Xˆ = WX , W = (XX T )−1/2 √n 1 84 / 94 − Introduction Feedforward Neural Networks
Z-score normalization / Standardizing the inputs Remember our linear regression : y = 3x + 2 + ( 0.1, 0.1), L2 loss, 30 1D samples U −
25 10
20 5 w w
15 0
10 5
5 0 5 10 0 5 10 15 b b Loss with raw input Loss with standardized input With standardized inputs, the gradient always points to the minimum !!
85 / 94 Introduction Feedforward Neural Networks
The starting point of training is critical
Pretraining Historically, training deep FNN was known to be hard, i.e. bad generalization errors. The starting point of a gradient descent has a dramatic impact. neural history compressors [Schmidhuber, 1991] • competitive learning [Maclin and Shavlik, 1995] • unsupervised pretraining based on Boltzman Machines • [Hinton, 2006] unsupervised pretraining based on autoencoders [Bengio, • 2006]
86 / 94 Introduction Feedforward Neural Networks
For example, pretraining with autoencoders
Idea : extract features that allow to reconstruct the previous layer activities Followed by fine-tuning with gradient descent Does not appear to be that critical nowadays (because of xxReLu and initialization strategies)
87 / 94 Introduction Feedforward Neural Networks
Initializing the weights/biases
Thoughts
intially behave as a linear predictor; Non linearities should be • activated by the learning algorithm only if necessary. units should not extract the same features : symmetry • breaking, otherwise, same gradients.
Suppose the inputs standardized, make output and gradients standardized: sigmoid : b = 0, w (0, √ 1 ) in the linear part fanin • ∼ N √ ⇒ √ sigm, tanh : b = 0, w ( √ 6 , √ 6 ) [Glorot, 2010] • ∼ U − ni +no ni +no ReLu : b = 0, w (0, 2/fanin) [He(2015)] • ∼ N p 88 / 94 Introduction Feedforward Neural Networks
LeCun initialization Initialization in the linear regime for the forward pass Aim : Initialize the weights so that f acts in its linear part, i.e. w close to 0 Use the symmetric transfer function f (x) = 1.7159 tanh( 2 x) • 3 f (1) = 1, f ( 1) = 1 ⇒ − − Center, normalize (unit variance) and decorrelate the input • dimensions initialize the weights from a distrib with µ = 0, σ = √1 • ni set the biases to 0 • This ensures the output of the layer is zero mean, unit • variance
Efficient Backprop, Lecun et al. (1998); Generalization and network design strategies, LeCun (1989) 89 / 94 Introduction Feedforward Neural Networks
Glorot initialization strategy Keep same distribution for the forward and backward pass
The activations and the gradients should have, initially, similar • distributions accross the layers to avoid vanishing/exploding gradient • The input dimensions should centered, normalized, • uncorrelated With a transfer function f , f 0(0) = 1, it turns to : • i 2 i, Var[W ] = fanin+fanout ∀ √ √ Glorot (Xavier) Uniform : W [ √ 6 , √ 6 ], b = 0 ∼ U − ni+no√ ni+no Glorot (Xavier) Normal : W (0, √ 2 ), b = 0 ∼ N fanin+fanout Understanding the difficulty of training deep feedforward neural networks, Glorot, Bengio, JMLR(2010). 90 / 94 Introduction Feedforward Neural Networks
He initialization strategy Designed for rectifier non linearities (ReLU, PReLU). Keep same distribution for the forward and backward pass
The activations and the gradients should have, initially, similar • distributions accross the layers The input dimensions should centered, normalized, • uncorrelated With a ReLU transfer function, d Conv k k filters on c • 1 2 1 2 × channels : 2 k cVar[wl ] = 1, 2 k dVar[wl ] = 1 √ √ He Uniform : W [ √ 6 , √ 6 ], b = 0 ∼ U − √ni ni He Normal : W (0, √ 2 ) ∼ N ni Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He et al, ICCV(2015). 91 / 94 Introduction Feedforward Neural Networks
Batch normalization [Ioffe, Szegedy(2015)]
Internal Covariate Shift Def [Ioffe(2015)]: The change in the distribution of network activations due to the change in network parameters during training Exp : 3 FC(100 units), sigmoid, output softmax, MNIST
Measure: distribution of activations of the last hidden layer during training, 15,50,85 th percentile { }
92 / 94 Introduction Feedforward Neural Networks
Batch normalization [Ioffe, Szegedy(2015)]
Batch normalization to prevent covariate shift Idea: standardize the activations of every layers to keep the same distributions during training. The gradient must be aware of this normalization, otherwise • may get parameter explosion (see Ioffe(2015)) Introduces a differentiable BN normalization layer : • z = g(Wu + b) z = g(BN(Wu + b)) →
yi = BNγ,β(xi ) = γxˆi + β xi µ xˆi = − B σ2 + B 2 p µB, σB : minibatch mean, variance
93 / 94 Introduction Feedforward Neural Networks
Batch Normalization Train and test time
Where : everywhere along the network, before ReLus • at training : standardize each unit’s activations over a • minibatch at test : • • with one sample, standardize over the population • use mean/variance from the train set • standardize over a batch of test samples
Learning much faster, better generalization
94 / 94 Convolutional Neural Networks (CNN)
Neocognitron [Fukushima(1980)]
LeNet5 [LeCun(1998)]
1 / 35 Idea : Exploiting the structure of the inputs
Ideas
• Features detected by convolutions with local kernels • parameters sharing, sparse weights ⇒ strongly regularized FNN (e.g. detecting an oriented edge is translation invariant)
2 / 35 The CNN of LeCun(1998)
Architecture
• (Conv/NonLinear/Pool) * n • followed by fully connected layers
3 / 35 General architecture of a CNN)
Architecture : Conv/ReLu/Pool
1 kernel
s
Bias s
Max ReLu
5
4
3
2
1
0 4 2 0 2 4
3 channels K kernels K feature maps K feature maps K feature maps
• Convolution : depth, size (3x3, 5x5), pad, stride • Max pooling : size, stride (e.g. (2,2))
4 / 35 Recent CNN
Multicolumn CDNN, Ciressan(2012) Ensemble of Convolutional neural networks trained with dataset augmentation.
0.23 % test misclassification on MNIST. 1.5 million of parameters.
5 / 35 Recent CNN
SuperVision, Krizhevsky(2012)
• top 5 error of 16% compared to runner-up with 26% error. • several convolutions were stacked without pooling, • trained on 2 GPUs, for a week • 60 Millions parameters, dropout, momentum, L2 penalty, dataset augmentation (trans, reflections, PCA)
6 / 35 Recent CNN SuperVision, Krizhevsky(2012)
• top 5 error of 16% compared to runner-up with 26% error. • several convolutions were stacked without pooling • 60 Millions parameters, dropout, momentum, L2 penalty, dataset augmentation (trans, reflections, PCA)
7 / 35 Recent CNN
VGG, Simonyan(2014)
• 16 layers : 13 convolutive, 3 fully connected • 3x3 convolutions, 2x2 pooling • stacked 3x3 convolutions ⇒ result in a 5x5 convolution with less parameters : • K input channels, K output channels, 5x5 convolution ⇒ 25K 2 parameters • K input channels, K output channels, 2 3x3 convolutions ⇒ 18K 2 parameters • 140 Million of parameters, dropout, momentum, L2 penalty, learning rate annealing, trained progressively
8 / 35 Recent CNN
Inception, GoogLeNet (Szegedy,2015) Idea : decrease the number of parameters by using 1x1 convolutions for cross-channel interactions.
• dramatic decrease in the number of parameters ≈ 6 Million • multi-level feature extraction
9 / 35 Recent CNN Residual Networks (He,2015) Idea : shortcut connections, no fully connected layers
• L2 penalty, batch normalization (NO dropout), momentum • up to 150 layers, for only 2 Million parameters 10 / 35 Recent CNN Striving for simplicity: The all convolutional Net (Springenberg,2014)
Output : You can get out pooling and only use convolutions (All-CNN-C). • training with SGD, momentum, L2 penalty, dropout • only 3x3 convolutions with various stride • last layers use 3x3 and 1x1 convolutions instead of FC layers
11 / 35 An attempt at synthesizing CNN design principles
12 / 35 Design principles for Convolutional Neural networks
Increase the number of filters through the network Rationale : • first layers extract low level features • higher layers combine the previous features
Number of filters
• LeNet-5 (1998) : 6 5x5 - 16 5x5 • AlexNet(2012) : 96 11x11, 256 5x5, (384 3x3)*2, 256 3x3 • VGG (2014) : 64 - 128 - 256 - 512; all 3x3 • ResNet (2015) : 64 - 128 - 256 - 512; all 3x3 • Inception (2015) : 64 → 1024, 1x1, 3x3, “5x5”
13 / 35 Design principles for Convolutional Neural networks Effective Receptive Field size (Conv 3x3 - Conv 3x3 - Max Pool) blocks
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 Input RF Size 1x1 3x3
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 Input RF Size 1x1 3x3 5x5
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 14x14 Input RF Size 1x1 3x3 5x5 6x6
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 14x14 14x14 Input RF Size 1x1 3x3 5x5 6x6 10x10
14 / 35 Design principles for Convolutional Neural networks
Effective Receptive Field size (Conv 3x3 - Conv 3x3 - Max Pool) blocks
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 7x7 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14 15x15
Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3 Conv 3x3 Max 2x2 Conv 3x3
same same stride 2 same same stride 2 same stride 1 stride 1 stride 1 stride 1 stride 1
Representation Size 28x28 28x28 28x28 14x14 14x14 14x14 7x7 7x7 Input RF Size 1x1 3x3 5x5 6x6 10x10 14x14 15x15 23x23
=⇒ Stack layers to ensure the RF cover the objects to detect. https://github.com/vdumoulin/conv_arithmetic
15 / 35 Design principles for Convolutional Neural networks Stacking small kernels (VGG, Inception)
Szegedi(2015) n input filters,αn output filters : • αn5x5 conv : 25αn2 params √ √ √ • αn, 3x3- αn, 3x3 : 9 αn2 + 9 ααn2 params; α = 2 ⇒ 24% saving n input filters,αn output filters : • αn3x3 conv : 9αn2 params √ √ √ • αn, 1x3 - αn, 3x1 : 3 αn2 + 3α αn2 params; α = 2 ⇒ 30% saving 16 / 35 Design principles for Convolutional Neural Networks Depthwise convolutions MobileNets [Howard,2017]
Decrease the number of parameters by decoupling feature extraction in space and feature combination See also Xcep- tion[Chollet(2016)]
17 / 35 Design principles for Convolutional Neural networks
Multiscale feature extraction
18 / 35 Design principles for Convolutional Neural networks
Dimensionality reduction with 1x1 convolutions
1 Relu height height n 1 Conv 1x1 m filters
width width #channels #channels n m Equivalent to a single layer FFN, slided over the pixels. Multiple 1x1 ↔ MLP Trainable non-linear transformation of the channels. Network in Network (Lin, 2013).
19 / 35 Design principles for Convolutional Neural networks
Ease the gradient flow with shortcuts
20 / 35 Design principles for Convolutional Neural networks
Do we need max pooling and fully connected layers ? The all convnet [Springenberg, 2015], ResNet [He, 2015]:
Avantage : you can slide the network over larger images to produce a volume of class probabilities.
21 / 35 Usefull tricks
Using Pre-trained models Already trained models can be found : • https://github.com/tensorflow/models : Tensorflow zoo • https://keras.io/applications/ : Keras pre-trained models • Caffe Model Zoo e.g. use a VGG pretrained on ImageNet, 1) replace the softmax, 2) fine-tune some of the deepest layers
22 / 35 Usefull tricks
Dataset augmentation Regularize your network by providing more samples : • With small perturbations (rotation, shift, zoom, ..)
• by altering the RGB pixel values with a PCA [Krizhevsky et al.,2012] • by learning a generator (see Generative Adversial Networks)
23 / 35 Usefull tricks
Model averaging
1 Train several models with different initialization, architecture 2 Average the response of these models e.g. on CIFAR-100, 11 models with loss≈1.3, acc≈70%; averaged : loss ≈ 0.82, acc≈ 77%. All winners to challenge use model averaging
Model compression : speeding up inference time
• Binarized Neural Networks [Courbariaux(2016)] • Knowledge distillation (train a small models with soft targets of a big model) [Hinton(2015)]
24 / 35 Viewing and understanding deep networks
Demo of Deep Visualization Toolbox http://yosinski.com/deepvis
25 / 35 Some applications of CNN
26 / 35 Image classification Aim : assign a label to an image Some benchmarks
• MNIST (28x28, 10 classes, grayscale, 60000 training, 10000 testing) • CIFAR-10, CIFAR-100 • ImageNet Task 1 (256x256, 1000 classes, 1.2 million training, 50.000 validation, and 100.000 test)
Image from [He(2016)] 27 / 35 Image classification ImageNet
28 / 35 Image classification
ImageNet
Image from [Canziani(2016)]
29 / 35 Object detection
Aim : detect the objects and output bounding boxes Metrics : detected classes, bbox coverage Some benchmarks
• Pascal-VOC • ImageNet Task 3 • Microsoft COCO
30 / 35 Applications of CNN : Object detection
Region based CNN [Girshick,2014]
Using the model AlexNet [Krizhevsky(2012)] for classifying.
31 / 35 Applications of CNN : Object detection
Fast RCNN (Girshick, 2015)
32 / 35 Applications of CNN : Object detection Faster RCNN (Ren, Girshik, 2015) - Introduces Region Proposal Network feeding a Fast-RCNN - end-to-end training
More recently : YOLO(2015), YOLO9000(2016) 33 / 35 Applications of CNN : Semantic/Instance segmentation Instance segmentation [He,2017]: Mask-RCNN Predicts a binary mask in addition to the classes and boxes
Other approaches : SegNet(2015), FC-DenseNet(2017), UNet(2015), ENet(2016)
34 / 35 Spatial Transformer Network (Jaderberg, 2016) Learns a differentiable transformation Tθ (crop, translation, rotation, scale, and skew). Aim: decrease the number of degrees of freedom of the objects.
35 / 35 Recurrent Neural Networks
Recurrent neural networks (RNN) Handling sequential data (speech, handwriting, language,...) Predicting in context
Input Output
1 / 15 Recurrent Neural Networks
Handling sequences with FNN
Time delay neural networks [Waibel(1989)] Delay line Hidden layers Output layer xt 6 −
xt 5 −
xt 4 −
xt 3 −
xt 2 −
xt 1 −
xt
But which size of the time window ? Must the history size be always the same ? Do we need the data over the whole time span ?
2 / 15 Recurrent Neural Networks
Recurrent Neural Networks (RNN) Architecture
Input Output
• W in inputs to hidden • W back outputs to hidden • W hidden to hidden • W out hidden to output
Note it Applies repeatedly the weight matrix. 3 / 15 Recurrent Neural Networks
Training a RNN: Forward mode differentation
Real Time Reccurent Learning (RTRL), [Williams(1989)] Same idea than forward-mode differentiation for FNN. • Computationally more expensive than reverse mode differentation, • Online training
Sustkever (2013). Training recurrent neural networks, PhD.
4 / 15 Recurrent Neural Networks
Training a RNN : Reverse mode differentation
Backpropagation Through Time (BPTT), [Werbos(1990)]
• Unfolds the computational graph in time ⇒ ∼ backprop in a deep FNN • computationnaly cheaper than RTRL • batch training
Sustkever (2013). Training recurrent neural networks, PhD.
5 / 15 Recurrent Neural Networks
Training a RNN is hard
Long-term dependencies
• Exploding/Vanishing gradient • if one output depends on an input a long time ago (long-term dependencies), that information may actually be lost or hard to be sensitive to • ⇒ introduce memory units, specifically designed to hold some information
6 / 15 Recurrent Neural Networks
Long-Short Term Memory Architecture [Hochreiter,Schmidhuber(1997)][Gers(2000)] Specifically designed to store information for long time delays Gating units specify when to integrate, release, forget
Jozefowicz(2015); “LSTM: A search space odyssey” [Greff(2017)]; Recurrent Highway Networks Other possibilities : e.g. Gated Recurrent Units (GRUs) [Cho(2014)]; 7 / 15 Recurrent Neural Networks
Bidirectional LSTM Speech to text Bidirectional LSTM for Speech recognition [Graves(2013)] Both past and future contexts are used for classifying the current observation. When you speak, past and future phonemes influence the way you pronounce the current one.
8 / 15 Recurrent Neural Networks
Applications of RNN: Language modelling Char RNN [Karpathy(2015)] Train A LSTM network to predict the next character. Then provide a seed and let it generate, character by character, a sentence.
http: //karpathy.github.io/2015/05/21/rnn-effectiveness/
9 / 15 Recurrent Neural Networks
Applications of RNN: Text to text Mapping a sentence in one language to its translation. Encoder/Decode [Sustkever(2014), Cho(2014)]
https://devblogs.nvidia.com/parallelforall/ introduction-neural-machine-translation-with-gpus/ 10 / 15 Recurrent Neural Networks
Applications of RNN: Text to text The encoder/decoder suffers when the sentences are long. Idea : let the network decides on which part of the sentence to attend when translating. [Bahdanau(2015)] Attention based LSTM translation
See also [Cho et al.(2015)] for image captioning. 11 / 15 Recurrent Neural Networks
Applications of RNN: Multi language translation Google’s Multilingual Neural Machine Translation System
Model trained with English↔Portuguese and English↔Spanish generalizes to English↔Spanish.
[Wu(2016); Johnson(2016)] 12 / 15 Recurrent Neural Networks
Applications of RNN Text to handwritten text Handwriting [Graves(2013)]
Data : sequence of characters, sequence of pen positions (x,y,up/down) 1 learn a handwritten text generator: (δxt+1, δyt+1, udt ) = f ({δxk , δyk , udk }k∈[0,t−1] 2 condition the network by inputing a sequence of characters, and learn to attend to the right characters 3 prime the network with a given character/pen sequence to mimic style The network outputs parameters for a mixture of gaussians http://www.cs.toronto.edu/~graves/handwriting.html
13 / 15 Recurrent Neural Networks
Combining CNN and RNN
Automatic captioning [Karpathy(2015)]
http://cs.stanford.edu/people/karpathy/deepimagesent/
14 / 15 Recurrent Neural Networks
We did not speak about
• Encoders/decoders, Deconvolutional networks, • Models for Natural Language Processing (e.g. word2vec, GLoVe, recursive networks) • Probabilistic/Energy based models : Hopfield networks, Restrictied boltzman machines, deep belief networks • More generally : generative models (e.g. Generative Adversial Networks[Goodfellow, 2014]) • Neural Turing Machines, Attention based models https://distill.pub/2016/augmented-rnns/
15 / 15