Deep Learning book, by Ian Goodfellow, and Aaron Courville Chapter 6 :Deep Feedforward Networks

Benoit Massé Dionyssos Kounades-Bastian

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25 Linear regression (and classication)

Input vector x Output vector y Parameters Weight W and bias b

Prediction : y = W>x + b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication)

Input vector x Output vector y Parameters Weight W and bias b

Prediction : y = W>x + b

W b x u y

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication)

Input vector x Output vector y Parameters Weight W and bias b

Prediction : y = W>x + b

W W x1 11 ... 23 b1 u1 y1

x2 b2 u2 y2

x3

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication)

Input vector x Output vector y Parameters Weight W and bias b

Prediction : y = W>x + b

W b x1 ,

y1

x2

y2

x3

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication)

Advantages Easy to use Easy to train, low risk of overtting

Drawbacks Some problems are inherently non-linear

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25 Linear regressor

W b x1 , y

x2

There is no value for W and b 2 such that ∀(x1, x2) ∈ {0, 1}   > x1 W + b = xor(x1, x2) x2

Solving XOR

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25 Solving XOR

Linear regressor

W b x1 , y

x2

There is no value for W and b 2 such that ∀(x1, x2) ∈ {0, 1}   > x1 W + b = xor(x1, x2) x2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25 Strictly equivalent : The composition of two linear operation is still a linear operation

Solving XOR

What about... ?

W, b V c x1 u1 , y

x2 u2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25 Solving XOR

What about... ?

W, b V c x1 u1 , y

x2 u2 Strictly equivalent : The composition of two linear operation is still a linear operation

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25 Solving XOR

And about... ?

W, b φ V c x1 u1 h1 , y

x2 u2 h2

In which φ(x) = max{0, x}

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25 Solving XOR

And about... ?

W, b φ V c x1 u1 h1 , y

x2 u2 h2

In which φ(x) = max{0, x}

It is possible ! 1 1  0   1  With W = , b = , V = and c = 0, 1 1 −1 −2

Vφ(Wx + b) = xor(x1, x2)

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25 Neural network with one hidden

Compact representation

W, b, φ V, c x h y

Neural network Hidden layer with non-linearity → can represent broader class of function

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25 Universal approximation theorem

Theorem A neural network with one hidden layer can approximate any continuous function

m More formally, given a continuous function f : Cn 7→ R where n Cn is a compact subset of R ,

K ε X > ∀ε, ∃fNN : x → vi φ(wi x + bi ) + c i=1 such that ε ∀x ∈ Cn, ||f (x) − fNN (x)|| < ε

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25 Using the network Even if we nd a way to obtain the network, the size of the hidden layer may be prohibitively large.

Problems

Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25 Problems

Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b

Using the network Even if we nd a way to obtain the network, the size of the hidden layer may be prohibitively large.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25 Deep neural network

Why Deep ? Let's stack l hidden layers one after the other; l is called the length of the network.

W1, b1, φ W2, b2, φ Wl , bl , φ V, c ... x h1 hl y

Properties of DNN The universal approximation theorem also apply Some functions can be approximated by a DNN with N hidden unit, and would require O(eN ) hidden units to be represented by a shallow network.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25 Remaining problem How to get this DNN ?

Summary

Comparison Linear classier − Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power − Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25 Summary

Comparison Linear classier − Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power − Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed

Remaining problem How to get this DNN ?

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25 Parameters When the architecture is dened, we need to train the DNN W1, b1,..., Wl , bl

On the path of getting my own DNN

Hyperparameters First, we need to dene the architecture of the DNN The depth l

The size of the hidden layers n1,..., nl The φ The output unit

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25 On the path of getting my own DNN

Hyperparameters First, we need to dene the architecture of the DNN The depth l

The size of the hidden layers n1,..., nl The activation function φ The output unit

Parameters When the architecture is dened, we need to train the DNN W1, b1,..., Wl , bl

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25 The activation function φ ReLU g : x 7→ max{0, x} −x −1 Sigmoid σ : x 7→ (1 + e ) Many others : tanh, RBF, softplus... The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y], I) > Sigmoid output yˆ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1|x) =y ˆ

Hyperparameters

The depth l

The size of the hidden layers n1,..., nl Strongly depend on the problem to solve

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y], I) > Sigmoid output yˆ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1|x) =y ˆ

Hyperparameters

The depth l

The size of the hidden layers n1,..., nl Strongly depend on the problem to solve The activation function φ ReLU g : x 7→ max{0, x} −x −1 Sigmoid σ : x 7→ (1 + e ) Many others : tanh, RBF, softplus...

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 Hyperparameters

The depth l

The size of the hidden layers n1,..., nl Strongly depend on the problem to solve The activation function φ ReLU g : x 7→ max{0, x} −x −1 Sigmoid σ : x 7→ (1 + e ) Many others : tanh, RBF, softplus... The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y], I) > Sigmoid output yˆ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1|x) =y ˆ

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 Parameters Training

Objective Let's dene θ = (W1, b1,..., Wl , bl ). We suppose we have a set of inputs X = (x1,..., xN ) and a set of expected outputs Y = (y1,..., yN ). The goal is to nd a neural network fNN such that

i i ∀i, fNN (x , θ) ' y .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25 Parameters Training

Cost function To evaluate the error that our current network makes, let's dene a cost function L(X, Y, θ). The goal becomes to nd

argmin L(X, Y, θ) θ

Loss function Should represent a combination of the distances between every i i y and the corresponding fNN (x , θ) Mean square error (rare) Cross-entropy

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25 Parameters Training

Find the minimum The basic idea consists in computing θˆ such that ˆ ∇θL(X, Y, θ) = 0.

This is dicult to solve analytically e.g. when θ have millions of degrees of freedom.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25 Parameters Training

Gradient descent Let's use a numerical way to optimize θ, called the (section 4.3). The idea is that

f (θ − εu) ' f (θ) − εu>∇f (θ)

So if we take u = ∇f (θ), we have u>u > 0 and then

f (θ − εu) ' f (θ) − εu>u < f (θ).

If f is a function to minimize, we have an update rule that improves our estimate.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25 Problem ˆ How to estimate eciently ∇θL(X, Y, θ) ? Back-propagation algorithm

Parameters Training

Gradient descent algorithm

1 Have an estimate θˆ of the parameters 2 ˆ Compute ∇θL(X, Y, θ) 3 ˆ ˆ Update θ ←− θ − ε∇θL

4 Repeat step 2-3 until ∇θL < threshold

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25 Parameters Training

Gradient descent algorithm

1 Have an estimate θˆ of the parameters 2 ˆ Compute ∇θL(X, Y, θ) 3 ˆ ˆ Update θ ←− θ − ε∇θL

4 Repeat step 2-3 until ∇θL < threshold

Problem ˆ How to estimate eciently ∇θL(X, Y, θ) ? Back-propagation algorithm

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25 Back-propagation for Parameter Learning

Consider the architecture: w2 w1

y φ φ x

with function:  y = φ w2φ(w1x) ,

 N some training pairs T xn yn , and = ˆ , ˆ n=1 an activation-function φ().

Learn w1, w2 so that: Feeding xˆn results yˆn.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 19 / 25 Prerequisite: dierentiable activation function

For learning to be possible φ() has to be dierentiable. Let 0 x ∂φ(x) denote the derivative of x . φ ( ) = ∂x φ( ) For example when φ(x) = Relu(x) we have:

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 20 / 25 Gradient-based Learning

Minimize the L(w1, w2, T ). We will learn the weights by iterating:

   updated   ∂L w w ∂w1 1 = 1 − γ   , (1) w2 w2  ∂L  ∂w2

L is the loss function (must be dierentiable): In detail is L(w1, w2, T ) and we want to compute the gradient(s) at w1, w2. γ is the learning rate (a scalar typically known).

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 21 / 25 Back-propagation

Calculate intermediate The partial derivatives are: values on all units: d 6 ∂L( ) 0 d . ∂d = L ( ) 1 a = w1xˆn. d 7 ∂ 0 w w x . ∂c = φ ( 2φ( 1ˆn)) 2 b w x . = φ( 1ˆn) c 8 ∂ w . ∂b = 2 3 c w2 w1xn . = φ( ˆ ) b 9 ∂ 0 w x .  ∂a = φ ( 1ˆn) 4 d = φ w2φ(w1xˆn) .  5 L(d) = L φ w2φ(w1xˆn) .

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 22 / 25 Start the calculation from left-to-right. We propage the gradients (partial products) from the last layer towards the input.

Calculating the Gradients I

Apply chain rule:

∂L ∂L ∂d ∂c ∂b ∂a = , ∂w1 ∂d ∂c ∂b ∂a w1

∂L(d) ∂L(d) ∂d ∂c = . ∂w2 ∂d ∂c w2

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25 Calculating the Gradients I

Apply chain rule:

∂L ∂L ∂d ∂c ∂b ∂a = , ∂w1 ∂d ∂c ∂b ∂a w1

∂L(d) ∂L(d) ∂d ∂c = . ∂w2 ∂d ∂c w2

Start the calculation from left-to-right. We propage the gradients (partial products) from the last layer towards the input.

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 23 / 25 Calculating the Gradients

And because we have N training pairs:

N ∂L X ∂L(dn) ∂dn ∂cn ∂bn ∂an = , ∂w1 ∂dn ∂cn ∂bn ∂an w1 n=1

N ∂L X ∂L(dn) ∂dn ∂cn = . ∂w2 ∂dn ∂cn w2 n=1

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 24 / 25 Thank you !

Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 25 / 25