Deep Learning Book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks

Deep Learning book, by Ian Goodfellow, Yoshua Bengio and Aaron Courville Chapter 6 :Deep Feedforward Networks Benoit Massé Dionyssos Kounades-Bastian Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 1 / 25 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W>x + b Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W>x + b W b x u y Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W>x + b W W x1 11 ::: 23 b1 u1 y1 x2 b2 u2 y2 x3 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication) Input vector x Output vector y Parameters Weight W and bias b Prediction : y = W>x + b W b x1 ; y1 x2 y2 x3 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 2 / 25 Linear regression (and classication) Advantages Easy to use Easy to train, low risk of overtting Drawbacks Some problems are inherently non-linear Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 3 / 25 Linear regressor W b x1 ; y x2 There is no value for W and b 2 such that 8(x1; x2) 2 f0; 1g > x1 W + b = xor(x1; x2) x2 Solving XOR Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25 Solving XOR Linear regressor W b x1 ; y x2 There is no value for W and b 2 such that 8(x1; x2) 2 f0; 1g > x1 W + b = xor(x1; x2) x2 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 4 / 25 Strictly equivalent : The composition of two linear operation is still a linear operation Solving XOR What about... ? W; b V c x1 u1 ; y x2 u2 Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25 Solving XOR What about... ? W; b V c x1 u1 ; y x2 u2 Strictly equivalent : The composition of two linear operation is still a linear operation Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 5 / 25 Solving XOR And about... ? W; b φ V c x1 u1 h1 ; y x2 u2 h2 In which φ(x) = maxf0; xg Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25 Solving XOR And about... ? W; b φ V c x1 u1 h1 ; y x2 u2 h2 In which φ(x) = maxf0; xg It is possible ! 1 1 0 1 With W = , b = , V = and c = 0, 1 1 −1 −2 Vφ(Wx + b) = xor(x1; x2) Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 6 / 25 Neural network with one hidden layer Compact representation W; b; φ V; c x h y Neural network Hidden layer with non-linearity ! can represent broader class of function Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 7 / 25 Universal approximation theorem Theorem A neural network with one hidden layer can approximate any continuous function m More formally, given a continuous function f : Cn 7! R where n Cn is a compact subset of R , K " X > 8"; 9fNN : x ! vi φ(wi x + bi ) + c i=1 such that " 8x 2 Cn; jjf (x) − fNN (x)jj < " Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 8 / 25 Using the network Even if we nd a way to obtain the network, the size of the hidden layer may be prohibitively large. Problems Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25 Problems Obtaining the network The universal theorem gives no information about HOW to obtain such a network Size of the hidden layer h Values of W and b Using the network Even if we nd a way to obtain the network, the size of the hidden layer may be prohibitively large. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 9 / 25 Deep neural network Why Deep ? Let's stack l hidden layers one after the other; l is called the length of the network. W1; b1; φ W2; b2; φ Wl ; bl ; φ V; c ... x h1 hl y Properties of DNN The universal approximation theorem also apply Some functions can be approximated by a DNN with N hidden unit, and would require O(eN ) hidden units to be represented by a shallow network. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 10 / 25 Remaining problem How to get this DNN ? Summary Comparison Linear classier − Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power − Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25 Summary Comparison Linear classier − Limited representational power + Simple Shallow Neural network (Exactly one hidden layer) + Unlimited representational power − Sometimes prohibitively wide Deep Neural network + Unlimited representational power + Relatively small number of hidden units needed Remaining problem How to get this DNN ? Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 11 / 25 Parameters When the architecture is dened, we need to train the DNN W1, b1,..., Wl , bl On the path of getting my own DNN Hyperparameters First, we need to dene the architecture of the DNN The depth l The size of the hidden layers n1,..., nl The activation function φ The output unit Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25 On the path of getting my own DNN Hyperparameters First, we need to dene the architecture of the DNN The depth l The size of the hidden layers n1,..., nl The activation function φ The output unit Parameters When the architecture is dened, we need to train the DNN W1, b1,..., Wl , bl Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 12 / 25 The activation function φ ReLU g : x 7! maxf0; xg −x −1 Sigmoid σ : x 7! (1 + e ) Many others : tanh, RBF, softplus... The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y]; I) > Sigmoid output y^ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1jx) =y ^ Hyperparameters The depth l The size of the hidden layers n1,..., nl Strongly depend on the problem to solve Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y]; I) > Sigmoid output y^ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1jx) =y ^ Hyperparameters The depth l The size of the hidden layers n1,..., nl Strongly depend on the problem to solve The activation function φ ReLU g : x 7! maxf0; xg −x −1 Sigmoid σ : x 7! (1 + e ) Many others : tanh, RBF, softplus... Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 Hyperparameters The depth l The size of the hidden layers n1,..., nl Strongly depend on the problem to solve The activation function φ ReLU g : x 7! maxf0; xg −x −1 Sigmoid σ : x 7! (1 + e ) Many others : tanh, RBF, softplus... The output unit > Linear output E[y] = V hl + c For regression with Gaussian distribution y ∼ N (E[y]; I) > Sigmoid output y^ = σ(w hl + b) For classication with Bernouilli distribution P(y = 1jx) =y ^ Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 13 / 25 Parameters Training Objective Let's dene θ = (W1; b1;:::; Wl ; bl ). We suppose we have a set of inputs X = (x1;:::; xN ) and a set of expected outputs Y = (y1;:::; yN ). The goal is to nd a neural network fNN such that i i 8i; fNN (x ; θ) ' y : Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 14 / 25 Parameters Training Cost function To evaluate the error that our current network makes, let's dene a cost function L(X; Y; θ). The goal becomes to nd argmin L(X; Y; θ) θ Loss function Should represent a combination of the distances between every i i y and the corresponding fNN (x ; θ) Mean square error (rare) Cross-entropy Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 15 / 25 Parameters Training Find the minimum The basic idea consists in computing θ^ such that ^ rθL(X; Y; θ) = 0: This is dicult to solve analytically e.g. when θ have millions of degrees of freedom. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 16 / 25 Parameters Training Gradient descent Let's use a numerical way to optimize θ, called the gradient descent (section 4.3). The idea is that f (θ − "u) ' f (θ) − "u>rf (θ) So if we take u = rf (θ), we have u>u > 0 and then f (θ − "u) ' f (θ) − "u>u < f (θ): If f is a function to minimize, we have an update rule that improves our estimate. Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 17 / 25 Problem ^ How to estimate eciently rθL(X; Y; θ) ? Back-propagation algorithm Parameters Training Gradient descent algorithm 1 Have an estimate θ^ of the parameters 2 ^ Compute rθL(X; Y; θ) 3 ^ ^ Update θ − θ − "rθL 4 Repeat step 2-3 until rθL < threshold Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25 Parameters Training Gradient descent algorithm 1 Have an estimate θ^ of the parameters 2 ^ Compute rθL(X; Y; θ) 3 ^ ^ Update θ − θ − "rθL 4 Repeat step 2-3 until rθL < threshold Problem ^ How to estimate eciently rθL(X; Y; θ) ? Back-propagation algorithm Benoit Massé, Dionyssos Kounades-Bastian Deep Feedforward Networks 18 / 25 Back-propagation for Parameter Learning Consider the architecture: w2 w1 y φ φ x with function: y = φ w2φ(w1x) ; N some training pairs T xn yn , and = ^ ; ^ n=1 an activation-function φ().

Load more