
Introduction Feedforward Neural Networks Deep learning J´er´emyFix CentraleSup´elec jeremy.fi[email protected] 2016 1 / 94 Introduction Feedforward Neural Networks Introduction and historical perspective [Schmidhuber, 2015] Deep Learning in Neural Networks: An Overview, J¨urgenSchmidhuber, Neural Networks (61), Pages 85-117 2 / 94 Introduction Feedforward Neural Networks Historical perspective on neural networks Perceptron (Rosenblatt, 1962) : linear classifier • AdaLinE (Widrow, Hoff, 1962): linear regressor • Minsky/Papert (1969): first winter • Convolutional Neural Networks (1980, 1998) : great! • Multilayer Perceptron and backprop (Rumelhart, 1986) : • great! but it is hard to train, and the SVM come in the 1990s .... : • second winter 2006 : pretraining ! • 2012 : AlexNet on Imagenet (10% better on test than the • 2nd) Now on : lot of state of the art neural networks • 3 / 94 Introduction Feedforward Neural Networks Some reasons of the current success GPU (speed of processing) / Data (regularizing) • theoretical understandings on the difficulty of training deep • networks Which libs? Torch(Lua)/PyTorhc , Caffee(Python/C++) • Theano/Lasagne (python, RIP 2017) , Tensorflow (Google, • Python, C++), Keras (wrapper over Tensorflow and Theano), CNTK, MXNET, Chainer, DyNet, ... 4 / 94 Introduction Feedforward Neural Networks What to read ? www.deeplearningbook.org Goodfellow, Bengio, Courville(2016) Who to follow ? N-1 : LeCun, Bengio, Hinton, Schmidhuber • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, • Krizevsky, .. obviously, the community is much larger. Which conferences ? ICML, NIPS, ICLR, .. https://github.com/terryum/awesome-deep-learning-papers 5 / 94 Introduction Feedforward Neural Networks What is a neural network The tree a Neural network is a directed graph edges : weighted connections • nodes : computational units • no cycle : feedforward neural networks (FNN) • with cycles : recurrent neural networks (RNN) • hides the jungle What is a convolutional neural network with a softmax output, ReLu hidden activations, with batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout ? 6 / 94 Introduction Feedforward Neural Networks Feedforward Neural Networks (FNN) Input Hidden Output Skip layer connection 7 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) classification, Given (xi ; yi ), yi 1; 1 • 2 {− g SAR Architecture, Basis functions φ (x) with φ (x) = 1 • j 0 Algorithm • Geometrical interpretation • Sensory Associative Result x0 a0 = φ0(x) w00 Σ g r0 w10 x1 a1 = φ1(x) w01 r Σ 1 x2 a2 = φ2(x) w02 Σ r2 w22 x3 8 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Classifier Given feature functions φj , with φ0(x) = 1, the perceptron classifies x as : y = g(w T Φ(x)) (1) 1 if x < 0 g(x) − (2) (+1 if x 0 ≥ 1 φ (x) n +1 1 with φ(x) R a , φ(x) = 2 3 2 φ2(x) 6 . 7 6 . 7 6 7 4 5 9 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g w if the input is correctly classified w = 8w + φ(xi ) if the input is incorrectly classified as -1 (3) <>w φ(xi ) if the input is incorrectly classified as +1 − :> 10 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is correctly classified Case y = +1 Case y = 1 i i − w φ(xi) φ(xi) w 11 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is misclassified Case y = +1 Case y = 1 i i − w φ(xi) φ(xi) w + φ(xi) w φ(x ) − i w 12 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) The cone of feasible solutions Consider two samples x1; x2 with y1 = +1, y2 = 1 − φ(x ) w 1 φ(x2) T v φ(x1) = 0 T v φ(x2) = 0 13 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g w if the input is correctly classified w = 8w + φ(xi ) if the input is incorrectly classified as -1 (4) <>w φ(xi ) if the input is incorrectly classified as +1 − :> T w if g(w φ(xi )) = yi T w = 8w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 (5) T − <>w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − :> 14 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g T w if g(w φ(xi )) = yi T w = 8w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 T − <>w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − :> w if g(w T φ(x )) = y w = i i T (w + yi φ(xi ) if g(w φ(xi )) = yi 6 1 T w = w + (yi y^i )φ(xi ) withy ^i = g(w φ(xi )) 2 − 15 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Definition (Linear separability) d A binary classification problem (xi ; yi ) R 1; 1 ; i [1::N] is 2 × {−d g 2 said to be linearly separable if there exists w R such that : 2 T i; sign(w xi ) = yi 8 with x < 0; sign(x) = 1; x 0; sign(x) = +1. 8 − 8 ≥ Theorem (Perceptron convergence theorem) d A classification problem (xi ; yi ) R 1; 1 ; i [1::N] is linearly separable if and only if the2 perceptron× {− g learning2 rule converges to an optimal solution in a finite number of steps. : easy; : we upper/lower bound w(t) 2 ( ) j j2 16 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) w = w + y φ(x ), with (t) the set of misclassified • t 0 i2I(t) i i samples I P 1 T it minimizes a loss : J(w) = max(0; yi w φ(xi )) • N i − the solution can be written as • P 1 wt = w0 + (yi y^i )φ(xi ) 2 − i X (yi y^i ) is the prediction error − 17 / 94 Introduction Feedforward Neural Networks Kernel Perceptron Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM); Given w(t) = w0 + i2I yi xi P < w; x > =< w0; x > + yi < xi ; x > i2I X k(w; x) = k(w0; x) + yi k(xi ; x) ) i2I X 3 2 1 0 1 2 3 3 2 1 0 1 2 3 18 / 94 Introduction Feedforward Neural Networks Adaptive Linear Elements (Widrow, Hoff, 1962) Linear regression, Analytically Given (xi ; yi ), yi R • 2 minimize J(w) = 1 y w T x 2 • N i i i jj − T jj Analytically w J(w) = 0 XX w = Xy • r P ) XX T non singular : w = (XX T )−1Xy • XX T singular (e.g. points along a line in 2D), infinite nb • solutions • regularized least square : min G(w) = J(w) + αw T w T • w G(w) = 0 (XX + αI )w = Xy • asr soon as α >)0, (XX T + αI ) is not singular Needs to compute XX T , i.e. over the whole training set... 19 / 94 Introduction Feedforward Neural Networks Adaptive Linear Elements (Widrow, Hoff, 1962) Linear regression with stochastic gradient descent start at w • 0 take each sample one after the other (online) x ; y • i i denotey ^ = w T x the prediction • i i update wt+1 = wt w J(wt ) = wt + (yi y^i )xi • − r − delta rule, δ = (yi y^i ) prediction error • − wt+1 = wt + δxi note the similarity with the perceptron learning rule • The samples xi are supposed to be \extended" with one dimension set to 1. 20 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Batch gradient descent compute the gradient of the loss J(w) over the whole training • set performs one step in direction of w J(w; x; y) • −∇ wt+1 = wt t w J(w; x; y) − r : learning rate • 21 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Stochastic gradient descent (SGD) one sample at a time, noisy estimate of w J • r performs one step in direction of w L(w; xi ; yi ) • −∇ wt+1 = wt t w L(w; xi ; yi ) − r faster to converge than gradient descent • 22 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Minibatch noisy estimate of the true gradient with M samples (e.g. • M = 64; 128); M is the minibatch size Randomize with = M, one set at a time • J jJ j 1 wt+1 = wt t w L(w; xj ; yj ) − M r j2J X smoother estimate than SGD • great for parallel architectures (GPU) • 23 / 94 Introduction Feedforward Neural Networks Does it make sense to use gradient descent ? Convex function n A function f : R R is convex : 7! n 1 x1; x2 R ; t [0; 1] () 8 2 8 2 f (tx1 + (1 t)x2) tf (x1) + (1 t)f (x2) − ≤ − 2 with f twice diff., n 2 x R ; H = f (x) is postive semidefinite () 8 2n T r i.e. x R ; x Hx 0 8 2 ≥ For a convex function f , all local minima are global minima.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages144 Page
-
File Size-