Deep Learning

Introduction Feedforward Neural Networks Deep learning JérémyFix CentraleSupélec jeremy.fi[email protected] 2016 1 / 94 Introduction Feedforward Neural Networks Introduction and historical perspective [Schmidhuber, 2015] Deep Learning in Neural Networks: An Overview, JürgenSchmidhuber, Neural Networks (61), Pages 85-117 2 / 94 Introduction Feedforward Neural Networks Historical perspective on neural networks Perceptron (Rosenblatt, 1962) : linear classifier • AdaLinE (Widrow, Hoff, 1962): linear regressor • Minsky/Papert (1969): first winter • Convolutional Neural Networks (1980, 1998) : great! • Multilayer Perceptron and backprop (Rumelhart, 1986) : • great! but it is hard to train, and the SVM come in the 1990s .... : • second winter 2006 : pretraining ! • 2012 : AlexNet on Imagenet (10% better on test than the • 2nd) Now on : lot of state of the art neural networks • 3 / 94 Introduction Feedforward Neural Networks Some reasons of the current success GPU (speed of processing) / Data (regularizing) • theoretical understandings on the difficulty of training deep • networks Which libs? Torch(Lua)/PyTorhc , Caffee(Python/C++) • Theano/Lasagne (python, RIP 2017) , Tensorflow (Google, • Python, C++), Keras (wrapper over Tensorflow and Theano), CNTK, MXNET, Chainer, DyNet, ... 4 / 94 Introduction Feedforward Neural Networks What to read ? www.deeplearningbook.org Goodfellow, Bengio, Courville(2016) Who to follow ? N-1 : LeCun, Bengio, Hinton, Schmidhuber • N : Goodfellow, Dauphin, Graves, Sutskever, Karpathy, • Krizevsky, .. obviously, the community is much larger. Which conferences ? ICML, NIPS, ICLR, .. https://github.com/terryum/awesome-deep-learning-papers 5 / 94 Introduction Feedforward Neural Networks What is a neural network The tree a Neural network is a directed graph edges : weighted connections • nodes : computational units • no cycle : feedforward neural networks (FNN) • with cycles : recurrent neural networks (RNN) • hides the jungle What is a convolutional neural network with a softmax output, ReLu hidden activations, with batch normalization layers, trained with RMSprop with Nesterov momentum regularized with dropout ? 6 / 94 Introduction Feedforward Neural Networks Feedforward Neural Networks (FNN) Input Hidden Output Skip layer connection 7 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) classification, Given (xi ; yi ), yi 1; 1 • 2 {− g SAR Architecture, Basis functions φ (x) with φ (x) = 1 • j 0 Algorithm • Geometrical interpretation • Sensory Associative Result x0 a0 = φ0(x) w00 Σ g r0 w10 x1 a1 = φ1(x) w01 r Σ 1 x2 a2 = φ2(x) w02 Σ r2 w22 x3 8 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Classifier Given feature functions φj , with φ0(x) = 1, the perceptron classifies x as : y = g(w T Φ(x)) (1) 1 if x < 0 g(x) − (2) (+1 if x 0 ≥ 1 φ (x) n +1 1 with φ(x) R a , φ(x) = 2 3 2 φ2(x) 6 . 7 6 . 7 6 7 4 5 9 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g w if the input is correctly classified w = 8w + φ(xi ) if the input is incorrectly classified as -1 (3) <>w φ(xi ) if the input is incorrectly classified as +1 − :> 10 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is correctly classified Case y = +1 Case y = 1 i i − w φ(xi) φ(xi) w 11 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Geometrical interpretation y = g(w T Φ(x)) Cases when a sample is misclassified Case y = +1 Case y = 1 i i − w φ(xi) φ(xi) w + φ(xi) w φ(x ) − i w 12 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) The cone of feasible solutions Consider two samples x1; x2 with y1 = +1, y2 = 1 − φ(x ) w 1 φ(x2) T v φ(x1) = 0 T v φ(x2) = 0 13 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g w if the input is correctly classified w = 8w + φ(xi ) if the input is incorrectly classified as -1 (4) <>w φ(xi ) if the input is incorrectly classified as +1 − :> T w if g(w φ(xi )) = yi T w = 8w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 (5) T − <>w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − :> 14 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Online Training algorithm Given (xi ; yi ), yi 1; 1 , the perceptron learning rule operates online : 2 {− g T w if g(w φ(xi )) = yi T w = 8w + φ(xi ) if g(w φ(xi )) = 1 and yi = +1 T − <>w φ(xi ) if g(w φ(xi )) = +1 and yi = 1 − − :> w if g(w T φ(x )) = y w = i i T (w + yi φ(xi ) if g(w φ(xi )) = yi 6 1 T w = w + (yi yî )φ(xi ) withy î = g(w φ(xi )) 2 − 15 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) Definition (Linear separability) d A binary classification problem (xi ; yi ) R 1; 1 ; i [1::N] is 2 × {−d g 2 said to be linearly separable if there exists w R such that : 2 T i; sign(w xi ) = yi 8 with x < 0; sign(x) = 1; x 0; sign(x) = +1. 8 − 8 ≥ Theorem (Perceptron convergence theorem) d A classification problem (xi ; yi ) R 1; 1 ; i [1::N] is linearly separable if and only if the2 perceptron× {− g learning2 rule converges to an optimal solution in a finite number of steps. : easy; : we upper/lower bound w(t) 2 ( ) j j2 16 / 94 Introduction Feedforward Neural Networks Perceptron (Rosenblatt, 1962) w = w + y φ(x ), with (t) the set of misclassified • t 0 i2I(t) i i samples I P 1 T it minimizes a loss : J(w) = max(0; yi w φ(xi )) • N i − the solution can be written as • P 1 wt = w0 + (yi yî )φ(xi ) 2 − i X (yi yî ) is the prediction error − 17 / 94 Introduction Feedforward Neural Networks Kernel Perceptron Any linear predictor involving only scalar products can be kernelized (kernel trick, cf SVM); Given w(t) = w0 + i2I yi xi P < w; x > =< w0; x > + yi < xi ; x > i2I X k(w; x) = k(w0; x) + yi k(xi ; x) ) i2I X 3 2 1 0 1 2 3 3 2 1 0 1 2 3 18 / 94 Introduction Feedforward Neural Networks Adaptive Linear Elements (Widrow, Hoff, 1962) Linear regression, Analytically Given (xi ; yi ), yi R • 2 minimize J(w) = 1 y w T x 2 • N i i i jj − T jj Analytically w J(w) = 0 XX w = Xy • r P ) XX T non singular : w = (XX T )−1Xy • XX T singular (e.g. points along a line in 2D), infinite nb • solutions • regularized least square : min G(w) = J(w) + αw T w T • w G(w) = 0 (XX + αI )w = Xy • asr soon as α >)0, (XX T + αI ) is not singular Needs to compute XX T , i.e. over the whole training set... 19 / 94 Introduction Feedforward Neural Networks Adaptive Linear Elements (Widrow, Hoff, 1962) Linear regression with stochastic gradient descent start at w • 0 take each sample one after the other (online) x ; y • i i denotey ^ = w T x the prediction • i i update wt+1 = wt w J(wt ) = wt + (yi yî )xi • − r − delta rule, δ = (yi yî ) prediction error • − wt+1 = wt + δxi note the similarity with the perceptron learning rule • The samples xi are supposed to be \extended" with one dimension set to 1. 20 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Batch gradient descent compute the gradient of the loss J(w) over the whole training • set performs one step in direction of w J(w; x; y) • −∇ wt+1 = wt t w J(w; x; y) − r : learning rate • 21 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Stochastic gradient descent (SGD) one sample at a time, noisy estimate of w J • r performs one step in direction of w L(w; xi ; yi ) • −∇ wt+1 = wt t w L(w; xi ; yi ) − r faster to converge than gradient descent • 22 / 94 Introduction Feedforward Neural Networks Batch/Minibatch/Stochastic gradient descent 1 N J(w; x; y) = L(w; x ; y ) N i i i=1 X T 2 e.g. L(w; xi ; yi ) = yi w xi jj − jj Minibatch noisy estimate of the true gradient with M samples (e.g. • M = 64; 128); M is the minibatch size Randomize with = M, one set at a time • J jJ j 1 wt+1 = wt t w L(w; xj ; yj ) − M r j2J X smoother estimate than SGD • great for parallel architectures (GPU) • 23 / 94 Introduction Feedforward Neural Networks Does it make sense to use gradient descent ? Convex function n A function f : R R is convex : 7! n 1 x1; x2 R ; t [0; 1] () 8 2 8 2 f (tx1 + (1 t)x2) tf (x1) + (1 t)f (x2) − ≤ − 2 with f twice diff., n 2 x R ; H = f (x) is postive semidefinite () 8 2n T r i.e. x R ; x Hx 0 8 2 ≥ For a convex function f , all local minima are global minima.

Deep Learning

Designing Algorithms for Machine Learning and Data Mining

Mistake Bounds for Binary Matrix Completion

Kernel Methods and Factorization for Image and Video Analysis

COMS 4771 Perceptron and Kernelization

Kernel Methods for Pattern Analysis

Fast Kernel Classifiers with Online and Active Learning

The Teaching Dimension of Kernel Perceptron

Online Passive-Aggressive Algorithms on a Budget

Mistake Bounds for Binary Matrix Completion

Output Neuron

A Review of Learning Planning Action Models Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Etivier, Sylvie Pesty

Explaining RNN Predictions for Sentiment Classification