Neural Networks

School of Computer Science 10-601B Introduction to Machine Learning Neural Networks Readings: Matt Gormley Bishop Ch. 5 Lecture 15 Murphy Ch. 16.5, Ch. 28 October 19, 2016 Mitchell Ch. 4 1 Reminders 2 Outline • Logistic Regression (Recap) • Neural Networks • Backpropagation 3 RECALL: LOGISTIC REGRESSION 4 Using gradient ascent for linear Recall… classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 5 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 7 Recall… Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 ∈ ∈ { } Model: Logistic function applied to dot product of parameters with input vector. 1 pθ(y =1t)= | 1+2tT( θT t) − Learning: finds the parameters that minimize some objective function. θ∗ = argmin J(θ) θ Prediction: Output is the most probable class. yˆ = `;Kt pθ(y t) y 0,1 | ∈{ } 8 NEURAL NETWORKS 9 Learning highly non-linear functions f: X à Y l f might be non-linear function l X (vector of) continuous and/or discrete vars l Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition © Eric Xing @ CMU, 2006-2011 10 Perceptron and Neural Nets l From biological neuron to artificial neuron (perceptron) Synapse Inputs Synapse Dendrites x Axon 1 Linear Hard Axon Combiner Limiter w1 Output ∑ Y Soma Soma w2 Dendrites θ x Synapse 2 Threshold l Activation function n ⎧+1, if X ≥ ω0 X = ∑ xiwi Y = ⎨ i=1 ⎩−1, if X < ω0 l Artificial neuron networks l supervised learning l gradient descent i g n a l s g n a l i i g n a l s g n a l i I n p u tI S O u tO p u t S Middle Layer Input Layer Output Layer © Eric Xing @ CMU, 2006-2011 11 Connectionist Models l Consider humans: l Neuron switching time ~ 0.001 second l Number of neurons ~ 1010 l Connections per neuron ~ 104-5 l Scene recognition time ~ 0.1 second l 100 inference steps doesn't seem like enough à much parallel computation l Properties of artificial neural nets (ANN) l Many neuron-like threshold switching units l Many weighted interconnections among units l Highly parallel, distributed processes © Eric Xing @ CMU, 2006-2011 12 Why is everyone talking Motivation about Deep Learning? • Because a lot of money is invested in it… – DeepMind: Acquired by Google for $400 million – DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag – Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars • Because it made the front page of the New York Times 13 Why is everyone talking Motivation about Deep Learning? 1960s Deep learning: – Has won numerous pattern recognition 1980s competitions – Does so with minimal feature 1990s engineering This wasn’t always the case! 2006 Since 1980s: Form of models hasn’t changed much, but lots of new tricks… – More hidden units 2016 – Better (online) optimization – New nonlinear functions (ReLUs) – Faster computers (CPUs and GPUs) 14 A Recipe for Background Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network – Loss function Examples: Mean-squared error, Cross Entropy 15 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 16 A Recipe for Background Gradients Machine Learning 1. Given training data: Backpropagation3. Define goal: can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function can compute the gradient of any 4. Train with SGD: differentiable function efficiently! (take small steps opposite the gradient) – Loss function 17 A Recipe for Background Goals for Today’s Lecture Machine Learning 1. Given training data: 1. Explore a new class of decision functions 3. Define goal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 18 Decision Functions Linear Regression T y = hθ(x)=σ(θ x) Output where σ(a)=a θ θ θ1 θ2 3 M Input … 19 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 20 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − Face Face Not a face θ θ θ1 θ2 3 M Input … 21 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − 1 1 0 y θ θ x2 θ1 θ2 3 M x1 Input … 22 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 23 Neural Network Model Inputs Output Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 24 “Combined logistic models” Inputs Output Age 34 .6 .5 0.6 .1 Gender 2 Σ .7 .8 “Probability of beingAlive” Stage 4 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 25 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 Σ “Probability of .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 26 Inputs Output Age 34 .6 .5 .2 0.6 .1 Gender 1 .3 Σ “Probability of .7 .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 27 Not really, no target for hidden units... Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 28 Jargon Pseudo-Correspondence l Independent variable = input variable l Dependent variable = output variable l Coefficients = “weights” l Estimates = “targets” Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 5 0.6 Gende 1 4 Σ “Probability of r beingAlive” Stage 4 8 Independent variables Coefficients Dependent variable x1, x2, x3 a, b, c p Prediction © Eric Xing @ CMU, 2006-2011 29 Decision Functions Neural Network Output Hidden Layer … Input … 30 Decision Functions Neural Network (F) Loss J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 31 i ∀ Building a Neural Net Output Features … 32 Building a Neural Net Output Hidden Layer … D = M 1 1 1 Input … 33 Building a Neural Net Output Hidden Layer … D = M Input … 34 Building a Neural Net Output Hidden Layer … D = M Input … 35 Building a Neural Net Output Hidden Layer … D < M Input … 36 Decision Boundary • 0 hidden layers: linear classifier – Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 37 Decision Boundary • 1 hidden layer – Boundary of convex region (open or closed) y x1 x2 Example from to Eric Postma via Jason Eisner 38 Decision Boundary y • 2 hidden layers – Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 39 Decision Functions Multi-Class Output Output … Hidden Layer … Input … 40 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 1 … Input 41 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 2 … Hidden Layer 1 … Input 42 Decision Functions Deeper Networks Next lecture: Output Making the … neural Hidden Layer 3 networks … deeper Hidden Layer 2 … Hidden Layer 1 … Input 43 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 46 Example from Honglak Lee (NIPS 2010) ARCHITECTURES 47 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 48 Activation Functions (F) Loss Neural Network with sigmoid 1 2 J = (y y∗) activation functions 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 49 i ∀ Activation Functions (F) Loss Neural Network with arbitrary 1 2 J = (y y∗) nonlinear activation functions 2 − (E) Output (nonlinear) y = σ(b) Output (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 50 i ∀ Activation Functions Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the sigmoid function… 51 Activation Functions • A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen AI Stats 2010 depth 4? sigmoid vs.

Neural Networks

Training Autoencoders by Alternating Minimization

6.5 Applications of Exponential and Logarithmic Functions 469

Deep Learning Based Computer Generated Face Identification Using

Matrix Calculus

Chapter 10. Logistic Models

Neural Network Architectures and Activation Functions: a Gaussian Process Approach

Differentiability in Several Variables: Summary of Basic Concepts

Lipschitz Recurrent Neural Networks

Jointly Clustering with K-Means and Learning Representations

Lecture 5: Logistic Regression 1 MLE Derivation

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Differentiability of the Convolution