Neural Networks and Backpropagation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 19 HTF 11 Mitchell 4 March 29, 2017 1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring 2 Neural Networks Outline • Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff) 3 RECALL: LOGISTIC REGRESSION 4 Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 5 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 7 Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 ∈ ∈ { } Model: Logistic function applied to dot product of parameters with input vector. 1 pθ(y =1t)= | 1+2tT( θT t) − Learning: finds the parameters that minimize some objective function. θ∗ = argmin J(θ) θ Prediction: Output is the most probable class. yˆ = `;Kt pθ(y t) y 0,1 | ∈{ } 8 NEURAL NETWORKS 9 A Recipe for Background Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network – Loss function Examples: Mean-squared error, Cross Entropy 10 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 11 A Recipe for Background GradientsMachine Learning 1. Given training dataBackpropagation: 3. Definecan goal: compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these:mode automatic differentiation that – Decision functioncan compute4. Train the withgradient SGD: of any differentiable(take function small steps efficiently! opposite the gradient) – Loss function 12 A Recipe for Background Goals for Today’sMachine Lecture Learning 1. 1.GivenExplore training a new data class: of 3.decision Define functionsgoal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 13 Decision Functions Linear Regression T y = hθ(x)=σ(θ x) Output where σ(a)=a θ1 θ2 θ3 θM Input … 14 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 15 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − Face Face Not a face θ1 θ2 θ3 θM Input … 16 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σIn(-aClass)= Example Output 1+2tT( a) − 1 1 0 y x2 θ1 θ2 θ3 θM x1 Input … 17 Decision Functions Perceptron T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 18 From Biological to Artificial The motivation for Artificial Neural Networks comes from biology… Biological “Model” Artificial Model • Neuron: an excitable cell • Neuron: node in a directed acyclic • Synapse: connection between graph (DAG) neurons • Weight: multiplier on each edge • • Activation Function: nonlinear A neuron sends an thresholding function, which allows a electrochemical pulse along its neuron to “fire” when the input value synapses when a sufficient voltage is sufficiently high change occurs • Artificial Neural Network: collection • Biological Neural Network: of neurons into a DAG, which define collection of neurons along some some differentiable function pathway through the brain Biological “Computation” Artificial Computation • Neuron switching time : ~ 0.001 sec • Many neuron-like threshold switching • Number of neurons: ~ 1010 units • Connections per neuron: ~ 104-5 • Many weighted interconnections • Scene recognition time: ~ 0.1 sec among units • Highly parallel, distributed processes 19 Slide adapted from Eric Xing Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 20 Neural Networks Whiteboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network 21 Neural Network Model Inputs Output Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 22 “Combined logistic models” Inputs Output Age 34 .6 .5 .1 0.6 Gender 2 S .7 .8 “Probability of beingAlive” Stage 4 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 23 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 S .8 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 24 Inputs Output Age 34 .6 .2 .5 .1 0.6 Gender 1 .3 S .7 .8 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 25 Not really, no target for hidden units... Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 26 Decision Functions Neural Network Output Hidden Layer … Input … 28 Decision Functions Neural Network (F) Loss J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 29 i ∀ NEURAL NETWORKS: REPRESENTATIONAL POWER 30 Building a Neural Net Q: How many hidden units, D, should we use? Output Features … 31 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M 1 1 1 Input … 32 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 33 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 34 Building a Neural Net Q: How many hidden units, D, should we use? Output What method(s) is this setting similar to? Hidden Layer … D < M Input … 35 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D > M What method(s) is this setting similar to? Input … 36 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 1 … Input 37 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 2 … Hidden Layer 1 … Input 38 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 39 Decision Functions Deeper Networks Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 40 Decision Boundary • 0 hidden layers: linear classifier – Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 41 Decision Boundary • 1 hidden layer – Boundary of convex region (open or closed) y x1 x2 Example from to Eric Postma via Jason Eisner 42 Decision Boundary y • 2 hidden layers – Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 43 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 46 Example from Honglak Lee (NIPS 2010).

Load more