Neural Networks and Backpropagation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 19 HTF 11 Mitchell 4 March 29, 2017 1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring 2 Neural Networks Outline • Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff) 3 RECALL: LOGISTIC REGRESSION 4 Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 5 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 7 Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 ∈ ∈ { } Model: Logistic function applied to dot product of parameters with input vector. 1 pθ(y =1t)= | 1+2tT( θT t) − Learning: finds the parameters that minimize some objective function. θ∗ = argmin J(θ) θ Prediction: Output is the most probable class. yˆ = `;Kt pθ(y t) y 0,1 | ∈{ } 8 NEURAL NETWORKS 9 A Recipe for Background Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network – Loss function Examples: Mean-squared error, Cross Entropy 10 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 11 A Recipe for Background GradientsMachine Learning 1. Given training dataBackpropagation: 3. Definecan goal: compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these:mode automatic differentiation that – Decision functioncan compute4. Train the withgradient SGD: of any differentiable(take function small steps efficiently! opposite the gradient) – Loss function 12 A Recipe for Background Goals for Today’sMachine Lecture Learning 1. 1.GivenExplore training a new data class: of 3.decision Define functionsgoal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 13 Decision Functions Linear Regression T y = hθ(x)=σ(θ x) Output where σ(a)=a θ1 θ2 θ3 θM Input … 14 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 15 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − Face Face Not a face θ1 θ2 θ3 θM Input … 16 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σIn(-aClass)= Example Output 1+2tT( a) − 1 1 0 y x2 θ1 θ2 θ3 θM x1 Input … 17 Decision Functions Perceptron T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 18 From Biological to Artificial The motivation for Artificial Neural Networks comes from biology… Biological “Model” Artificial Model • Neuron: an excitable cell • Neuron: node in a directed acyclic • Synapse: connection between graph (DAG) neurons • Weight: multiplier on each edge • • Activation Function: nonlinear A neuron sends an thresholding function, which allows a electrochemical pulse along its neuron to “fire” when the input value synapses when a sufficient voltage is sufficiently high change occurs • Artificial Neural Network: collection • Biological Neural Network: of neurons into a DAG, which define collection of neurons along some some differentiable function pathway through the brain Biological “Computation” Artificial Computation • Neuron switching time : ~ 0.001 sec • Many neuron-like threshold switching • Number of neurons: ~ 1010 units • Connections per neuron: ~ 104-5 • Many weighted interconnections • Scene recognition time: ~ 0.1 sec among units • Highly parallel, distributed processes 19 Slide adapted from Eric Xing Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ1 θ2 θ3 θM Input … 20 Neural Networks Whiteboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network 21 Neural Network Model Inputs Output Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 22 “Combined logistic models” Inputs Output Age 34 .6 .5 .1 0.6 Gender 2 S .7 .8 “Probability of beingAlive” Stage 4 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 23 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 S .8 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 24 Inputs Output Age 34 .6 .2 .5 .1 0.6 Gender 1 .3 S .7 .8 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 25 Not really, no target for hidden units... Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights Hidden Weights variable variables Layer Prediction © Eric Xing @ CMU, 2006-2011 26 Decision Functions Neural Network Output Hidden Layer … Input … 28 Decision Functions Neural Network (F) Loss J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 29 i ∀ NEURAL NETWORKS: REPRESENTATIONAL POWER 30 Building a Neural Net Q: How many hidden units, D, should we use? Output Features … 31 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M 1 1 1 Input … 32 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 33 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 34 Building a Neural Net Q: How many hidden units, D, should we use? Output What method(s) is this setting similar to? Hidden Layer … D < M Input … 35 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D > M What method(s) is this setting similar to? Input … 36 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 1 … Input 37 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 2 … Hidden Layer 1 … Input 38 Decision Functions Deeper Networks Q: How many layers should we use? Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 39 Decision Functions Deeper Networks Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 40 Decision Boundary • 0 hidden layers: linear classifier – Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 41 Decision Boundary • 1 hidden layer – Boundary of convex region (open or closed) y x1 x2 Example from to Eric Postma via Jason Eisner 42 Decision Boundary y • 2 hidden layers – Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 43 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 46 Example from Honglak Lee (NIPS 2010).

Neural Networks and Backpropagation

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support