10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University
Neural Networks and Backpropagation
Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 19 HTF 11 Mitchell 4 March 29, 2017
1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring
2 Neural Networks Outline
• Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff)
3 RECALL: LOGISTIC REGRESSION
4 Using gradient ascent for linear classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model
5 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T p (y =1t)= h(t)=sign( t) | 1+2tT( T t)
sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T p (y =1t)= h(t)=sign( t) | 1+2tT( T t)
sign(x) 1 logistic(u) ≡ −u 1+ e 7 Logistic Regression
Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 { } Model: Logistic function applied to dot product of parameters with input vector. 1 p (y =1t)= | 1+2tT( T t) Learning: finds the parameters that minimize some objective function. = argmin J( ) Prediction: Output is the most probable class. yˆ = `;Kt p (y t) y 0,1 | { } 8 NEURAL NETWORKS
9 A Recipe for Background Machine Learning
1. Given training data: Face Face Not a face
2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network
– Loss function Examples: Mean-squared error, Cross Entropy
10 A Recipe for Background Machine Learning
1. Given training data: 3. Define goal:
2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function
11 A Recipe for Background GradientsMachine Learning
1. Given training dataBackpropagation: 3. Definecan goal: compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these:mode automatic differentiation that – Decision functioncan compute4. Train the withgradient SGD: of any differentiable(take function small steps efficiently! opposite the gradient) – Loss function
12 A Recipe for Background Goals for Today’sMachine Lecture Learning
1. 1.GivenExplore training a new data class: of 3.decision Define functionsgoal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function
13 Decision Functions Linear Regression
T y = h (x)= ( x)
Output where (a)=a
θ1 θ2 θ3 θM
Input …
14 Decision Functions Logistic Regression
T y = h (x)= ( x) 1 where (a)= Output 1+2tT( a)
θ1 θ2 θ3 θM
Input …
15 Decision Functions Logistic Regression
T y = h (x)= ( x) 1 where (a)= Output 1+2tT( a) Face Face Not a face
θ1 θ2 θ3 θM
Input …
16 Decision Functions Logistic Regression
T y = h (x)= ( x) 1 where In(-aClass)= Example Output 1+2tT( a) 1 1 0
y
x2 θ1 θ2 θ3 θM x1
Input …
17 Decision Functions Perceptron
T y = h (x)= ( x) 1 where (a)= Output 1+2tT( a)
θ1 θ2 θ3 θM
Input …
18 From Biological to Artificial The motivation for Artificial Neural Networks comes from biology…
Biological “Model” Artificial Model • Neuron: an excitable cell • Neuron: node in a directed acyclic • Synapse: connection between graph (DAG) neurons • Weight: multiplier on each edge • • Activation Function: nonlinear A neuron sends an thresholding function, which allows a electrochemical pulse along its neuron to “fire” when the input value synapses when a sufficient voltage is sufficiently high change occurs • Artificial Neural Network: collection • Biological Neural Network: of neurons into a DAG, which define collection of neurons along some some differentiable function pathway through the brain
Biological “Computation” Artificial Computation • Neuron switching time : ~ 0.001 sec • Many neuron-like threshold switching • Number of neurons: ~ 1010 units • Connections per neuron: ~ 104-5 • Many weighted interconnections • Scene recognition time: ~ 0.1 sec among units • Highly parallel, distributed processes
19 Slide adapted from Eric Xing Decision Functions Logistic Regression
T y = h (x)= ( x) 1 where (a)= Output 1+2tT( a)
θ1 θ2 θ3 θM
Input …
20 Neural Networks
Whiteboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network
21 Neural Network Model
Inputs Output Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2
Dependent Independent Weights Hidden Weights variable variables Layer Prediction
© Eric Xing @ CMU, 2006-2011 22 “Combined logistic models”
Inputs Output Age 34 .6 .5 .1 0.6 Gender 2 S .7 .8 “Probability of beingAlive” Stage 4
Dependent Independent Weights Hidden Weights variable variables Layer Prediction
© Eric Xing @ CMU, 2006-2011 23 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 S .8 “Probability of beingAlive” Stage 4 .2
Dependent Independent Weights Hidden Weights variable variables Layer Prediction
© Eric Xing @ CMU, 2006-2011 24 Inputs Output Age 34 .6 .2 .5 .1 0.6 Gender 1 .3 S .7 .8 “Probability of beingAlive” Stage 4 .2
Dependent Independent Weights Hidden Weights variable variables Layer Prediction
© Eric Xing @ CMU, 2006-2011 25 Not really, no target for hidden units...
Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2
Dependent Independent Weights Hidden Weights variable variables Layer Prediction
© Eric Xing @ CMU, 2006-2011 26 Decision Functions Neural Network
Output
Hidden Layer …
Input …
28 Decision Functions Neural Network
(F) Loss J = 1 (y y(d))2 2
(E) Output (sigmoid) 1 y = 1+2tT( b) Output
(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 29 i NEURAL NETWORKS: REPRESENTATIONAL POWER
30 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Features …
31 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D = M
1 1 1
Input …
32 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D = M
Input …
33 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D = M
Input …
34 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
What method(s) is this setting similar to?
Hidden Layer … D < M
Input …
35 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D > M
What method(s) is this setting similar to?
Input …
36 Decision Functions Deeper Networks
Q: How many layers should we use?
Output
… Hidden Layer 1
… Input
37 Decision Functions Deeper Networks
Q: How many layers should we use?
Output
… Hidden Layer 2
… Hidden Layer 1
… Input
38 Decision Functions Deeper Networks
Q: How many layers should we use? Output
… Hidden Layer 3
… Hidden Layer 2
… Hidden Layer 1
… Input
39 Decision Functions Deeper Networks
Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 40 Decision Boundary
• 0 hidden layers: linear classifier – Hyperplanes
y
x1 x2
Example from to Eric Postma via Jason Eisner 41 Decision Boundary
• 1 hidden layer – Boundary of convex region (open or closed) y
x1 x2
Example from to Eric Postma via Jason Eisner 42 Decision Boundary
y • 2 hidden layers – Combinations of convex regions
x1 x2
Example from to Eric Postma via Jason Eisner 43 Decision Different Levels of Functions Abstraction
• We don’t know the “right” levels of abstraction • So let the model figure it out!
44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction
Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions
45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction
Output
… Hidden Layer 3
… Hidden Layer 2
… Hidden Layer 1
… Input
46 Example from Honglak Lee (NIPS 2010)