10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University
Neural Networks + Backpropagation
Matt Gormley Lecture 13 Oct. 7, 2019
1 Reminders • Homework 4: Logistic Regression – Out: Wed, Sep. 25 – Due: Fri, Oct. 11 at 11:59pm • Homework 5: Neural Networks – Out: Fri, Oct. 11 – Due: Fri, Oct. 25 at 11:59pm • Today’s In-Class Poll – http://p13.mlcourse.org
2 Q&A
Q: What is mini-batch SGD?
A: A variant of SGD…
3 Mini-Batch SGD • Gradient Descent: Compute true gradient exactly from all N examples • Mini-Batch SGD: Approximate true gradient by the average gradient of K randomly chosen examples • Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example
4 Mini-Batch SGD
Three variants of first-order optimization:
5 NEURAL NETWORKS
6 Neural Networks Chalkboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network
7 Neural Network Parameters Question: Suppose you are training a one-hidden layer neural network with sigmoid activations for binary classification. Answer:
True or False: There is a unique set of parameters that maximize the likelihood of the dataset above.
8 ARCHITECTURES
9 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function
10 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D = M
Input …
13 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D = M
Input …
14 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
What method(s) is this setting similar to?
Hidden Layer … D < M
Input …
15 Building a Neural Net
Q: How many hidden units, D, should we use?
Output
Hidden Layer … D > M
What method(s) is this setting similar to?
Input …
16 Deeper Networks
Q: How many layers should we use?
Output
… Hidden Layer 1
… Input
17 Deeper Networks
Q: How many layers should we use?
Output
… Hidden Layer 2
… Hidden Layer 1
… Input
18 Deeper Networks
Q: How manyOutput layers should we use?
… Hidden Layer 3
… Hidden Layer 2
… Hidden Layer 1
… Input
19 Deeper Networks
Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 20 Different Levels of Abstraction
• We don’t know the “right” levels of abstraction • So let the model figure it out!
24 Example from Honglak Lee (NIPS 2010) Different Levels of Abstraction
Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions
25 Example from Honglak Lee (NIPS 2010) Different Levels of Abstraction
Output
… Hidden Layer 3
… Hidden Layer 2
… Hidden Layer 1
… Input
26 Example from Honglak Lee (NIPS 2010) Activation Functions
(F) Loss Neural Network with sigmoid 1 2 J = (y y ) activation functions 2
(E) Output (sigmoid) 1 y = 1+2tT( b) Output
(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 27 i Activation Functions
(F) Loss Neural Network with arbitrary 1 2 J = (y y ) nonlinear activation functions 2
(E) Output (nonlinear) y = (b) Output
(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (nonlinear) z = (a ), j j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 28 i Activation Functions
Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the sigmoid function…
29 Activation Functions
• A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs
Alternate 1: tanh
Like logistic function but shifted to range [-1, +1]
Slide from William Cohen AI Stats 2010
depth 4?
sigmoid vs. tanh
Figure from Glorot & Bentio (2010) Activation Functions
• A new change: modifying the nonlinearity – reLU often used in vision tasks
Alternate 2: rectified linear unit
Linear with a cutoff at zero
(Implementation: clip the gradient when you pass zero)
Slide from William Cohen Activation Functions
• A new change: modifying the nonlinearity – reLU often used in vision tasks
Alternate 2: rectified linear unit
Soft version: log(exp(x)+1)
Doesn’t saturate (at one end) Sparsifies outputs Helps with vanishing gradient
Slide from William Cohen Decision Neural Network Functions
(F) Loss Neural Network for Classification J = 1 (y y(d))2 2
(E) Output (sigmoid) 1 y = 1+2tT( b) Output
(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 34 i Decision Neural Network Functions
(F) Loss Neural Network for Regression J = 1 (y y(d))2 2
(E) Output (sigmoid) 1 y = 1+2tT( b) Output
(D) Output (linear) D … yb = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 35 i Objective Functions for NNs 1. Quadratic Loss: – the same objective as Linear Regression – i.e. mean squared error 2. Cross-Entropy: – the same objective as Logistic Regression – i.e. negative log likelihood – This requires probabilities, so we add an additional “softmax” layer at the end of our network
Forward Backward
1 2 dJ Quadratic J = (y y ) = y y 2 dy dJ 1 1 Cross Entropy J = y HQ;(y)+(1 y ) HQ;(1 y) = y +(1 y ) dy y y 1 36 Objective Functions for NNs Cross-entropy vs. Quadratic loss
Figure from Glorot & Bentio (2010) Multi-Class Output
Output …
Hidden Layer …
Input …
38 Multi-Class Output (F) Loss K Softmax: J = k=1 yk HQ;(yk) 2tT(bk) (E) Output (softmax) y = 2tT(bk) k yk = K K l=1 2tT(bl) l=1 2tT(bl) (D) Output (linear) b = D z k k j=0 kj j … Output (C) Hidden (nonlinear) z = (a ), j j j … Hidden Layer (B) Hidden (linear) a = M x , j j i=0 ji i … Input (A) Input Given xi, i 39 Neural Network Errors Question A: On which of the datasets below Question B: On which of the datasets could a one-hidden layer neural network below could a one-hidden layer neural achieve zero classification error? Select all network for regression achieve nearly zero that apply. MSE? Select all that apply.
A) B) A) B)
C) D) C) D)
40 DECISION BOUNDARY EXAMPLES
41 Example #1: Diagonal Band Example #2: One Pocket
Example #3: Four Gaussians Example #4: Two Pockets
42 Example #1: Diagonal Band
43 Example #1: Diagonal Band
44 Example #1: Diagonal Band hidden
45 Example #1: Diagonal Band hidden
46 Example #1: Diagonal Band hidden
47 Example #1: Diagonal Band hidden
48 Example #1: Diagonal Band hidden
hidden
hidden hidden
49 Example #2: One Pocket
50 Example #2: One Pocket
51 Example #2: One Pocket hidden
52 Example #2: One Pocket hidden
53 Example #2: One Pocket hidden
54 Example #2: One Pocket hidden
55 Example #2: One Pocket hidden
56 Example #2: One Pocket
hidden hidden
hidden hidden
57 Example #3: Four Gaussians
58 Example #3: Four Gaussians
59 Example #3: Four Gaussians
60 Example #3: Four Gaussians hidden
61 Example #3: Four Gaussians hidden
62 Example #3: Four Gaussians hidden
63 Example #3: Four Gaussians hidden
64 Example #4: Two Pockets
65 Example #4: Two Pockets
66 Example #4: Two Pockets
67 Example #4: Two Pockets
68 Example #4: Two Pockets
69 Example #4: Two Pockets hidden
70 Example #4: Two Pockets hidden
71 Example #4: Two Pockets hidden
72 Example #4: Two Pockets hidden
73 Neural Networks Objectives You should be able to… • Explain the biological motivations for a neural network • Combine simpler models (e.g. linear regression, binary logistic regression, multinomial logistic regression) as components to build up feed-forward neural network architectures • Explain the reasons why a neural network can model nonlinear decision boundaries for classification • Compare and contrast feature engineering with learning features • Identify (some of) the options available when designing the architecture of a neural network • Implement a feed-forward neural network
74 Computing Gradients DIFFERENTIATION
75 A Recipe for Background Machine Learning
1. Given training data: 3. Define goal:
2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function
76 Approaches to Training Differentiation
• Question 1: When can we compute the gradients for an arbitrary neural network?
• Question 2: When can we make the gradient computation efficient?
77 Approaches to Training Differentiation
1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable derivatives – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets – Pro: Computes partial derivatives of one output f(x)i with respect to all inputs xj in time proportional to computation of f(x) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers. – Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional to computation of f(x) – Con: Slow for high dimensional inputs (e.g. vector-valued x) – Required: Algorithm for computing f(x)
78 Training Finite Difference Method
Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 79 Training Symbolic Differentiation
2 minuteSpeed time Quiz: limit. Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? Round your answer to the nearest integer.
Answer: Answers below are in the form [dy/dx, dy/dz] A. [42, -72] E. [1208, 810] B. [72, -42] F. [810, 1208] C. [100, 127] G. [1505, 94] D. [127, 100] H. [94, 1505] 80 Training Symbolic Differentiation
Differentiation Quiz #2:
…
…
…
82 CHAIN RULE
83 Training Chain Rule
Chalkboard – Chain Rule of Calculus
84 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training! ) !Chain{ Rule } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: … dy dy dudy dy du = = (2.4) (2.4) dx du dxdx du dx
Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 85 circuit.Binary To compute Logistic the derivative Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, probability we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training! ) !Chain{ Rule } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation is just repeated … dy dy dudy dy du application= of the = (2.4) (2.4) chaindx ruledu fromdxdx du dx Calculus 101.
Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 86 circuit.Binary To compute Logistic the derivative Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, probability we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P Intuitions BACKPROPAGATION
87 Error Back-Propagation
88 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
89 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
90 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
91 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
92 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
93 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
94 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
95 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
96 Slide from (Stoyanov & Eisner, 2012) Error Back-Propagation
p(y|x(i))
ϴ z y(i)
97 Slide from (Stoyanov & Eisner, 2012) Algorithm BACKPROPAGATION
98 Training Backpropagation
Chalkboard – Example: Backpropagation for Chain Rule #1
Differentiation Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? Round your answer to the nearest integer.
99 Training Backpropagation
Chalkboard – SGD for Neural Network – Example: Backpropagation for Neural Network
100 Training Backpropagation
Automatic Differentiation – Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the “computation graph”) 2. Visit each node in topological order. For variable ui with inputs v1,…, vN a. Compute ui = gi(v1,…, vN) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable ui = gi(v1,…, vN) a. We already know dy/dui b. Increment dy/dvj by (dy/dui)(dui/dvj) (Choice of algorithm ensures computing (dui/dvj) is easy)
Return partial derivatives dy/dui for all variables 101