Machine Learning I Lecture 18 Multi-Layer Perceptron
Total Page:16
File Type:pdf, Size:1020Kb
ECE595 / STAT598: Machine Learning I Lecture 18 Multi-Layer Perceptron Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University c Stanley Chan 2020. All Rights Reserved. 1 / 28 Outline Discriminative Approaches Lecture 16 Perceptron 1: Definition and Basic Concepts Lecture 17 Perceptron 2: Algorithm and Property Lecture 18 Multi-Layer Perceptron: Back Propagation This lecture: Multi-Layer Perceptron: Back Propagation Multi-Layer Perceptron Hidden Layer Matrix Representation Back Propagation Chain Rule 4 Fundamental Equations Algorithm Interpretation c Stanley Chan 2020. All Rights Reserved. 2 / 28 Single-Layer Perceptron Input neurons x Weights w T Predicted label = σ(w x + w0). c Stanley Chan 2020. All Rights Reserved. 3 / 28 Multi-Layer Network https://towardsdatascience.com/ multi-layer-neural-networks-with-sigmoid-function-deep-learning-for-rookies-2-bf464f09eb7f Introduce a layer of hidden neurons So now you have two sets of weights: from input to hidden, and from hidden to output c Stanley Chan 2020. All Rights Reserved. 4 / 28 Many Hidden Layers You can introduce as many hidden layers as you want. Every time you add a hidden layer, you add a set of weights. c Stanley Chan 2020. All Rights Reserved. 5 / 28 Understanding the Weights Each hidden neuron is an output of a perceptron So you will have 2 1 1 1 3 2 3 2 1 3 w w ::: w x1 h1 11 12 1n 1 1 1 1 h 6w21 w22 ::: w2n 7 6x2 7 6 2 7 = 6 7 6 7 6 7 6 . .. 7 6 . 7 4:::5 4 . 5 4 . 5 h1 1 1 1 n wm1 wm2 ::: wmn xm c Stanley Chan 2020. All Rights Reserved. 6 / 28 Progression to DEEP (Linear) Neural Networks Single-layer: h = w T x Hidden-layer: h = W T x Two Hidden Layers: T T h = W 2 W 1 x Three Hidden Layers: T T T h = W 3 W 2 W 1 x A LOT of Hidden Layers: T T T h = W N ::: W 2 W 1 x c Stanley Chan 2020. All Rights Reserved. 7 / 28 Interpreting the Hidden Layer Each hidden neuron is responsible for certain features. Given an object, the network identifies the most likely features. c Stanley Chan 2020. All Rights Reserved. 8 / 28 Interpreting the Hidden Layer https://www.scientificamerican.com/article/springtime-for-ai-the-rise-of-deep-learning/ c Stanley Chan 2020. All Rights Reserved. 9 / 28 Two Questions about Multi-Layer Network How do we efficiently learn the weights? Ultimately we need to minimize the loss N X T T T 2 J(W 1;:::; W L) = kW L ::: W 2 W 1 x i − y i k i=1 One layer: Gradient descent. Multi-layer: Also gradient descent, also known as Back propagation (BP) by Rumelhart, Hinton and Williams (1986) Back propagation = Very careful book-keeping and chain rule What is the optimization landscape? Convex? Global minimum? Saddle point? Two-layer case is proved by Baldi and Hornik (1989) All local minima are global. A critical point is either a saddle point or global minimum. L-layer case is proved by Kawaguchi (2016). Also proved L-layer nonlinear network (with sigmoid between adjacent layers.) c Stanley Chan 2020. All Rights Reserved. 10 / 28 Outline Discriminative Approaches Lecture 16 Perceptron 1: Definition and Basic Concepts Lecture 17 Perceptron 2: Algorithm and Property Lecture 18 Multi-Layer Perceptron: Back Propagation This lecture: Multi-Layer Perceptron: Back Propagation Multi-Layer Perceptron Hidden Layer Matrix Representation Back Propagation Chain Rule 4 Fundamental Equations Algorithm Interpretation c Stanley Chan 2020. All Rights Reserved. 11 / 28 Back Propagation: A 20-Minute Tour You will be able to find A LOT OF blogs on the internet discussing how back propagation is being implemented. Some are mystifying back propagation Some literally just teach you the procedure of back propagation without telling you the intuition I find the following online book by Mike Nielsen fairly well-written http://neuralnetworksanddeeplearning.com/ The following slides are written based on Nielsen's book We will not go into great details The purpose to get you exposed to the idea, and de-mystify back propagation As stated before, back propagation is chain rule + very careful book keeping c Stanley Chan 2020. All Rights Reserved. 12 / 28 Back Propagation Here is the loss function you want to minimize: N X T T T 2 J(W 1;:::; W L) = kσ(W L : : : σ(W 2 σ(W 1 xi ))) − y i k i=1 You have a set of nonlinear activation functions, usually the sigmoid. To optimize, you need gradient descent. For example, for W 1 t+1 t t W 1 = W 1 − αrJ(W 1) But you need to do this for all W 1;:::; W L. And there are lots of sigmoid functions. Let us do the brute force. And this is back-propagation. (Really? Yes...) c Stanley Chan 2020. All Rights Reserved. 13 / 28 Let us See an Example Let us look at two layers T T 2 J(W 1; W 2) = kσ(W 2 σ(W 1 x)) − yk | {z } a2 Let us go backward: @J @J @a = · 2 @W 2 @a2 @W 2 Now, what is a2? T T a2 = σ(W 2 σ(W 1 x)) | {z } z2 So let us compute: @a @a @z 2 = 2 · 2 : @W 2 @z2 @W 2 c Stanley Chan 2020. All Rights Reserved. 14 / 28 Let us See an Example T T 2 J(W 1; W 2) = kσ(W 2 σ(W 1 x)) − yk | {z } a1 How about W 1? Again, let us go backward: @J @J @a = · 2 @W 1 @a2 @W 1 T But you can now repeat the calculation as follows (Let z1 = W 1 x) @a @a @a 2 = 2 1 @W 1 @a1 @W 1 @a @a @z = 2 1 1 @a1 @z1 W 1 So it is just a very long sequence of chain rule. c Stanley Chan 2020. All Rights Reserved. 15 / 28 Notations for Back Propagation The following notations are based on Nielsen's online book. The purpose of doing these is to write down a concise algorithm. Weights: 3 w24: The 3rd layer 3 w24: From 4-th neuron to 2-nd neuron c Stanley Chan 2020. All Rights Reserved. 16 / 28 Notations for Back Propagation Activation and Bias: 3 a1: 3rd layer, 1st activation 2 b3: 2nd layer, 3rd bias T Here is the relationship. Think of σ(w x + w0): ! ` X ` `−1 ` aj = σ wjk ak + bj : k c Stanley Chan 2020. All Rights Reserved. 17 / 28 Understanding Back Propagation This is the main equation ! ` X ` `−1 ` ` ` aj = σ wjk ak + bj ; or aj = σ(zj ): k | {z } ` zj ` ` aj : activation, zj : intermediate. c Stanley Chan 2020. All Rights Reserved. 18 / 28 Loss The loss takes the form of X L 2 C = (aj − yj ) j Think of two-class cross-entropy where each aL is a 2-by-1 vector c Stanley Chan 2020. All Rights Reserved. 19 / 28 Error Term The error is defined as ` @C δj = ` @zj You can show that at the output, L L @C @aj @C 0 L δj = L L = L σ (zj ): @aj @zj @aj c Stanley Chan 2020. All Rights Reserved. 20 / 28 4 Fundamental Equations for Back Propagation BP Equation 1: For the error in the output layer: L @C 0 L δj = L σ (zj ): (BP-1) @aj @C L First term: L is rate of change w.r.t. aj @aj 0 L L Second term: σ (zj ) = rate of change w.r.t. zj . So it is just chain rule. 1 P L 2 Example: If C = 2 j (yj − aj ) , then @C L L = (aj − yj ) @aj L 0 L Matrix-vector form: δ = raC σ (z ) c Stanley Chan 2020. All Rights Reserved. 21 / 28 4 Fundamental Equations for Back Propagation BP Equation 2: An equation for the error δ` in terms of the error in the next layer, δ`+1 δ` = ((w `+1)T δ`+1) σ0(z`): (BP-2) You start with δ`+1. Take weighted average w `+1. (BP-1) and (BP-2) can help you determine error at any layer. c Stanley Chan 2020. All Rights Reserved. 22 / 28 4 Fundamental Equations for Back Propagation Equation 3: An equation for the rate of change of the cost with respect to any bias in the network. @C ` ` = δj : (BP-3) @bj ` Good news: We have already known δj from Equation 1na dn 2. @C So computing ` is easy. @bj Equation 4: An equation for the rate of change of the cost with respect to any weight in the network. @C `−1 ` ` = ak δj (BP-4) @wjk Again, everything on the right is known. So it is easy to compute. c Stanley Chan 2020. All Rights Reserved. 23 / 28 Back Propagation Algorithm Below is a very concise summary of the BP algorithm c Stanley Chan 2020. All Rights Reserved. 24 / 28 Step 2: Feed Forward Step Let us take a closer look at Step 2 The feed forward step computes the intermediate variables and the activations z` = (w `)T a`−1 + b` a` = σ(z`): c Stanley Chan 2020. All Rights Reserved. 25 / 28 Step 3: Output Error Let us take a closer look at Step 3 The output error is given by (BP-1) L 0 L δ = raC σ (z ) c Stanley Chan 2020. All Rights Reserved. 26 / 28 Step 4: Output Error Let us take a closer look at Step 4 The error back propagation is given by (BP-2) δ` = ((w `+1)T δ`+1) σ0(z`): c Stanley Chan 2020.