Neural Networks + Backpropagation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks + Backpropagation Matt Gormley Lecture 13 Oct. 7, 2019 1 Reminders • Homework 4: Logistic Regression – Out: Wed, Sep. 25 – Due: Fri, Oct. 11 at 11:59pm • Homework 5: Neural Networks – Out: Fri, Oct. 11 – Due: Fri, Oct. 25 at 11:59pm • Today’s In-Class Poll – http://p13.mlcourse.org 2 Q&A Q: What is mini-batch SGD? A: A variant of SGD… 3 Mini-Batch SGD • Gradient Descent: Compute true gradient exactly from all N examples • Mini-Batch SGD: Approximate true gradient by the average gradient of K randomly chosen examples • Stochastic Gradient Descent (SGD): Approximate true gradient by the gradient of one randomly chosen example 4 Mini-Batch SGD Three variants of first-order optimization: 5 NEURAL NETWORKS 6 Neural Networks Chalkboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network 7 Neural Network Parameters Question: Suppose you are training a one-hidden layer neural network with sigmoid activations for binary classification. Answer: True or False: There is a unique set of parameters that maximize the likelihood of the dataset above. 8 ARCHITECTURES 9 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 10 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 13 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D = M Input … 14 Building a Neural Net Q: How many hidden units, D, should we use? Output What method(s) is this setting similar to? Hidden Layer … D < M Input … 15 Building a Neural Net Q: How many hidden units, D, should we use? Output Hidden Layer … D > M What method(s) is this setting similar to? Input … 16 Deeper Networks Q: How many layers should we use? Output … Hidden Layer 1 … Input 17 Deeper Networks Q: How many layers should we use? Output … Hidden Layer 2 … Hidden Layer 1 … Input 18 Deeper Networks Q: How manyOutput layers should we use? … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 19 Deeper Networks Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any continuous function g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 20 Different Levels of Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 24 Example from Honglak Lee (NIPS 2010) Different Levels of Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 25 Example from Honglak Lee (NIPS 2010) Different Levels of Abstraction Output … HiDDen Layer 3 … HiDDen Layer 2 … HiDDen Layer 1 … Input 26 Example from Honglak Lee (NIPS 2010) Activation Functions (F) Loss Neural Network with sigmoid 1 2 J = (y y∗) activation functions 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 27 i ∀ Activation Functions (F) Loss Neural Network with arbitrary 1 2 J = (y y∗) nonlinear activation functions 2 − (E) Output (nonlinear) y = σ(b) Output (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 28 i ∀ Activation Functions Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the sigmoid function… 29 Activation Functions • A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen AI Stats 2010 depth 4? sigmoid vs. tanh Figure from Glorot & Bentio (2010) Activation Functions • A new change: modifying the nonlinearity – reLU often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implementation: clip the gradient when you pass zero) Slide from William Cohen Activation Functions • A new change: modifying the nonlinearity – reLU often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn’t saturate (at one end) Sparsifies outputs Helps with vanishing gradient Slide from William Cohen Decision Neural Network Functions (F) Loss Neural Network for Classification J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 34 i ∀ Decision Neural Network Functions (F) Loss Neural Network for Regression J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … yb = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 35 i ∀ Objective Functions for NNs 1. Quadratic Loss: – the same objective as Linear Regression – i.e. mean squared error 2. Cross-Entropy: – the same objective as Logistic Regression – i.e. negative log likelihood – This requires probabilities, so we add an additional “softmax” layer at the end of our network Forward Backward 1 2 dJ Quadratic J = (y y∗) = y y∗ 2 − dy − dJ 1 1 Cross Entropy J = y∗ HQ;(y)+(1 y∗) HQ;(1 y) = y∗ +(1 y∗) − − dy y − y 1 − 36 Objective Functions for NNs Cross-entropy vs. Quadratic loss Figure from Glorot & Bentio (2010) Multi-Class Output Output … Hidden Layer … Input … 38 Multi-Class Output (F) Loss K Softmax: J = k=1 yk∗ HQ;(yk) 2tT(bk) (E) Output (softmax) y = 2tT(bk) k yk = K K l=1 2tT(bl) l=1 2tT(bl) (D) Output (linear) b = D β z k k j=0 kj j ∀ … Output (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Hidden Layer (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ … Input (A) Input Given xi, i ∀ 39 Neural Network Errors Question A: On which of the datasets below Question B: On which of the datasets could a one-hidden layer neural network below could a one-hidden layer neural achieve zero classification error? Select all network for regression achieve nearly zero that apply. MSE? Select all that apply. A) B) A) B) C) D) C) D) 40 DECISION BOUNDARY EXAMPLES 41 Example #1: Diagonal Band Example #2: One Pocket Example #3: Four Gaussians Example #4: Two Pockets 42 Example #1: Diagonal Band 43 Example #1: Diagonal Band 44 Example #1: Diagonal Band hidden 45 Example #1: Diagonal Band hidden 46 Example #1: Diagonal Band hidden 47 Example #1: Diagonal Band hidden 48 Example #1: Diagonal Band hidden hidden hidden hidden 49 Example #2: One Pocket 50 Example #2: One Pocket 51 Example #2: One Pocket hidden 52 Example #2: One Pocket hidden 53 Example #2: One Pocket hidden 54 Example #2: One Pocket hidden 55 Example #2: One Pocket hidden 56 Example #2: One Pocket hidden hidden hidden hidden 57 Example #3: Four Gaussians 58 Example #3: Four Gaussians 59 Example #3: Four Gaussians 60 Example #3: Four Gaussians hidden 61 Example #3: Four Gaussians hidden 62 Example #3: Four Gaussians hidden 63 Example #3: Four Gaussians hidden 64 Example #4: Two Pockets 65 Example #4: Two Pockets 66 Example #4: Two Pockets 67 Example #4: Two Pockets 68 Example #4: Two Pockets 69 Example #4: Two Pockets hidden 70 Example #4: Two Pockets hidden 71 Example #4: Two Pockets hidden 72 Example #4: Two Pockets hidden 73 Neural Networks Objectives You should be able to… • Explain the biological motivations for a neural network • Combine simpler models (e.g. linear regression, binary logistic regression, multinomial logistic regression) as components to build up feed-forward neural network architectures • Explain the reasons why a neural network can model nonlinear decision boundaries for classification • Compare and contrast feature engineering with learning features • Identify (some of) the options available when designing the architecture of a neural network • Implement a feed-forward neural network 74 Computing Gradients DIFFERENTIATION 75 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 76 Approaches to Training Differentiation • Question 1: When can we compute the gradients for an arbitrary neural network? • Question 2: When can we make the gradient computation efficient? 77 Approaches to Training Differentiation 1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f(x) on any input x 2.

Load more