Neural Networks and Backpropagation

10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Neural Networks and Backpropagation Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 20 HTF 11 Mitchell 4 April 3, 2017 1 Reminders • Homework 6: Unsupervised Learning – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring 2 Neural Networks Outline • Logistic Regression (Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning Last Lecture – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2-Layer Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic Chain Rule (of calculus) This Lecture – Chain Rule for Arbitrary Computation Graph – Backpropagation Algorithm – Module-based Automatic Differentiation (Autodiff) 3 DECISION BOUNDARY EXAMPLES 4 Example #1: Diagonal Band 5 Example #2: One Pocket 6 Example #3: Four Gaussians 7 Example #4: Two Pockets 8 Example #1: Diagonal Band 9 Example #1: Diagonal Band 10 Example #1: Diagonal Band Error in slides: “layers” should read “number of hidden units” All the neural networks in this section used 1 hidden layer. 11 Example #1: Diagonal Band 12 Example #1: Diagonal Band 13 Example #1: Diagonal Band 14 Example #1: Diagonal Band 15 Example #2: One Pocket 16 Example #2: One Pocket 17 Example #2: One Pocket 18 Example #2: One Pocket 19 Example #2: One Pocket 20 Example #2: One Pocket 21 Example #2: One Pocket 22 Example #2: One Pocket 23 Example #3: Four Gaussians 24 Example #3: Four Gaussians 25 Example #3: Four Gaussians 26 Example #3: Four Gaussians 27 Example #3: Four Gaussians 28 Example #3: Four Gaussians 29 Example #3: Four Gaussians 36 Example #3: Four Gaussians 37 Example #3: Four Gaussians 38 Example #4: Two Pockets 39 Example #4: Two Pockets 40 Example #4: Two Pockets 41 Example #4: Two Pockets 42 Example #4: Two Pockets 43 Example #4: Two Pockets 44 Example #4: Two Pockets 45 Example #4: Two Pockets 46 Example #4: Two Pockets 47 ARCHITECTURES 54 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 55 Activation Functions Neural Network with sigmoid (F) Loss 1 2 J = (y y∗) activation functions 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 56 i ∀ Activation Functions (F) Loss Neural Network with arbitrary 1 2 J = (y y∗) nonlinear activation functions 2 − (E) Output (nonlinear) y = σ(b) Output (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 57 i ∀ Activation Functions Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ −u activation function 1+ e (nonlinearity) is always the sigmoid function… 58 Activation Functions • A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen AI Stats 2010 depth 4? sigmoid vs. tanh Figure from Glorot & Bentio (2010) Activation Functions • A new change: modifying the nonlinearity – reLU often used in vision tasks Alternate 2: rectified linear unit Linear with a cutoff at zero (Implementation: clip the gradient when you pass zero) Slide from William Cohen Activation Functions • A new change: modifying the nonlinearity – reLU often used in vision tasks Alternate 2: rectified linear unit Soft version: log(exp(x)+1) Doesn’t saturate (at one end) Sparsifies outputs Helps with vanishing gradient Slide from William Cohen Objective Functions for NNs • Regression: – Use the same objective as Linear Regression – Quadratic loss (i.e. mean squared error) • Classification: – Use the same objective as Logistic Regression – Cross-entropy (i.e. negative log likelihood) – This requires probabilities, so we add an additional “softmax” layer at the end of our network Forward Backward 1 2 dJ Quadratic J = (y y∗) = y y∗ 2 − dy − dJ 1 1 Cross Entropy J = y∗ HQ;(y)+(1 y∗) HQ;(1 y) = y∗ +(1 y∗) − − dy y − y 1 − 63 Cross-entropy vs. Quadratic loss Figure from Glorot & Bentio (2010) A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 67 Objective Functions Matching Quiz: Suppose you are given a neural net with a single output, y, and one hidden layer. 1) Minimizing sum of squared 5) …MLE estimates of weights assuming errors… target follows a Bernoulli with parameter given by the output value 2) Minimizing sum of squared errors plus squared Euclidean 6) …MAP estimates of weights norm of weights… …gives… assuming weight priors are zero mean Gaussian 3) Minimizing cross-entropy… 7) …estimates with a large margin on 4) Minimizing hinge loss… the training data 8) …MLE estimates of weights assuming zero mean Gaussian noise on the output value A. 1=5, 2=7, 3=6, 4=8 E. 1=8, 2=6, 3=5, 4=7 B. 1=5, 2=7, 3=8, 4=6 F. 1=8, 2=6, 3=8, 4=6 C. 1=7, 2=5, 3=5, 4=7 D. 1=7, 2=5, 3=6, 4=8 68 BACKPROPAGATION 69 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 70 Approaches to Training Differentiation • Question 1: When can we compute the gradients of the parameters of an arbitrary neural network? • Question 2: When can we make the gradient computation efficient? 71 Approaches to Training Differentiation 1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable derivatives – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets – Pro: Computes partial derivatives of one output f(x)i with respect to all inputs xj in time proportional to computation of f(x) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for computing f(x) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers. – Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional to computation of f(x) – Con: Slow for high dimensional inputs (e.g. vector-valued x) – Required: Algorithm for computing f(x) 72 Training Finite Difference Method Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 73 Training Symbolic Differentiation Calculus Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below? 74 Training Symbolic Differentiation Calculus Quiz #2: … … … 75 Training Chain Rule Whiteboard – Chain Rule of Calculus 76 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can deﬁne2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can deﬁne an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule.

Neural Networks and Backpropagation

Backpropagation and Deep Learning in the Brain

Implementing Machine Learning & Neural Network Chip

Training Autoencoders by Alternating Minimization

Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms

Q-Learning in Continuous State and Action Spaces

CS 189 Introduction to Machine Learning Spring 2021 Jonathan Shewchuk HW6

Double Backpropagation for Training Autoencoders Against Adversarial Attack

Face Recognition: a Convolutional Neural-Network Approach

Can Temporal-Difference and Q-Learning Learn Representation? a Mean-Field Analysis

CNN Architectures

Introduction to Machine Learning

Neural Networks Shuai Li John Hopcroft Center, Shanghai Jiao Tong University