School of Computer Science

10-701 Introduction to

Perceptron & Neural Networks

Readings: Matt Gormley Bishop Ch. 4.1.7, Ch. 5 Lecture 12 Murphy Ch. 16.5, Ch. 28 October 17, 2016 Mitchell Ch. 4

1 Reminders • Homework 3: – due 10/24/16

2 Outline • Discriminative vs. Generative • • Neural Networks •

3 DISCRIMINATIVE AND GENERATIVE CLASSIFIERS

4 Generative vs. Discriminative • Generative Classifiers: – Example: Naïve Bayes – Define a joint model of the observations x and the labels y: p(x,y) – Learning maximizes (joint) likelihood – Use Bayes’ Rule to classify based on the posterior: p(y x)=p(x y)p(y)/p(x) | | • Discriminative Classifiers: – Example: – Directly model the conditional: p(y x) | – Learning maximizes conditional likelihood

5 Generative vs. Discriminative Finite Sample Analysis (Ng & Jordan, 2002) [Assume that we are learning from a finite training dataset] If model assumptions are correct: Naive Bayes is a more efficient learner (requires fewer samples) than Logistic Regression

If model assumptions are incorrect: Logistic Regression has lower asymtotic error, and does better than Naïve Bayes

6 solid: NB dashed: LR

7 Slide courtesy of William Cohen solid: NB dashed: LR

Naïve Bayes makes stronger assumptions about the data but needs fewer examples to estimate the parameters

“On Discriminative vs Generative Classifiers: ….” Andrew Ng and Michael Jordan, NIPS 2001.

8 Slide courtesy of William Cohen Generative vs. Discriminative Learning (Parameter Estimation)

Naïve Bayes: Parameters are decoupled à Closed form solution for MLE

Logistic Regression: Parameters are coupled à No closed form solution – must use iterative optimization techniques instead

9 Naïve Bayes vs. Logistic Reg. Learning (MAP Estimation of Parameters)

Bernoulli Naïve Bayes: Parameters are probabilities à Beta prior (usually) pushes probabilities away from zero / one extremes

Logistic Regression: Parameters are not probabilities à Gaussian prior encourages parameters to be close to zero

(effectively pushes the probabilities away from zero / one extremes)

10 Naïve Bayes vs. Logistic Reg. Features

Naïve Bayes: Features x are assumed to be conditionally independent given y. (i.e. Naïve Bayes Assumption)

Logistic Regression: No assumptions are made about the form of the features x. They can be dependent and correlated in any fashion.

11 THE PERCEPTRON ALGORITHM

12 Recall… Background: Hyperplanes Why don’t we drop the generative model and try to learn this hyperplane directly? Recall… Background: Hyperplanes Hyperplane (Definition 1): = x : wT x = b H { } Hyperplane (Definition 2): = x : wT x =0 H { w and x =1 1 } Half-spaces: + = x : wT x > 0 and x =1 H { 1 } T = x : w x < 0 and x =1 H { 1 } Recall… Directly modeling the Background: Hyperplanes hyperplane would use a Why don’t we drop the decision : generative model and try to learn this h(t)=sign(T t) hyperplane directly? for: y 1, +1 { } Online Learning Model Setup: • We receive an example (x, y) • Make a prediction h(x) • Check for correctness h(x) = y?

Goal: • Minimize the number of mistakes

16 Recall… GeometricMargins Margin Definition: The margin of example � w.r.t. a linear sep. � is the distance from � to the plane � ⋅ � = 0 (or the negative if on wrong side)

Definition: The margin �� of a set of examples � wrt a linear separator � is the smallest margin over points � ∈ �. Definition: The margin � of a set of examples � is the maximum �� over all linear separators �.

� + w � + - + - + - - - + - - 17 Slide from Nina Balcan - - Perceptron Algorithm

Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y +1, 1 D { }i=1 { } Prediction: Output determined by hyperplane. T 1, if a 0 sign(a)= yˆ = h(x) = sign( x) 1, otherwise Learning: Iterative procedure: • while not converged • receive next example (x, y) • predict y’ = h(x) • if positive mistake: add x to parameters • if negative mistake: subtract x from parameters

18 Perceptron Algorithm Learning:

Algorithm 1 Perceptron Learning Algorithm (Batch) 1: procedure P( = (t(1),y(1)),...,(t(N),y(N)) ) D { } 2: 0 Initialize parameters 3: while not converged do 4: for i 1, 2,...,N do For each example { T } 5: yˆ sign( t(i)) Predict 6: if yˆ = y(i) then If mistake 7: + y(i)t(i) Update parameters 8: return

19 Analysis: Perceptron Perceptron Mistake Bound Guarantee: If data has margin and all points inside a ball of radius R, then Perceptron makes (R/)2 mistakes. (Normalized margin: multiplying all points by 100, or dividing all points by 100, doesn’t change the number of mistakes; algo is invariant to scaling.) + + +

γ + γ + - + - + - - R - - - - - 20 Slide from Nina Balcan Analysis: Perceptron Perceptron Mistake Bound Theorem 0.1 (Block (1962), Novikoff (1962)). Given dataset: = (t(i),y(i)) N . D { }i=1 Suppose: 1. Finite size inputs: x(i) R || || 2. Linearly separable data: s.t. =1and (i) (i) || || y ( t ) , i · Then: The number of mistakes made by the Perceptron + algorithm on this dataset is + +

γ + 2 γ + k (R/) - + - + - - - R - - - - 21 Figure from Nina Balcan Analysis: Perceptron Proof of Perceptron Mistake Bound:

We will show that there exist constants A and B s.t. Ak (k+1) Bk || ||

Ak (k+1) Bk || || Ak (k+1) Bk Ak (k+1) ||Bk || || ||

Ak (k+1) Bk || || 22 Analysis: Perceptron Theorem 0.1 (Block (1962), Novikoff (1962)). (i) (i) N Given dataset: = (t ,y ) . + D { }i=1 + + Suppose: (i) γ + 1. Finite size inputs: x R γ + || || - + 2. Linearly separable data: s.t. =1and - + (i) (i) || || - - y ( t ) , i - R · - - Then: The number of mistakes made by the Perceptron - - algorithm on this dataset is

k (R/)2 Algorithm 1 Perceptron Learning Algorithm (Online) 1: procedure P( = (t(1),y(1)), (t(2),y(2)),... ) D { } 2: 0, k =1 Initialize parameters 3: for i 1, 2,... do For each example { (k) } 4: if y(i)( t(i)) 0 then If mistake (k+1) · (k) 5: + y(i)t(i) Update parameters 6: k k +1 7: return 23 Analysis: Perceptron Whiteboard: Proof of Perceptron Mistake Bound

24 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 1: for some A, Ak (k+1) Bk || || (k+1) (k) (i) (i) =( + y t ) · by Perceptron algorithm update (k) (i) (i) = + y ( t ) · · (k) + · by assumption (k+1) k · by induction on k since (1) = 0 (k+1) k || || since r m r m and =1 || || || || · || || Cauchy-Schwartz inequality 25 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 2: for some B, Ak (k+1) Bk || || (k+1) 2 = (k) + y(i)t(i) 2 || || || || by Perceptron algorithm update = (k) 2 +(y(i))2 t(i) 2 +2y(i)((k) t(i)) || || || || · (k) 2 +(y(i))2 t(i) 2 || || || || since kth mistake y(i)((k) t(i)) 0 · = (k) 2 + R2 || || since (y(i))2 t(i) 2 = t(i) 2 = R2 by assumption and (y(i))2 =1 || || || || (k+1) 2 kR2 || || by induction on k since ((1))2 =0 (k+1) kR || ||

26 Analysis: Perceptron Proof of Perceptron Mistake Bound: Part 3: Combining the bounds finishes the proof. k (k+1) kR || || k (R/)2

The total number of mistakes must be less than this

27 Extensions of Perceptron • Kernel Perceptron – Choose a kernel K(x’, x) – Apply the kernel trick to Perceptron – Resulting algorithm is still very simple • Structured Perceptron – Basic idea can also be applied when y ranges over an exponentially large set – Mistake bound does not depend on the size of that set

28 Summary: Perceptron • Perceptron is a simple linear classifier • Simple learning algorithm: when a mistake is made, add / subtract the features • For linearly separable and inseparable data, we can bound the number of mistakes (geometric argument) • Extensions support nonlinear separators and structured prediction

29 RECALL: LOGISTIC REGRESSION

30 Using gradient ascent for linear Recall… classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with to learn parameters 4. Predict the class with highest probability under the model

31 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T p(y =1t)= h(t)=sign( t) | 1+2tT( T t)

sign(x) 1 logistic(u) ≡ −u 1+ e 32 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T p(y =1t)= h(t)=sign( t) | 1+2tT( T t)

sign(x) 1 logistic(u) ≡ −u 1+ e 33 Recall… Logistic Regression

Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 { } Model: applied to dot product of parameters with input vector. 1 p(y =1t)= | 1+2tT( T t) Learning: finds the parameters that minimize some objective function. = argmin J() Prediction: Output is the most probable class. yˆ = `;Kt p(y t) y 0,1 | { } 34 NEURAL NETWORKS

35 Learning highly non-linear functions

f: X à Y

l f might be non-linear function

l X (vector of) continuous and/or discrete vars

l Y (vector of) continuous and/or discrete vars

The XOR gate Speech recognition

© Eric Xing @ CMU, 2006-2011 36 Perceptron and Neural Nets

l From biological to (perceptron) Synapse Inputs

Synapse x 1 Linear Hard Axon Combiner Limiter w1 Output ∑ Y Soma w2 Dendrites θ x Synapse 2 Threshold l

n ⎧+1, if X ≥ ω0 X = ∑ xiwi Y = ⎨ i=1 ⎩−1, if X < ω0

l Artificial neuron networks l supervised learning l gradient descent i g n a l s g n a l i i g n a l s g n a l i I n p u tI S O u tO p u t S Middle Layer Input Layer Output Layer © Eric Xing @ CMU, 2006-2011 37 Connectionist Models

l Consider humans: l Neuron switching time ~ 0.001 second l Number of ~ 1010 l Connections per neuron ~ 104-5 l Scene recognition time ~ 0.1 second l 100 inference steps doesn't seem like enough à much parallel computation l Properties of artificial neural nets (ANN) l Many neuron-like threshold switching units l Many weighted interconnections among units l Highly parallel, distributed processes

© Eric Xing @ CMU, 2006-2011 38 Why is everyone talking Motivation about ? • Because a lot of money is invested in it… – DeepMind: Acquired by Google for $400 million – DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag – Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars • Because it made the front page of the New York Times

39 Why is everyone talking Motivation about Deep Learning?

1960s Deep learning: – Has won numerous pattern recognition 1980s competitions – Does so with minimal feature 1990s engineering This wasn’t always the case! 2006 Since 1980s: Form of models hasn’t changed much, but lots of new tricks… – More hidden units 2016 – Better (online) optimization – New nonlinear functions (ReLUs) – Faster computers (CPUs and GPUs) 40 A Recipe for Background Machine Learning

1. Given training data: Face Face Not a face

2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network

– Loss function Examples: Mean-squared error, Cross Entropy

41 A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

42 A Recipe for Background Gradients Machine Learning

1. Given training data: Backpropagation3. Define goal: can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function can compute the gradient of any 4. Train with SGD: differentiable function efficiently! (take small steps opposite the gradient) – Loss function

43 A Recipe for Background Goals for Today’s Lecture Machine Learning

1. Given training data: 1. Explore a new class of decision functions 3. Define goal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

44 Decision Functions Linear Regression

T y = h(x)=( x)

Output where (a)=a

θ θ θ1 θ2 3 M

Input …

45 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a)

θ θ θ1 θ2 3 M

Input …

46 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a) Face Face Not a face

θ θ θ1 θ2 3 M

Input …

47 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a) 1 1 0

y

θ θ x2 θ1 θ2 3 M x1

Input …

48 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a)

θ θ θ1 θ2 3 M

Input …

49 Neural Network Model

Inputs Output Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights HiddenL Weights variable variables ayer Prediction

© Eric Xing @ CMU, 2006-2011 50 “Combined logistic models”

Inputs Output Age 34 .6 .5 0.6 .1 Gender 2 Σ .7 .8 “Probability of beingAlive” Stage 4

Dependent Independent Weights HiddenL Weights variable variables ayer Prediction

© Eric Xing @ CMU, 2006-2011 51 Inputs Output Age 34 .5 .2 0.6

Gender 2 .3 Σ “Probability of .8 beingAlive” Stage 4 .2

Dependent Independent Weights HiddenL Weights variable variables ayer Prediction

© Eric Xing @ CMU, 2006-2011 52 Inputs Output Age 34 .6 .5 .2 0.6 .1 Gender 1 .3 Σ “Probability of .7 .8 beingAlive” Stage 4 .2

Dependent Independent Weights HiddenL Weights variable variables ayer Prediction

© Eric Xing @ CMU, 2006-2011 53 Not really, no target for hidden units...

Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights HiddenL Weights variable variables ayer Prediction

© Eric Xing @ CMU, 2006-2011 54 Jargon Pseudo-Correspondence

l Independent variable = input variable

l Dependent variable = output variable

l Coefficients = “weights”

l Estimates = “targets”

Logistic Regression Model (the sigmoid unit)

Inputs Output Age 34 5 0.6 Gende 1 4 Σ “Probability of r beingAlive” Stage 4 8

Independent variables Coefficients Dependent variable x1, x2, x3 a, b, c p Prediction © Eric Xing @ CMU, 2006-2011 55 Decision Functions Neural Network

Output

Hidden Layer …

Input …

56 Decision Functions Neural Network

(F) Loss J = 1 (y y(d))2 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 57 i Building a Neural Net

Output

Features …

58 Building a Neural Net

Output

Hidden Layer … D = M

1 1 1

Input …

59 Building a Neural Net

Output

Hidden Layer … D = M

Input …

60 Building a Neural Net

Output

Hidden Layer … D = M

Input …

61 Building a Neural Net

Output

Hidden Layer … D < M

Input …

62 Decision Boundary

• 0 hidden layers: linear classifier – Hyperplanes

y

x1 x2

Example from to Eric Postma via Jason Eisner 63 Decision Boundary

• 1 hidden layer – Boundary of convex region (open or closed) y

x1 x2

Example from to Eric Postma via Jason Eisner 64 Decision Boundary

y • 2 hidden layers – Combinations of convex regions

x1 x2

Example from to Eric Postma via Jason Eisner 65 Decision Functions Multi-Class Output

Output …

Hidden Layer …

Input …

66 Decision Functions Deeper Networks

Next lecture:

Output

… Hidden Layer 1

… Input

67 Decision Functions Deeper Networks

Next lecture:

Output

… Hidden Layer 2

… Hidden Layer 1

… Input

68 Decision Functions Deeper Networks

Next lecture: Output Making the

… neural Hidden Layer 3 networks

… deeper Hidden Layer 2

… Hidden Layer 1

… Input

69 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out!

70 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions

71 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction

Output

… Hidden Layer 3

… Hidden Layer 2

… Hidden Layer 1

… Input

72 Example from Honglak Lee (NIPS 2010) ARCHITECTURES

73 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function

74 Activation Functions

(F) Loss Neural Network with sigmoid 1 2 J = (y y) activation functions 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j=0 j zj Hidden Layer (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 75 i Activation Functions

(F) Loss Neural Network with arbitrary 1 2 J = (y y) nonlinear activation functions 2

(E) Output (nonlinear) y = (b) Output

(D) Output (linear) D … b = j=0 j zj Hidden Layer (C) Hidden (nonlinear) z = (a ), j j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 76 i Activation Functions

Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the

77 Activation Functions

• A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs

Alternate 1: tanh

Like logistic function but shifted to range [-1, +1]

Slide from William Cohen AI Stats 2010

depth 4?

sigmoid vs. tanh

Figure from Glorot & Bentio (2010) Activation Functions

• A new change: modifying the nonlinearity – reLU often used in vision tasks

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the gradient when you pass zero)

Slide from William Cohen Activation Functions

• A new change: modifying the nonlinearity – reLU often used in vision tasks

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end) Sparsifies outputs Helps with vanishing gradient

Slide from William Cohen Objective Functions for NNs • Regression: – Use the same objective as Linear Regression – Quadratic loss (i.e. mean squared error) • Classification: – Use the same objective as Logistic Regression – Cross-entropy (i.e. negative log likelihood) – This requires probabilities, so we add an additional “softmax” layer at the end of our network

Forward Backward

1 2 dJ Quadratic J = (y y) = y y 2 dy dJ 1 1 Cross Entropy J = y HQ;(y)+(1 y) HQ;(1 y) = y +(1 y) dy y y 1 82 Multi-Class Output

Output …

Hidden Layer …

Input …

83 Multi-Class Output (F) Loss K Softmax: J = k=1 yk HQ;(yk) 2tT(bk) (E) Output (softmax) y = 2tT(bk) k yk = K K l=1 2tT(bl) l=1 2tT(bl) (D) Output (linear) b = D z k k j=0 kj j … Output (C) Hidden (nonlinear) z = (a ), j j j … Hidden Layer (B) Hidden (linear) a = M x , j j i=0 ji i … Input (A) Input Given xi, i 84 Cross-entropy vs. Quadratic loss

Figure from Glorot & Bentio (2010) A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

86 Objective Functions Matching Quiz: Suppose you are given a neural net with a single output, y, and one hidden layer.

1) Minimizing sum of squared 5) …MLE estimates of weights assuming errors… target follows a Bernoulli with parameter given by the output value 2) Minimizing sum of squared errors plus squared Euclidean 6) …MAP estimates of weights norm of weights… …gives… assuming weight priors are zero mean Gaussian 3) Minimizing cross-entropy… 7) …estimates with a large margin on 4) Minimizing hinge loss… the training data 8) …MLE estimates of weights assuming zero mean Gaussian noise on the output value A. 1=5, 2=7, 3=6, 4=8 D. 1=7, 2=5, 3=6, 4=8 B. 1=5, 2=7, 3=8, 4=6 E. 1=8, 2=6, 3=5, 4=7

C. 1=7, 2=5, 3=5, 4=7 F. 1=8, 2=6, 3=8, 4=6 87 BACKPROPAGATION

88 A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

89 Training Backpropagation

• Question 1: When can we compute the gradients of the parameters of an arbitrary neural network?

• Question 2: When can we make the gradient computation efficient?

90 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training ! ) !Chain Rule { } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: … dy dy dudy dy du = = (2.4) (2.4) dx du dxdx du dx

Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 91 circuit.Binary To compute Logistic the derivative Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, probability we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training ! ) !Chain Rule { } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation is just repeated … dy dy dudy dy du application of the = = (2.4) (2.4) chain rule dx dufrom dxdx du dx Calculus 101.

Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 92 circuit.Binary To compute Logistic the derivative Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, probability we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKSarithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out thatnor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input !Training ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputChain Rule I , in terms of an { ! } ! ) ! { } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj … i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation: 1. dy Instantiate the computation as a directed acyclic graphdy dudy dy du , where each = = (2.4) (2.4) dx intermediate quantity is a node du dxdx du dx 2. At each node, store (a) the quantity computed in the forward pass Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic circuit.Binary To compute Logistic the derivative Regression of someBinaryand (b) the loss function logisticpartial derivative (below regression we use can regression) beof the goal with respect to that node’s interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we canintermediate quantity. repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below3. is theInitialize✓, probability we can all partial derivatives to 0. repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The4. forwardqVisit each node in below pass is computes the probability thereverse topological order following: that the output label takes. At each node, add its on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1contribution to the partial derivatives of its parents Theq forward) pass computes the following:(2.5) 1 J = y⇤ log q +(1 y⇤)log(1 q) where q = P✓(Yi =1x)= D (2.6) (2.5) This algorithm is also called | 1+exp( ✓ x ) automatic differentiation in the reverse-mode 93 j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P Training Backpropagation

Simple Example: The goal is to compute J = +Qb(bBM(x2)+3x2) dJ on the forward pass and the derivative dx on the backward pass.

Forward Backward dJ J = cos(u) Y= sin(u) du dJ dJ du du dJ dJ du du u = u1 + u2 Y= , =1 Y= , =1 du1 du du1 du1 du2 du du2 du2 dJ dJ du1 du1 u1 = sin(t) Y= , = +Qb(t) dt du1 dt dt dJ dJ du2 du2 u2 =3t Y= , =3 dt du2 dt dt dJ dJ dt dt t = x2 Y= , =2x dx dt dx dx

94 Training Backpropagation

Simple Example: The goal is to compute J = +Qb(bBM(x2)+3x2) dJ on the forward pass and the derivative dx on the backward pass.

Forward Backward dJ J = cos(u) Y= sin(u) du dJ dJ du du dJ dJ du du u = u1 + u2 Y= , =1 Y= , =1 du1 du du1 du1 du2 du du2 du2 dJ dJ du1 du1 u1 = sin(t) Y= , = +Qb(t) dt du1 dt dt dJ dJ du2 du2 u2 =3t Y= , =3 dt du2 dt dt dJ dJ dt dt t = x2 Y= , =2x dx dt dx dx

95 Training Backpropagation

Output

Case 1: Logistic θ θ1 θ2 3 θM Regression … Input

Forward Backward

dJ y (1 y) J = y HQ; y +(1 y) HQ;(1 y) = + dy y y 1 1 dJ dJ dy dy 2tT( a) y = = , = 1+2tT( a) da dy da da (2tT( a) + 1)2 D dJ dJ da da a = x = , = x j j d da d d j j=0 j j j dJ dJ da da = , = j dxj da dxj dxj 96 Training Backpropagation

(F) Loss J = 1 (y y(d))2 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 97 i Training Backpropagation

(F) Loss 1 2 J = (y y) 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j=0 j zj Hidden Layer (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 98 i Training Backpropagation

Case 2: Forward Backward Neural dJ y (1 y) J = y HQ; y +(1 y) HQ;(1 y) = + dy y y 1 Network 1 dJ dJ dy dy 2tT( b) y = = , = 1+2tT( b) db dy db db (2tT( b) + 1)2 … D dJ dJ db db b = z = , = z … j j d db d d j j=0 j j j dJ dJ db db = , = j dzj db dzj dzj 1 dJ dJ dz dz 2tT( a ) z = = j , j = j j 1+2tT( a ) da dz da da (2tT( a ) + 1)2 j j j j j j M dJ dJ da da a = x = j , j = x j ji i d da d d i i=0 ji j ji ji dJ dJ da da D = j , j = dx da dx dx ji i j i i j=0 99 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKSarithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out thatnor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input !Training ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputChain Rule I , in terms of an { ! } ! ) ! { } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj … i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation: 1. dy Instantiate the computation as a directed acyclic graphdy dudy dy du , where each = = (2.4) (2.4) dx intermediate quantity is a node du dxdx du dx 2. At each node, store (a) the quantity computed in the forward pass Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic circuit.Binary To compute Logistic the derivative Regression of someBinaryand (b) the loss function logisticpartial derivative (below regression we use can regression) beof the goal with respect to that node’s interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we canintermediate quantity. repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below3. is theInitialize✓, probability we can all partial derivatives to 0. repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The4. forwardqVisit each node in below pass is computes the probability thereverse topological order following: that the output label takes. At each node, add its on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1contribution to the partial derivatives of its parents Theq forward) pass computes the following:(2.5) 1 J = y⇤ log q +(1 y⇤)log(1 q) where q = P✓(Yi =1x)= D (2.6) (2.5) This algorithm is also called | 1+exp( ✓ x ) automatic differentiation in the reverse-mode 100 j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKSarithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out thatnor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input !Training ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputChain Rule I , in terms of an { ! } ! ) ! { } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj … i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation: 1. dy Instantiate the computation as a directed acyclic graphdy dudy dy du , where each = = (2.4) (2.4) dx node represents a Tensor. du dxdx du dx 2. At each node, store (a) the quantity computed in the forward pass Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic circuit.Binary To compute Logistic the derivative Regression of someBinaryand (b) the loss function logisticpartial derivative (below regression we use can regression) bes interpreted of the goal with aswith respect to that node’s a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we canTensor. repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below3. is theInitialize✓, probability we can all partial derivatives to 0. repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The4. forwardqVisit each node in below pass is computes the probability thereverse topological order following: that the output label takes. At each node, add its on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1contribution to the partial derivatives of its parents Theq forward) pass computes the following:(2.5) 1 J = y⇤ log q +(1 y⇤)log(1 q) where q = P✓(Yi =1x)= D (2.6) (2.5) This algorithm is also called | 1+exp( ✓ x ) automatic differentiation in the reverse-mode 101 j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P Training Backpropagation

Case 2: Forward Backward Neural dJ y (1 y) Module 5 J = y HQ; y +(1 y) HQ;(1 y) = + dy y y 1 Network 1 dJ dJ dy dy 2tT( b) y = = , = Module 4 1+2tT( b) db dy db db (2tT( b) + 1)2 … D dJ dJ db db b = z = , = z … j j d db d d j j=0 j j j Module 3 dJ dJ db db = , = j dzj db dzj dzj 1 dJ dJ dz dz 2tT( a ) z = = j , j = j Module 2 j 1+2tT( a ) da dz da da (2tT( a ) + 1)2 j j j j j j M dJ dJ da da a = x = j , j = x j ji i d da d d i i=0 ji j ji ji Module 1 dJ dJ da da D = j , j = dx da dx dx ji i j i i j=0 102 A Recipe for Background Gradients Machine Learning

1. Given training data: Backpropagation3. Define goal: can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function can compute the gradient of any 4. Train with SGD: differentiable function efficiently! (take small steps opposite the gradient) – Loss function

103 Summary 1. Neural Networks… – provide a way of learning features – are highly nonlinear prediction functions – (can be) a highly parallel network of logistic regression classifiers – discover useful hidden representations of the input 2. Backpropagation… – provides an efficient way to compute gradients – is a special case of reverse-mode automatic differentiation

104