<<

10-601 Introduction to Machine Department School of Science Carnegie Mellon University

Neural Networks and

Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 20 HTF 11 Mitchell 4 April 3, 2017

1 Reminders • Homework 6: – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring

2 Neural Networks Outline

(Recap) – , Model, Learning, • Neural Networks – A Recipe for Machine Learning Last Lecture – Visual Notation for Neural Networks – Example: Logistic Regression Output Surface – 2- Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic (of calculus) This Lecture – Chain Rule for Arbitrary Computation Graph – Backpropagation – Module-based Automatic Differentiation (Autodiff)

3 DECISION BOUNDARY EXAMPLES

4 Example #1: Diagonal Band

5 Example #2: One Pocket

6 Example #3: Four Gaussians

7 Example #4: Two Pockets

8 Example #1: Diagonal Band

9 Example #1: Diagonal Band

10 Example #1: Diagonal Band

Error in slides: “layers” should read “number of hidden units”

All the neural networks in this section used 1 hidden layer.

11 Example #1: Diagonal Band

12 Example #1: Diagonal Band

13 Example #1: Diagonal Band

14 Example #1: Diagonal Band

15 Example #2: One Pocket

16 Example #2: One Pocket

17 Example #2: One Pocket

18 Example #2: One Pocket

19 Example #2: One Pocket

20 Example #2: One Pocket

21 Example #2: One Pocket

22 Example #2: One Pocket

23 Example #3: Four Gaussians

24 Example #3: Four Gaussians

25 Example #3: Four Gaussians

26 Example #3: Four Gaussians

27 Example #3: Four Gaussians

28 Example #3: Four Gaussians

29 Example #3: Four Gaussians

36 Example #3: Four Gaussians

37 Example #3: Four Gaussians

38 Example #4: Two Pockets

39 Example #4: Two Pockets

40 Example #4: Two Pockets

41 Example #4: Two Pockets

42 Example #4: Two Pockets

43 Example #4: Two Pockets

44 Example #4: Two Pockets

45 Example #4: Two Pockets

46 Example #4: Two Pockets

47 ARCHITECTURES

54 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of (nonlinearity) 4. Form of objective function

55 Activation Functions

Neural Network with sigmoid (F) Loss 1 2 J = (y y) activation functions 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 56 i Activation Functions

(F) Loss Neural Network with arbitrary 1 2 J = (y y) nonlinear activation functions 2

(E) Output (nonlinear) y = (b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (nonlinear) z = (a ), j j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 57 i Activation Functions

Sigmoid / So far, we’ve 1 assumed that the logistic(u) ≡ −u activation function 1+ e (nonlinearity) is always the

58 Activation Functions

• A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs

Alternate 1: tanh

Like logistic function but shifted to range [-1, +1]

Slide from William Cohen AI Stats 2010

depth 4?

sigmoid vs. tanh

Figure from Glorot & Bentio (2010) Activation Functions

• A new change: modifying the nonlinearity – reLU often used in vision tasks

Alternate 2: rectified linear unit

Linear with a cutoff at zero

(Implementation: clip the when you pass zero)

Slide from William Cohen Activation Functions

• A new change: modifying the nonlinearity – reLU often used in vision tasks

Alternate 2: rectified linear unit

Soft version: log(exp(x)+1)

Doesn’t saturate (at one end) Sparsifies outputs Helps with vanishing gradient

Slide from William Cohen Objective Functions for NNs • Regression: – Use the same objective as – Quadratic loss (i.e. ) • Classification: – Use the same objective as Logistic Regression – Cross-entropy (i.e. negative log likelihood) – This requires , so we add an additional “softmax” layer at the end of our network

Forward Backward

1 2 dJ Quadratic J = (y y) = y y 2 dy dJ 1 1 J = y HQ;(y)+(1 y) HQ;(1 y) = y +(1 y) dy y y 1 63 Cross-entropy vs. Quadratic loss

Figure from Glorot & Bentio (2010) A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) –

67 Objective Functions

Matching Quiz: Suppose you are given a neural net with a single output, y, and one hidden layer.

1) Minimizing sum of squared 5) …MLE estimates of weights assuming errors… target follows a Bernoulli with parameter given by the output value 2) Minimizing sum of squared errors plus squared Euclidean 6) …MAP estimates of weights norm of weights… …gives… assuming weight priors are zero mean Gaussian 3) Minimizing cross-entropy… 7) …estimates with a large margin on 4) Minimizing hinge loss… the training data 8) …MLE estimates of weights assuming zero mean Gaussian noise on the output value

A. 1=5, 2=7, 3=6, 4=8 E. 1=8, 2=6, 3=5, 4=7 B. 1=5, 2=7, 3=8, 4=6 F. 1=8, 2=6, 3=8, 4=6 C. 1=7, 2=5, 3=5, 4=7 D. 1=7, 2=5, 3=6, 4=8 68 BACKPROPAGATION

69 A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

70 Approaches to Training Differentiation

• Question 1: When can we compute the of the parameters of an arbitrary neural network?

• Question 2: When can we make the gradient computation efficient?

71 Approaches to Training Differentiation

1. Finite Difference Method – Pro: Great for testing implementations of backpropagation – Con: Slow for high dimensional inputs / outputs – Required: Ability to call the function f(x) on any input x 2. Symbolic Differentiation – Note: The method you learned in high-school – Note: Used by Mathematica / Wolfram Alpha / Maple – Pro: Yields easily interpretable – Con: Leads to exponential computation time if not carefully implemented – Required: Mathematical expression that defines f(x) 3. Automatic Differentiation - Reverse Mode – Note: Called Backpropagation when applied to Neural Nets

– Pro: Computes partial derivatives of one output f(x)i with respect to all inputs xj in time proportional to computation of f(x) – Con: Slow for high dimensional outputs (e.g. vector-valued functions) – Required: Algorithm for f(x) 4. Automatic Differentiation - Forward Mode – Note: Easy to implement. Uses dual numbers.

– Pro: Computes partial derivatives of all outputs f(x)i with respect to one input xj in time proportional to computation of f(x) – Con: Slow for high dimensional inputs (e.g. vector-valued x) – Required: Algorithm for computing f(x)

72 Training Finite Difference Method

Notes: • Suffers from issues of floating point precision, in practice • Typically only appropriate to use on small examples with an appropriately chosen epsilon 73 Training Symbolic Differentiation

Calculus Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below?

74 Training Symbolic Differentiation

Calculus Quiz #2:

75 Training Chain Rule

Whiteboard – Chain Rule of Calculus

76 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training! ) !Chain{ Rule } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: … dy dy dudy dy du = = (2.4) (2.4) dx du dxdx du dx

Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 77 circuit.Binary To compute Logistic the Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P 2.2. NEURAL NETWORKS AND BACKPROPAGATION x to J, but also a manner of carrying out that computation in terms of the intermediate quantities a, z, b, y. Which intermediate quantities to use is a design decision. In this 2.2. NEURALway, the NETWORKS arithmetic circuit AND BACKPROPAGATIONdiagram of Figure 2.1 is differentiated from the standard neural network diagram in two ways. A standard diagram for a neural network does not show this x to Jchoice, but also of intermediate a manner of carrying quantities out that nor computationthe form of in the terms computations. of the intermediate quantitiesThea, z topologies, b, y. Which presented intermediate in this quantities section to are use very is a simple. design decision. However, In we this will later (Chap- way, theter arithmetic5) how an circuit entire diagram algorithm of Figure can define2.1 is differentiated an arithmetic from circuit. the standard neural network diagram in two ways. A standard diagram for a neural network does not show this choice of intermediate quantities nor the form of the computations. The2.2.2 topologies Backpropagation presented in this section are very simple. However, we will later (Chap- ter 5) how an entire algorithm can define an arithmetic circuit. The backpropagation algorithm (Rumelhart et al., 1986) is a general method for computing 2.2.2the Backpropagation gradient of a neural network. Here we generalize the concept of a neural network to include any arithmetic circuit. Applying the backpropagation algorithm on these circuits The backpropagationamounts to repeated algorithm application (Rumelhart of et the al., 1986 chain) is rule. a general This method general for algorithm computing goes under many the gradient of a neural network. Here we generalize the concept of a neural network to other names: automatic differentiation (AD) in the reverse mode (Griewank and Corliss, include any arithmetic circuit. Applying the backpropagation algorithm on these circuits amounts1991 to), repeated analytic application differentiation, of the chain module-based rule. This general AD, algorithm autodiff, goes etc. under Below many we define a forward otherpass, names: which automatic computes differentiation the output (AD) bottom-up, in the reverse and mode a (backwardGriewank and pass, Corliss, which computes the 1991),derivatives analytic differentiation, of all intermediate module-based quantities AD, autodiff, top-down. etc. Below we define a forward pass, which computes the output bottom-up, and a backward pass, which computes the derivatives of all intermediate quantities top-down. Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allows us to differentiate a function f defined as the composition of two functions g Chain Rule At the core of the backpropagation algorithm is the chain rule. The chain rule allowsand h ussuch to differentiate that f =(g a functionh). If thef defined inputs as and the outputs composition of g and of twoh are functions vector-valuedg variables K J J I K I and hthensuch thatf isf as=( well:g h).h If: theR inputs andR outputsand g of: Rg and h areR vector-valuedf : R variablesR . Given an input ! ! ) ! then fvectoris as well:x =h x: 1R,xK 2,...,xRJ andK ,g we: R computeJ RI thef output: RK y R=I . Giveny1,y2,...,y an inputI , in terms of an { ! }Training! ) !Chain{ Rule } vectorintermediatex = x1,x2,...,x vectorK u, we= computeu ,u ,...,u the output.y That= y is,1,y the2,...,y computationI , in termsy of= anf(x)=g(h(x)) { } 1 2 J { } intermediate vector u = u ,u ,...,u{ . That is, the} computation y = f(x)=g(h(x)) can be described{ in1 a feed-forward2 J } manner: y = g(u) and u = h(x). Then the chain rule can bemust described sum in over a feed-forward all the intermediate manner:Given: y quantities.= g(u) and u = h(x). Then the chain rule must sum over all the intermediate quantities.Chain Rule: J dy J dydyi du dyi duj i = i = j , i, k , i, k (2.3) (2.3) dx dxduk dx 8duj dxk 8 k j=1 j kj=1 X X If the inputsIf the and inputs outputs and of outputsf, g, and ofhfare, g, all and scalars,h are then all wescalars, obtain then the familiar we obtain form the familiar form of theof chain the rule: chain rule: Backpropagation is just repeated … dy dy dudy dy du application= of the = (2.4) (2.4) chaindx ruledu fromdxdx du dx Calculus 101.

Binary Logistic Regression Binary logistic regression can be interpreted as a arithmetic 78 circuit.Binary To compute Logistic the derivative Regression of someBinary loss function logistic (below regression we use can regression) be interpreted with as a arithmetic respectcircuit. to the model To compute parameters the✓, derivative we can repeatedly of some apply loss the function chain rule (below (i.e. backprop- we use regression) with agation).respect Note to that the the model output parametersq below is the✓, probability we can repeatedly that the output apply label the takes chain on rule the (i.e. backprop- value agation).1. y⇤ is the Note true output that the label. output The forwardq below pass is computes the probability the following: that the output label takes on the value 1. y⇤ Jis= they⇤ truelog q output+(1 label.y⇤)log(1 Theq forward) pass computes the following:(2.5) 1 where qJ==Py✓(⇤Ylogi =1q +(1x)= y⇤)log(1 q) (2.6) (2.5) | 1+exp( D ✓ x ) j=0 j j 1 where q = P✓(Yi =1x)= D (2.6) 13 | P1+exp( ✓ x ) j=0 j j 13 P Training Backpropagation

Whiteboard – Example: Backpropagation for Calculus Quiz #1

Calculus Quiz #1: Suppose x = 2 and z = 3, what are dy/dx and dy/dz for the function below?

79 Training Backpropagation

Automatic Differentiation – Reverse Mode (aka. Backpropagation) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a , where each variable is a node (i.e. the “computation graph”) 2. Visit each node in topological order. For variable ui with inputs v1,…, vN a. Compute ui = gi(v1,…, vN) b. Store the result at the node Backward Computation 1. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1. 2. Visit each node in reverse topological order. For variable ui = gi(v1,…, vN) a. We already know dy/dui b. Increment dy/dvj by (dy/dui)(dui/dvj) (Choice of algorithm ensures computing (dui/dvj) is easy)

Return partial derivatives dy/dui for all variables 80 Training Backpropagation

Simple Example: The goal is to compute J = +Qb(bBM(x2)+3x2) dJ on the forward pass and the derivative dx on the backward pass.

Forward Backward dJ J = cos(u) Y= sin(u) du dJ dJ du du dJ dJ du du u = u1 + u2 Y= , =1 Y= , =1 du1 du du1 du1 du2 du du2 du2 dJ dJ du1 du1 u1 = sin(t) Y= , = +Qb(t) dt du1 dt dt dJ dJ du2 du2 u2 =3t Y= , =3 dt du2 dt dt dJ dJ dt dt t = x2 Y= , =2x dx dt dx dx

81 Training Backpropagation

Simple Example: The goal is to compute J = +Qb(bBM(x2)+3x2) dJ on the forward pass and the derivative dx on the backward pass.

Forward Backward dJ J = cos(u) Y= sin(u) du dJ dJ du du dJ dJ du du u = u1 + u2 Y= , =1 Y= , =1 du1 du du1 du1 du2 du du2 du2 dJ dJ du1 du1 u1 = sin(t) Y= , = +Qb(t) dt du1 dt dt dJ dJ du2 du2 u2 =3t Y= , =3 dt du2 dt dt dJ dJ dt dt t = x2 Y= , =2x dx dt dx dx

82 Training Backpropagation

Whiteboard – SGD for Neural Network – Example: Backpropagation for Neural Network

83 Training Backpropagation

Output

Case 1: Logistic θ1 θ2 θ3 θM Regression … Input

Forward Backward

dJ y (1 y) J = y HQ; y +(1 y) HQ;(1 y) = + dy y y 1 1 dJ dJ dy dy 2tT( a) y = = , = 1+2tT( a) da dy da da (2tT( a) + 1)2 D dJ dJ da da a = x = , = x j j d da d d j j=0 j j j dJ dJ da da = , = j dxj da dxj dxj 84 Training Backpropagation

(F) Loss J = 1 (y y(d))2 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 85 i Training Backpropagation

(F) Loss 1 2 J = (y y) 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 86 i Training Backpropagation

Case 2: Forward Backward Neural dJ y (1 y) J = y HQ; y +(1 y) HQ;(1 y) = + Network dy y y 1 1 dJ dJ dy dy 2tT( b) y = = , = 1+2tT( b) db dy db db (2tT( b) + 1)2 … D dJ dJ db db … b = z = , = z j j d db d d j j=0 j j j dJ dJ db db = , = j dzj db dzj dzj 1 dJ dJ dz dz 2tT( a ) z = = j , j = j j 1+2tT( a ) da dz da da (2tT( a ) + 1)2 j j j j j j M dJ dJ da da a = x = j , j = x j ji i d da d d i i=0 ji j ji ji dJ dJ da da D = j , j = dx da dx dx ji i j i i j=0 87 Training Backpropagation

Backpropagation (Auto.Diff. - Reverse Mode) Forward Computation 1. Write an algorithm for evaluating the function y = f(x). The algorithm defines a directed acyclic graph, where each variable is a node (i.e. the “computation graph”) 2. Visit each node in topological order. a. Compute the corresponding variable’s value b. Store the result at the node

Backward Computation 3. Initialize all partial derivatives dy/duj to 0 and dy/dy = 1. 4. Visit each node in reverse topological order. For variable ui = gi(v1,…, vN) a. We already know dy/dui b. Increment dy/dvj by (dy/dui)(dui/dvj) (Choice of algorithm ensures computing (dui/dvj) is easy)

Return partial derivatives dy/dui for all variables 88 Training Backpropagation

Case 2: Forward Backward Neural dJ y (1 y) Module 5 J = y HQ; y +(1 y) HQ;(1 y) = + Network dy y y 1 1 dJ dJ dy dy 2tT( b) y = = , = Module 4 1+2tT( b) db dy db db (2tT( b) + 1)2 … D dJ dJ db db … b = z = , = z j j d db d d j j=0 j j j Module 3 dJ dJ db db = , = j dzj db dzj dzj

1 dJ dJ dzj dzj 2tT( aj ) Module 2 zj = = , = 1+2tT( a ) da dz da da (2tT( a ) + 1)2 j j j j j j M dJ dJ da da a = x = j , j = x j ji i d da d d i i=0 ji j ji ji Module 1 dJ dJ da da D = j , j = dx da dx dx ji i j i i j=0 89 A Recipe for Background GradientsMachine Learning

1. Given training dataBackpropagation: 3. Definecan goal: compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these:mode automatic differentiation that – Decision functioncan compute4. Train the withgradient SGD: of any differentiable(take function small steps efficiently! opposite the gradient) – Loss function

90 Summary 1. Neural Networks… – provide a way of learning features – are highly nonlinear prediction functions – (can be) a highly parallel network of logistic regression classifiers – discover useful hidden representations of the input 2. Backpropagation… – provides an efficient way to compute gradients – is a special case of reverse-mode automatic differentiation

91