<<

10-601 Introduction to Machine Learning Department School of Computer Science Carnegie Mellon University

Neural Networks and

Neural Net Readings: Murphy -- Matt Gormley Bishop 5 Lecture 19 HTF 11 Mitchell 4 March 29, 2017

1 Reminders • Homework 6: – Release: Wed, Mar. 22 – Due: Mon, Apr. 03 at 11:59pm • Homework 5 (Part II): Peer Review – Expectation: You Release: Wed, Mar. 29 should spend at most 1 – Due: Wed, Apr. 05 at 11:59pm hour on your reviews • Peer Tutoring

2 Neural Networks Outline

(Recap) – Data, Model, Learning, Prediction • Neural Networks – A Recipe for Machine Learning – Visual Notation for Neural Networks – Example: Logistic Regression Output – 2- Neural Network – 3-Layer Neural Network • Neural Net Architectures – Objective Functions – Activation Functions • Backpropagation – Basic (of ) – Chain Rule for Arbitrary Computation Graph – Backpropagation – Module-based Automatic Differentiation (Autodiff)

3 RECALL: LOGISTIC REGRESSION

4 Using ascent for linear classifiers Key idea behind today’s lecture: 1. Define a (logistic regression) 2. Define an objective (likelihood) 3. Optimize it with to learn parameters 4. Predict the class with highest under the model

5 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T p(y =1t)= h(t)=sign( t) | 1+2tT( T t)

sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear classifiers This decision function isn’t Use a differentiable differentiable: function instead: 1 T p(y =1t)= h(t)=sign( t) | 1+2tT( T t)

sign(x) 1 logistic(u) ≡ −u 1+ e 7 Logistic Regression

Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 { } Model: applied to of parameters with input vector. 1 p(y =1t)= | 1+2tT( T t) Learning: finds the parameters that minimize some objective function. = argmin J() Prediction: Output is the most probable class. yˆ = `;Kt p(y t) y 0,1 | { } 8 NEURAL NETWORKS

9 A Recipe for Background Machine Learning

1. Given training data: Face Face Not a face

2. Choose each of these: – Decision function Examples: , Logistic regression, Neural Network

Examples: Mean-squared error, Cross Entropy

10 A Recipe for Background Machine Learning

1. Given training data: 3. Define goal:

2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

11 A Recipe for Background GradientsMachine Learning

1. Given training dataBackpropagation: 3. Definecan goal: compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these:mode automatic differentiation that – Decision functioncan compute4. Train the withgradient SGD: of any differentiable(take function small steps efficiently! opposite the gradient) – Loss function

12 A Recipe for Background Goals for Today’sMachine Lecture Learning

1. 1.GivenExplore training a new data class: of 3.decision Define functionsgoal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function

13 Decision Functions Linear Regression

T y = h(x)=( x)

Output where (a)=a

θ1 θ2 θ3 θM

Input …

14 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a)

θ1 θ2 θ3 θM

Input …

15 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a) Face Face Not a face

θ1 θ2 θ3 θM

Input …

16 Decision Functions Logistic Regression

T y = h(x)=( x) 1 where In(-aClass)= Example Output 1+2tT( a) 1 1 0

y

x2 θ1 θ2 θ3 θM x1

Input …

17 Decision Functions

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a)

θ1 θ2 θ3 θM

Input …

18 From Biological to Artificial The motivation for Artificial Neural Networks comes from

Biological “Model” Artificial Model • Neuron: an excitable cell • Neuron: node in a directed acyclic • Synapse: connection between graph (DAG) neurons • Weight: multiplier on each edge • • : nonlinear A neuron sends an thresholding function, which allows a electrochemical pulse along its neuron to “fire” when the input value synapses when a sufficient voltage is sufficiently high change occurs • Artificial Neural Network: collection • Biological Neural Network: of neurons into a DAG, which define collection of neurons along some some pathway through the brain

Biological “Computation” Artificial Computation • Neuron switching time : ~ 0.001 sec • Many neuron-like threshold switching • Number of neurons: ~ 1010 units • Connections per neuron: ~ 104-5 • Many weighted interconnections • Scene recognition time: ~ 0.1 sec among units • Highly parallel, distributed processes

19 Slide adapted from Eric Xing Decision Functions Logistic Regression

T y = h(x)=( x) 1 where (a)= Output 1+2tT( a)

θ1 θ2 θ3 θM

Input …

20 Neural Networks

Whiteboard – Example: Neural Network w/1 Hidden Layer – Example: Neural Network w/2 Hidden Layers – Example: Feed Forward Neural Network

21 Neural Network Model

Inputs Output Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights Hidden Weights variable variables Layer Prediction

© Eric Xing @ CMU, 2006-2011 22 “Combined logistic models”

Inputs Output Age 34 .6 .5 .1 0.6 Gender 2 S .7 .8 “Probability of beingAlive” Stage 4

Dependent Independent Weights Hidden Weights variable variables Layer Prediction

© Eric Xing @ CMU, 2006-2011 23 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 S .8 “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights Hidden Weights variable variables Layer Prediction

© Eric Xing @ CMU, 2006-2011 24 Inputs Output Age 34 .6 .2 .5 .1 0.6 Gender 1 .3 S .7 .8 “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights Hidden Weights variable variables Layer Prediction

© Eric Xing @ CMU, 2006-2011 25 Not really, no target for hidden units...

Age 34 .6 .4 .2 S .1 .5 0.6 Gender 2 .2 S .3 .8 .7 S “Probability of beingAlive” Stage 4 .2

Dependent Independent Weights Hidden Weights variable variables Layer Prediction

© Eric Xing @ CMU, 2006-2011 26 Decision Functions Neural Network

Output

Hidden Layer …

Input …

28 Decision Functions Neural Network

(F) Loss J = 1 (y y(d))2 2

(E) Output (sigmoid) 1 y = 1+2tT( b) Output

(D) Output (linear) D … b = j zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j j … Input (B) Hidden (linear) a = M x , j j i=0 ji i (A) Input Given x , i 29 i NEURAL NETWORKS: REPRESENTATIONAL POWER

30 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

Features …

31 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

Hidden Layer … D = M

1 1 1

Input …

32 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

Hidden Layer … D = M

Input …

33 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

Hidden Layer … D = M

Input …

34 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

What method(s) is this setting similar to?

Hidden Layer … D < M

Input …

35 Building a Neural Net

Q: How many hidden units, D, should we use?

Output

Hidden Layer … D > M

What method(s) is this setting similar to?

Input …

36 Decision Functions Deeper Networks

Q: How many layers should we use?

Output

… Hidden Layer 1

… Input

37 Decision Functions Deeper Networks

Q: How many layers should we use?

Output

… Hidden Layer 2

… Hidden Layer 1

… Input

38 Decision Functions Deeper Networks

Q: How many layers should we use? Output

… Hidden Layer 3

… Hidden Layer 2

… Hidden Layer 1

… Input

39 Decision Functions Deeper Networks

Q: How many layers should we use? • Theoretical answer: – A neural network with 1 hidden layer is a universal function approximator – Cybenko (1989): For any g(x), there exists a 1-hidden-layer neural net hθ(x) s.t. | hθ(x) – g(x) | < ϵ for all x, assuming sigmoid activation functions Output • Empirical answer: – Before 2006: “Deep networks (e.g.… 3 or more hidden layers) are too hardHidden Layerto 1train” – After 2006: “Deep networks are easier to train than shallow networks (e.g. 2 or fewer layers) for many problems” … Input Big caveat: You need to know and use the right tricks. 40 Decision Boundary

• 0 hidden layers: linear classifier – Hyperplanes

y

x1 x2

Example from to Eric Postma via Jason Eisner 41 Decision Boundary

• 1 hidden layer – Boundary of convex region (open or closed) y

x1 x2

Example from to Eric Postma via Jason Eisner 42 Decision Boundary

y • 2 hidden layers – Combinations of convex regions

x1 x2

Example from to Eric Postma via Jason Eisner 43 Decision Different Levels of Functions Abstraction

• We don’t know the “right” levels of abstraction • So let the model figure it out!

44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction

Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions

45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction

Output

… Hidden Layer 3

… Hidden Layer 2

… Hidden Layer 1

… Input

46 Example from Honglak Lee (NIPS 2010)