School of Computer Science 10-601B Introduction to Machine Learning Neural Networks Readings: Matt Gormley Bishop Ch. 5 Lecture 15 Murphy Ch. 16.5, Ch. 28 October 19, 2016 Mitchell Ch. 4 1 Reminders 2 Outline • Logistic Regression (Recap) • Neural Networks • Backpropagation 3 RECALL: LOGISTIC REGRESSION 4 Using gradient ascent for linear Recall… classifiers Key idea behind today’s lecture: 1. Define a linear classifier (logistic regression) 2. Define an objective function (likelihood) 3. Optimize it with gradient descent to learn parameters 4. Predict the class with highest probability under the model 5 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 6 Using gradient ascent for linear Recall… classifiers This decision function isn’t Use a differentiable function differentiable: instead: 1 T pθ(y =1t)= h(t)=sign(θ t) | 1+2tT( θT t) − sign(x) 1 logistic(u) ≡ −u 1+ e 7 Recall… Logistic Regression Data: Inputs are continuous vectors of length K. Outputs are discrete. (i) (i) N K = t ,y where t R and y 0, 1 D { }i=1 ∈ ∈ { } Model: Logistic function applied to dot product of parameters with input vector. 1 pθ(y =1t)= | 1+2tT( θT t) − Learning: finds the parameters that minimize some objective function. θ∗ = argmin J(θ) θ Prediction: Output is the most probable class. yˆ = `;Kt pθ(y t) y 0,1 | ∈{ } 8 NEURAL NETWORKS 9 Learning highly non-linear functions f: X à Y l f might be non-linear function l X (vector of) continuous and/or discrete vars l Y (vector of) continuous and/or discrete vars The XOR gate Speech recognition © Eric Xing @ CMU, 2006-2011 10 Perceptron and Neural Nets l From biological neuron to artificial neuron (perceptron) Synapse Inputs Synapse Dendrites x Axon 1 Linear Hard Axon Combiner Limiter w1 Output ∑ Y Soma Soma w2 Dendrites θ x Synapse 2 Threshold l Activation function n ⎧+1, if X ≥ ω0 X = ∑ xiwi Y = ⎨ i=1 ⎩−1, if X < ω0 l Artificial neuron networks l supervised learning l gradient descent i g n a l s g n a l i i g n a l s g n a l i I n p u tI S O u tO p u t S Middle Layer Input Layer Output Layer © Eric Xing @ CMU, 2006-2011 11 Connectionist Models l Consider humans: l Neuron switching time ~ 0.001 second l Number of neurons ~ 1010 l Connections per neuron ~ 104-5 l Scene recognition time ~ 0.1 second l 100 inference steps doesn't seem like enough à much parallel computation l Properties of artificial neural nets (ANN) l Many neuron-like threshold switching units l Many weighted interconnections among units l Highly parallel, distributed processes © Eric Xing @ CMU, 2006-2011 12 Why is everyone talking Motivation about Deep Learning? • Because a lot of money is invested in it… – DeepMind: Acquired by Google for $400 million – DNNResearch: Three person startup (including Geoff Hinton) acquired by Google for unknown price tag – Enlitic, Ersatz, MetaMind, Nervana, Skylab: Deep Learning startups commanding millions of VC dollars • Because it made the front page of the New York Times 13 Why is everyone talking Motivation about Deep Learning? 1960s Deep learning: – Has won numerous pattern recognition 1980s competitions – Does so with minimal feature 1990s engineering This wasn’t always the case! 2006 Since 1980s: Form of models hasn’t changed much, but lots of new tricks… – More hidden units 2016 – Better (online) optimization – New nonlinear functions (ReLUs) – Faster computers (CPUs and GPUs) 14 A Recipe for Background Machine Learning 1. Given training data: Face Face Not a face 2. Choose each of these: – Decision function Examples: Linear regression, Logistic regression, Neural Network – Loss function Examples: Mean-squared error, Cross Entropy 15 A Recipe for Background Machine Learning 1. Given training data: 3. Define goal: 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 16 A Recipe for Background Gradients Machine Learning 1. Given training data: Backpropagation3. Define goal: can compute this gradient! And it’s a special case of a more general algorithm called reverse- 2. Choose each of these: mode automatic differentiation that – Decision function can compute the gradient of any 4. Train with SGD: differentiable function efficiently! (take small steps opposite the gradient) – Loss function 17 A Recipe for Background Goals for Today’s Lecture Machine Learning 1. Given training data: 1. Explore a new class of decision functions 3. Define goal: (Neural Networks) 2. Consider variants of this recipe for training 2. Choose each of these: – Decision function 4. Train with SGD: (take small steps opposite the gradient) – Loss function 18 Decision Functions Linear Regression T y = hθ(x)=σ(θ x) Output where σ(a)=a θ θ θ1 θ2 3 M Input … 19 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 20 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − Face Face Not a face θ θ θ1 θ2 3 M Input … 21 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − 1 1 0 y θ θ x2 θ1 θ2 3 M x1 Input … 22 Decision Functions Logistic Regression T y = hθ(x)=σ(θ x) 1 where σ(a)= Output 1+2tT( a) − θ θ θ1 θ2 3 M Input … 23 Neural Network Model Inputs Output Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 24 “Combined logistic models” Inputs Output Age 34 .6 .5 0.6 .1 Gender 2 Σ .7 .8 “Probability of beingAlive” Stage 4 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 25 Inputs Output Age 34 .5 .2 0.6 Gender 2 .3 Σ “Probability of .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 26 Inputs Output Age 34 .6 .5 .2 0.6 .1 Gender 1 .3 Σ “Probability of .7 .8 beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 27 Not really, no target for hidden units... Age 34 .6 .4 .2 Σ 0.6 .1 .5 Gender 2 .2 Σ .3 Σ .8 .7 “Probability of beingAlive” Stage 4 .2 Dependent Independent Weights HiddenL Weights variable variables ayer Prediction © Eric Xing @ CMU, 2006-2011 28 Jargon Pseudo-Correspondence l Independent variable = input variable l Dependent variable = output variable l Coefficients = “weights” l Estimates = “targets” Logistic Regression Model (the sigmoid unit) Inputs Output Age 34 5 0.6 Gende 1 4 Σ “Probability of r beingAlive” Stage 4 8 Independent variables Coefficients Dependent variable x1, x2, x3 a, b, c p Prediction © Eric Xing @ CMU, 2006-2011 29 Decision Functions Neural Network Output Hidden Layer … Input … 30 Decision Functions Neural Network (F) Loss J = 1 (y y(d))2 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = βj zj Hidden Layer j=0 (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 31 i ∀ Building a Neural Net Output Features … 32 Building a Neural Net Output Hidden Layer … D = M 1 1 1 Input … 33 Building a Neural Net Output Hidden Layer … D = M Input … 34 Building a Neural Net Output Hidden Layer … D = M Input … 35 Building a Neural Net Output Hidden Layer … D < M Input … 36 Decision Boundary • 0 hidden layers: linear classifier – Hyperplanes y x1 x2 Example from to Eric Postma via Jason Eisner 37 Decision Boundary • 1 hidden layer – Boundary of convex region (open or closed) y x1 x2 Example from to Eric Postma via Jason Eisner 38 Decision Boundary y • 2 hidden layers – Combinations of convex regions x1 x2 Example from to Eric Postma via Jason Eisner 39 Decision Functions Multi-Class Output Output … Hidden Layer … Input … 40 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 1 … Input 41 Decision Functions Deeper Networks Next lecture: Output … Hidden Layer 2 … Hidden Layer 1 … Input 42 Decision Functions Deeper Networks Next lecture: Output Making the … neural Hidden Layer 3 networks … deeper Hidden Layer 2 … Hidden Layer 1 … Input 43 Decision Different Levels of Functions Abstraction • We don’t know the “right” levels of abstraction • So let the model figure it out! 44 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Face Recognition: – Deep Network can build up increasingly higher levels of abstraction – Lines, parts, regions 45 Example from Honglak Lee (NIPS 2010) Decision Different Levels of Functions Abstraction Output … Hidden Layer 3 … Hidden Layer 2 … Hidden Layer 1 … Input 46 Example from Honglak Lee (NIPS 2010) ARCHITECTURES 47 Neural Network Architectures Even for a basic Neural Network, there are many design decisions to make: 1. # of hidden layers (depth) 2. # of units per hidden layer (width) 3. Type of activation function (nonlinearity) 4. Form of objective function 48 Activation Functions (F) Loss Neural Network with sigmoid 1 2 J = (y y∗) activation functions 2 − (E) Output (sigmoid) 1 y = 1+2tT( b) Output − (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (sigmoid) 1 zj = 1+2tT( a ) , j − j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 49 i ∀ Activation Functions (F) Loss Neural Network with arbitrary 1 2 J = (y y∗) nonlinear activation functions 2 − (E) Output (nonlinear) y = σ(b) Output (D) Output (linear) D … b = j=0 βj zj Hidden Layer (C) Hidden (nonlinear) z = σ(a ), j j j ∀ … Input (B) Hidden (linear) a = M α x , j j i=0 ji i ∀ (A) Input Given x , i 50 i ∀ Activation Functions Sigmoid / Logistic Function So far, we’ve 1 assumed that the logistic(u) ≡ 1+ e−u activation function (nonlinearity) is always the sigmoid function… 51 Activation Functions • A new change: modifying the nonlinearity – The logistic is not widely used in modern ANNs Alternate 1: tanh Like logistic function but shifted to range [-1, +1] Slide from William Cohen AI Stats 2010 depth 4? sigmoid vs.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages78 Page
-
File Size-