<<

Artificial Neural Networks • Threshold units • descent • Multilayer networks • • Hidden representations • Example: Face recognition • Advanced topics

CS 5751 Machine Chapter 4 Artificial Neural Networks 1 Learning Connectionist Models Consider humans • Neuron switching time ~.001 second • Number of neurons ~1010 • Connections per neuron ~104-5 • Scene recognition time ~.1 second • 100 inference step does not seem like enough must use lots of parallel computation! Properties of artificial neural nets (ANNs): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically

CS 5751 Machine Chapter 4 Artificial Neural Networks 2 Learning When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction

CS 5751 Machine Chapter 4 Artificial Neural Networks 3 Learning ALVINN drives 70 mph on highways

Sharp Straight Sharp Left Ahead Right

4 Hidden Units

30x32 Sensor Input Retina

CS 5751 Machine Chapter 4 Artificial Neural Networks 4 Learning X0=1

X1 W

W 0 1 X W 2 2

Σ n  n W n w x ∑ i i  1 if ∑ wi xi i=0 σ =  i=1 -1 otherwise Xn 

1 if w0 + w1x1 +...+ wn xn > 0 o(x1,..., xn ) =   -1 otherwise Sometimes we will use simpler vector notation : r r r 1 if w⋅ x > 0 o(x) =  -1 otherwise CS 5751 Machine Chapter 4 Artificial Neural Networks 5 Learning Decision Surface of Perceptron X X 1 1

X2 X2

Represents some useful functions

• What weights represent g(x1,x2) = AND(x1,x2)? But some functions not representable • e.g., not linearly separable • therefore, we will want networks of these ...

CS 5751 Machine Chapter 4 Artificial Neural Networks 6 Learning Perceptron Training Rule

• wi ← wi + ∆wi where η ∆wi = (t − o)xi • t = c(xr) is target value η • o is perceptron output • is small constant (e.g., .1) called

Can prove it will converge • If training data is linearly separable • and η is sufficiently small

CS 5751 Machine Chapter 4 Artificial Neural Networks 7 Learning Gradient Descent To understand, consider simple linear unit, where

o = w0 + w1x1 + ...+ wn xn

Idea : learn wi 's that minimize the squared error r 1 2 E[]w ≡ 2 ∑(td − od ) d∈D Where D is the set of training examples

CS 5751 Machine Chapter 4 Artificial Neural Networks 8 Learning Gradient Descent

CS 5751 Machine Chapter 4 Artificial Neural Networks 9 Learning Gradient Descent

r  ∂E ∂E ∂E  Gradient ∇E[w] ≡  , ,...,  ∂w0 ∂w1 ∂wn  η r Training rule : ∆wi = − ∇E[w] ∂E i.e., ∆wi = −η ∂wi

CS 5751 Machine Chapter 4 Artificial Neural Networks 10 Learning Gradient Descent

∂E ∂ 1 2 = ∑(td − od ) ∂wi ∂wi 2 d

1 ∂ 2 = ∑ (td − od ) 2 d ∂wi 1 ∂ = ∑ 2(td − od ) (td − od ) 2 d ∂wi ∂ r r = ∑(td − od ) (td − w⋅ xd ) d ∂wi ∂E = ∑(td − od )(−xi,d ) ∂wi d

CS 5751 Machine Chapter 4 Artificial Neural Networks 11 Learning Gradient Descent GRADIENT − DESCENT(training _ examples,η ) Each training examples is a pair of the form < xr,t > , where xr is the vector of input values and t is the t arg et output value. is the learning rate (e.g., .05 ). • Initialize each w to some small random value i η • Until the termination condition is met, do

- Initialize each ∆wi to zero. - For each < xr,t > in training _ examples, do ∗ Input the instance xr and compute output o

* For each linear unit weight wi , do

∆wi ← ∆wi +η (t − o)xi - For each linear unit weight w, do

wi ← wi + ∆wi

CS 5751 Machine Chapter 4 Artificial Neural Networks 12 Learning Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η

Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H

CS 5751 Machine Chapter 4 Artificial Neural Networks 13 Learning Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied: 1. Compute the gradient ∇E [wr] r 1 2 D ED[w] ≡ 2 ∑(td − od ) r r r d∈D 2. w ← w −η ∇ED[w] Incremental mode Gradient Descent: Do until satisfied: - For each training example d in D 1. Compute the gradient ∇E [wr] r 1 2 d Ed [w] ≡ (td − od ) r r r 2 2. w ← w −η ∇Ed [w]

Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough

CS 5751 Machine Chapter 4 Artificial Neural Networks 14 Learning Multilayer Networks of Sigmoid Units

h1 h2 d1 d3 d3 d2 d1

d o d0 d2 0

h1 h2

x1 x2

CS 5751 Machine Chapter 4 Artificial Neural Networks 15 Learning Multilayer Decision Space

CS 5751 Machine Chapter 4 Artificial Neural Networks 16 Learning X =1 0 Sigmoid Unit X1 W

W 0 1 X W 2 2 Σ n n 1 W net = w x o =σ (net) = ∑ i i −net i=0 1+ e

Xn σ (x) is the 1 σ 1+ e-x σ d (x) Nice property : = (x)(1−σ (x)) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation

CS 5751 Machine Chapter 4 Artificial Neural Networks 17 Learning The Sigmoid Function

1 1 σ (x) = 0.9 1+ e-x 0.8 0.7 0.6 0.5

output 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 net input

Sort of a rounded step function Unlike step function, can take (makes learning possible)

CS 5751 Machine Chapter 4 Artificial Neural Networks 18 Learning Error Gradient for a Sigmoid Unit

But we know : ∂E ∂ 1 2 = ∑(td − od ) ∂w ∂w 2 ∂od ∂σ (netd ) i i d∈D = = o (1− o ) ∂net ∂net d d 1 ∂ 2 d d = ∑ (td − od ) r r 2 d ∂wi ∂netd ∂(w⋅ xd ) = = xi,d 1 ∂ ∂wi ∂wi = ∑ 2(td − od ) (td − od ) 2 d ∂wi So : ∂E  ∂od  = (t − o )−  = − (td − od )od (1− od )xi,d ∑ d d   ∂w ∑ d  ∂wi  i d∈D

∂od ∂netd = - ∑(td − od ) d ∂netd ∂wi

CS 5751 Machine Chapter 4 Artificial Neural Networks 19 Learning Backpropagation Initialize all weights to small random numbers. Until satisfied, do • For each training example, do 1. Input the training example and compute the outputs δ 2. For each output unit k

k ← ok (1− ok )(tk − ok ) δ 3. For each hidden unit h

k ← oh (1− oh ) ∑ wh,k k k∈outputsδ

4. Update each network weight wi, j

wi, j ← wi, j + ∆wi, j where η ∆wi, j = δ j xi, j

CS 5751 Machine Chapter 4 Artificial Neural Networks 20 Learning More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α

∆wi, j (n) =η δ j xi, j +α ∆wi, j (n −1) • Minimizes error over training examples • Will it generalize well to subsequent examples? • Training can take thousands of iterations -- slow! – Using network after training is fast

CS 5751 Machine Chapter 4 Artificial Neural Networks 21 Learning Learning Hidden Layer Representations

Outputs Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 Inputs 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001

CS 5751 Machine Chapter 4 Artificial Neural Networks 22 Learning Learning Hidden Layer Representations

Outputs

Input Output 10000000 → .89 .04 .08 →10000000 01000000 → .01 .11 .88 → 01000000

Inputs 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001

CS 5751 Machine Chapter 4 Artificial Neural Networks 23 Learning Output Unit Error during Training

Sum of squared errors for each output unit

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500

CS 5751 Machine Chapter 4 Artificial Neural Networks 24 Learning Hidden Unit Encoding

Hidden unit encoding for one input

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500

CS 5751 Machine Chapter 4 Artificial Neural Networks 25 Learning Input to Hidden Weights

Weights from inputs to one hidden unit

4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500

CS 5751 Machine Chapter 4 Artificial Neural Networks 26 Learning Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum • Momentum can cause quicker convergence • Stochastic gradient descent also results in faster convergence • Can train multiple networks and get different results (using different initial weights)

Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions as training progresses CS 5751 Machine Chapter 4 Artificial Neural Networks 27 Learning Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with a single hidden layer • But that might require an exponential (in the number of inputs) hidden units

Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]

CS 5751 Machine Chapter 4 Artificial Neural Networks 28 Learning in ANNs

Error versus weight updates (example 1)

0.01 0.009 Training set 0.008 Validation set 0.007 0.006

Error 0.005 0.004 0.003 0.002 0 5000 10000 15000 20000 Number of weight updates

CS 5751 Machine Chapter 4 Artificial Neural Networks 29 Learning Overfitting in ANNs

Error versus weight updates (Example 2)

0.08 0.07 Training set 0.06 0.05 Validation set 0.04

Error 0.03 0.02 0.01 0 0 1000 2000 3000 4000 5000 6000 Number of weight updates

CS 5751 Machine Chapter 4 Artificial Neural Networks 30 Learning Neural Nets for Face Recognition left strt rgt up

30x32 inputs 90% accurate learning head pose, and recognizing 1-of-20 faces

Typical Input Images

CS 5751 Machine Chapter 4 Artificial Neural Networks 31 Learning Learned Network Weights

left strt rgt up Learned Weights

30x32 inputs

Typical Input Images

CS 5751 Machine Chapter 4 Artificial Neural Networks 32 Learning Alternative Error Functions

Penalize large weights : γ r 1 2 2 E(w) ≡ 2 ∑∑(tkd − okd ) + ∑ w ji d∈∈D k outputs i, j Train on target as well as values:

2   ∂t ∂o   E(wr) ≡ 1 (t − o )2 + µ kd − kd 2 ∑∑ kd kd  j j   d∈∈D k outputs   ∂x d ∂x d   Tie together weights : • e.g., in phoneme recognition

CS 5751 Machine Chapter 4 Artificial Neural Networks 33 Learning Recurrent Networks y(t+1) y(t+1)

x(t) x(t)

Feedforward Recurrent Network Network

y(t+1)

y(t)

x(t) y(t-1)

x(t-1) Recurrent Network unfolded in time x(t-2) CS 5751 Machine Chapter 4 Artificial Neural Networks 34 Learning Neural Network Summary • physiologically (neurons) inspired model • powerful (accurate), slow, opaque (hard to understand resulting model) • bias: preferential – based on gradient descent – finds local minimum – effect by initial conditions, parameters •neural units – linear – linear threshold – sigmoid

CS 5751 Machine Chapter 4 Artificial Neural Networks 35 Learning Neural Network Summary (cont) • gradient descent – convergence • linear units – limitation: hyperplane decision surface – learning rule • multilayer network – advantage: can have non-linear decision surface – backpropagation to learn • backprop learning rule • learning issues – units used

CS 5751 Machine Chapter 4 Artificial Neural Networks 36 Learning Neural Network Summary (cont) • learning issues (cont) – batch versus incremental (stochastic) – parameters • initial weights • learning rate •momentum – cost (error) function • sum of squared errors • can include penalty terms • recurrent networks – simple – backpropagation through time

CS 5751 Machine Chapter 4 Artificial Neural Networks 37 Learning