Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face recognition • Advanced topics
CS 5751 Machine Chapter 4 Artificial Neural Networks 1 Learning Connectionist Models Consider humans • Neuron switching time ~.001 second • Number of neurons ~1010 • Connections per neuron ~104-5 • Scene recognition time ~.1 second • 100 inference step does not seem like enough must use lots of parallel computation! Properties of artificial neural nets (ANNs): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically
CS 5751 Machine Chapter 4 Artificial Neural Networks 2 Learning When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction
CS 5751 Machine Chapter 4 Artificial Neural Networks 3 Learning ALVINN drives 70 mph on highways
Sharp Straight Sharp Left Ahead Right
4 Hidden Units
30x32 Sensor Input Retina
CS 5751 Machine Chapter 4 Artificial Neural Networks 4 Learning Perceptron X0=1
X1 W
W 0 1 X W 2 2
Σ n n W n w x ∑ i i 1 if ∑ wi xi i=0 σ = i=1 -1 otherwise Xn
1 if w0 + w1x1 +...+ wn xn > 0 o(x1,..., xn ) = -1 otherwise Sometimes we will use simpler vector notation : r r r 1 if w⋅ x > 0 o(x) = -1 otherwise CS 5751 Machine Chapter 4 Artificial Neural Networks 5 Learning Decision Surface of Perceptron X X 1 1
X2 X2
Represents some useful functions
• What weights represent g(x1,x2) = AND(x1,x2)? But some functions not representable • e.g., not linearly separable • therefore, we will want networks of these ...
CS 5751 Machine Chapter 4 Artificial Neural Networks 6 Learning Perceptron Training Rule
• wi ← wi + ∆wi where η ∆wi = (t − o)xi • t = c(xr) is target value η • o is perceptron output • is small constant (e.g., .1) called learning rate
Can prove it will converge • If training data is linearly separable • and η is sufficiently small
CS 5751 Machine Chapter 4 Artificial Neural Networks 7 Learning Gradient Descent To understand, consider simple linear unit, where
o = w0 + w1x1 + ...+ wn xn
Idea : learn wi 's that minimize the squared error r 1 2 E[]w ≡ 2 ∑(td − od ) d∈D Where D is the set of training examples
CS 5751 Machine Chapter 4 Artificial Neural Networks 8 Learning Gradient Descent
CS 5751 Machine Chapter 4 Artificial Neural Networks 9 Learning Gradient Descent
r ∂E ∂E ∂E Gradient ∇E[w] ≡ , ,..., ∂w0 ∂w1 ∂wn η r Training rule : ∆wi = − ∇E[w] ∂E i.e., ∆wi = −η ∂wi
CS 5751 Machine Chapter 4 Artificial Neural Networks 10 Learning Gradient Descent
∂E ∂ 1 2 = ∑(td − od ) ∂wi ∂wi 2 d
1 ∂ 2 = ∑ (td − od ) 2 d ∂wi 1 ∂ = ∑ 2(td − od ) (td − od ) 2 d ∂wi ∂ r r = ∑(td − od ) (td − w⋅ xd ) d ∂wi ∂E = ∑(td − od )(−xi,d ) ∂wi d
CS 5751 Machine Chapter 4 Artificial Neural Networks 11 Learning Gradient Descent GRADIENT − DESCENT(training _ examples,η ) Each training examples is a pair of the form < xr,t > , where xr is the vector of input values and t is the t arg et output value. is the learning rate (e.g., .05 ). • Initialize each w to some small random value i η • Until the termination condition is met, do
- Initialize each ∆wi to zero. - For each < xr,t > in training _ examples, do ∗ Input the instance xr and compute output o
* For each linear unit weight wi , do
∆wi ← ∆wi +η (t − o)xi - For each linear unit weight w, do
wi ← wi + ∆wi
CS 5751 Machine Chapter 4 Artificial Neural Networks 12 Learning Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η
Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H
CS 5751 Machine Chapter 4 Artificial Neural Networks 13 Learning Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied: 1. Compute the gradient ∇E [wr] r 1 2 D ED[w] ≡ 2 ∑(td − od ) r r r d∈D 2. w ← w −η ∇ED[w] Incremental mode Gradient Descent: Do until satisfied: - For each training example d in D 1. Compute the gradient ∇E [wr] r 1 2 d Ed [w] ≡ (td − od ) r r r 2 2. w ← w −η ∇Ed [w]
Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough
CS 5751 Machine Chapter 4 Artificial Neural Networks 14 Learning Multilayer Networks of Sigmoid Units
h1 h2 d1 d3 d3 d2 d1
d o d0 d2 0
h1 h2
x1 x2
CS 5751 Machine Chapter 4 Artificial Neural Networks 15 Learning Multilayer Decision Space
CS 5751 Machine Chapter 4 Artificial Neural Networks 16 Learning X =1 0 Sigmoid Unit X1 W
W 0 1 X W 2 2 Σ n n 1 W net = w x o =σ (net) = ∑ i i −net i=0 1+ e
Xn σ (x) is the sigmoid function 1 σ 1+ e-x σ d (x) Nice property : = (x)(1−σ (x)) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation
CS 5751 Machine Chapter 4 Artificial Neural Networks 17 Learning The Sigmoid Function
1 1 σ (x) = 0.9 1+ e-x 0.8 0.7 0.6 0.5
output 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 net input
Sort of a rounded step function Unlike step function, can take derivative (makes learning possible)
CS 5751 Machine Chapter 4 Artificial Neural Networks 18 Learning Error Gradient for a Sigmoid Unit
But we know : ∂E ∂ 1 2 = ∑(td − od ) ∂w ∂w 2 ∂od ∂σ (netd ) i i d∈D = = o (1− o ) ∂net ∂net d d 1 ∂ 2 d d = ∑ (td − od ) r r 2 d ∂wi ∂netd ∂(w⋅ xd ) = = xi,d 1 ∂ ∂wi ∂wi = ∑ 2(td − od ) (td − od ) 2 d ∂wi So : ∂E ∂od = (t − o )− = − (td − od )od (1− od )xi,d ∑ d d ∂w ∑ d ∂wi i d∈D
∂od ∂netd = - ∑(td − od ) d ∂netd ∂wi
CS 5751 Machine Chapter 4 Artificial Neural Networks 19 Learning Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, do • For each training example, do 1. Input the training example and compute the outputs δ 2. For each output unit k
k ← ok (1− ok )(tk − ok ) δ 3. For each hidden unit h
k ← oh (1− oh ) ∑ wh,k k k∈outputsδ
4. Update each network weight wi, j
wi, j ← wi, j + ∆wi, j where η ∆wi, j = δ j xi, j
CS 5751 Machine Chapter 4 Artificial Neural Networks 20 Learning More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α
∆wi, j (n) =η δ j xi, j +α ∆wi, j (n −1) • Minimizes error over training examples • Will it generalize well to subsequent examples? • Training can take thousands of iterations -- slow! – Using network after training is fast
CS 5751 Machine Chapter 4 Artificial Neural Networks 21 Learning Learning Hidden Layer Representations
Outputs Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 Inputs 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001
CS 5751 Machine Chapter 4 Artificial Neural Networks 22 Learning Learning Hidden Layer Representations
Outputs
Input Output 10000000 → .89 .04 .08 →10000000 01000000 → .01 .11 .88 → 01000000
Inputs 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001
CS 5751 Machine Chapter 4 Artificial Neural Networks 23 Learning Output Unit Error during Training
Sum of squared errors for each output unit
0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500
CS 5751 Machine Chapter 4 Artificial Neural Networks 24 Learning Hidden Unit Encoding
Hidden unit encoding for one input
1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500
CS 5751 Machine Chapter 4 Artificial Neural Networks 25 Learning Input to Hidden Weights
Weights from inputs to one hidden unit
4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500
CS 5751 Machine Chapter 4 Artificial Neural Networks 26 Learning Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum • Momentum can cause quicker convergence • Stochastic gradient descent also results in faster convergence • Can train multiple networks and get different results (using different initial weights)
Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions as training progresses CS 5751 Machine Chapter 4 Artificial Neural Networks 27 Learning Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with a single hidden layer • But that might require an exponential (in the number of inputs) hidden units
Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al. 1989] • Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988]
CS 5751 Machine Chapter 4 Artificial Neural Networks 28 Learning Overfitting in ANNs
Error versus weight updates (example 1)
0.01 0.009 Training set 0.008 Validation set 0.007 0.006
Error 0.005 0.004 0.003 0.002 0 5000 10000 15000 20000 Number of weight updates
CS 5751 Machine Chapter 4 Artificial Neural Networks 29 Learning Overfitting in ANNs
Error versus weight updates (Example 2)
0.08 0.07 Training set 0.06 0.05 Validation set 0.04
Error 0.03 0.02 0.01 0 0 1000 2000 3000 4000 5000 6000 Number of weight updates
CS 5751 Machine Chapter 4 Artificial Neural Networks 30 Learning Neural Nets for Face Recognition left strt rgt up
30x32 inputs 90% accurate learning head pose, and recognizing 1-of-20 faces
Typical Input Images
CS 5751 Machine Chapter 4 Artificial Neural Networks 31 Learning Learned Network Weights
left strt rgt up Learned Weights
30x32 inputs
Typical Input Images
CS 5751 Machine Chapter 4 Artificial Neural Networks 32 Learning Alternative Error Functions
Penalize large weights : γ r 1 2 2 E(w) ≡ 2 ∑∑(tkd − okd ) + ∑ w ji d∈∈D k outputs i, j Train on target slopes as well as values:
2 ∂t ∂o E(wr) ≡ 1 (t − o )2 + µ kd − kd 2 ∑∑ kd kd j j d∈∈D k outputs ∂x d ∂x d Tie together weights : • e.g., in phoneme recognition
CS 5751 Machine Chapter 4 Artificial Neural Networks 33 Learning Recurrent Networks y(t+1) y(t+1)
x(t) x(t)
Feedforward Recurrent Network Network
y(t+1)
y(t)
x(t) y(t-1)
x(t-1) Recurrent Network unfolded in time x(t-2) CS 5751 Machine Chapter 4 Artificial Neural Networks 34 Learning Neural Network Summary • physiologically (neurons) inspired model • powerful (accurate), slow, opaque (hard to understand resulting model) • bias: preferential – based on gradient descent – finds local minimum – effect by initial conditions, parameters •neural units – linear – linear threshold – sigmoid
CS 5751 Machine Chapter 4 Artificial Neural Networks 35 Learning Neural Network Summary (cont) • gradient descent – convergence • linear units – limitation: hyperplane decision surface – learning rule • multilayer network – advantage: can have non-linear decision surface – backpropagation to learn • backprop learning rule • learning issues – units used
CS 5751 Machine Chapter 4 Artificial Neural Networks 36 Learning Neural Network Summary (cont) • learning issues (cont) – batch versus incremental (stochastic) – parameters • initial weights • learning rate •momentum – cost (error) function • sum of squared errors • can include penalty terms • recurrent networks – simple – backpropagation through time
CS 5751 Machine Chapter 4 Artificial Neural Networks 37 Learning