Artificial Neural Networks

Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face recognition • Advanced topics CS 5751 Machine Chapter 4 Artificial Neural Networks 1 Learning Connectionist Models Consider humans • Neuron switching time ~.001 second • Number of neurons ~1010 • Connections per neuron ~104-5 • Scene recognition time ~.1 second • 100 inference step does not seem like enough must use lots of parallel computation! Properties of artificial neural nets (ANNs): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically CS 5751 Machine Chapter 4 Artificial Neural Networks 2 Learning When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction CS 5751 Machine Chapter 4 Artificial Neural Networks 3 Learning ALVINN drives 70 mph on highways Sharp Straight Sharp Left Ahead Right 4 Hidden Units 30x32 Sensor Input Retina CS 5751 Machine Chapter 4 Artificial Neural Networks 4 Learning Perceptron X0=1 X1 W W 0 1 X W 2 2 Σ n n W n w x ∑ i i 1 if ∑ wi xi i=0 σ = i=1 -1 otherwise Xn 1 if w0 + w1x1 +...+ wn xn > 0 o(x1,..., xn ) = -1 otherwise Sometimes we will use simpler vector notation : r r r 1 if w⋅ x > 0 o(x) = -1 otherwise CS 5751 Machine Chapter 4 Artificial Neural Networks 5 Learning Decision Surface of Perceptron X X 1 1 X2 X2 Represents some useful functions • What weights represent g(x1,x2) = AND(x1,x2)? But some functions not representable • e.g., not linearly separable • therefore, we will want networks of these ... CS 5751 Machine Chapter 4 Artificial Neural Networks 6 Learning Perceptron Training Rule • wi ← wi + ∆wi where ∆wi =η (t − o)xi • t = c(xr) is target value • o is perceptron output • η is small constant (e.g., .1) called learning rate Can prove it will converge • If training data is linearly separable • and η is sufficiently small CS 5751 Machine Chapter 4 Artificial Neural Networks 7 Learning Gradient Descent To understand, consider simple linear unit, where o = w0 + w1x1 + ...+ wn xn Idea : learn wi 's that minimize the squared error r 1 2 E[]w ≡ 2 ∑(td − od ) d∈D Where D is the set of training examples CS 5751 Machine Chapter 4 Artificial Neural Networks 8 Learning Gradient Descent CS 5751 Machine Chapter 4 Artificial Neural Networks 9 Learning Gradient Descent r ∂E ∂E ∂E Gradient ∇E[w] ≡ , ,..., ∂w0 ∂w1 ∂wn r Training rule : ∆wi = −η ∇E[w] ∂E i.e., ∆wi = −η ∂wi CS 5751 Machine Chapter 4 Artificial Neural Networks 10 Learning Gradient Descent ∂E ∂ 1 2 = ∑(td − od ) ∂wi ∂wi 2 d 1 ∂ 2 = ∑ (td − od ) 2 d ∂wi 1 ∂ = ∑ 2(td − od ) (td − od ) 2 d ∂wi ∂ r r = ∑(td − od ) (td − w⋅ xd ) d ∂wi ∂E = ∑(td − od )(−xi,d ) ∂wi d CS 5751 Machine Chapter 4 Artificial Neural Networks 11 Learning Gradient Descent GRADIENT − DESCENT(training _ examples,η ) Each training examples is a pair of the form < xr,t > , where xr is the vector of input values and t is the t arg et output value. η is the learning rate (e.g., .05 ). • Initialize each wi to some small random value • Until the termination condition is met, do - Initialize each ∆wi to zero. - For each < xr,t > in training _ examples, do ∗ Input the instance xr and compute output o * For each linear unit weight wi , do ∆wi ← ∆wi +η (t − o)xi - For each linear unit weight w, do wi ← wi + ∆wi CS 5751 Machine Chapter 4 Artificial Neural Networks 12 Learning Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H CS 5751 Machine Chapter 4 Artificial Neural Networks 13 Learning Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied: 1. Compute the gradient ∇E [wr] r 1 2 D ED[w] ≡ 2 ∑(td − od ) r r r d∈D 2. w ← w −η ∇ED[w] Incremental mode Gradient Descent: Do until satisfied: - For each training example d in D 1. Compute the gradient ∇E [wr] r 1 2 d Ed [w] ≡ (td − od ) r r r 2 2. w ← w −η ∇Ed [w] Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough CS 5751 Machine Chapter 4 Artificial Neural Networks 14 Learning Multilayer Networks of Sigmoid Units h1 h2 d1 d3 d3 d2 d1 d o d0 d2 0 h1 h2 x1 x2 CS 5751 Machine Chapter 4 Artificial Neural Networks 15 Learning Multilayer Decision Space CS 5751 Machine Chapter 4 Artificial Neural Networks 16 Learning X =1 0 Sigmoid Unit X1 W W 0 1 X W 2 2 Σ n n 1 W net = w x o =σ (net) = ∑ i i −net i=0 1+ e Xn σ (x) is the sigmoid function 1 1+ e-x dσ (x) Nice property : = σ (x)(1−σ (x)) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation CS 5751 Machine Chapter 4 Artificial Neural Networks 17 Learning The Sigmoid Function 1 1 σ (x) = 0.9 1+ e-x 0.8 0.7 0.6 0.5 output 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 net input Sort of a rounded step function Unlike step function, can take derivative (makes learning possible) CS 5751 Machine Chapter 4 Artificial Neural Networks 18 Learning Error Gradient for a Sigmoid Unit But we know : ∂E ∂ 1 2 = ∑(td − od ) ∂w ∂w 2 ∂od ∂σ (netd ) i i d∈D = = o (1− o ) ∂net ∂net d d 1 ∂ 2 d d = ∑ (td − od ) r r 2 d ∂wi ∂netd ∂(w⋅ xd ) = = xi,d 1 ∂ ∂wi ∂wi = ∑ 2(td − od ) (td − od ) 2 d ∂wi So : ∂E ∂od = (t − o )− = − (td − od )od (1− od )xi,d ∑ d d ∂w ∑ d ∂wi i d∈D ∂od ∂netd = - ∑(td − od ) d ∂netd ∂wi CS 5751 Machine Chapter 4 Artificial Neural Networks 19 Learning Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, do • For each training example, do 1. Input the training example and compute the outputs 2. For each output unit k δ k ← ok (1− ok )(tk − ok ) 3. For each hidden unit h δ k ← oh (1− oh ) ∑ wh,kδ k k∈outputs 4. Update each network weight wi, j wi, j ← wi, j + ∆wi, j where ∆wi, j =η δ j xi, j CS 5751 Machine Chapter 4 Artificial Neural Networks 20 Learning More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α ∆wi, j (n) =η δ j xi, j +α ∆wi, j (n −1) • Minimizes error over training examples • Will it generalize well to subsequent examples? • Training can take thousands of iterations -- slow! – Using network after training is fast CS 5751 Machine Chapter 4 Artificial Neural Networks 21 Learning Learning Hidden Layer Representations Outputs Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 Inputs 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001 CS 5751 Machine Chapter 4 Artificial Neural Networks 22 Learning Learning Hidden Layer Representations Outputs Input Output 10000000 → .89 .04 .08 →10000000 01000000 → .01 .11 .88 → 01000000 Inputs 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001 CS 5751 Machine Chapter 4 Artificial Neural Networks 23 Learning Output Unit Error during Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 24 Learning Hidden Unit Encoding Hidden unit encoding for one input 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 25 Learning Input to Hidden Weights Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 26 Learning Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum • Momentum can cause quicker convergence • Stochastic gradient descent also results in faster convergence • Can train multiple networks and get different results (using different initial weights) Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions as training progresses CS 5751 Machine Chapter 4 Artificial Neural Networks 27 Learning Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with a single hidden layer • But that might require an exponential (in the number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al.

Artificial Neural Networks

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support