Artificial Neural Networks

Artificial Neural Networks • Threshold units • Gradient descent • Multilayer networks • Backpropagation • Hidden layer representations • Example: Face recognition • Advanced topics CS 5751 Machine Chapter 4 Artificial Neural Networks 1 Learning Connectionist Models Consider humans • Neuron switching time ~.001 second • Number of neurons ~1010 • Connections per neuron ~104-5 • Scene recognition time ~.1 second • 100 inference step does not seem like enough must use lots of parallel computation! Properties of artificial neural nets (ANNs): • Many neuron-like threshold switching units • Many weighted interconnections among units • Highly parallel, distributed process • Emphasis on tuning weights automatically CS 5751 Machine Chapter 4 Artificial Neural Networks 2 Learning When to Consider Neural Networks • Input is high-dimensional discrete or real-valued (e.g., raw sensor input) • Output is discrete or real valued • Output is a vector of values • Possibly noisy data • Form of target function is unknown • Human readability of result is unimportant Examples: • Speech phoneme recognition [Waibel] • Image classification [Kanade, Baluja, Rowley] • Financial prediction CS 5751 Machine Chapter 4 Artificial Neural Networks 3 Learning ALVINN drives 70 mph on highways Sharp Straight Sharp Left Ahead Right 4 Hidden Units 30x32 Sensor Input Retina CS 5751 Machine Chapter 4 Artificial Neural Networks 4 Learning Perceptron X0=1 X1 W W 0 1 X W 2 2 Σ n n W n w x ∑ i i 1 if ∑ wi xi i=0 σ = i=1 -1 otherwise Xn 1 if w0 + w1x1 +...+ wn xn > 0 o(x1,..., xn ) = -1 otherwise Sometimes we will use simpler vector notation : r r r 1 if w⋅ x > 0 o(x) = -1 otherwise CS 5751 Machine Chapter 4 Artificial Neural Networks 5 Learning Decision Surface of Perceptron X X 1 1 X2 X2 Represents some useful functions • What weights represent g(x1,x2) = AND(x1,x2)? But some functions not representable • e.g., not linearly separable • therefore, we will want networks of these ... CS 5751 Machine Chapter 4 Artificial Neural Networks 6 Learning Perceptron Training Rule • wi ← wi + ∆wi where ∆wi =η (t − o)xi • t = c(xr) is target value • o is perceptron output • η is small constant (e.g., .1) called learning rate Can prove it will converge • If training data is linearly separable • and η is sufficiently small CS 5751 Machine Chapter 4 Artificial Neural Networks 7 Learning Gradient Descent To understand, consider simple linear unit, where o = w0 + w1x1 + ...+ wn xn Idea : learn wi 's that minimize the squared error r 1 2 E[]w ≡ 2 ∑(td − od ) d∈D Where D is the set of training examples CS 5751 Machine Chapter 4 Artificial Neural Networks 8 Learning Gradient Descent CS 5751 Machine Chapter 4 Artificial Neural Networks 9 Learning Gradient Descent r ∂E ∂E ∂E Gradient ∇E[w] ≡ , ,..., ∂w0 ∂w1 ∂wn r Training rule : ∆wi = −η ∇E[w] ∂E i.e., ∆wi = −η ∂wi CS 5751 Machine Chapter 4 Artificial Neural Networks 10 Learning Gradient Descent ∂E ∂ 1 2 = ∑(td − od ) ∂wi ∂wi 2 d 1 ∂ 2 = ∑ (td − od ) 2 d ∂wi 1 ∂ = ∑ 2(td − od ) (td − od ) 2 d ∂wi ∂ r r = ∑(td − od ) (td − w⋅ xd ) d ∂wi ∂E = ∑(td − od )(−xi,d ) ∂wi d CS 5751 Machine Chapter 4 Artificial Neural Networks 11 Learning Gradient Descent GRADIENT − DESCENT(training _ examples,η ) Each training examples is a pair of the form < xr,t > , where xr is the vector of input values and t is the t arg et output value. η is the learning rate (e.g., .05 ). • Initialize each wi to some small random value • Until the termination condition is met, do - Initialize each ∆wi to zero. - For each < xr,t > in training _ examples, do ∗ Input the instance xr and compute output o * For each linear unit weight wi , do ∆wi ← ∆wi +η (t − o)xi - For each linear unit weight w, do wi ← wi + ∆wi CS 5751 Machine Chapter 4 Artificial Neural Networks 12 Learning Summary Perceptron training rule guaranteed to succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training rule uses gradient descent • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not separable by H CS 5751 Machine Chapter 4 Artificial Neural Networks 13 Learning Incremental (Stochastic) Gradient Descent Batch mode Gradient Descent: Do until satisfied: 1. Compute the gradient ∇E [wr] r 1 2 D ED[w] ≡ 2 ∑(td − od ) r r r d∈D 2. w ← w −η ∇ED[w] Incremental mode Gradient Descent: Do until satisfied: - For each training example d in D 1. Compute the gradient ∇E [wr] r 1 2 d Ed [w] ≡ (td − od ) r r r 2 2. w ← w −η ∇Ed [w] Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η made small enough CS 5751 Machine Chapter 4 Artificial Neural Networks 14 Learning Multilayer Networks of Sigmoid Units h1 h2 d1 d3 d3 d2 d1 d o d0 d2 0 h1 h2 x1 x2 CS 5751 Machine Chapter 4 Artificial Neural Networks 15 Learning Multilayer Decision Space CS 5751 Machine Chapter 4 Artificial Neural Networks 16 Learning X =1 0 Sigmoid Unit X1 W W 0 1 X W 2 2 Σ n n 1 W net = w x o =σ (net) = ∑ i i −net i=0 1+ e Xn σ (x) is the sigmoid function 1 1+ e-x dσ (x) Nice property : = σ (x)(1−σ (x)) dx We can derive gradient descent rules to train • One sigmoid unit • Multilayer networks of sigmoid units → Backpropagation CS 5751 Machine Chapter 4 Artificial Neural Networks 17 Learning The Sigmoid Function 1 1 σ (x) = 0.9 1+ e-x 0.8 0.7 0.6 0.5 output 0.4 0.3 0.2 0.1 0 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 net input Sort of a rounded step function Unlike step function, can take derivative (makes learning possible) CS 5751 Machine Chapter 4 Artificial Neural Networks 18 Learning Error Gradient for a Sigmoid Unit But we know : ∂E ∂ 1 2 = ∑(td − od ) ∂w ∂w 2 ∂od ∂σ (netd ) i i d∈D = = o (1− o ) ∂net ∂net d d 1 ∂ 2 d d = ∑ (td − od ) r r 2 d ∂wi ∂netd ∂(w⋅ xd ) = = xi,d 1 ∂ ∂wi ∂wi = ∑ 2(td − od ) (td − od ) 2 d ∂wi So : ∂E ∂od = (t − o )− = − (td − od )od (1− od )xi,d ∑ d d ∂w ∑ d ∂wi i d∈D ∂od ∂netd = - ∑(td − od ) d ∂netd ∂wi CS 5751 Machine Chapter 4 Artificial Neural Networks 19 Learning Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, do • For each training example, do 1. Input the training example and compute the outputs 2. For each output unit k δ k ← ok (1− ok )(tk − ok ) 3. For each hidden unit h δ k ← oh (1− oh ) ∑ wh,kδ k k∈outputs 4. Update each network weight wi, j wi, j ← wi, j + ∆wi, j where ∆wi, j =η δ j xi, j CS 5751 Machine Chapter 4 Artificial Neural Networks 20 Learning More on Backpropagation • Gradient descent over entire network weight vector • Easily generalized to arbitrary directed graphs • Will find a local, not necessarily global error minimum – In practice, often works well (can run multiple times) • Often include weight momentum α ∆wi, j (n) =η δ j xi, j +α ∆wi, j (n −1) • Minimizes error over training examples • Will it generalize well to subsequent examples? • Training can take thousands of iterations -- slow! – Using network after training is fast CS 5751 Machine Chapter 4 Artificial Neural Networks 21 Learning Learning Hidden Layer Representations Outputs Input Output 10000000 → 10000000 01000000 → 01000000 00100000 → 00100000 00010000 → 00010000 Inputs 00001000 → 00001000 00000100 → 00000100 00000010 → 00000010 00000001 → 00000001 CS 5751 Machine Chapter 4 Artificial Neural Networks 22 Learning Learning Hidden Layer Representations Outputs Input Output 10000000 → .89 .04 .08 →10000000 01000000 → .01 .11 .88 → 01000000 Inputs 00100000 → .01 .97 .27 → 00100000 00010000 → .99 .97 .71 → 00010000 00001000 → .03 .05 .02 → 00001000 00000100 → .22 .99 .99 → 00000100 00000010 → .80 .01 .98 → 00000010 00000001 → .60 .94 .01 → 00000001 CS 5751 Machine Chapter 4 Artificial Neural Networks 23 Learning Output Unit Error during Training Sum of squared errors for each output unit 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 24 Learning Hidden Unit Encoding Hidden unit encoding for one input 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 25 Learning Input to Hidden Weights Weights from inputs to one hidden unit 4 3 2 1 0 -1 -2 -3 -4 -5 0 500 1000 1500 2000 2500 CS 5751 Machine Chapter 4 Artificial Neural Networks 26 Learning Convergence of Backpropagation Gradient descent to some local minimum • Perhaps not global minimum • Momentum can cause quicker convergence • Stochastic gradient descent also results in faster convergence • Can train multiple networks and get different results (using different initial weights) Nature of convergence • Initialize weights near zero • Therefore, initial networks near-linear • Increasingly non-linear functions as training progresses CS 5751 Machine Chapter 4 Artificial Neural Networks 27 Learning Expressive Capabilities of ANNs Boolean functions: • Every Boolean function can be represented by network with a single hidden layer • But that might require an exponential (in the number of inputs) hidden units Continuous functions: • Every bounded continuous function can be approximated with arbitrarily small error by a network with one hidden layer [Cybenko 1989; Hornik et al.

Load more