CS536: Machine Learning Artificial Neural Networks
Total Page:16
File Type:pdf, Size:1020Kb
CS536: Machine Learning Artificial Neural Networks Fall 2005 Ahmed Elgammal Dept of Computer Science Rutgers University Neural Networks Biological Motivation: Brain • Networks of processing units (neurons) with connections (synapses) between them • Large number of neurons: 1011 • Large connectitivity: each connected to, on average, 104 others • Switching time 10-3 second • Parallel processing • Distributed computation/memory • Processing is done by neurons and the memory is in the synapses • Robust to noise, failures CS 536 – Artificial Neural Networks - - 2 1 Neural Networks Characteristic of Biological Computation • Massive Parallelism • Locality of Computation • Adaptive (Self Organizing) • Representation is Distributed CS 536 – Artificial Neural Networks - - 3 Understanding the Brain • Levels of analysis (Marr, 1982) 1. Computational theory 2. Representation and algorithm 3. Hardware implementation • Reverse engineering: From hardware to theory • Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience CS 536 – Artificial Neural Networks - - 4 2 • ALVINN system Pomerleau (1993) • Many successful examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction Backgammon [Tesauro] CS 536 – Artificial Neural Networks - - 5 When to use ANN • Input is high-dimensional discrete or real-valued (e.g. raw sensor input). Inputs can be highly correlated or independent. • Output is discrete or real valued • Output is a vector of values • Possibly noisy data. Data may contain errors • Form of target function is unknown • Long training time are acceptable • Fast evaluation of target function is required • Human readability of learned target function is unimportant ⇒ ANN is much like a black-box CS 536 – Artificial Neural Networks - - 6 3 Perceptron d T y = ∑w j x j + w 0 = w x j =1 T w = []w 0 ,w 1 ,...,w d T x = []1,x 1 ,..., x d (Rosenblatt, 1962) CS 536 – Artificial Neural Networks - - 7 Perceptron Activation Rule: Linear threshold (step unit) Or, more succinctly: o(x) = sgn(w ⋅ x) CS 536 – Artificial Neural Networks - - 8 4 What a Perceptron Does • 1 dimensional case: • Regression: y=wx+w 0 • Classification: y=1(wx+w0>0) y y s y w0 w w 0 w x w0 x x x0=+1 CS 536 – Artificial Neural Networks - - 9 Perceptron Decision Surface Perceptron as hyperplane decision surface in the n-dimensional input space The perceptron outputs 1 for instances lying on one side of the hyperplane and outputs –1 for instances on the other side Data that can be separated by a hyperplane: linearly separable CS 536 – Artificial Neural Networks - - 10 5 Perceptron Decision Surface A single unit can represent some useful functions • What weights represent g(x1, x2) = AND(x1, x2)? Majority, Or But some functions not representable • e.g., not linearly separable • Therefore, we'll want networks of these... CS 536 – Artificial Neural Networks - - 11 Perceptron training rule wi ← wi +∆ wi where ∆ wi = η (t-o) xi Where: • t = c(x) is target value • o is perceptron output • η is small constant (e.g., .1) called the learning rate (or step size) CS 536 – Artificial Neural Networks - - 12 6 Perceptron training rule Can prove it will converge • If training data is linearly separable •and η sufficiently small • Perceptron Conversion Theorem (Rosenblatt): if the data are linearly separable then the perceptron learning algorithm converges in finite time. CS 536 – Artificial Neural Networks - - 13 Gradient Descent – Delta Rule Also know as LMS (least mean squares) rule or widrow-Hoff rule. To understand, consider simpler linear unit, where o = w0 + w1x1 + … + wnxn Let's learn wi's to minimize squared error 2 E[w] ≡ 1/2 Σd in D (td-od) Where D is set of training examples CS 536 – Artificial Neural Networks - - 14 7 Error Surface CS 536 – Artificial Neural Networks - - 15 Gradient Descent Gradient ∇E[w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn] When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E Training rule: ∆w = -η ∇E[w] in other words: ∆wi = -η ∂E/∂wi This results in the following update rule: ∆wi = η Σd (td-od) (xi,d) CS 536 – Artificial Neural Networks - - 16 8 Gradient of Error ∂E/∂wi 2 = ∂/∂wi 1/2 Σd (td-od) 2 = 1/2 Σd ∂/∂wi (td-od) = 1/2 Σd 2(td-od) ∂/∂wi (td-od) = Σd (td-od) ∂/∂wi (td-wxd) = Σd (td-od) (-xi,d) Learning Rule: ∆wi = -η ∂E/∂wi ⇒∆wi = η Σd (td-od) (xi,d) CS 536 – Artificial Neural Networks - - 17 Stochastic Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ∇ED [w] 2. w ← w - ∇ED [w] Incremental mode Gradient Descent: Do until satisfied • For each training example d in D 1. Compute the gradient ∇Ed [w] 2. w ← w - ∇Ed [w] CS 536 – Artificial Neural Networks - - 18 9 More Stochastic Grad. Desc. 2 ED[w] ≡ 1/2 Σd in D (td-od) 2 Ed [w] ≡ 1/2 (td-od) Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η set small enough Incremental Learning Rule: ∆wi = η (td-od) (xi,d) Delta Rule: ∆wi = η (t-o) (xi) δ = (t-o) CS 536 – Artificial Neural Networks - - 19 Gradient Descent Code GRADIENT-DESCENT(training examples, η) Each training example is a pair of the form <x, t>, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). • Initialize each wi to some small random value • Until the termination condition is met, Do – Initialize each ∆wi to zero. – For each <x, t> in training examples, Do • Input the instance x to the unit and compute the output o • For each linear unit weight wi, Do ∆wi ← ∆wi + η (t-o)xi – For each linear unit weight wi, Do wi ← wi + ∆wi CS 536 – Artificial Neural Networks - - 20 10 Summary Perceptron training rule will succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training uses gradient descent (delta rule) • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not H separable CS 536 – Artificial Neural Networks - - 21 K Outputs Regression: d T y i = ∑w ij x j + w i 0 = w i x j =1 y = Wx Linear Map from Rd⇒Rk Classification: T oi = w i x exp o y = i i exp o ∑k k choose C i if y i = max y k k CS 536 – Artificial Neural Networks - - 22 11 Training • Online (instances seen one by one) vs batch (whole sample) learning: – No need to store the whole sample – Problem may change in time – Wear and degradation in system components • Stochastic gradient-descent: Update after a single pattern • Generic update rule (LMS rule): t t t t ∆w ij = η(ri − y i )x j Update =LearningFa ctor ⋅()DesiredOut put −ActualOutp ut ⋅Input CS 536 – Artificial Neural Networks - - 23 Training a Perceptron: Regression • Regression (Linear output): 1 2 1 2 Et ()w |xt ,rt = (rt −yt ) = [rt −(wTxt )] 2 2 t t t t ∆wj =η()r −y xj CS 536 – Artificial Neural Networks - - 24 12 Classification • Single sigmoid output yt = sigmoid(wT xt ) • K>2 softmax outputs E t ()w | xt ,rt = −rt log yt − ()()1− rt log 1− yt t t t t ∆w j =η()r − y x j exp w T x t y t = i Et (){}w | x t ,r t = − r t log y t exp w T x t i i ∑ i i ∑k k i t t t t ∆w ij = η()ri − y i x j CS 536 – Artificial Neural Networks - - 25 Learning Boolean AND CS 536 – Artificial Neural Networks - - 26 13 XOR •No w0, w1, w2 satisfy: w 0 ≤ 0 (Minsky and Papert, 1969) w 2 + w 0 > 0 w 1 + w 0 > 0 w 1 + w 2 + w 0 ≤ 0 CS 536 – Artificial Neural Networks - - 27 Multilayer Perceptrons H T y i = v i z = ∑v ih z h + v i 0 h =1 T z h = sigmoid ()w h x 1 = d 1 + exp − w x + w []()∑j =1 hj j h 0 (Rumelhart et al., 1986) CS 536 – Artificial Neural Networks - - 28 14 x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2) CS 536 – Artificial Neural Networks - - 29 Sigmoid Activation σ(x) is the sigmoid (s-like) function 1/(1 + e-x) Nice property: d σ(x)/dx = σ(x) (1-σ(x)) Other variations: σ(x) = 1/(1 + e-k.x) where k is a positive constant that determines the steepness of the threshold. CS 536 – Artificial Neural Networks - - 30 15 CS 536 – Artificial Neural Networks - - 31 Error Gradient for Sigmoid ∂E/∂wi 2 = ∂/∂wi 1/2 Σd (td-od) 2 = 1/2 Σd ∂/∂wi (td-od) = 1/2 Σd 2(td-od) ∂/∂wi (td-od) = Σd (td-od) (-∂od/∂wi) = - Σd (td-od) (∂od/∂netd ∂netd/∂wi) CS 536 – Artificial Neural Networks - - 32 16 Even more… But we know: ∂od/∂netd =∂σ(netd)/∂netd = od (1- od ) ∂netd/∂wi = ∂(w · xd)/∂wi = xi,d So: ∂E/∂wi = - Σd (td-od)od (1 - od) xi,d CS 536 – Artificial Neural Networks - - 33 Backpropagation H T y i = v i z = ∑v ih z h + v i 0 h =1 T z h = sigmoid ()w h x 1 = d 1 + exp − w x + w []()∑j =1 hj j h 0 ∂E ∂E ∂y ∂z = i h ∂w hj ∂y i ∂z h ∂w hj CS 536 – Artificial Neural Networks - - 34 17 Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do • For each training example, Do 1. Input the training example to the network and compute the network outputs 2.