CS536: Machine Learning Artificial Neural Networks

CS536: Machine Learning Artificial Neural Networks Fall 2005 Ahmed Elgammal Dept of Computer Science Rutgers University Neural Networks Biological Motivation: Brain • Networks of processing units (neurons) with connections (synapses) between them • Large number of neurons: 1011 • Large connectitivity: each connected to, on average, 104 others • Switching time 10-3 second • Parallel processing • Distributed computation/memory • Processing is done by neurons and the memory is in the synapses • Robust to noise, failures CS 536 – Artificial Neural Networks - - 2 1 Neural Networks Characteristic of Biological Computation • Massive Parallelism • Locality of Computation • Adaptive (Self Organizing) • Representation is Distributed CS 536 – Artificial Neural Networks - - 3 Understanding the Brain • Levels of analysis (Marr, 1982) 1. Computational theory 2. Representation and algorithm 3. Hardware implementation • Reverse engineering: From hardware to theory • Parallel processing: SIMD vs MIMD Neural net: SIMD with modifiable local memory Learning: Update by training/experience CS 536 – Artificial Neural Networks - - 4 2 • ALVINN system Pomerleau (1993) • Many successful examples: Speech phoneme recognition [Waibel] Image classification [Kanade, Baluja, Rowley] Financial prediction Backgammon [Tesauro] CS 536 – Artificial Neural Networks - - 5 When to use ANN • Input is high-dimensional discrete or real-valued (e.g. raw sensor input). Inputs can be highly correlated or independent. • Output is discrete or real valued • Output is a vector of values • Possibly noisy data. Data may contain errors • Form of target function is unknown • Long training time are acceptable • Fast evaluation of target function is required • Human readability of learned target function is unimportant ⇒ ANN is much like a black-box CS 536 – Artificial Neural Networks - - 6 3 Perceptron d T y = ∑w j x j + w 0 = w x j =1 T w = []w 0 ,w 1 ,...,w d T x = []1,x 1 ,..., x d (Rosenblatt, 1962) CS 536 – Artificial Neural Networks - - 7 Perceptron Activation Rule: Linear threshold (step unit) Or, more succinctly: o(x) = sgn(w ⋅ x) CS 536 – Artificial Neural Networks - - 8 4 What a Perceptron Does • 1 dimensional case: • Regression: y=wx+w 0 • Classification: y=1(wx+w0>0) y y s y w0 w w 0 w x w0 x x x0=+1 CS 536 – Artificial Neural Networks - - 9 Perceptron Decision Surface Perceptron as hyperplane decision surface in the n-dimensional input space The perceptron outputs 1 for instances lying on one side of the hyperplane and outputs –1 for instances on the other side Data that can be separated by a hyperplane: linearly separable CS 536 – Artificial Neural Networks - - 10 5 Perceptron Decision Surface A single unit can represent some useful functions • What weights represent g(x1, x2) = AND(x1, x2)? Majority, Or But some functions not representable • e.g., not linearly separable • Therefore, we'll want networks of these... CS 536 – Artificial Neural Networks - - 11 Perceptron training rule wi ← wi +∆ wi where ∆ wi = η (t-o) xi Where: • t = c(x) is target value • o is perceptron output • η is small constant (e.g., .1) called the learning rate (or step size) CS 536 – Artificial Neural Networks - - 12 6 Perceptron training rule Can prove it will converge • If training data is linearly separable •and η sufficiently small • Perceptron Conversion Theorem (Rosenblatt): if the data are linearly separable then the perceptron learning algorithm converges in finite time. CS 536 – Artificial Neural Networks - - 13 Gradient Descent – Delta Rule Also know as LMS (least mean squares) rule or widrow-Hoff rule. To understand, consider simpler linear unit, where o = w0 + w1x1 + … + wnxn Let's learn wi's to minimize squared error 2 E[w] ≡ 1/2 Σd in D (td-od) Where D is set of training examples CS 536 – Artificial Neural Networks - - 14 7 Error Surface CS 536 – Artificial Neural Networks - - 15 Gradient Descent Gradient ∇E[w] = [∂E/∂w0,∂E/∂w1,…,∂E/∂wn] When interpreted as a vector in weight space, the gradient specifies the direction that produces the steepest increase in E Training rule: ∆w = -η ∇E[w] in other words: ∆wi = -η ∂E/∂wi This results in the following update rule: ∆wi = η Σd (td-od) (xi,d) CS 536 – Artificial Neural Networks - - 16 8 Gradient of Error ∂E/∂wi 2 = ∂/∂wi 1/2 Σd (td-od) 2 = 1/2 Σd ∂/∂wi (td-od) = 1/2 Σd 2(td-od) ∂/∂wi (td-od) = Σd (td-od) ∂/∂wi (td-wxd) = Σd (td-od) (-xi,d) Learning Rule: ∆wi = -η ∂E/∂wi ⇒∆wi = η Σd (td-od) (xi,d) CS 536 – Artificial Neural Networks - - 17 Stochastic Gradient Descent Batch mode Gradient Descent: Do until satisfied 1. Compute the gradient ∇ED [w] 2. w ← w - ∇ED [w] Incremental mode Gradient Descent: Do until satisfied • For each training example d in D 1. Compute the gradient ∇Ed [w] 2. w ← w - ∇Ed [w] CS 536 – Artificial Neural Networks - - 18 9 More Stochastic Grad. Desc. 2 ED[w] ≡ 1/2 Σd in D (td-od) 2 Ed [w] ≡ 1/2 (td-od) Incremental Gradient Descent can approximate Batch Gradient Descent arbitrarily closely if η set small enough Incremental Learning Rule: ∆wi = η (td-od) (xi,d) Delta Rule: ∆wi = η (t-o) (xi) δ = (t-o) CS 536 – Artificial Neural Networks - - 19 Gradient Descent Code GRADIENT-DESCENT(training examples, η) Each training example is a pair of the form <x, t>, where x is the vector of input values, and t is the target output value. η is the learning rate (e.g., .05). • Initialize each wi to some small random value • Until the termination condition is met, Do – Initialize each ∆wi to zero. – For each <x, t> in training examples, Do • Input the instance x to the unit and compute the output o • For each linear unit weight wi, Do ∆wi ← ∆wi + η (t-o)xi – For each linear unit weight wi, Do wi ← wi + ∆wi CS 536 – Artificial Neural Networks - - 20 10 Summary Perceptron training rule will succeed if • Training examples are linearly separable • Sufficiently small learning rate η Linear unit training uses gradient descent (delta rule) • Guaranteed to converge to hypothesis with minimum squared error • Given sufficiently small learning rate η • Even when training data contains noise • Even when training data not H separable CS 536 – Artificial Neural Networks - - 21 K Outputs Regression: d T y i = ∑w ij x j + w i 0 = w i x j =1 y = Wx Linear Map from Rd⇒Rk Classification: T oi = w i x exp o y = i i exp o ∑k k choose C i if y i = max y k k CS 536 – Artificial Neural Networks - - 22 11 Training • Online (instances seen one by one) vs batch (whole sample) learning: – No need to store the whole sample – Problem may change in time – Wear and degradation in system components • Stochastic gradient-descent: Update after a single pattern • Generic update rule (LMS rule): t t t t ∆w ij = η(ri − y i )x j Update =LearningFa ctor ⋅()DesiredOut put −ActualOutp ut ⋅Input CS 536 – Artificial Neural Networks - - 23 Training a Perceptron: Regression • Regression (Linear output): 1 2 1 2 Et ()w |xt ,rt = (rt −yt ) = [rt −(wTxt )] 2 2 t t t t ∆wj =η()r −y xj CS 536 – Artificial Neural Networks - - 24 12 Classification • Single sigmoid output yt = sigmoid(wT xt ) • K>2 softmax outputs E t ()w | xt ,rt = −rt log yt − ()()1− rt log 1− yt t t t t ∆w j =η()r − y x j exp w T x t y t = i Et (){}w | x t ,r t = − r t log y t exp w T x t i i ∑ i i ∑k k i t t t t ∆w ij = η()ri − y i x j CS 536 – Artificial Neural Networks - - 25 Learning Boolean AND CS 536 – Artificial Neural Networks - - 26 13 XOR •No w0, w1, w2 satisfy: w 0 ≤ 0 (Minsky and Papert, 1969) w 2 + w 0 > 0 w 1 + w 0 > 0 w 1 + w 2 + w 0 ≤ 0 CS 536 – Artificial Neural Networks - - 27 Multilayer Perceptrons H T y i = v i z = ∑v ih z h + v i 0 h =1 T z h = sigmoid ()w h x 1 = d 1 + exp − w x + w []()∑j =1 hj j h 0 (Rumelhart et al., 1986) CS 536 – Artificial Neural Networks - - 28 14 x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2) CS 536 – Artificial Neural Networks - - 29 Sigmoid Activation σ(x) is the sigmoid (s-like) function 1/(1 + e-x) Nice property: d σ(x)/dx = σ(x) (1-σ(x)) Other variations: σ(x) = 1/(1 + e-k.x) where k is a positive constant that determines the steepness of the threshold. CS 536 – Artificial Neural Networks - - 30 15 CS 536 – Artificial Neural Networks - - 31 Error Gradient for Sigmoid ∂E/∂wi 2 = ∂/∂wi 1/2 Σd (td-od) 2 = 1/2 Σd ∂/∂wi (td-od) = 1/2 Σd 2(td-od) ∂/∂wi (td-od) = Σd (td-od) (-∂od/∂wi) = - Σd (td-od) (∂od/∂netd ∂netd/∂wi) CS 536 – Artificial Neural Networks - - 32 16 Even more… But we know: ∂od/∂netd =∂σ(netd)/∂netd = od (1- od ) ∂netd/∂wi = ∂(w · xd)/∂wi = xi,d So: ∂E/∂wi = - Σd (td-od)od (1 - od) xi,d CS 536 – Artificial Neural Networks - - 33 Backpropagation H T y i = v i z = ∑v ih z h + v i 0 h =1 T z h = sigmoid ()w h x 1 = d 1 + exp − w x + w []()∑j =1 hj j h 0 ∂E ∂E ∂y ∂z = i h ∂w hj ∂y i ∂z h ∂w hj CS 536 – Artificial Neural Networks - - 34 17 Backpropagation Algorithm Initialize all weights to small random numbers. Until satisfied, Do • For each training example, Do 1. Input the training example to the network and compute the network outputs 2.

CS536: Machine Learning Artificial Neural Networks

Backpropagation and Deep Learning in the Brain

Implementing Machine Learning & Neural Network Chip

Predrnn: Recurrent Neural Networks for Predictive Learning Using Spatiotemporal Lstms

Introduction to Machine Learning

Lecture 11 Recurrent Neural Networks I CMSC 35246: Deep Learning

Comparative Analysis of Recurrent Neural Network Architectures for Reservoir Inﬂow Forecasting

Deep Learning in Bioinformatics

Machine Learning and Data Mining Machine Learning Algorithms Enable Discovery of Important “Regularities” in Large Data Sets

Sparse Cooperative Q-Learning

The Conferences Start Time Table

Counterfactual Explanations for Machine Learning: a Review

Need of Machine Learning in Bioinformatics