Neural Networks 2
Biological Neural Networks Artifical Neural Networks Neural networks are inspired by our brains. The human brain has about 1011 neurons and 1014 synapses. A neuron consists of a soma (cell body), axons (sends signals), and dendrites (receives signals). A synapse connects an axon to a dendrite. Given a signal, a synapse might increase (excite) or decrease (inhibit) electrical potential. A neuron fires when its electrical potential reaches a threshold. Neural Networks 2 Learning might occur by changes to synapses. BiologicalNeuralNetworks ...... 2 CS 5233 Artificial Intelligence Artificial Neural Networks – 2 ArtificialNeuralNetworks ...... 3 ANNStructure...... 4 Artificial Neural Networks ANNIllustration...... 5 An (artificial) neural network consists of units, connections, and weights. Inputs Perceptrons 6 and outputs are numeric. PerceptronLearningRule ...... 6 PerceptronsContinued ...... 7 Biological NN Artificial NN Example of Perceptron Learning (α =1)...... 8 soma unit Gradient Descent 9 axon, dendrite connection GradientDescent(Linear)...... 9 synapse weight TheDeltaRule ...... 10 potential weighted sum threshold bias weight Multilayer Networks 11 signal activation Illustration...... 11 SigmoidActivation...... 12 CS 5233 Artificial Intelligence Artificial Neural Networks – 3 PlotofSigmoidFunction...... 13 ApplyingTheChainRule...... 14 Backpropagation ...... 15 Hypertangent Activation Function...... 16 GaussianActivationFunction...... 17 IssuesinNeuralNetworks ...... 18
1 2 ANN Structure Perceptrons 6 A typical unit i receives inputs aj1, aj2,... from other units and performs a Perceptron Learning Rule weighted sum: A perceptron is a single unit with activation: ini = W0,i + j Wj,i aj P and outputs activation ai = g(ini). a = sign W0 + j Wj ∗ xj input units hidden units P Typically, store the inputs, transform the inputs sign returns −1 or 1. W is the bias weight. into an internal numeric vector, and an output unit transforms the hidden 0 One version of the perceptron learning rule is: values into the prediction. An ANN is a function f(x, W) = a, where x is an example, W is the in ← W0 + Wj ∗ xj weights, and a is the prediction (activation value from output unit). E ← max(0,P1 − y ∗ in) Learning is finding a W that minimizes error. if E > 0 then W0 ← W0 + α ∗ y CS 5233 Artificial Intelligence Artificial Neural Networks – 4 Wj ← Wj + α ∗ xj ∗ y for each input xj x ANN Illustration is the inputs, y ∈{1, −1} is the target, and α > 0 is the learning rate. INPUT HIDDEN OUTPUT CS 5233 Artificial Intelligence Artificial Neural Networks – 6 UNITS UNITS UNIT OUTPUT Perceptrons Continued x1 W15 H5 This learning rule tends to minimize E. W25 The perceptron convergence theorem states that if some W classifies all W35 W05 W57 the training examples correctly, then the perceptron learning rule will x2 W45 converge to zero error on the training examples. Usually, many epochs (passes over the training examples) are needed until bias W07 O7 a7 W16 convergence. x3 If zero error is not possible, use α ≈ 0.1/n, where n is the number of W26 normalized or standardized inputs. W06 W67 W36 CS 5233 Artificial Intelligence Artificial Neural Networks – 7 x4W46 H6 WEIGHTS CS 5233 Artificial Intelligence Artificial Neural Networks – 5
3 4 Example of Perceptron Learning (α =1) The Delta Rule Using α =1: Given error E(W), obtain the gradient: Inputs Weights ∂E ∂E ∂E ∇E(W) = , ,... x1 x2 x3 x4 y in E W0 W1 W2 W3 W4 ∂W ∂W ∂W 0 0 0 0 0 0 1 n 0 0 0 1 -1 0 1 -1 0 0 0 -1 To decrease error, use the update rule: 1 1 1 0 1 -1 2 0 1 1 1 -1 ∂E 1 1 1 1 1 2 0 0 1 1 1 -1 Wj ← Wj − α 0 0 1 1 -1 0 1 -1 1 1 0 -2 ∂Wj 0 0 0 0 1 -1 2 0 1 1 0 -2 0 1 0 1 -1 -1 0 0 1 1 0 -2 The LMS update rule can be derived using: 1 0 0 0 1 1 0 0 1 1 0 -2 2 E(W)=(y − (W0 + Wj ∗ xj)) /2 1 0 1 1 1 -1 2 1 2 1 1 -1 0 1 0 0 -1 2 3 0 2 0 1 -1 For the perceptron learningP rule, use: CS 5233 Artificial Intelligence Artificial Neural Networks – 8 E(W) = max(0, 1 − y ∗ (W0 + Wj ∗ xj))
CS 5233 Artificial Intelligence P Artificial Neural Networks – 10
Gradient Descent 9
Gradient Descent (Linear) Multilayer Networks 11 Suppose activation is a linear function: Illustration a = W0 + Wj ∗ xj INPUT HIDDEN OUTPUT The Widrow-HoffP (LMS) update rule is: UNITS UNITS UNIT OUTPUT diff ← y − a x1 1 H5 W0 ← W0 + α ∗ diff Wj ← Wj + α ∗ xj ∗ diff for each input j 2 1 −1 2 where α is the learning rate. x2 This update rule tends to minimize squared error. 3 E(W) = (y − a)2 bias 1 O7 a7 (x,y) 0 P x3 CS 5233 Artificial Intelligence Artificial Neural Networks – 9 −3 −2 −4 4 x41 H6 WEIGHTS CS 5233 Artificial Intelligence Artificial Neural Networks – 11
5 6 Sigmoid Activation Applying The Chain Rule 2 The sigmoid function is defined as: Using E =(yi − ai) for output unit i: 1 sigmoid(x) = ∂E ∂E ∂ai ∂ini 1 + e−x = ∂Wj,i ∂ai ∂ini ∂Wj,i It is commonly used for ANN activation functions: = −2(yi − ai) ai(1 − ai) aj ai = sigmoid(ini) = sigmoid(W + W a ) 0,i j j,i j For weights from input to hidden units: Note that P ∂sigmoid(x) ∂E ∂E ∂ai ∂ini ∂aj ∂inj = sigmoid(x)(1 − sigmoid(x)) = ∂x ∂Wk,j ∂ai ∂ini ∂aj ∂inj ∂Wk,j
CS 5233 Artificial Intelligence Artificial Neural Networks – 12 = −2(yi − ai) ai(1 − ai) Wj,i aj(1 − aj) xk Plot of Sigmoid Function CS 5233 Artificial Intelligence Artificial Neural Networks – 14 1.0 Backpropagation sigmoid(x) 0.8 Backpropagation is an application of the delta rule. Update each weight W using the rule: 0.6 ∂E W ← W − α 0.4 ∂W where α is the learning rate. 0.2 CS 5233 Artificial Intelligence Artificial Neural Networks – 15 0.0 -4 -2 0 2 4 CS 5233 Artificial Intelligence Artificial Neural Networks – 13
7 8 Hypertangent Activation Function Issues in Neural Networks 1.0 Sufficiently complex ANNs can approximate any “reasonable” function. tanh(x) ANNs approx. a preference bias for interpolation. How many hidden units and layers? 0.5 What are good initial values? One approach to avoid overfitting is: 0.0 Remove “validation” exs. from training exs. Train neural network using training exs. Choose weights that are best on validation set. -0.5 Faster algorithms might rely on: Momentum, a running average of deltas. -1.0 Conjugant Gradient, a second deriv. method, -2 -1 0 1 2 or RPROP, update based only on sign. CS 5233 Artificial Intelligence Artificial Neural Networks – 16 CS 5233 Artificial Intelligence Artificial Neural Networks – 18
Gaussian Activation Function 1.0 gaussian(x,1) 0.8
0.6
0.4
0.2
0.0 -3 -2 -1 0 1 2 3 CS 5233 Artificial Intelligence Artificial Neural Networks – 17
9 10