Multilayer Perceptrons and Backpropagation Perceptrons

Multilayer Perceptrons and Backpropagation Informatics 1 CG: Lecture 6 Mirella Lapata Reading: School of Informatics Kevin Gurney’s Introduction to Neural Networks, University of Edinburgh Chapters 5–6.5 [email protected] January 22, 2016 1 / 33 2 / 33 Perceptrons Recap: Perceptrons 1 − w0 Connectionism is a computer modeling approach inspired by neural networks. Anatomy of a connectionist model: units, connections 1 x w y The Perceptron as a linear classifier. 1 1 n x = Pi=0 wixi A learning algorithm for Perceptrons. 0 0 Key limitation: only works for linearly separable data. wn xn 3 / 33 4 / 33 Multilayer Perceptrons (MLPs) Activation Functions x1 w1 x2 w2 1 y Σ 0 wn h xn Step function Sigmoid function y y Input layer Output layer 1 1 Hidden layer 0 0 MLPs are feed-forward neural networks, organized in layers. h x h x One input layer, one or more hidden layers, one output layer. Outputs 0 or 1. Outputs a real value between 0 and 1. Each node in a layer connected to all other nodes in next layer. Each connection has a weight (can be zero). 5 / 33 6 / 33 Sigmoids Learning with MLPs As with perceptrons, finding the right weights is very hard! Solution technique: learning! Learning: adjusting the weights based on training examples. 7 / 33 8 / 33 Supervised Learning Learning and Error Minimization Recall: Perceptron Learning Rule General Idea Minimize the difference between the actual and desired outputs: 1 Send the MLP an input pattern, x, from the training set. 2 Get the output from the MLP, y. w w + η(t o)x i ← i − i 3 Compare y with the “right answer”, or target t, to get the error quantity. Error Function: Mean Squared Error (MSE) 4 Use the error quantity to modify the weights, so next time y An error function represents such a difference over a set of inputs: will be closer to t. N 5 1 Repeat with another x from the training set. E(w~ ) = (tp op)2 2N − Xp=1 When updating weights after seeing x, the network doesn’t just N is the number of patterns change the way it deals with x, but other inputs too ... tp is the target output for pattern p Inputs it has not seen yet! op is the output obtained for pattern p Generalization is the ability to deal accurately with unseen inputs. the 2 makes little difference, but makes life easier later on! 9 / 33 10 / 33 Gradient Descent Gradient and Derivatives: The Idea The derivative is a measure of the rate of change of a function, as its input changes; dy One technique that can be used For function y = f (x), the derivative dx indicates how much for minimizing functions is y changes in response to changes in x. gradient descent. If x and y are real numbers, and if the graph of y is plotted Can we use this on our error against x, the derivative measures the slope or gradient of the function E? line at each point, i.e., it describes the steepness or incline. We would like a learning rule that tells us how to update weights, like this: wij0 = wij + ∆wij But what should ∆wij be? 11 / 33 12 / 33 Gradient and Derivatives: The Idea Gradient and Derivatives: The Idea So, we know how to use derivatives to adjust one input value. But we have several weights to adjust! We need to use partial derivatives. A partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant. Example If y = f (x , x ), then we can have ∂y and ∂y . 1 2 ∂x1 ∂x2 dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. In our learning rule case, if we can work out the partial derivatives, dy dx < 0 implies that y decreases as x increases. If we want to we can use this rule to update the weights: find the minimum y, we should increase x. dy dx = 0 implies that we are at a minimum or maximum or a wij0 = wij + ∆wij plateau. To get closer to the minimum: where ∆w = η ∂E . dy ij ∂wij xnew = xold η − − dx 13 / 33 14 / 33 Summary So Far Using Gradient Descent to Minimize the Error ij We learnt what a multilayer perceptron is. f j We know a learning rule for updating weights in order to wij minimise the error: P f wij0 = wij + ∆wij P ∂E where ∆wij = η − ∂wij ∆wij tells us in which direction and how much we should change each weight to roll down the slope (descend the The mean squared error function E, which we want to minimize: gradient) of the error function E. N ∂E 1 p p 2 So, how do we calculate ∂w ? E(w~ ) = (t o ) ij 2N − Xp=1 15 / 33 16 / 33 Using Gradient Descent to Minimize the Error Using Gradient Descent to Minimize the Error If we use a sigmoid activation function f , then the output of neuron i for pattern p is: For the pth pattern and the ith neuron, we use gradient descent on p 1 the error function: oi = f (ui ) = au 1 + e− i ∂Ep p p where a is a pre-defined constant and u is the result of the input ∆wij = η = η(t o )f 0(ui )xij i − ∂w i − i function in neuron i: ij u = w x where f (u ) = df is the derivative of f with respect to u . i ij ij 0 i dui i j If f is the sigmoid function, f (u ) = af (u )(1 f (u )). X 0 i i − i 17 / 33 18 / 33 Using Gradient Descent to Minimize the Error Updating Output vs Hidden Neurons We can update outputij neurons using the generalize delta rule: p f ∆wij = η δ xij j i pwij p p P δ =( t o )f 0(u ) i i − i i p This δi is only good for the output neuronsf , since it relies on target outputs. But we don’t have target output for the hidden P nodes! What can we use instead? We can update weights after processing each pattern, using rule: p p p ∆w = η (t o ) f 0(u ) x δi = wki δk f 0(ui ) ij i − i i ij p Xk ∆wij = ηδ i xij This is known as the generalized delta rule. This rule propagates error back from output nodes to hidden We need to use the derivative of the activation function f . nodes. If effect, it blames hidden nodes according to how much So, f must be differentiable! The threshold activation function influence they had. So, now we have rules for updating both is not continuous, thus not differentiable! output and hidden neurons! Sigmoid has a derivative which is easy to calculate. 19 / 33 20 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ 1 Σ Σ Σ 1 Σ Σ Σ 2 Σ Σ Σ 2 Σ Σ Σ 1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 21 / 33 22 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ 1 Σ Σ Σ 1 Σ Σ Σ 2 Σ Σ Σ 2 Σ Σ Σ 1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 23 / 33 24 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ 1 Σ Σ Σ 1 Σ Σ Σ 2 Σ Σ Σ 2 Σ Σ Σ 1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 25 / 33 26 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ 1 Σ Σ Σ 1 Σ Σ Σ 2 Σ Σ Σ 2 Σ Σ Σ 1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 3 Calculate error for the output neurons. 3 Propagate backward error. 27 / 33 28 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ 1 Σ Σ Σ 1 Σ Σ Σ 2 Σ Σ Σ 2 Σ Σ Σ 1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 3 Propagate backward error. 3 Propagate backward error. 29 / 33 30 / 33 Backpropagation Online Backpropagation illustration: Σ Σ 1 Σ Σ Σ 1: Initialize all weights to small random values. 2: repeat 2 Σ Σ Σ 3: for each training example do 4: Forward propagate the input features of the example to determine the MLP’s outputs. 5: Back propagate error to generate ∆wij for all weights wij . 1 Present the pattern at the input layer. 6: Update the weights using ∆wij . 2 Propagate forward activations. 7: end for 3 Propagate backward error. 8: until stopping criteria reached. ∂E 4 Calculate ∂wij 5 Repate for all patterns and sum up. 31 / 33 32 / 33 Summary We learnt what a multilayer perceptron is. We have some intuition about using gradient descent on an error function. We know a learning rule for updating weights in order to ∂E minimize the error: ∆wij = η − ∂wij If we use the squared error, we get the generalized delta rule: p ∆wij = ηδi xij .

Load more