Multilayer and Informatics 1 CG: Lecture 6

Mirella Lapata Reading:

School of Informatics Kevin Gurney’s Introduction to Neural Networks, University of Edinburgh Chapters 5–6.5 [email protected]

January 22, 2016

1 / 33 2 / 33

Perceptrons Recap: Perceptrons

1 −

w0 is a computer modeling approach inspired by neural networks.

Anatomy of a connectionist model: units, connections 1 x w y The as a linear classifier. 1 1 n x = Pi=0 wixi A learning for Perceptrons. 0 0 Key limitation: only works for linearly separable data.

wn

xn

3 / 33 4 / 33 Multilayer Perceptrons (MLPs) Activation Functions

x1 w1

x2 w2 1 y Σ 0 wn h

xn

Step function y y

Input Output layer 1 1 Hidden layer 0 0 MLPs are feed-forward neural networks, organized in layers. h x h x One input layer, one or more hidden layers, one output layer. Outputs 0 or 1. Outputs a real value between 0 and 1. Each node in a layer connected to all other nodes in next layer. Each connection has a weight (can be zero). 5 / 33 6 / 33

Sigmoids Learning with MLPs

Input layer Output layer Hidden layer

As with perceptrons, finding the right weights is very hard! Solution technique: learning! Learning: adjusting the weights based on training examples. 7 / 33 8 / 33 Learning and Error Minimization

Recall: Perceptron General Idea Minimize the difference between the actual and desired outputs: 1 Send the MLP an input pattern, x, from the training set. 2 Get the output from the MLP, y. w w + η(t o)x i ← i − i 3 Compare y with the “right answer”, or target t, to get the error quantity. Error Function: (MSE) 4 Use the error quantity to modify the weights, so next time y An error function represents such a difference over a set of inputs: will be closer to t. N 5 1 Repeat with another x from the training set. E(w~ ) = (tp op)2 2N − Xp=1 When updating weights after seeing x, the network doesn’t just N is the number of patterns change the way it deals with x, but other inputs too ... tp is the target output for pattern p Inputs it has not seen yet! op is the output obtained for pattern p Generalization is the ability to deal accurately with unseen inputs. the 2 makes little difference, but makes life easier later on!

9 / 33 10 / 33

Gradient Descent and Derivatives: The Idea

The derivative is a measure of the rate of change of a function, as its input changes; dy One technique that can be used For function y = f (x), the derivative dx indicates how much for minimizing functions is y changes in response to changes in x. . If x and y are real numbers, and if the graph of y is plotted Can we use this on our error against x, the derivative measures the slope or gradient of the function E? line at each point, i.e., it describes the steepness or incline. We would like a learning rule that tells us how to update weights, like this:

wij0 = wij + ∆wij

But what should ∆wij be?

11 / 33 12 / 33 Gradient and Derivatives: The Idea Gradient and Derivatives: The Idea

So, we know how to use derivatives to adjust one input value. But we have several weights to adjust! We need to use partial derivatives. A of a function of several variables is its derivative with respect to one of those variables, with the others held constant.

Example If y = f (x , x ), then we can have ∂y and ∂y . 1 2 ∂x1 ∂x2 dy dx > 0 implies that y increases as x increases. If we want to find the minimum y, we should reduce x. In our learning rule case, if we can work out the partial derivatives, dy dx < 0 implies that y decreases as x increases. If we want to we can use this rule to update the weights: find the minimum y, we should increase x. dy dx = 0 implies that we are at a minimum or maximum or a wij0 = wij + ∆wij plateau. To get closer to the minimum: where ∆w = η ∂E . dy ij ∂wij xnew = xold η − − dx 13 / 33 14 / 33

Summary So Far Using Gradient Descent to Minimize the Error

ij

We learnt what a is. f j We know a learning rule for updating weights in order to wij minimise the error: P f wij0 = wij + ∆wij P ∂E where ∆wij = η − ∂wij ∆wij tells us in which direction and how much we should change each weight to roll down the slope (descend the The mean squared error function E, which we want to minimize: gradient) of the error function E. N ∂E 1 p p 2 So, how do we calculate ∂w ? E(w~ ) = (t o ) ij 2N − Xp=1

15 / 33 16 / 33 Using Gradient Descent to Minimize the Error Using Gradient Descent to Minimize the Error ij ij

f j f wij j P wij P f f P P

If we use a sigmoid f , then the output of neuron i for pattern p is: For the pth pattern and the ith neuron, we use gradient descent on

p 1 the error function: oi = f (ui ) = au 1 + e− i ∂Ep p p where a is a pre-defined constant and u is the result of the input ∆wij = η = η(t o )f 0(ui )xij i − ∂w i − i function in neuron i: ij u = w x where f (u ) = df is the derivative of f with respect to u . i ij ij 0 i dui i j If f is the sigmoid function, f (u ) = af (u )(1 f (u )). X 0 i i − i 17 / 33 18 / 33

Using Gradient Descent to Minimize the Error Updating Output vs Hidden Neurons ij We can update output neurons using the generalize : f j wij p P ∆wij = η δi xij p p p f δ =( t o )f 0(u ) i i − i i p P This δi is only good for the output neurons, since it relies on target outputs. But we don’t have target output for the hidden nodes! What can we use instead? We can update weights after processing each pattern, using rule: p p p ∆w = η (t o ) f 0(u ) x δi = wki δk f 0(ui ) ij i − i i ij p Xk ∆wij = ηδ i xij This is known as the generalized delta rule. This rule propagates error back from output nodes to hidden We need to use the derivative of the activation function f . nodes. If effect, it blames hidden nodes according to how much So, f must be differentiable! The threshold activation function influence they had. So, now we have rules for updating both is not continuous, thus not differentiable! output and hidden neurons! Sigmoid has a derivative which is easy to calculate. 19 / 33 20 / 33 Backpropagation Backpropagation

illustration: illustration: Σ Σ Σ Σ

1 Σ Σ Σ 1 Σ Σ Σ

2 Σ Σ Σ 2 Σ Σ Σ

1 Present the pattern at the input layer. 1 Present the pattern at the input layer.

21 / 33 22 / 33

Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ

1 Σ Σ Σ 1 Σ Σ Σ

2 Σ Σ Σ 2 Σ Σ Σ

1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations.

23 / 33 24 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ

1 Σ Σ Σ 1 Σ Σ Σ

2 Σ Σ Σ 2 Σ Σ Σ

1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations.

25 / 33 26 / 33

Backpropagation Backpropagation

illustration: illustration: Σ Σ Σ Σ

1 Σ Σ Σ 1 Σ Σ Σ

2 Σ Σ Σ 2 Σ Σ Σ

1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 3 Calculate error for the output neurons. 3 Propagate backward error.

27 / 33 28 / 33 Backpropagation Backpropagation illustration: illustration: Σ Σ Σ Σ

1 Σ Σ Σ 1 Σ Σ Σ

2 Σ Σ Σ 2 Σ Σ Σ

1 Present the pattern at the input layer. 1 Present the pattern at the input layer. 2 Propagate forward activations. 2 Propagate forward activations. 3 Propagate backward error. 3 Propagate backward error.

29 / 33 30 / 33

Backpropagation Online Backpropagation illustration: Σ Σ

1 Σ Σ Σ 1: Initialize all weights to small random values. 2: repeat 2 Σ Σ Σ 3: for each training example do 4: Forward propagate the input features of the example to determine the MLP’s outputs. 5: Back propagate error to generate ∆wij for all weights wij . 1 Present the pattern at the input layer. 6: Update the weights using ∆wij . 2 Propagate forward activations. 7: end for 3 Propagate backward error. 8: until stopping criteria reached. ∂E 4 Calculate ∂wij 5 Repate for all patterns and sum up.

31 / 33 32 / 33 Summary

We learnt what a multilayer perceptron is. We have some intuition about using gradient descent on an error function. We know a learning rule for updating weights in order to ∂E minimize the error: ∆wij = η − ∂wij If we use the squared error, we get the generalized delta rule: p ∆wij = ηδi xij . p We know how to calculate δi for output and hidden layers. We can use this rule to learn an MLP’s weights using the backpropagation algorithm. Next lecture: a neural network model of the past tense.

33 / 33