Supervised Learning

Artificial neural networks, simple supervised learning AIMS • Understand the aim of using neural networks for pattern classification. • Describe how a set of examples of stimuli and correct responses can be used to train an artificial neural network to respond correctly via changes in synaptic weights governed by the firing rates of the pre- and post-synaptic neurons and the correct post-synaptic firing rate. • Describe how this type of learning rule is used to perform pattern recognition in a perceptron. • Discuss the limitation of the perceptron. READING Books 1,2,5. recap: Learning in neural networks The problem: find connection weights such that the network does something useful. Solution: Experience-dependent learning rules to modify connection weights, i.e. learn from examples. 1. ‘Unsupervised’ (no ‘teacher’ or feedback about right and wrong outputs) 2. ‘Supervised’: A. Evolution/genetic algorithms B. Occasional reward or punishment (‘reinforcement learning’) C. Fully-supervised: each example includes correct output. 1 Pattern classification: linear separation of categories o = f(h), h = w1x1 + w2x2 Input space. T=1.5 f(h) 1 o=1 w1=1 w2=1 o=0 1 T h x1+x2=1.5, x2=-x1+1.5 x1 x2 Consider the network with weights w1 0 and w2 = 1, threshold T=1.5 and a 0 1 threshold logic transfer function. The line separating o=1 from o=0 is where the net input equals the threshold: h w1 x1 + w2 x2 = T, ie line: x2 = -(w1/w2)x1 + T/w2 (in y = mx+c format). Changing the weights changes the line. So learning to classify two groups of input patterns means finding weights so that the line separates the groups. It performs the logical ‘AND’ function Pattern classification, example x2 = height/ cm ‘decision boundary’ 1 for men o={ ‘decision region’ 0 for women o=1 o=0 + + + f(h) 200 + 1 o + T=660 o + + + + o o + + T h o o + o o + o 5x1+3x2 = 660 w1=5 w2=3 150 o o o o o x2 = -(5/3)x1+220 50 80 x1 = weight/ kg x1 x2 ‘Learning’ or ‘training’ on example patterns is used to define the weights and thus the ‘decision region’. Performance depends on how well it ‘generalises’ i.e. how well it classifies new data. These networks perform ‘linear discrimination’: the decision boundary is always linear. 2 xi Reminder: the ‘Hebb rule Δwij 0 1 xj oi wij wij → wij + ε xj oi 0 - - xj 1 - + Supervised learning: the ‘delta rule’ n n n • Given ‘training set’ of inputs x (x1 ,x2 ..) with n n n target output valuess t (t1 ,t2 ..) where n=1,2,.. xj oi wij • Present input pattern n and find the corresponding n n n output values o (o1 ,o2 ..) • Change connection weights according to: n n n wij → wij + ε xj (ti - oi ) • Present the next input pattern: n → n+1 n n n The term (ti - oi ) is also known as ‘delta’ δi : the ‘error’ made by output i for input pattern n. The delta rule is like the Hebb rule, but with post-synaptic term δ to change weights so as to reduce ‘error’. The perceptron (Rosenblatt, 1962) output =1 if pattern detected Present input patterns xn (some are pattern to (threshold logic function) be detected with target tn =1, others are ‘foils’ to o be ignored with target tn =0). wj For each pattern apply the ‘delta rule’: n n n wj → wj + ε xj (t - o ) After many presentations of the whole training set (in random order) it can find the best linear xj discriminator of targets from foils. ‘Artificial retina’ n n n Notice that wj changes only if t ≠ o and xj ≠ 0. n n n n If t > o then wj increases; if t < o then wj decreases, i.e. the delta rule changes weights so as to reduce the error 3 Perceptrons, thresholds and multiple outputs. Q: The delta rule only learns weights, how do we find the value for the threshold T? A: Just use T=0 and add another input x0=-1, then the weight from it w0 can serve the same purpose: the condition for output to be active: w1x1+w2x2+..>T is the same as w0x0+w1x1+w2x2+..>0 if x0=-1 and w0=T - Then the delta rule can find connection weight w0 (which is the same as finding T). o oi wj w0 wij xj xj x0= -1 Perceptrons can have many output units (forming a single-layer neural network) – each output is trained using the delta rule independently of the others. Non-linearly separable problems f(h) e.g. XOR ‘exclusive OR’ T o 1 x2 o=1 o=0 w1 w2 T h 1 x1 x2 n n n wi → wi + ε xi (t - o ) Perceptrons perform linear discrimination and can’t solve 0 1 x1 ‘non-linearly separable’ problems (Minsky & Papert, 1969) x1 o 0 +1 T=0.5 0 1 -2 0 1 x2 ‘Hidden’ processing units 0 T=0.5 T=1.5 can give more power, e.g: +1 1 1 1 1 1 4 o1 o2 Multi-layer perceptrons v v1 2 v3 f(h) Q: How train connection weights if the ‘internal 1 n representation’ is not known (i.e. ti not known n for ‘hidden layer’, xj not known for output? h A: if the neurons use a smooth (continuous) x1 x2 transfer function, changing a connection weight from any (active) neuron anywhere in the Delta rule: network will change the output firing rate n n n n wij → wij + ε xj (ti - oi ) slightly. If the output oi is now closer to its n target value ti then the change was good.. Example: But note: The effect of changing a connection n n o =0.6 t =0.3 weight depends on the values of the errors n n n δi =(ti – oi ), connections wij and activations vi 2 further up in the network.. n n v1 =0.5 v2 w If w decreases the error for input pattern n reduces n n x1 =1.0 x2 5.

Supervised Learning

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support