10-701 Introduction to Machine Learning (PhD) Lecture 9: Neural Networks
Leila Wehbe Carnegie Mellon University The Perceptron Machine Learning Department
Slides based on on Tom Mitchell’s 10-701 Spring 2016 material Readings: Hal Daumé III Chapter 4,10 [TM] Ch. 4, [CB] Ch. 5
Inspired by biological neurons Perceptron
Axon (output to other neurons)
Dendrites (inputs from b other neurons, can be excitatory or inhibitory) Perceptron D Error driven learning D
a = b + wdxd a = b + wdxd d=1 d=1 X prediction X = SIGN(a) • At each step, return SIGN(a) • if SIGN(a)≠ y update parameter • otherwise don’t change
b
44 a course in machine learning
twice in a row, we should do a better job the second time around. To see why this particular update achieves this, consider the fol-
lowing scenario. We have some current set of parameters w1,...,wD, b. theWe perceptron observe an example 43 (x, y). For simplicity, suppose this is a posi- Doestive example, this so movey =+1. a We in compute the right an activation direction?a, and make an error. Namely, a < 0. We now update our weights and bias. Let’s call Algorithm 5 PerceptronTrain(D, MaxIter) •theupdate new weights w’ w=10 ,...,w +yw0Dx, b 0=. Suppose w + x we observe the same exam- 1: w 0, for all d = 1 ...D // initialize weights d ple again and need to compute a new activation a0. We proceed by a 2: b 0 // initialize bias •littleb’ algebra:= b + y = b+1 3: for iter = 1 ...MaxIter do 4: for all (x,y) D do D D 2 5: a  w x + b // compute activation for this example d=1 d d a0 =  wd0 xd + b0 (4.3) 6: if ya 0 then d=1 7: w w + yx , for all d = 1 ...D // update weights D d d d 8: b b + y // update bias = (w + x )x +(b + 1) (4.4)  d d d 9: end if d=1 10: end for D D 11: end for =  wdxd + b +  xdxd + 1 (4.5) d=1 d=1 12: return w0, w1,...,wD, b D 2 = a +  xd + 1 > a (4.6) Algorithm 6 PerceptronTest(w0, w1,...,wD, b, xˆ) d=1 D 1: a  fromw xˆ http://ciml.info/dl/v0_99/ciml-v0_99-ch08.pdf+ b // compute activation for the test example d=1 d d So the difference between the old activation a and the new activa- 2: return sign(a) tion a is  x2 + 1. But x2 0, since it’s squared. So this value is 0 d d d always at least one. Thus, the new activation is always at least the old on to the next one. Second, it is error driven. This means that, so activation plus one. Since this was a positive example, we have suc- long as it is doing well, it doesn’t bother updating its parameters. cessfully moved the activation in the proper direction. (Though note The algorithm maintains a “guess” at good parameters (weights that there’s no guarantee that we will correctly classify this point the and bias) as it runs. It processes one example at a time. For a given second, third or even fourth time around!) This analysis hold for the case pos- example, it makes a prediction. It checks to see if this prediction The only hyperparameter of the perceptron algorithm is MaxIter, ? itive examples (y =+1). It should is correct (recall that this is training data, so we have access to true also hold for negative examples. the number of passes to make over the training data. If we make Work it out. labels). If the prediction is correct, it does nothing. Only when the many many passes over the training data, then the algorithm is likely prediction is incorrect does it change its parameters, and it changes to overfit. (This would be like studying too long for an exam and just them in such a way that it would do better on this example next confusing yourself.) On the other hand, going over the data only time around. It then goes on to the next example. Once it hits the one time might lead to underfitting. This is shown experimentally in last example in the training set, it loops back around for a specified Figure 4.3. The x-axis shows the number of passes over the data and number of iterations. the y-axis shows the training error and the test error. As you can see, The training algorithm for the perceptron is shown in Algo- there is a “sweet spot” at which test performance begins to degrade rithm 4.2 and the corresponding prediction algorithm is shown in due to overfitting. Algorithm 4.2. There is one “trick” in the training algorithm, which One aspect of the perceptron algorithm that is left underspecified probably seems silly, but will be useful later. It is in line 6, when we is line 4, which says: loop over all the training examples. The natural check to see if we want to make an update or not. We want to make implementation of this would be to loop over them in a constant an update if the current prediction (just sign(a)) is incorrect. The order. The is actually a bad idea. trick is to multiply the true label y by the activation a and compare Consider what the perceptron algorithm would do on a data set this against zero. Since the label y is either +1 or 1, you just need that consisted of 500 positive examples followed by 500 negative to realize that ya is positive whenever a and y have the same sign. examples. After seeing the first few positive examples (maybe five), In other words, the product ya is positive if the current prediction is it would likely decide that every example is positive, and would stop correct. It is very very important to check ? ya 0 rather than ya < 0. Why? The particular form of update for the perceptron is quite simple. The weight w is increased by yx and the bias is increased by y. The d d Figure 4.3: training and test error via goal of the update is to adjust the parameters so that they are “bet- early stopping ter” for the current example. In other words, if we saw this example 44 a course in machine learning
twice in a row, we should do a better job the second time around. To see why this particular update achieves this, consider the fol- the perceptron 49
lowing scenario. We have some current set of parameters w1,...,wD, b. We observe an example (x, y). For simplicity, suppose this is a posi- vector that separates the data. (And if the data is inseparable, then it Doestive example, this so movey =+1. a We in compute the right an activation direction?a, and make an What is thewill never decision converge.) This boundary? is great news. It means that the perceptron converges whenever it is even remotely possible to converge. error. Namely, a < 0. We now update our weights and bias. Let’s call D The second question is: how long does it take to converge? By the new weights w0 ,...,w0 , b0. Suppose we observe the same exam- a = b + wdxd = 0 • update w’ =1 w +yDx = w + x “how long,” what we really mean is “how many updates?” As is the ple again and need to compute a new activation a . We proceed by a d=1 • b’ = b + y = b+1 0 case for much learning theory,X you will not be able to get an answer little algebra: of the form “it will converge after 5293 updates.” This is asking too much. The sort of answerw we can hope to get is of the form “it will D the perceptron 49 converge after at most 5293 updates.” a0 =  wd0 xd + b0 (4.3) d=1 What you might expect to see is that the perceptron will con- D vectorverge that separates more quickly the data. for (And easy if the learning data is inseparable,problems than then itfor hard learning will neverproblems. converge.) This This certainly is great news. fits intuition. It means that The the question perceptron is how to define =  (wd + xd)xd +(b + 1) (4.4) d=1 converges“easy” whenever and “hard” it is even in remotely a meaningful possible way. to converge. One way to make this def- The second question is: how long does it take to converge? By D D inition is through the notion of margin. If I give you a data set and “how long,” what we really mean is “how many updates?” As is the = + + + 1 (4.5) hyperplane that separates itthen the margin is the distance between  wdxd b  xdxd case for much learning theory, you will not be able to get an answer = = d 1 d 1 of thethe form hyperplane “it will converge and after the nearest5293 updates.” point. This Intuitively, is asking problemstoo with large D much.margins The sort of should answer be we easy can hope (there’s to get lots is of of the “wiggle form “it room” will to find a sepa- 2 a becomes more positive = a +  xd + 1 > a (4.6) convergerating after hyperplane);at most 5293 updates.” and problems with small margins should be hard (not guaranteed that a>0) d=1 What(you you really might haveexpect to to get see ais verythat the specific perceptron well will tuned con- weight vector). verge moreFormally, quickly for given easy a learning data set problemsD, a weight than for vector hard learningw and bias b, the So the difference between the old activation a and the new activa- problems. This certainly fits intuition. The question is how to define margin of w, b on D is defined as: 2 2 “easy” and “hard” in a meaningful way. One way to make this def- tion a0 is Âd xd + 1. But xd 0, since it’s squared. So this value is inition is through the notion of margin. If I give you a data set and always at least one. Thus, the new activation is always at least the old min(x,y) D y w x + b if w separates D hyperplanemargin that(D separates, w, b)= itthen the margin2 is the· distance between (4.8) activation plus one. Since this was a positive example, we have suc- • otherwise the hyperplane and the nearest( point. Intuitively, problems with large cessfully moved the activation in the proper direction. (Though note margins should be easy (there’s lots of “wiggle room” to find a sepa- How good is this algorithm? Notionrating ofIn hyperplane); margin words, the and margin problems is only with small defined margins if w, shouldb actually be hard separate the data that there’s no guarantee that we will correctly classify this point the (you really(otherwise have to getit is a just very specific•). In well the tuned case weightthat it vector). separates the data, we second, third or even fourth time around!) This analysis holdFormally,find for the the given case point a pos- data with set D the, a minimum weight vector activation,w and bias afterb, the the activation is • Convergence: an entire pass without margin of w, b on D is defined as: The only hyperparameter of the perceptron algorithm is MaxIter, itive examples (y multiplied=+1). It should by the label. So long as the margin is not •, ? also hold for negativeFor examples. some historical reason (that is unknown to the author), mar- it is always positive. Geometrically thechanging number of passes the toweights. make over the training data. If we make min(x,y) D y w x + b if w separates D ? this makes sense, but why does Work it out. margingins(D, w are, b)= always denoted2 by· the Greek letter g (gamma).(4.8) One often many many passes over the training data, then the algorithm is likely ( • otherwise Eq (4.8) yield this? talks about the margin of a data set. The margin of a data set is the to overfit. (This would be like studying too long for an exam and just In words,largest the margin attainable is only margin defined on if w this, b actually data. Formally: separate the data (otherwise it is just •). In the case that it separates the data, we •confusingIf the yourself.)data is Onlinearly the other separable, hand, going over the the data only find the pointmargin with(D the)= minimumsup margin activation,(D, w, afterb) the activation is (4.9) onealgorithm time might leadwill to converge. underfitting. ThisBut is not shown experimentally in multiplied by the label. w,b So long as the margin is not •, For some historical reason (that is unknown to the author), mar- it is always positive. Geometrically Figure 4.3. The x-axis shows the number of passes over the data and ? this makes sense, but why does necessarily to the “best” boundary gins areIn always words, denoted to compute by the Greek the margin letter g of(gamma). a data set,One you often “try” every possi- the y-axis shows the training error and the test error. As you can see, Eq (4.8) yield this? talks aboutble w the, b marpair.gin Forof a eachdata pair,set. The you margin compute of a data its margin. set is the We then take the there is a “sweet spot” at which test performance begins to degrade largest attainable margin on this data. Formally: 1 1 If datalargest is linearly of these separable as the overall with margin margin of the data. andIf the data is not You can read “sup” as “max” if you due to overfitting. linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical ||x||margin≤ 1, (thenD)=sup algorithmmargin(D, w ,willb) converge in 1 updates(4.9) difference in how the • case is of the margin,w,b is •. One aspect of the perceptron algorithm that is left underspecified 2 handled. 2 2 is line 4, which says: loop over all the training examples. The natural In words,There to compute is a famous the margin theorem of a data due set, to you Rosenblatt “try” every possi-that shows that the Rosenblatt 1958 ble w,numberb pair. For of each errors pair, that you computethe perceptron its margin. algorithm We then takemakes the is bounded by implementation of this would be to loop over them in a constant 2 1 1 largestg of these. More as the formally: overall margin of the data. If the data is not You can read “sup” as “max” if you order. The is actually a bad idea. linearly separable, then the value of the sup, and therefore the value like: the only difference is a technical difference in how the • case is of the margin, is •. Consider what the perceptron algorithm would do on a data set handled. There is a famous theorem due to Rosenblatt2 that shows that the 2 Rosenblatt 1958 that consisted of 500 positive examples followed by 500 negative number of errors that the perceptron algorithm makes is bounded by 2 examples. After seeing the first few positive examples (maybe five), g . More formally: it would likely decide that every example is positive, and would stop
Figure 4.3: training and test error via early stopping Every node is analogous to a neuron
Neural Networks
Every node is analogous to a neuron Every node is analogous to a neuron
sigmoid unit
aj yj
b Prediction: Prediction: i ij j i yL = wL yL 1 + bL 0 1 j Forward Propagation Forward Propagation X j jk k j @ A yL 1 = wL 1yL 2 + bL 1 ij jk k j i wL wL 1yL 2 + bL 1 + bL 0 1 ! j k !
How to train? Backprop with one node per layer
Level L-2 Level L-1 Level L sigmoid units True value What is aL-1 aL @E ? wL-1 yL-1 wL yL t @w23 yL-2
bL-1 bL
i ij j i yL = wL yL 1 + bL 0 1 j X 1 i i 2 @ A E = (yL t ) ij jk k j wL wL 1yL 2 + bL 1 2 ! !
1 2 1 2 Loss function E = (yL t) Loss function E = (yL t)
aL-1 aL aL-1 aL
wL-1 yL-1 wL yL t wL-1 yL-1 wL yL t yL-2 yL-2
bL-1 bL bL-1 bL
Find the gradient for one training point at earlier nodes and parameters: how much does a very @E small change in the value of nodes and
parameters affect the loss for that point?
Backprop with one node per layer Backprop with one node per layer
1 2 1 2 Loss function E = (yL t) E = (yL t)
aL-1 aL aL-1 aL
wL-1 yL-1 wL yL t wL-1 yL-1 wL yL t yL-2 yL-2
bL-1 bL bL-1 bL
1 2 1 2 E = (yL t) E = (yL t)
@w
1 2 1 2 E = (yL t) E = (yL t)
aL-1 aL aL-1 aL
wL-1 yL-1 wL yL t wL-1 yL-1 wL yL t yL-2 yL-2
bL-1 bL bL-1 bL
1 2 1 2 E = (yL t) E = (yL t)
Backprop with one node per layer Backprop with one node per layer
1 2 1 2 E = (yL t) E = (yL t)
aL-1 aL aL-1 aL
wL-1 yL-1 wL yL t wL-1 yL-1 wL yL t yL-2 yL-2
bL-1 bL bL-1 bL