Neural Networks a Simple Problem (Linear Regression)
Total Page:16
File Type:pdf, Size:1020Kb
Neural Networks A Simple Problem (Linear y Regression) x1 k • We have training data X = { x1 }, i=1,.., N with corresponding output Y = { yk}, i=1,.., N • We want to find the parameters that predict the output Y from the data X in a linear fashion: Y ≈ wo + w1 x1 1 A Simple Problem (Linear y Notations:Regression) Superscript: Index of the data point in the training data set; k = kth training data point Subscript: Coordinate of the data point; k x1 = coordinate 1 of data point k. x1 k • We have training data X = { x1 }, k=1,.., N with corresponding output Y = { yk}, k=1,.., N • We want to find the parameters that predict the output Y from the data X in a linear fashion: k k y ≈ wo + w1 x1 A Simple Problem (Linear y Regression) x1 • It is convenient to define an additional “fake” attribute for the input data: xo = 1 • We want to find the parameters that predict the output Y from the data X in a linear fashion: k k k y ≈ woxo + w1 x1 2 More convenient notations y x • Vector of attributes for each training data point:1 k k k x = [ xo ,.., xM ] • We seek a vector of parameters: w = [ wo,.., wM] • Such that we have a linear relation between prediction Y and attributes X: M k k k k k k y ≈ wo xo + w1x1 +L+ wM xM = ∑wi xi = w ⋅ x i =0 More convenient notations y By definition: The dot product between vectors w and xk is: M k k w ⋅ x = ∑wi xi i =0 x • Vector of attributes for each training data point:1 i i i x = [ xo ,.., xM ] • We seek a vector of parameters: w = [ wo,.., wM] • Such that we have a linear relation between prediction Y and attributes X: M k k k k k k y ≈ wo xo + w1x1 +L+ wM xM = ∑wi xi = w ⋅ x i =0 3 Neural Network: Linear Perceptron xo w o Output prediction M wi w x = w ⋅ x x ∑ i i i i =0 w M Input attribute values xM Neural Network: Linear Perceptron Note: This input unit corresponds to the “fake” attribute xo = 1. Called the bias xo w Output Unit o M w i ∑wi xi = w ⋅ x xi i =0 w M Connection with weight Neural Network Learning xM problem: Adjust the connection weights so that the network Input Units generates the correct prediction on the training data. 4 Linear Regression: Gradient Descent • We seek a vector of parameters: w = [ wo,.., wM] that minimizes the error between the prediction Y and and the data X: N k k k k 2 E = ∑ (y − (wo xo + w1x1 +L+ wM xM )) k=1 N 2 = ∑()y k − w ⋅ xk k=1 N 2 k k = ∑δk δk = y − w ⋅ x k=1 δ Linear Regression:k is the error between Gradient the input x and the prediction y at data point k. y Graphically,Descent it the “vertical” distance • We seek a vectorbetween of parameters: data point wk and= [ w theo,.., predictionwM] that minimizes the errorcalculated between by usingthe prediction the vector ofY linearand and the data X: parameters w. N k k k k 2 E = ∑ (y − (wo xo + w1x1 +L+ wM xM )) k=1 N 2 = ∑()y k − w ⋅ xk x k=1 1 N 2 k k = ∑δk δk = y − w ⋅ x k=1 5 Gradient Descent • The minimum of E is reached when the derivatives with respect to each of the parameters wi is zero: N ∂E k k k k k = −2∑()y − (wo xo +w1x1 +L+w M xM ) xi ∂w i k =1 N k k k = −2∑()y − w ⋅x xi k =1 N k = −2∑δk xi k =1 Gradient Descent • The minimum of E is reached when the derivatives with respect to each of the parameters wi is zero: Note that the contribution of training data element number k k N to the overall gradient is -δkxi ∂E k k k k k = −2∑()y − (wo xo +w1x1 +L+w M xM ) xi ∂w i k =1 N k k k = −2∑()y − w ⋅x xi k =1 N k = −2∑δ k xi k =1 6 Gradient Descent Update Rule E Here we need Here we need to decrease wi. Note that to increase wi. Note that ∂E ∂E ∂wi ∂wi is positive is negative w • Update rule: Move in the direction i opposite to the gradient direction ∂E wi ← wi −α ∂wi Perceptron Training • Given input training data xk with corresponding value yk 1. Compute error: k k δk ← y − w ⋅ x 2. Update NN weights: k wi ← wi + αδ k xi 7 α is the learning rate. α too small:Linear May convergePerceptron slowly and Training may need a lot of training examples k α too• large: Given May input change training w too data quickly x with and spend a long timecorresponding oscillating around value the y kminimum. 1. Compute error: k k δk ← y − w ⋅ x 2. Update NN weights: k wi ← wi + αδ k xi w1 wo “True” value w1= 0.7 “True” value wo= 0.3 iterations iterations “True” function: y = 0.3 + 0.7 x1 w = [0.3 0.7] 8 After 2 iterations (2 training points) After 6 iterations (6 training points) 9 w1 wo “True” value w1= 0.7 “True” value wo= 0.3 iterations iterations After 20 iterations (20 training points) Perceptrons: Remarks • Update has many names: delta rule, gradient rule, LMS rule….. • Update is guaranteed to converge to the best linear fit (global minimum of E) • Of course, there are more direct ways of solving the linear regression problem by using linear algebra techniques. It boils down to a simple matrix inversion (not shown here). • In fact, the perceptron training algorithm can be much, much slower than the direct solution • So why do we bother with this? The answer in the next few of slides…be patient 10 A Simple Classification Problem Training data: x1 • Suppose that we have one attribute x1 • Suppose that the data is in two classes (red dots and green dots) • Given an input value x1, we wish to predict the most likely class (note: Same problem as the one we solved with decision trees and nearest-neighbors). A Simple Classification Problem y = 1 y = 0 • We could convert it to a problem similar to the x1 previous one by defining an output value y if0 in red class y = if1 in green class • The problem now is to learn a mapping between the attribute x1 of the training examples and their corresponding class output y 11 A Simple Classification Problem y = 1 if0 y < θ What we would like: a piece-wise y = constant prediction function: if1 y ≥ θ This is not continuous Does y = 0 not have derivatives x1 θ y = 1 y = w.x w = [ wo w1] x = [1 x1] What we get from the current linear perceptron model: y = 0 continuous linear prediction x1 y = 1 y = w.x w = [ wo w1] Possible solution: Transform the linear x = [1 x1] predictions by some function σ which would transform to a continuous y = 0 approximation of a threshold x1 y = 1 σ w x) y = ( . This curve is a continuous w = [ wo w1] approximation (“soft” threshold) of the x = [1 x ] hard threshold θ 1 Note that we can take derivatives of y = 0 that prediction function x1 12 The Sigmoid Function 1 σ ()t = 1+ e−t • Note: It is not important to remember the exact expression of σ ( in fact, alternate definitions are used for σ) . What is important to remember is that: – It is smooth and has a derivative σ’ (exact expression is unimportant) – It approximates a hard threshold function at x = 0 Generalization to M Attributes A linear separation is parameterized like a line: M ∑w i xi = w ⋅x = 0 i =0 • Two classes are linearly separable if they can be separated by a linear combination of the attributes: – Threshold in 1-d – Line in 2-d – Plane in 3-d – Hyperplane in M-d 13 Generalization to M Attributes y = 0 in this region, A linear separation is we can approximate parameterized like a line: M y by σ(w.x) ≈ 0 ∑w i xi = w ⋅x = 0 i =0 y = 1 in this region, we can approximate y by σ(w.x) ≈ 1 Single Layer Network for Classification xo w o Output prediction wi xi M w M σ ∑wi xi = σ ()w ⋅ x i =0 xM • Term: Single-layer Perceptron 14 Interpreting the Squashing Function Data is very close to threshold (small Data is very far from threshold margin) σ value is very close to (large margin) σ value is 1/2 we are not sure 50-50 very close to 0 we are very chance that the class is 0 or 1 confident that the class is 1 • Roughly speaking, we can interpret the output as how confident we are in the classification: Prob( y=1| x) Training • Given input training data xk with corresponding value yk 1. Compute error: k k δk ← y −σ (w ⋅ x ) 2. Update NN weights: k k wi ← wi + αδ k xi σ ′(w ⋅ x ) 15 Note: It is exactly the same as before, except for the additionalTraining complication of passing the linear output through σ • Given input training data xk with corresponding value yk 1. Compute error: k k δk ← y −σ (w ⋅ x ) 2.This Update formula NNderived weights: by direct application of the chain rule from calculus k k wi ← wi + αδ k xi σ ′(w ⋅ x ) Example x2 x2 = x1 y = 0 w = [0 1 -1] Annoying detail: We get the same separating line if we multiplying all y = 1 of w by some constant, so we are really interested in the relative values, such as the slope of the line, -w2/w1.