MALIS: Neural Networks Maria A

07/11/2019 MALIS: Neural Networks Maria A. Zuluaga Data Science Department Recap: Classification The input space divided into decision regions whose boundaries are called decision boundaries or decision surfaces . MALIS 2019 2 Source: A Zisserman 1 07/11/2019 Separating hyperplanes Two (of infinitely) possible separating hyperplanes Least squares solution regressing: 1 = −1 leads to a line given by {: + + = 0 Separating hyperplanes classifiers are linear classifiers which try to “explicitly” separate the data as well as Figure 4.1.4 From The Elements of Statistical possible. learning MALIS 2019 3 The Perceptron MALIS 2019 4 2 07/11/2019 The Perceptron • Assumptions: • Data is linearly separable • Binary classification using labels −1, 1 • Goal: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. … MALIS 2019 5 Formulation • = ( + b) +1, ≥ 0 = −1, < 0 As before we will “absorb” b by adding a “dummy” variable to x: = () = = Source: Machine learning for Intelligent Systems . Cornell University 1 ! MALIS 2019 6 3 07/11/2019 Error function: The perceptron criterion • Conditions: • Patterns will have " " # 0 • Patterns will have " " < 0 • Since , this means we want all patterns to satisfy: −1, +1 "y" # 0 ( %& = − ' "" Perceptron criterion") MALIS 2019 7 Interpretation • The output is the characteristic function of a half-space, which is limited by the hyperplane x 2 y = 1 y = −1 + b = 0 x1 hyperplane MALIS 2019 8 4 07/11/2019 Representation +1 W0 activation x i Wi MALIS 2019 9 Linear Separability • Given two sets of points, is there a perceptron which classifies them ? x2 x2 x1 x1 YES NO • True only if sets are linearly separable 1 MALIS 2019 0 5 07/11/2019 Finding the weights • Obtain an expression for the gradient of the perceptron criterion • Use stochastic gradient descent to minimize the error function • A change in the weight vector is given by: (+,) (+) + * = * − -.%& = * + -/"" Stochastic gradient descent vs gradient descent : Rather than computing the sum of the gradient contributions of each observation followed by a step in the negative gradient direction, a step is taken after each single observation is visited. MALIS 2019 11 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + - 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 12 6 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 13 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 14 7 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 15 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 16 8 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 17 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 18 9 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 19 Hands on example: The OR function • 03_perceptron.ipynb 1 +1 -1 0 1 MALIS 2019 20 10 07/11/2019 1 Hands on example: The OR function 0 1 X0 X1 X2 b W1 W2 y activation m 0.5 0 1 1 MALIS 2019 21 Solution MALIS 2019 22 11 07/11/2019 Perceptron convergence theorem • If there exists an exact solution (i.e. data is linearly separable), then it is guaranteed for the perceptron algorithm to converge to this solution in a finite number of steps. • However: • The number of steps to convergence might be large • Until convergence is achieved, it is not possible to distinguish a non-separable problem from a slow to converge one. MALIS 2019 23 What if we use a different initialization value? Question: These are specific examples. What are the general set of inequalities that must be satisfied for an OR perceptron? MALIS 2019 24 12 07/11/2019 Other logic functions Exercises to complete in the notebook MALIS 2019 25 Perceptron Limitations • Perceptrons cannot generate XOR: x1 x2 XOR x2 XOR 0 0 0 1 ? 1 0 1 0 1 1 0 1 x1 1 1 0 [Minsky 1969] -> The AI winter 26 13 07/11/2019 Perceptron Limitations • The algorithm does not converge when the data are not separable • When the data is separable, there are many solutions, and which one is found depends on the starting values • The “finite” number of steps can be very large. 27 Some history Source: C. Bishop - PRML MALIS 2019 28 14 07/11/2019 Recap • We introduced the perceptron algorithm, a linear classifier that guarantees convergence • We saw that it guarantees a solution for separable data • But we also saw that it has numerous limitations MALIS 2019 29 Neural Networks MALIS 2019 30 15 07/11/2019 Why neural networks? A note on history • The term neural network has its origins in attempts to find mathematical representations of information processing in biological systems • From Bishop: it has been used very broadly to cover a wide range of different models, many of which have been the subJect of exaggerated claims regarding their biological plausibility • They are nonlinear efficient models for statistical pattern recognition MALIS 2019 31 Motivation • Recall the first lecture on linear models • We saw that adding features could give a better fit of the model 2 = 1, , , 3 , 4 • : basis function 2 • Model: 6 , = ' 5 25() 57 MALIS 2019 32 16 07/11/2019 Motivation Revisit: 01_linear_models.ipynb • We also saw that choosing the right set of features was challenging • Goal: Making the basis functions depend on parameters and25(/) allow these parameters to be adJusted along with the coefficients during training 5 • How? Neural networks Which one is the good value for n? MALIS 2019 33 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Basic neural network model: series of functional transformations • Step 1: Construct M linear combinations of the input variables / 9 8 9 Layer index. E.g. activations () () parameters of the 5 = ' 5" " + 5 first layer of the "7 network J=1,..M Index of linear weights bias combinations i=1,..D Index of dimensions of the input X MALIS 2019 34 17 07/11/2019 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 2: Transform each activation using a differentiable, nonlinear activation function : ; <5 = : 5 • generally chosen to be a sigmoidal function : ; • corresponds to the outputs of the basis functions of our model. Recall: <5 6 , = ' 5 25() 57 • In the context of neural networks, these are called hidden units MALIS 2019 35 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 3: The are again linearly combined to give output unit activations <5 6 Layer index. E.g. Output unit parameters of the activations + = ' +5 <5 + + 2nd layer of the 57 network k=1,..K Output index. weights bias K: number of outputs MALIS 2019 36 18 07/11/2019 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 4: The are transformed using an activation function to give a set of network+ outputs + • Choice of the activation function follows same considerations as for linear models • For regression: Identity function • Common activation functions for classification • Sigmoid function: = = 1/(1 + ?@A) • Tanh: FG@FHG tanh = G HG • Hinge or relu: F ,F : =max(,0) • Softmax (multiclass): FGK : = ( GM LM F MALIS 2019 37 Feed forward networks a.k.a MLP Final expression • Combining all, using a sigmoidal output unit activation function 6 9 + , = = ' +5 : ' 5" " + 5 + + 57 "7 MALIS 2019 38 19 07/11/2019 Feed forward networks / MLP Interpretation: Network diagram • Forward propagation of information through the network Fig 5.1 PRML – C. Bishop. Two-layered network MALIS 2019 39 Simplifying notation Absorbing the biases • As with linear models, the bias parameters can be absorbed into the set of weight parameters by adding a dummy variable N = 1 () 5 = ' 5" " "7 • In the second layer: 6 + = ' +5 <5 57 MALIS 2019 40 20 07/11/2019 Feed forward networks a.k.a MLP Simplified final expression • The overall network function now becomes 6 9 + , = = ' +5 : ' 5" " 57 "7 6 , = ' 5 25() 57 MALIS 2019 41 Multilayer perceptron Interpretation • Two stages of processing, each of which resembles the perceptron Similar to perceptron 6 N + , = = ' +5 : ' 5" " 57 "7 Adapted from Fig 5.1 PRML – C.

MALIS: Neural Networks Maria A

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support