MALIS: Neural Networks Maria A

07/11/2019 MALIS: Neural Networks Maria A. Zuluaga Data Science Department Recap: Classification The input space divided into decision regions whose boundaries are called decision boundaries or decision surfaces . MALIS 2019 2 Source: A Zisserman 1 07/11/2019 Separating hyperplanes Two (of infinitely) possible separating hyperplanes Least squares solution regressing: 1 = −1 leads to a line given by {: + + = 0 Separating hyperplanes classifiers are linear classifiers which try to “explicitly” separate the data as well as Figure 4.1.4 From The Elements of Statistical possible. learning MALIS 2019 3 The Perceptron MALIS 2019 4 2 07/11/2019 The Perceptron • Assumptions: • Data is linearly separable • Binary classification using labels −1, 1 • Goal: Find a separating hyperplane by minimizing the distance of misclassified points to the decision boundary. … MALIS 2019 5 Formulation • = ( + b) +1, ≥ 0 = −1, < 0 As before we will “absorb” b by adding a “dummy” variable to x: = () = = Source: Machine learning for Intelligent Systems . Cornell University 1 ! MALIS 2019 6 3 07/11/2019 Error function: The perceptron criterion • Conditions: • Patterns will have " " # 0 • Patterns will have " " < 0 • Since , this means we want all patterns to satisfy: −1, +1 "y" # 0 ( %& = − ' "" Perceptron criterion") MALIS 2019 7 Interpretation • The output is the characteristic function of a half-space, which is limited by the hyperplane x 2 y = 1 y = −1 + b = 0 x1 hyperplane MALIS 2019 8 4 07/11/2019 Representation +1 W0 activation x i Wi MALIS 2019 9 Linear Separability • Given two sets of points, is there a perceptron which classifies them ? x2 x2 x1 x1 YES NO • True only if sets are linearly separable 1 MALIS 2019 0 5 07/11/2019 Finding the weights • Obtain an expression for the gradient of the perceptron criterion • Use stochastic gradient descent to minimize the error function • A change in the weight vector is given by: (+,) (+) + * = * − -.%& = * + -/"" Stochastic gradient descent vs gradient descent : Rather than computing the sum of the gradient contributions of each observation followed by a step in the negative gradient direction, a step is taken after each single observation is visited. MALIS 2019 11 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + - 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 12 6 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 13 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 14 7 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 15 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 16 8 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 17 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 18 9 07/11/2019 Perceptron training algorithm initialize while TRUE do : m=0 foreach do (x i,y i) if " " 0 0 m=m+1* = * + 1/1 Illustration adapted from Fig 4.7 PRML – 4.7 PRML Fig from adapted Illustration CBishop if m = 0 break MALIS 2019 19 Hands on example: The OR function • 03_perceptron.ipynb 1 +1 -1 0 1 MALIS 2019 20 10 07/11/2019 1 Hands on example: The OR function 0 1 X0 X1 X2 b W1 W2 y activation m 0.5 0 1 1 MALIS 2019 21 Solution MALIS 2019 22 11 07/11/2019 Perceptron convergence theorem • If there exists an exact solution (i.e. data is linearly separable), then it is guaranteed for the perceptron algorithm to converge to this solution in a finite number of steps. • However: • The number of steps to convergence might be large • Until convergence is achieved, it is not possible to distinguish a non-separable problem from a slow to converge one. MALIS 2019 23 What if we use a different initialization value? Question: These are specific examples. What are the general set of inequalities that must be satisfied for an OR perceptron? MALIS 2019 24 12 07/11/2019 Other logic functions Exercises to complete in the notebook MALIS 2019 25 Perceptron Limitations • Perceptrons cannot generate XOR: x1 x2 XOR x2 XOR 0 0 0 1 ? 1 0 1 0 1 1 0 1 x1 1 1 0 [Minsky 1969] -> The AI winter 26 13 07/11/2019 Perceptron Limitations • The algorithm does not converge when the data are not separable • When the data is separable, there are many solutions, and which one is found depends on the starting values • The “finite” number of steps can be very large. 27 Some history Source: C. Bishop - PRML MALIS 2019 28 14 07/11/2019 Recap • We introduced the perceptron algorithm, a linear classifier that guarantees convergence • We saw that it guarantees a solution for separable data • But we also saw that it has numerous limitations MALIS 2019 29 Neural Networks MALIS 2019 30 15 07/11/2019 Why neural networks? A note on history • The term neural network has its origins in attempts to find mathematical representations of information processing in biological systems • From Bishop: it has been used very broadly to cover a wide range of different models, many of which have been the subJect of exaggerated claims regarding their biological plausibility • They are nonlinear efficient models for statistical pattern recognition MALIS 2019 31 Motivation • Recall the first lecture on linear models • We saw that adding features could give a better fit of the model 2 = 1, , , 3 , 4 • : basis function 2 • Model: 6 , = ' 5 25() 57 MALIS 2019 32 16 07/11/2019 Motivation Revisit: 01_linear_models.ipynb • We also saw that choosing the right set of features was challenging • Goal: Making the basis functions depend on parameters and25(/) allow these parameters to be adJusted along with the coefficients during training 5 • How? Neural networks Which one is the good value for n? MALIS 2019 33 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Basic neural network model: series of functional transformations • Step 1: Construct M linear combinations of the input variables / 9 8 9 Layer index. E.g. activations () () parameters of the 5 = ' 5" " + 5 first layer of the "7 network J=1,..M Index of linear weights bias combinations i=1,..D Index of dimensions of the input X MALIS 2019 34 17 07/11/2019 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 2: Transform each activation using a differentiable, nonlinear activation function : ; <5 = : 5 • generally chosen to be a sigmoidal function : ; • corresponds to the outputs of the basis functions of our model. Recall: <5 6 , = ' 5 25() 57 • In the context of neural networks, these are called hidden units MALIS 2019 35 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 3: The are again linearly combined to give output unit activations <5 6 Layer index. E.g. Output unit parameters of the activations + = ' +5 <5 + + 2nd layer of the 57 network k=1,..K Output index. weights bias K: number of outputs MALIS 2019 36 18 07/11/2019 Feed forward networks a.k.a. The multilayer perceptron (MLP) • Step 4: The are transformed using an activation function to give a set of network+ outputs + • Choice of the activation function follows same considerations as for linear models • For regression: Identity function • Common activation functions for classification • Sigmoid function: = = 1/(1 + ?@A) • Tanh: FG@FHG tanh = G HG • Hinge or relu: F ,F : =max(,0) • Softmax (multiclass): FGK : = ( GM LM F MALIS 2019 37 Feed forward networks a.k.a MLP Final expression • Combining all, using a sigmoidal output unit activation function 6 9 + , = = ' +5 : ' 5" " + 5 + + 57 "7 MALIS 2019 38 19 07/11/2019 Feed forward networks / MLP Interpretation: Network diagram • Forward propagation of information through the network Fig 5.1 PRML – C. Bishop. Two-layered network MALIS 2019 39 Simplifying notation Absorbing the biases • As with linear models, the bias parameters can be absorbed into the set of weight parameters by adding a dummy variable N = 1 () 5 = ' 5" " "7 • In the second layer: 6 + = ' +5 <5 57 MALIS 2019 40 20 07/11/2019 Feed forward networks a.k.a MLP Simplified final expression • The overall network function now becomes 6 9 + , = = ' +5 : ' 5" " 57 "7 6 , = ' 5 25() 57 MALIS 2019 41 Multilayer perceptron Interpretation • Two stages of processing, each of which resembles the perceptron Similar to perceptron 6 N + , = = ' +5 : ' 5" " 57 "7 Adapted from Fig 5.1 PRML – C.

Load more