Machine Learning & Data Mining

MACHINE LEARNING & DATA MINING CS/CNS/EE 155 Deep Learning Part I 1 logistic regression Y 1 logistic function P (Y =1X) | 1 P (Y =1X)= (w|x+w ) | 1+e− 0 0 X can fit binary labels Y 0, 1 2 { } | w x + w0 =0 defines the boundary between the classes in higher dimensions, this is a hyperplane 2 X2 Y =0 Y =1 X1 additional features in X result in additional weights in w 3 if Y 0, 1 2 { } 1 P (Y =1X)= (w|x+w ) | 1+e− 0 use binary cross-entropy loss function N = y(i) log(P (y =1x(i))) + (1 y(i)) log(1 P (y =1x(i))) L − i | − − i | i=1 X h i when yi =1make the when yi =0make the output closeyi to=1 output closeyi to=0 4 if Y 1, 1 2 {− } 1 P (y x)= y(wT x+w ) | 1+e− 0 use logistic loss function N y(i)(wT x(i)+w ) = log(1 + e− 0 ) L i=1 X T (i) (i) make sure w x + w0 and y have the same sign, and T (i) w x + w0 is large in magnitude 5 how do we extend logistic regression to handle multiple classes? y 1,...,K 2 { } approach I split the points into groups of one vs. rest, train model on each split X2 X2 X2 X2 X1 X1 X1 X1 approach II train one model on all data classes simultaneously X2 (w|x+w ) e k 0,k P (Y = k X)= | K (w x+w0,k ) | e k0 0 k0=1 P softmax function X1 6 (w|x+w ) e k 0,k multi-class logistic regression P (Y = k X)= | K (w x+w0,k ) | e k0 0 k0=1 P 1 (w|x+w ) assume probabilities of the form P (Y = k)= e k 0,k Z (log linear) Z is the ‘partition function,’ which normalizes the probabilities probabilities must sum to one K K 1 (w|x+w ) P (Y = k)= e k 0,k =1 Z kX=1 kX=1 K (w|x+w ) therefore Z = e k 0,k kX=1 7 logistic regression is a linear classifier linear scoring function is passed through the non-linear logistic function to give a probability output often works well for simple data distributions breaks down when confronted with data distributions that are not linearly separable AND OR XOR X2 X2 X2 X1 X1 X1 linearly separable linearly separable not linearly separable 8 to tackle non-linear data distributions with a linear approach, we need to turn it into a linear problem use a set of non-linear features in which the data are linearly separable one approach: use a set of pre-defined non-linear transformations XOR X2 X ,X X ,X ,X X 1 2 ! 1 2 1 2 linear decision hyperbolic decision boundary boundary X1 9 to tackle non-linear data distributions with a linear approach, we need to turn it into a linear problem use a set of non-linear features in which the data are linearly separable another approach: logistic regression outputs a non-linear transformation, use multiple stages of logistic regression XOR X1,X2 X1 X2,X1 X2 X2 linear decision! multiple^ linear decision_ boundary boundaries XOR: (X1 X2) (X1 X2) X1 ¬ ^ ^ _ 10 we used multiple stages of linear classifiers to create a a non-linear classifier as the number of stages increases, so too does the expressive power of the resulting non-linear classifier depth: the number of stages of processing 11 the point of deep learning with enough stages of linear/non-linear operations (depth), we can learn a good set of non-linear features to linearize any non-linear problem 12 basic operation: logistic regression X1 weights X 2 output feature summation non-linearity X3 X 4 X input features Xp artificial neuron 13 multiple operations X1 X2 X X3 X X4 Xp X layer of artificial neurons 14 multiple stages of operations X X X X X artificial neural network 15 notation X X X X X at the input layer X0 units 16 notation X X X X X to the first layer W 1 weights 17 notation X X X X X S1 at the first layer summations 18 notation X X X X X σ non-linearity 19 notation X X X X at the first layer X X1 units 20 notation X X X X X etc. 21 notation bias unit 1 1 S1 1 W1,0 1 W1,1 0 X1 X 1 W 1,2 0 1 X X2 1,3 W 0 X X3 0 1 ,N 1 W X number of units in layer 0 0 X 0 N X N 0 1 1 1 0 S1 = W1,0 + W1,iXi i=1 bias weight X 22 notation 1 1 S1 1 W1,0 1 W1,1 0 X1 X 1 W 1,2 0 1 X X2 1,3 W 0 X X3 0 1 ,N 1 W X 0 XN 0 X 0 1 1 0 absorb bias unit into X1 S = W X 0 1 1 1 X1,0 =1 vectorized form (dot product) 23 notation 1 S1 1 0 X1 X1 S2 0 X X2 0 X X3 1 SN 1 X 0 XN 0 X S1 = W 1X0 fully vectorized form (matrix multiplication) 24 notation 1 S1 1 1 X1 0 X1 X1 S2 0 1 X X2 X2 0 X X3 1 SN 1 X 1 XN 1 0 XN 0 X X1 = σ(S1) element-wise non-linearity 25 notation append a bias unit with weights 1 1 S1 1 1 X1 0 X1 X1 S2 0 1 X X2 X2 0 X X3 1 SN 1 X 1 XN 1 0 XN 0 X and proceed 26 matrix form 0 0 N X number of features number N number of data points Y 27 matrix form N N 0 N = X0 N 0 N 1 S1 N 1 W 1 note: appending biases to W 1 and bias units to X 0 changes N 0 N 0 +1 ! 28 matrix form N N = 1 1 N 1 X1 σ(N S ) 29 matrix form input X 0 somewhere back in here ( ( N ( N ` ` = ` ` 1 ` 2 X σ( W σ( W − σ( W − σ( ) it’s just a (deep, non-linear) function! note: bias appending omitted for clarity 30 train with gradient descent final layer XL is the output of the network L X1 X use (XL,Y) to compare L L L X and Y X2 X take gradients ` of loss w.r.t. rW L weights at each level ` update the weights to decrease loss, L L bring X closer to Y XN L X ` ` W W ↵ W ` − r L 31 final layer how do we get the gradients? backpropagation L X1 X L X2 1. Define a loss function. X 2. Use chain rule to recursively take derivatives at each level backward through the network. L XN L X 32 output what is the best form in which to present the labels? binary labels: Y 0, 1 (binary classification) 2 { } represent with a single output unit, use binary logistic regression at the last layer L [0, 1] X1 P (Y =1X) | X P (Y =0X)=1 P (Y =1X) | − | 33 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with a single output unit, regress to the correct class? ( , ) XL 1 Y1 1,...,K 1 2 { } X No, in general, classes do not have a numerical relation class K is not ‘greater’ than class K 1 − 34 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with multiple output units, regress to an encoding of the correct class (e.g. binary) [0, 1] XL 0, 1 1 { } X [0, 1] XL 0, 1 2 { } X Y 1,...,K 2 { } L [0, 1] X L 0, 1 N { } X N L = K 6 No, correct output requires coordinated} effort from units implies arbitrary similarities induced by the coding 35 output what is the best form in which to present the labels? categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with K output units, multi-class logistic regression at the last layer [0, 1] XL P (Y =1X) 1 | X [0, 1] XL P (Y =2X) 2 | X Y 1,...,K 2 { } [0, 1] XL P (Y = K X) K | X softmax Yes captures independence assumptions in the class }structure, and is a valid output 36 output categorical labels: Y 1,...,K (multi-class classification) 2 { } represent with K output units, multi-class logistic regression at the last layer one-hot vector encoding example K =5 Y =1 Y =2 Y =3 Y =4 Y =5 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 37 loss function probability of x belonging to class k w|x e k multi-class logistic regression P (Y = k)= K w| x e k0 k0=1 P softmax loss function (cross-entropy) w| x e yi = log P (Y = y )= log i i K w| x L − − e k0 k0=1 minimize the negative log probability ofP the correct class recall: N = y(i) log(P (y =1x(i))) + (1 y(i)) log(1 P (y =1x(i))) L − i | − − i | i=1 Xsoftmaxh loss is the extension of binary case to multi-class i 38 backpropagation deep networks are just composed functions L L L 1 1 0 X = σ(W σ(W − σ(...σ(W X ))))) the loss (XL,Y) is a function of XL, which is a function of each layer’s weights W ` L therefore, we can find W ` using chain rule r L recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the output layer @ @ @XL @SL L = L @W L @XL @SL @W L 39 backpropagation recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the output layer @ @ @XL @SL L = L @W L @XL @SL @W L @ L the first two terms are the same @SL at the layer before L L L 1 L 1 @ @ @X @S @X − @S − LL 1 = LL L L 1 L 1 L 1 @W − @X @S @X − @S − @W − 40 backpropagation recall ` ` ` ` ` 1 X = σ(S ) S = W X − at the layer before L L L 1 L 1 @ @ @X @S @X − @S − LL 1 = LL L L 1 L 1 L 1 @W − @X @S @X − @S − @W − @ LL 1 the first four terms are the same @S − at the layer before that L L L 1 L 1 L 2 L 2 @ @ @X @S @X − @S − @X − @S − LL 2 = LL L L 1 L 1 L 2 L 2 L 2 @W − @X @S @X − @S − @X − @S − @W − 41 backpropagation general overview - dynamic programming compute gradient of loss w.r.t.

Load more