Basic Classification Methods

Basic Classification Methods The Essentials of Data Analytics and Machine Learning [A guide for anyone who wants to learn practical machining learning using R] Author: Dr. Mike Ashcroft Editor: Ali Syed This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/4.0/. ® 2016 Dr. Michael Ashcroft and Persontyle Limited Essentials of Data Analytics and Machine Learning 1 BASIC CLASSIFICATION METHODS In this module we look at a number of basic classification methods. Module 9 We first look at cases where the independent variables are real, and examine the following algorithms: - Linear and Quadratic Discriminant Analysis - The Perceptron Algorithm - Logistic We then look at the case where both the independent and target variables are discrete, which we will term discrete-discrete classification. Basic Classification Methods Basic Classification Methods We now turn to classification problems. Where regression problems sought to estimate real valued target (Y) variables, classification problems seek to estimate discrete (nominal) target variables. So we seek to estimate the category to which a new case belongs, based on our knowledge of the independent (or input or X) variables of that case. In this module we look at a number of basic classification methods. We first look at cases where the independent variables are real, and examine the following algorithms: - Linear and Quadratic Discriminant Analysis - The Perceptron Algorithm - Logistic Regression We then look at the case where both the independent and target variables are discrete, which we will term discrete-discrete classification. Misclassification Error Since we no longer have real-valued variables as our target variables, we cannot use the error measures discussed in module 10. Instead, we will use misclassification error: 푁 1 푀퐶퐸(푓) = ∑ 퐼(푦 ≠ 푓(푋 )) (1) 푁 푖 푖 푖=1 Where N is the number of items in the data the models performance is being evaluated on, and I is an indicator function which takes a proposition and returns 1 if that proposition is true, and 0 otherwise. Obviously, the misclassification error is simply the proportion of cases that the model misclassified. It is the most common error score for classification problems. However, we will see other possible classification error scores in module 12. MSE for probabilistic classification In fact, in binary cases the misclassification error of a classifier is equivalent to its mean square error, so long as the target variable’s classes are coded as 0 and 1. And in general it is valid to Essentials of Data Analytics and Machine Learning 1 Basic Classification Methods calculate the mean square error of a classifier in terms of the probability it assigns to the actual class of a case: 푛 1 2 푀푆퐸(푚) = ∑(1 − 푃̂(푌 = 푦 )) 푛 푖 푖=1 This can be particularly useful when evaluating the performance of models which are used as classifiers but which output a probability distribution over all classes rather than a simple estimated class. The difference between these two cases should be seen as analogous to that between a regression model which outputs a probability distribution over ℝ compared with one which simply provides point estimates. Linear & Quadratic Discriminant Analysis The first two classification models algorithms are called linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These make use of multivariate normal distributions. Essentially these function by fitting multivariate normal distributions over each class and then examining the relative densities of these multiple distributions at points of interest. Where the two methods differ is how they fit the parameters for the multivariate normal distributions. Multivariate Normal Distribution The multivariate normal distribution is the multivariate equivalent of the normal distribution discussed in module 8. It is parameterized by the mean vector 흁 and covariance matrix 횺. The mean vector, 휇, is simply the mean of each of the variables: 흁 = 〈휇ퟏ, … , 휇풏〉 (2) 휇ퟏ = 퐸[푋푖] (3) We estimate 휇풊 using the sample mean 푥̅풊: 푁 1 푥̅ = ∑ 푥 (4) 풊 푁 푖푗 푗=1 Where the sample data has N rows, and 푥푖푗 is the value for variable 푋풊 in row j. The covariance matrix 횺 generalizes the notion of variance to multiple dimensions, where: Essentials of Data Analytics and Machine Learning 2 Basic Classification Methods 휎11 ⋯ 휎1푛 횺 = [ ⋮ ⋱ ⋮ ] (5) 휎푛1 ⋯ 휎푛푛 휎푖푗 = 푐표푣(푋푖푋푗) = 퐸[(푋푖 − 휇푖)(푋푗 − 휇푗)] (6) We estimate 횺 using the sample covariance matrix Q: 푞11 ⋯ 푞1푛 퐐 = [ ⋮ ⋱ ⋮ ] (7) 푞푛1 ⋯ 푞푛푛 푁 1 푞 = ∑(푥 − 푥̅ )(푥 − 푥̅ ) (8) 푖푗 푁 − 1 푖푘 풊 푗푘 풋 푘=1 Note that the ith element of the diagonal of the covariance matrix corresponds to the ordinary univariate variances of the variable 푋풊. For reference, the probability density of the normal distribution is given by: 1 1 (− (푥−휇)푇Σ−1(푥−휇)) 푒 2 (9) √(2휋)푘|Σ| Where k is the number of variables. Quadratic Discriminant Analysis (QDA) QDA assumes that each class, i, is drawn from a multivariate normal distribution 풩(흁푖, 횺푖). For each class, we estimate the For this example, we will use the kmeans data from the sml package. Essentials of Data Analytics and Machine Learning 3 Basic Classification Methods > library(sml) > data(kmeans) > head(kmeans) [,1] [,2] [,3] [1,] -2.080294 -0.7169352 1 [2,] -2.569492 -1.0970885 1 [3,] -1.670747 -0.3413764 1 [4,] -1.979666 -0.4501546 1 [5,] -1.948958 -1.0647235 1 [6,] -1.279791 -0.4383802 1 The input variables are columns one and two, and the target class variable is column 3. Let’s see what we are working with: > plot(kmeans[,1:2],col=kmeans[,3]) We now estimate the parameters for the normal distributions associated with each class, using the mean and cov functions: > mus=sapply(1:5,function(i) colMeans(kmeans[which(kmeans[,3]==i),1:2])) > mus [,1] [,2] [,3] [,4] [,5] [1,] -1.758795 0.06843362 -0.03256789 0.8862959 0.8366331 [2,] -0.572754 0.75716673 -0.29897861 -1.3024879 1.4170538 > sigmas=lapply(1:5,function(i) cov(kmeans[which(kmeans[,3]==i),1:2])) > sigmas [[1]] [,1] [,2] [1,] 0.10755842 0.07069203 [2,] 0.07069203 0.07954641 [[2]] [,1] [,2] [1,] 0.16242031 0.01819607 [2,] 0.01819607 0.08335855 [[3]] [,1] [,2] [1,] 0.05804393 -0.05676342 [2,] -0.05676342 0.08059121 [[4]] [,1] [,2] [1,] 0.012387898 0.002603444 [2,] 0.002603444 0.015393026 [[5]] [,1] [,2] [1,] 0.05485341 0.02018681 [2,] 0.02018681 0.02188345 Essentials of Data Analytics and Machine Learning 4 Basic Classification Methods Notice that the kmeans data is scaled and centered – so if you look at the covariance of the entire data the diagonal entries are ones. We can have a look at the class densities using the plot3d, surfaces3d, and dmvnorm functions. The first two are in the rgl package. The last is in the mvtnorm package. We will also make use of the apply and outer functions: See the text box for an analysis of what is going on in the complicated line! > library(rgl) > library(mvtnorm) > plot3d(cbind(kmeans[,1:2],rep(0,nrow(kmeans))),zlim=c(0,1),col=kmeans[,3]) > for (i in 1:5) surface3d(x,y,outer(x,y,function(a,b).1*apply(cbind(a,b),1,fu nction(r)dmvnorm(r,mus[,i],sigmas[[i]]))),col=i) Essentials of Data Analytics and Machine Learning 5 Basic Classification Methods Let’s look a little more closely at that one You will have notices that the example code can contain complicated commands. This is to expose you to ‘real’ R code. Normally it has been left to you to work through the commands and understand how they bring together multiple functions into compounded statements. But in case this last example was too much, here is an analysis of the complicated command: for (i in 1:5) surface3d(x,y,outer(x,y,function(a,b).1*apply(cbind(a,b),1, function(r)dmvnorm(r,mus[,i],sigmas[[i]]))),col=i) Let’s analyze from the outer scope inwards: 1. for (i in 1:5) This part is easy: We are going to build a surface for each class, 1 to 5. So we enter a loop whose index is 1 to 5. 2. surface3d(x,y,outer…,col=i) The surface3d function draws a 3d surface. We need to pass vectors giving the x and y values for a grid, then a vector of z values for every point in that grid – so the number of entries in z is equal to the number of entries in x multiplied by the number of entries in y. Calculating the z values is the hard part, which we look at more below. We want these surfaces to be colored, so we use the class index as the color for each surface. 3. outer(x,y,function(a,b)…) The outer function performs an operator on each pair of items in two vectors. By default the function is multiplication which results in the outer product of the two vectors. We want instead to use a custom function to find the density at each point in our grid using the appropriate multivariate normal function. 4. function(a,b).1*apply(cbind(a,b),1,function(r)…) The outer function passes all combinations of our original x and y vectors. So a and b do not equal x and y. We bind them into a matrix, and then use apply to apply another function on each row of this matrix. This will be the density estimation for the given points (see below).

Load more