Basic Classification Methods

The Essentials of Data Analytics and

[A guide for anyone who wants to learn practical machining learning using R]

Author: Dr. Mike Ashcroft Editor: Ali Syed

This work is licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc- sa/4.0/.

® 2016 Dr. Michael Ashcroft and Persontyle Limited

Essentials of Data Analytics and Machine Learning 1

BASIC CLASSIFICATION METHODS In this module we look at a number of basic classification methods. Module 9 We first look at cases where the independent variables are real, and examine the following algorithms:

- Linear and Quadratic Discriminant Analysis

- The Perceptron Algorithm

- Logistic

We then look at the case where both the independent and target variables are discrete, which we will term discrete-discrete classification. Basic Classification Methods

Basic Classification Methods

We now turn to classification problems. Where regression problems sought to estimate real valued target (Y) variables, classification problems seek to estimate discrete (nominal) target variables. So we seek to estimate the category to which a new case belongs, based on our knowledge of the independent (or input or X) variables of that case.

In this module we look at a number of basic classification methods. We first look at cases where the independent variables are real, and examine the following algorithms:

- Linear and Quadratic Discriminant Analysis - The Perceptron Algorithm - Logistic Regression

We then look at the case where both the independent and target variables are discrete, which we will term discrete-discrete classification.

Misclassification Error

Since we no longer have real-valued variables as our target variables, we cannot use the error measures discussed in module 10. Instead, we will use misclassification error:

푁 1 푀퐶퐸(푓) = ∑ 퐼(푦 ≠ 푓(푋 )) (1) 푁 푖 푖 푖=1

Where N is the number of items in the data the models performance is being evaluated on, and I is an which takes a proposition and returns 1 if that proposition is true, and 0 otherwise.

Obviously, the misclassification error is simply the proportion of cases that the model misclassified. It is the most common error score for classification problems. However, we will see other possible classification error scores in module 12.

MSE for probabilistic classification

In fact, in binary cases the misclassification error of a classifier is equivalent to its mean square error, so long as the target variable’s classes are coded as 0 and 1. And in general it is valid to

Essentials of Data Analytics and Machine Learning 1

Basic Classification Methods calculate the mean square error of a classifier in terms of the probability it assigns to the actual class of a case:

푛 1 2 푀푆퐸(푚) = ∑(1 − 푃̂(푌 = 푦 )) 푛 푖 푖=1

This can be particularly useful when evaluating the performance of models which are used as classifiers but which output a over all classes rather than a simple estimated class. The difference between these two cases should be seen as analogous to that between a regression model which outputs a probability distribution over ℝ compared with one which simply provides point estimates.

Linear & Quadratic Discriminant Analysis

The first two classification models algorithms are called linear discriminant analysis (LDA) and quadratic discriminant analysis (QDA). These make use of multivariate normal distributions. Essentially these function by fitting multivariate normal distributions over each class and then examining the relative densities of these multiple distributions at points of interest. Where the two methods differ is how they fit the parameters for the multivariate normal distributions.

Multivariate

The multivariate normal distribution is the multivariate equivalent of the normal distribution discussed in module 8. It is parameterized by the mean vector 흁 and covariance matrix 횺. The mean vector, 휇, is simply the mean of each of the variables:

흁 = 〈휇ퟏ, … , 휇풏〉 (2)

휇ퟏ = 퐸[푋푖] (3)

We estimate 휇풊 using the sample mean 푥̅풊:

푁 1 푥̅ = ∑ 푥 (4) 풊 푁 푖푗 푗=1

Where the sample data has N rows, and 푥푖푗 is the value for variable 푋풊 in row j.

The covariance matrix 횺 generalizes the notion of to multiple dimensions, where:

Essentials of Data Analytics and Machine Learning 2

Basic Classification Methods

휎11 ⋯ 휎1푛 횺 = [ ⋮ ⋱ ⋮ ] (5) 휎푛1 ⋯ 휎푛푛

휎푖푗 = 푐표푣(푋푖푋푗) = 퐸[(푋푖 − 휇푖)(푋푗 − 휇푗)] (6)

We estimate 횺 using the sample covariance matrix Q:

푞11 ⋯ 푞1푛 퐐 = [ ⋮ ⋱ ⋮ ] (7) 푞푛1 ⋯ 푞푛푛 푁 1 푞 = ∑(푥 − 푥̅ )(푥 − 푥̅ ) (8) 푖푗 푁 − 1 푖푘 풊 푗푘 풋 푘=1

Note that the ith element of the diagonal of the covariance matrix corresponds to the ordinary univariate of the variable 푋풊.

For reference, the probability density of the normal distribution is given by:

1 1 (− (푥−휇)푇Σ−1(푥−휇)) 푒 2 (9) √(2휋)푘|Σ|

Where k is the number of variables.

Quadratic Discriminant Analysis (QDA)

QDA assumes that each class, i, is drawn from a multivariate normal distribution 풩(흁푖, 횺푖). For each class, we estimate the

For this example, we will use the kmeans data from the sml package.

Essentials of Data Analytics and Machine Learning 3

Basic Classification Methods

> library(sml) > data(kmeans) > head(kmeans) [,1] [,2] [,3] [1,] -2.080294 -0.7169352 1 [2,] -2.569492 -1.0970885 1 [3,] -1.670747 -0.3413764 1 [4,] -1.979666 -0.4501546 1 [5,] -1.948958 -1.0647235 1 [6,] -1.279791 -0.4383802 1

The input variables are columns one and two, and the target class variable is column 3. Let’s see what we are working with:

> plot(kmeans[,1:2],col=kmeans[,3])

We now estimate the parameters for the normal distributions associated with each class, using the mean and cov functions:

> mus=sapply(1:5,function(i) colMeans(kmeans[which(kmeans[,3]==i),1:2])) > mus [,1] [,2] [,3] [,4] [,5] [1,] -1.758795 0.06843362 -0.03256789 0.8862959 0.8366331 [2,] -0.572754 0.75716673 -0.29897861 -1.3024879 1.4170538 > sigmas=lapply(1:5,function(i) cov(kmeans[which(kmeans[,3]==i),1:2])) > sigmas [[1]] [,1] [,2] [1,] 0.10755842 0.07069203 [2,] 0.07069203 0.07954641

[[2]] [,1] [,2] [1,] 0.16242031 0.01819607 [2,] 0.01819607 0.08335855

[[3]] [,1] [,2] [1,] 0.05804393 -0.05676342 [2,] -0.05676342 0.08059121

[[4]] [,1] [,2] [1,] 0.012387898 0.002603444 [2,] 0.002603444 0.015393026

[[5]] [,1] [,2] [1,] 0.05485341 0.02018681 [2,] 0.02018681 0.02188345

Essentials of Data Analytics and Machine Learning 4

Basic Classification Methods

Notice that the kmeans data is scaled and centered – so if you look at the covariance of the entire data the diagonal entries are ones.

We can have a look at the class densities using the plot3d, surfaces3d, and dmvnorm functions. The first two are in the rgl package. The last is in the mvtnorm package. We will also make use of the apply and outer functions: See the text box for an analysis of what is going on in the complicated line!

> library(rgl) > library(mvtnorm) > plot3d(cbind(kmeans[,1:2],rep(0,nrow(kmeans))),zlim=c(0,1),col=kmeans[,3]) > for (i in 1:5) surface3d(x,y,outer(x,y,function(a,b).1*apply(cbind(a,b),1,fu nction(r)dmvnorm(r,mus[,i],sigmas[[i]]))),col=i)

Essentials of Data Analytics and Machine Learning 5

Basic Classification Methods

Let’s look a little more closely at that one

You will have notices that the example code can contain complicated commands. This is to expose you to ‘real’ R code. Normally it has been left to you to work through the commands and understand how they bring together multiple functions into compounded statements. But in case this last example was too much, here is an analysis of the complicated command: for (i in 1:5) surface3d(x,y,outer(x,y,function(a,b).1*apply(cbind(a,b),1, function(r)dmvnorm(r,mus[,i],sigmas[[i]]))),col=i)

Let’s analyze from the outer scope inwards:

1. for (i in 1:5) This part is easy: We are going to build a surface for each class, 1 to 5. So we enter a loop whose index is 1 to 5. 2. surface3d(x,y,outer…,col=i) The surface3d function draws a 3d surface. We need to pass vectors giving the x and y values for a grid, then a vector of z values for every point in that grid – so the number of entries in z is equal to the number of entries in x multiplied by the number of entries in y. Calculating the z values is the hard part, which we look at more below. We want these surfaces to be colored, so we use the class index as the color for each surface. 3. outer(x,y,function(a,b)…) The outer function performs an operator on each pair of items in two vectors. By default the function is multiplication which results in the outer product of the two vectors. We want instead to use a custom function to find the density at each point in our grid using the appropriate multivariate normal function. 4. function(a,b).1*apply(cbind(a,b),1,function(r)…) The outer function passes all combinations of our original x and y vectors. So a and b do not equal x and y. We bind them into a matrix, and then use apply to apply another function on each row of this matrix. This will be the density estimation for the given points (see below). We multiply this by .1 so as to flatten the surfaces – this is only to make the resulting picture look nice and would not normally be done. 5. dmvnorm(r,mus[,i],sigmas[[i]]) The dmvnorm function is the density function for a multivariate normal distribution. We pass the point we are interested in calculating the density for as the vector r, then we specify the mean vector and covariance matrix we are using for this surface.

Essentials of Data Analytics and Machine Learning 6

Basic Classification Methods

We can now see how the input space is covered by the different class densities, and how each dominates the area around its own class data points. The preponderance of the black distribution is a little misleading: All the distributions approach 0 quickly, but these have been assigned the color black since it was the first distribution drawn. We can draw the distributions in the opposite order and the ‘floor’ color is light blue.

> plot3d(cbind(kmeans[,1:2],rep(0,nrow(kmeans))),zlim=c(0,1),col=kmeans[,3]) > for (i in 5:1) surface3d(x,y,outer(x,y,function(a,b).1*apply(cbind(a,b),1,fu nction(r)dmvnorm(r,mus[,i],sigmas[[i]]))),col=i)

Note that we did not really need to use the dmvnorm function. We could have calculated directly from the pdf equation:

> r=c(-1,0) > dmvnorm(r,mus[,1],sigmas[[1]]) [1] 0.1689647 > 1/sqrt(((2*pi)^2)*det(sigmas[[1]]))*exp(-1/2*( t(r-mus[,1])%*%solve(sigmas[[1]])%*%(r-mus[,1 ]))) [,1] [1,] 0.1689647

While calculating the density using the pdf What is important to note is that, due to equation is simply masochistic, it is sometimes their different covariance matrices, each necessary to implement equations manually distribution has a different shape. and it is a good idea to get comfortable in doing so.

Finally, if we look at the decision boundaries of the distributions from directly overhead we see why this is call quadratic discriminant analysis: These decision boundaries are indeed quadratic.

Essentials of Data Analytics and Machine Learning 7

Basic Classification Methods

We now have the ability to calculate the class densities at any point in the input space. So, if we have a new case where 푋1 = −0.4 and 푋2 = 0.3, we can classify it by assigning it to the class associated with the normal distributions of maximal density at that point. In this case, we have:

> densities=(sapply(1:5,function(i)dmvno rm(c(-0.4,0.3),mus[,i],sigmas[[i]]))) > densities [1] 4.965691e-04 2.574646e-01 4.148971e- 01 5.811862e-80 1.304800e-12 > which.max(densities) [1] 3

So we would categorize this new case as class 3. Note, though, that this assumes that all classes are equally likely – which is reasonable in our case, since all classes had equal numbers of cases in the training data. If this was not the case, we would multiply these conditional probabilities of class given input variables by distribution giving the prior probabilities of the classes. For example, if the prior on the classes was:

Class Probability 1 .2 2 .4 3 .1 4 .1 5 .2

Then looking simply at the densities at point 푋1 = −0.4 and 푋2 = 0.3 would lead us to classify a new case with such observed input values as being of class 2:

> prior=c(.2,.4,.1,.1,.2) > which.max(densities*prior) [1] 2

If the training data is an unbiased sample, you can estimate the class prior simply by calculating the proportions of the sample data that are in different classes. Note that the prior need not be normalized to be a probability distribution: <20, 40, 10, 10, 20> will work as well as <.2,.4,.1,.1,.2>.

Essentials of Data Analytics and Machine Learning 8

Basic Classification Methods

Note that what is occurring is really a form of density estimation rather than classification techniques. That is to say, our QDA model provides estimates of the density of each class at different points in the input space. So in principle we could use it to provide conditional probability distributions for the target variable given the input variables. We simply find the value of each class density at the point corresponding to the observed input variable values and then normalize. In our case where 푋1 = −0.4 and 푋2 = 0.3 with a uniform prior:

> densities/sum(densities) [1] 7.379996e-04 3.826431e-01 6.166189e-01 8.6 37573e-80 1.939190e-12 > barplot(densities/sum(densities),ylim=c(0,1) ,names=1:5)

We see the probability that the new case is of class three is estimated to be .617.

Alternatively, with the non-uniform prior given above:

> (densities*prior)/sum(densities*prior) [1] 6.869371e-04 7.123357e-01 2.869774e-01 4.0 19968e-80 1.805016e-12 > barplot((densities*prior)/sum(densities*prio r),ylim=c(0,1),names=1:5)

We see the probability that the new case is of class three is now only .287, which is less that class two which is estimated to be .712.

In practice, QDA is seldom used as a density estimator. It assumes that the class data is drawn from multivariate normal distributions. This is seldom the case, but the models are simple and so robust to overfitting and hence can perform well in classification tasks even when this assumption is not even approximately true. The density estimation is not so robust, so when this assumption is not approximately true the conditional probability estimates are likely to be very poor. We will look at other, better density estimation techniques in later modules.

The mass package contains the qda function which is very simple to use. Let’s label the columns of our data so we make use of formulas and then use it:

> colnames(kmeans)=c("X1","X2","Y") > qdaModel=MASS::qda(Y~.,as.data.frame(kmeans)) > qdaModel Call: qda(Y ~ ., data = as.data.frame(kmeans))

Prior probabilities of groups: 1 2 3 4 5

Essentials of Data Analytics and Machine Learning 9

Basic Classification Methods

0.2 0.2 0.2 0.2 0.2

Group means: X1 X2 1 -1.75879474 -0.5727540 2 0.06843362 0.7571667 3 -0.03256789 -0.2989786 4 0.88629594 -1.3024879 5 0.83663307 1.4170538

We are can perform prediction by placing new cases to classify in a data frame with appropriately names columns. So on our example case: 푋1 = −0.4 and 푋2 = 0.3:

> predict(qdaModel,data.frame(X1=c(-.4),X2=c(.3))) $class [1] 3 Levels: 1 2 3 4 5

$posterior 1 2 3 4 5 1 0.0007379996 0.3826431 0.6166189 8.637573e-80 1.93919e-12

As you can see, we receive as output both the class classification and the posterior conditional probabilities.

Where k is the number of classes and p the number of input variables, both the number of free parameters and the effective degrees of freedom of a QDA model is:

푚(푚 + 3) (푘 − 1)( + 1) 2

This is fewer parameters than we worked with in our naïve approach, and if you examine the objects returned by the qda function, you will see that they work with only a triangular matrix, and the values are not those of the actual covariance matrix.

Linear Discriminant Analysis (LDA)

Linear discriminant analysis differs from quadratic discriminant analysis in how the covariance matrices of the normal distributions are estimated. Rather than each covariance matrix being estimated from the corresponding class data, it is assumed that all distributions share the same covariance matrix. In this case it is estimation of the terms in the covariance matrix is given by:

푁 1 푞 = ∑(푥 − 푥̅ )(푥 − 푥̅ ) (10) 푖푗 푁 − 1 푖푘 푖,푚 푗푘 푗,푚 푘=1

Essentials of Data Analytics and Machine Learning 10

Basic Classification Methods

Note that rather than subtracting the overall means, it is the relevant class mean that is subtracted. So if our data was:

X1 X2 Y 4 5 1 2 7 1 12 2 2 15 3 2

Where 푥̅푎,푏 is the mean of variable a when Y=b, the means are:

푥̅1,1 = 3

푥1̅ ,2 = 13.5

푥̅2,1 = 6

푥̅2,2 = 2.5

The covariance matrix would be:

1 1 2 − 6 6 1 5 − 6 6

1 1 푞 = ((4 − 3)2 + (2 − 3)2 + (12 − 13.5)2 + (15 − 13.5)2) = 2 11 4 − 1 6 1 푞 = 푞 = ((4 − 3)(5 − 6) + (2 − 3)(7 − 6) + (12 − 13.5)(2 − 2.5) 21 12 4 − 1 1 + (15 − 13.5)(3 − 2.5)) = − 6

1 5 푞 = ((5 − 6)2 + (7 − 6)2 + (2 − 2.5)2 + (3 − 2.5)2) = 22 4 − 1 6

We now provide an example of LDA using the kmeans data used in the QDA example. The means are calculated as before:

> mus=sapply(1:5,function(i) colMeans(kmeans[which(kmeans[,3]==i),1:2]))

Essentials of Data Analytics and Machine Learning 11

Basic Classification Methods

> mus [,1] [,2] [,3] [,4] [,5] [1,] -1.758795 0.06843362 -0.03256789 0.8862959 0.8366331 [2,] -0.572754 0.75716673 -0.29897861 -1.3024879 1.4170538

To calculate the covariance parameters, we create a temporary data frame that stores the values of each variable minus its class mean:

> temp=sapply(1:2,function(i)sapply( 1:nrow(kmeans),function(j) kmeans[j, i]-mus[i,kmeans[j,3]])) > head(temp) [,1] [,2] X1 -0.3214993 -0.1441812 X1 -0.8106972 -0.5243345 X1 0.0880481 0.2313776 X1 -0.2208715 0.1225994 X1 -0.1901636 -0.4919695 X1 0.4790035 0.1343738

Then we calculate the covariance matrix:

> ldaCov=crossprod(temp)/(nrow(temp) -1) > ldaCov [,1] [,2] [1,] 0.07693057 0.01068814 [2,] 0.01068814 0.05464702

We can now examine how the input space is covered by the class densities. Now each density is identically shaped. The difference is merely where they are located.

If we look at the decision boundaries of the distributions from directly overhead we see the decision boundaries are linear. This is why it is called linear discriminant analysis.

Just as in the QDA case, we can now calculate densities at a point corresponding to a new case, and classify the new case to the class associated with the maximum density at that point. So where 푋1 = −0.4 and 푋2 = 0.3 we have:

> densities=(sapply(1:5,function(i)dmvnorm(c(-0.4,0.3),mus[,i],ldaCov))) > densities [1] 1.877575e-07 1.408192e-01 1.945355e-02 5.842888e-18 2.690597e-08 > which.max(densities) [1] 2

Essentials of Data Analytics and Machine Learning 12

Basic Classification Methods

Exactly as in the case of QDA, if the prior probabilities of the classes are not uniform this must be taken into account. Likewise, we are able to obtain conditional probabilities rather than simply classifications, but the legitimacy of this relies on the assumption that the class really are drawn from normal distributions with equal covariance matrices:

> densities/sum(densities) [1] 1.171486e-06 8.786211e-01 1.213776e-01 3.645585e-17 1.678759e-07

You should refer to the QDA section to review these matters.

The mass package contains the lda function which we use as we used the qda function. Again we label the columns of our data so we make use of formulas:

> colnames(kmeans)=c("X1","X2","Y") > ldaModel=MASS::lda(Y~.,as.data.frame(kmeans)) > ldaModel Call: lda(Y ~ ., data = as.data.frame(kmeans))

Prior probabilities of groups: 1 2 3 4 5 0.2 0.2 0.2 0.2 0.2

Group means: X1 X2 1 -1.75879474 -0.5727540 2 0.06843362 0.7571667 3 -0.03256789 -0.2989786 4 0.88629594 -1.3024879 5 0.83663307 1.4170538

Coefficients of linear discriminants: LD1 LD2 X1 0.3603611 3.587931 X2 4.1283468 -1.123463

Proportion of trace: LD1 LD2 0.5994 0.4006

The implementation of LDA in MASS is actually an implementation of Fisher’s Linear Discriminant, which involves a projection onto linear discriminants, much as PCA projects onto principal components. The coefficients of these linear discriminants are given in the matrix and the amount of between-class variance that is explained by each successive linear discriminant is given in the proportion of trace vector. Many people use the labels LDA and Fisher’s Linear Discriminant interchangeably, but we have not. Accordingly, we ignore these additional elements except to note that they can be used for feature transformation and selection similarly to how PCA was used.

Essentials of Data Analytics and Machine Learning 13

Basic Classification Methods

We are can perform prediction by placing new cases to classify in a data frame with appropriately names columns. So on our example case: 푋1 = −0.4 and 푋2 = 0.3:

> predict(ldaModel,data.frame(X1=c(-.4),X2=c(.3))) $class [1] 2 Levels: 1 2 3 4 5

$posterior 1 2 3 4 5 1 1.673359e-06 0.8728385 0.1271596 9.969783e-17 2.526338e-07

$x LD1 LD2 1 1.09436 -1.772211

As you can see, we receive as output both the class classification and the posterior conditional probabilities. We also see the projection of the input variables onto the linear discriminants. We do not get exactly the same values as in our manual implementation, which indicates that there is a default setting in the lda function that leads to something not quite equivalent to our manual implementation.

Where k is the number of classes and p the number of input variables, both the number of free parameters and the effective degrees of freedom of a LDA model is:

(푘 − 1)(푚 + 1)

Again, this is fewer parameters than we worked with in our naïve approach and if you look through the items in the object returned by lda you will not find the covariance matrix. But as noted, the implementation here differs significantly from that undertaken in our naïve manual implementation.

The Perceptron

The second basic classification algorithm we will look at is the perceptron algorithm. The perceptron algorithm is historically and educationally interesting, but should not be used on real problems: Its approach is dominated by the vector machines of module 19 which should always be preferred.

The idea behind the perceptron algorithm is simple: In a binary classification problem, we will seek to find a hyperplane within the feature space such that when a case falls on one side of this hyperplane we will classify it to one class, when it falls on the other, we classify it to the other class. Letting 푤0 represent the threshold, and remembering the definition of hyperplanes in module 9, this is identical to:

Essentials of Data Analytics and Machine Learning 14

Basic Classification Methods

푓(푋) = 1 푖푓 ∑ 푤푖푥푖 > 푤0 푖=1 (11) = 0 표푡ℎ푒푟푤푖푠푒

This is equivalent to:

푓(푋) = 푠푖푔푛 (∑ 푤푖푥푖 − 푤0) (12) 푖=1

Since we seek to find the optimal values of the parameters {푤0, … , 푤푛} we can introduce a dummy variable 푥0=1 and instead seek to find the optimal values of same parameters in:

푓(푋) = 푠푖푔푛 (∑ 푤푖푥푖) (13) 푖=0

The perceptron algorithm is a method for (trying to) optimize{푤0, … , 푤푛}. For training data X, Y where Y is a binary variable, it is as follows:

2 1. Set 푤0, … , 푤푛 to 0, and 퐿 = max(‖푋‖ ). 2. Repeat: a. Classify all training data points using current weights. b. If no points are misclassified, terminate. c. Otherwise, pick a misclassified training point, i. A point is misclassified if 푇 푠푖푔푛(푾 푿푖) ≠ 푌푖 d. Update weights:

 푊 = 푊 + 푌푖푋푖 (treating 푊 as excluding 푊0).  푊0 = 푊0 + 푌푖퐿

Essentials of Data Analytics and Machine Learning 15

Basic Classification Methods

If the training data is linearly separable, the perceptron algorithm is guaranteed to terminate with a separating hyperplane. If it is not, the above algorithm will never terminate. Accordingly, unless we are certain the training data is linearly separable we should keep track of the number of iterations performed of the loop in step 2, as well as the best parameter values found evaluated by misclassification error on the training data. After a given number of iterations we terminate and return these best found parameters. Note that they will not generally be the parameters being used in the last iteration. Figure 1: Not all separating hyperplanes are equal. All else being equal one would prefer the green While there is a guarantee that the perceptron separating hyperplane to the others. algorithm will terminate with a separating hyperplane if one exists (and it is allowed to run long enough) there is no guarantee that it will terminate with an optimal non- separating hyperplane if no separating hyperplane exists. Further, while the algorithm will in the separable case converge to a separating hyperplane which is by definition optimal regarding misclassification error, there is no guarantee that such the selected hyperplane will be optimal in other ways. In particular, if we define an optimally separating hyperplane as one which maximizes the minimal distance between the hyperplane and the closest points of the different classes, the returned hyperplane is not guaranteed to be optimal in this sense. The perceptron of optimal stability, now known as a support vector Figure 2: Optimal Separating Hyperplane. An classifier, was designed to overcome these difficulties optimal separating hyperplane is defined to be one that maximizes the distance to the closest training and should be preferred. We will examine this classifier points. and its generalization, the support vector machine in module 19.

Since its use it to be avoided, we will not examine examples of manual or third party implementations.

Logistic Regression

Logistic regression is a very important technique that we will meet again being utilized within more sophisticated statistical models, such as neural networks. It is also another example of a (see module 9).

Essentials of Data Analytics and Machine Learning 16

Basic Classification Methods

Logistic Regression estimates 푌 = 푓(푿) when:

• Y consists of proportions/probabilities OR binary coded ({true,false},{0,1}, etc) data. • 푿 = 푋0, 푋1, … , 푋푛 are real valued variables, with 푋0 = 1.

It fits a logistic, or logit, curve to the relationship between X and Y. The logit curve is a sigmoid (s-shaped) curve, and is the natural log of the odds, where the odds 푃(푛) Figure 3: The standard logistic function of an event n occurring is . When Y gives 1−푃(푛) proportions or probabilities, we have:

푒푿푩 푌̂ = 1 + 푒푿푩

When Y gives binary coded data we have:

푒푿푩 휌(푌) = 1 + 푒푿푩

These are generalized linear models, and the associated is of the natural logarithm of the odds:

푃̂(푌) log (표푑푑푠̂ (푌)) = log ( ) = 푿푩 1 + 푃̂(푌)

Noting that we have included 푋0 = 1 as a dummy variable.

It should be pointed out that the estimation of the parameters 푩 is not by OLS, since its assumptions are not even approximately true in this case. Instead, numerical methods are used. This makes it unsuitable for manual implementation, and we will skip directly to working with the glm function introduced in module 9. We will work through two examples. The first will have a Y variable giving proportions and the second will have a binary Y variable, so we can see the slight difference in how we work with these two cases.

Essentials of Data Analytics and Machine Learning 17

Basic Classification Methods

The first example will work with the menarche data from the MASS package. This dataset gives the number of female children in Warsaw at various ages who have reached menarche. We first load it and examine the first six rows using the head function.

> library(MASS) > data(menarche) > head(menarche) Age Total Menarche 1 9.21 376 0 2 10.21 200 0 3 10.58 93 0 4 10.83 120 2 5 11.08 90 2 6 11.33 88 5

We can plot the ratio of female children reaching menarche versus age using the formula version of the plot function:

> plot(Menarche/Total~Age,menarche)

We see that this plot has an approximate sigmoid shape, so we can be confident that the logistic regression model will perform reasonably well.

We now use the glm function. Since we do not have the ratio as a variable we use the following command:

> model=glm(cbind(Menarche,Total-Menarche)~Age,"binomial",menarche)

Specifying “binomial” in the second place sets the family parameter and ensures we will perform logistic regression. Notice that we have to give a matrix of the positive and negative cases and regress this on Age in the formula.

When we use the model to predict the proportion of female children in Warsaw who have reached menarche for a new age value, can need to specify that we want the output to be of type response. So to predict this value for age 14.5, we type:

> predict(model,data.frame(Age=14.5),type="response")

Essentials of Data Analytics and Machine Learning 18

Basic Classification Methods

1

0.9196164

We can add the regression line for this model to the plot of the original data. This time we will use the function curve, which has the annoying issue that you must pass the name of a function, not simply an unnamed function. We will also add residuals. So we type:

> f=function(x)predict(model,data.frame(Age =x),type="response") > curve(f, 8, 18, n = 100,add=T,col="blue") > segments(menarche$Age,menarche$Menarche/m enarche$Total,menarche$Age,f(menarche$Age), col="red")

We see that the model does a good job of classifying, but to get a quantitative evaluation we can obtain a MSE score:

> mean((f(menarche$Age)-menarche$Menarche/menarche$Total)^2) [1] 0.000912997

Note that we use MSE not misclassification error. That is because we have a regression problem, not a classification one: We are, at this point, estimating the probability of a female child of a particular age having reached menarche (or, if you prefer, we are estimating the proportion of female children of a particular age in Warsaw who have reached menarche). In other words, our model is a density estimate of the conditional probability function rather than a simple classifier. But we could use it as a classifying by simply classifying a new female child as reaching (1) or not-reaching (0) menarche, which we could do by:

푐푙푠(푥) = 퐼(푃(푀 = 1|퐴푔푒) > .5)

Obviously we throw away information by such use. To calculate the misclassification rate on the training data, we would first set up the classification function:

> g=function(x)as.numeric(f(x)>.5)

We see that the classification of a 14 year old would be 1 (has reached menarche):

> g(14)

[1] 1

Essentials of Data Analytics and Machine Learning 19

Basic Classification Methods

Now we can find the in-sample misclassification rate (this is why we cast to numeric in g):

> sum(apply(menarche,1,function(r)ifelse(g(r[1])==0,r[3],r[2]-r[3])))/sum(mena rche[,2]) [1] 0.09392547

It would now be possible to compare the performance of our logistic regression model as a classifier to other classifiers. We could also calculate the MSE for probabilistic classification error score, which gives us:

> sum(apply(menarche,1,function(r)f(r[1])* (r[2]-r[3])+(1-f(r[1]))*r[3]))/sum(menarch e[,2]) [1] 0.1293836

The second case we will examine uses the seeds data in the sml package. This small data set has a number of seeds of different ages and our target variable is whether they have germinated (0 or 1). We load and examine it by means of a scatter plot:

> library(sml) > data(seeds) > plot(seeds)

Since the response (Y) variable is binary, our call the glm function is slightly different:

> model=glm(Germinated~Age,"binomial",seeds)

We can plot the regression line using:

> f=function(x)predict(model,data.frame(Age=x) ,type="response") > curve(f,add=T,col="blue")

In this case it makes no sense to add residuals, as the original data points were not proportions.

Again, we are estimating the probability that a seed of a particular age has germinated. So the output for predict with an input of 4 is:

> predict(model,data.frame(Age=4),type="response") 1

Essentials of Data Analytics and Machine Learning 20

Basic Classification Methods

0.1708797

Like in the first example, we can use this as the basis for a classifier by classifying new cases with more than .5 probability of germinating as 1, and others as 0. We can then get an in-sample misclassification error:

> g=function(x)as.numeric(f(x)>.5) > sum(g(seeds[,1])!=seeds[,2])/nrow(seeds) [1] 0.2

Again, note that we lose information by treating the logistic regression model as a simple classifier but this allows us to compare the performance of our logistic regression model as a classifier. Alternatively, we can calculate the MSE for probabilistic classification error score:

> sum((f(seeds[,1])-seeds[,2])^2)/nrow(seeds) [1] 0.1196765

Discrete-Discrete Classification

The above methods dealt with cases where all the input variables were real. We now turn to the case where all the input variables are themselves discrete. Prima facie this is a very simple topic, but even here there are novel approaches that are worth understanding, such as the noisy-or model. It is also one of the areas where usable Bayesian statistical methods were first developed, and we will take this chance to explore these methods.

Categorical and Conditional Categorical Distributions

Categorical distributions use n parameters to specify the probability distribution of a n-valued . Each parameter, i, gives the probability of the variable taking the ith value. The degrees of freedom of such a distribution is n-1, since we could represent such a distribution with n-1 parameters using the constraint:

n

∑ P(X = xj) j=1

Here is a three valued categorical distribution representing the result of a football match:

Essentials of Data Analytics and Machine Learning 21

Basic Classification Methods

Result: Win Draw Loss

Prob. .6 .3 .1

Conditional categorical distributions P(Y|X) give a categorical distribution for each possible value of the discrete variables being conditioned upon.

Here is a conditional categorical distribution representing the distribution over the result of a match given the values taken by the location and weather variables:

Location Weather Win Draw Loss

Home Raining .2 .7 .1 Home Normal .8 .15 .05 Home Hot .6 .2 .1 Away Raining .1 .8 .1 Away Normal .5 .4 .1 Away Hot .2 .6 .2

This gives a probability distribution for our output variable Y given input variables X. We can transform it into a simple classifier by using the rule:

푐푙푠(푌|푿) = arg max 푃(푌 = 푦|푿) 푦∈푌

Count Parameters

We now explain how to fit categorical and conditional categorical distributions from data using count parameters.

Maximum Likelihood Estimate

Take discrete variable 푋: {푥1, 푥2, … , 푥푛}, distributed 푐푎푡(푝1, 푝2, … , 푝푛). Let us track the number of times that we have seen 푋 take particular values with the count parameters:

{푐1, 푐2, … , 푐푛}

Essentials of Data Analytics and Machine Learning 22

Basic Classification Methods

The ratio of the count parameters gives us our maximum likelihood estimation for the distribution parameters. Recall that this was the values of the parameters that makes the observations most probable:

푐푖 푝푖 = 푛 ∑푗=1 푐푗

So if we had observed a football team wining five times, losing three times and drawing twice, we would give a maximum likelihood estimation of the categorical distribution for the random variable that is the result of this team playing as follows:

Result: Win Draw Loss

Prob. 5 2 3 . 5 = . 2 = . 3 = 5 + 2 + 3 5 + 2 + 3 5 + 2 + 3

If we had observed that the team had won seven times, lost once and drawn twice at home, but won only once, lost five times and drawn four times away, we would give a maximum likelihood estimation of the conditional categorical distribution for the result of this team playing given the location of the game as follows:

Location Win Draw Loss

Home 7 2 1 . 7 = . 2 = . 1 = 7 + 2 + 1 7 + 2 + 1 7 + 2 + 1 Away 1 4 5 . 1 = . 4 = . 5 = 1 + 4 + 5 1 + 4 + 5 1 + 4 + 5

Adaption

Using count parameters makes it easy to adapt our parameter estimates: We simply add to the counts as observations occur and adjust the parameters accordingly. In the two examples from the previous section, in the first case if we observed ten more results and the team won five of them, drew four and lost one, we would adjust the parameters of our categorical distribution to:

Essentials of Data Analytics and Machine Learning 23

Basic Classification Methods

Result: Win Draw Loss

Prob. 10 6 4 . 5 = . 3 = . 2 = 10 + 6 + 4 10 + 6 + 4 10 + 6 + 4

For the second case, if we observed ten more results at home and the team won four of them, drew four and lost two, and we also observed six results away and the team won none, drew three and lost three we would adjust the parameters of our conditional categorical distribution to:

Location Win Draw Loss

Home 11 6 3 . 55 = . 3 = . 15 = 11 + 6 + 3 11 + 6 + 3 11 + 6 + 3 Away 1 7 8 . 0625 = . 4375 = . 5 = 1 + 7 + 8 1 + 7 + 8 1 + 7 + 8

We can adapt to soft (probabilistic) evidence too. Continuing the conditional example, if our counts prior to the new evidence are as directly above and we hear that the team played another home game from someone who cannot remember the result for sure, but is 75% sure that it was a win with 25% chance it was a draw we would update the counts to:

Location Win Draw Loss

Home 11.75 6.25 3 . 5595 ≈ . 2976 ≈ . 1429 ≈ 11.75 + 6.25 + 3 11.75 + 6.25 + 3 11.75 + 6.25 + 3 Away 1 7 8 . 0625 = . 4375 = . 5 = 1 + 7 + 8 1 + 7 + 8 1 + 7 + 8

Note that as the count parameters increase, new observations will alter the ML estimation of the distribution parameters less and less.

Encoding Expert Knowledge & Count Parameters

It is easy to encode both an expert’s knowledge about probabilities and their confidence in their estimation. We simply get them to specify their knowledge ’as if’ they had seen a particular set of observations. To see how this implicitly specifies the confidence an expert has in their

Essentials of Data Analytics and Machine Learning 24

Basic Classification Methods estimate takes these two possible estimates of a categorical distribution over the probability of catching a fish at a certain location:

Expert A Catch at least Not catch any one fish fish

Counts 900 100

Probabilities 900 100 . 9 = . 1 = 1000 1000

Expert B Catch at least Not catch any one fish fish

Counts 9 1

Probabilities 9 1 . 9 = . 1 = 10 10

Both experts give the same probabilities, but expert A is much more confident in the sense that the counts she has provided, since they are higher, are much more resilient to empirical counter-evidence. If we actually went to fish at the given spot ten times and never caught any fish, then once we update the expert specified pseudo-counts, expert A’s distribution has hardly changed, but expert B’s has been entirely overwhelmed by the new evidence to the extent that we think it more probable than not that we would not catch any fish at this location on a given visit:

Expert A Catch at least Not catch any one fish fish

Counts 900 = 900 + 0 110 = 100 + 10

Probabilities 900 110 . 891 ≈ . 109 ≈ 1010 1010

Essentials of Data Analytics and Machine Learning 25

Basic Classification Methods

Expert B Catch at least Not catch any one fish fish

Counts 9 = 9 + 0 11 = 1 + 10

Probabilities 9 11 . 45 = . 55 = 20 20

Ignorance & Conservativism

A problem with the (maximum likelihood) method given for estimating parameters in categorical and conditional categorical distribution using count-parameters is that it is extremely unconservative. These distributions represent our knowledge about the system we are using them to model. Using the maximum likelihood method then given a single observation, we immediately jump to the conclusion that a certain values is 100% likely to occur in all future cases, and all others will never occur. This seems unreasonable.

The reason for this behavior is that the maximum likelihood method begins our count parameters at zero. A common alternative is to begin our counts at one rather than zero. This ensures a conservativism in estimates from small numbers of observations and also ensures that all values always have a non-zero probability. So, for our original case where we had observed a football team winning five times, losing three times and drawing twice, we would give an estimate of the categorical distribution for the random variable that is the result of this team playing as follows:

Result: Win Draw Loss

Prob. 5 + 1 2 + 1 4 . 462 ≈ . 231 ≈ . 308 ≈ 6 + 3 + 4 6 + 3 + 4 6 + 3 + 4

The choice of ones to model ignorance may also be justified by an understanding of these count parameters as parameters of a , which we look at in the next section.

Count Parameters & Dirichlet Distributions

Bayesian understands the true parameters of a distribution to themselves be random variables which we can model by another distribution. This second distribution gives the probability for different values that these values are the true parameters of the base

Essentials of Data Analytics and Machine Learning 26

Basic Classification Methods distribution. It is then possible to update this distribution over the parameters on the basis of new observations of data known to be drawn from the base distribution using Bayes theorem:

푝(푥|휃)푝(휃) 푝(휃|푥) = ∫ 푝(푥|휃′)푝(휃′)푑휃′

In general, of course, the integration over the parameter space in the denominator render this equation insolvable. But there are special cases where it is solvable. One of those special cases is where the base distribution is the categorical distribution and the distribution over the parameters of the categorical distribution is the Dirichlet distribution. The Dirichlet distribution can be understood as being parameterized by the count parameters, and the update rules that satisfy the Bayes theorem in the face of new data drawn from the base categorical distribution are precisely the addition of one to the count parameters associated with each observed datum.

Formally, the density function of the Dirichlet distribution is given as:

n Γ(∑n c ) i=1 i ci−1 Dir(c1, c2 … cn) = n ∏ xi , ∏ Γ(ci) i=i i=1

n−1 The support of the distribution is x1, … , xn−1 where xi ∈ [0,1] and ∑i=1 xi < 1. This amounts to saying that a Dirichlet distribution with n parameters provides a distribution over vectors of length n-1 whose items are all positive and which sum to less than one. This allows us to interpret them as the parameters of a categorical distribution which gives non-zero probability to all items. The x1, … , xn−1 are directly the probabilities associated with the first n-1 values of the random variable associated with categorical distribution with the last value, 푥푛, implicit from the constraint that all values must sum to 1.

It is easy to gain an intuitive understanding of Dirichlet distributions by examining some of their examples. We will use the ddirichlet function found in the gtools package. This time we will use the namespace notation rather than loading gtools into the environment. Our first example is for when we have two count parameters, and we look at the distribution over the single parameter for the base categorical distribution when the count parameters are 〈1,1〉, 〈2,8〉, 〈20,80〉, and 〈200,800〉:

> plot(function(x)sapply(x,function(t)gtools::ddirichlet(c(t,1-t),c(1,1))),fro m=0,to=1,n=50) > plot(function(x)sapply(x,function(t)gtools::ddirichlet(c(t,1-t),c(2,8))),fro m=0,to=1,n=50) > plot(function(x)sapply(x,function(t)gtools::ddirichlet(c(t,1-t),c(20,80))),f rom=0,to=1,n=50)

Essentials of Data Analytics and Machine Learning 27

Basic Classification Methods

> plot(function(x)sapply(x,function(t)gtools::ddirichlet(c(t,1-t),c(200,800))) ,from=0,to=1,n=50)

We see that when the count parameters are 〈1,1〉 we have a uniform distribution over all possible values for p. This represent complete ignorance. As the count parameters increase in the ration 2:8 we see the is over .2 in all cases, and this is maximum aposteriori estimate of p. But as the count parameters get larger, the density clusters more and more densely around this mode, modeling our increased confidence that the true value of the parameter is close to .2.

Our second example has three count parameters, and we look at the distribution over the two parameters for the base categorical distribution when the count parameters are 〈1,1,1〉, 〈2,6,2〉, 〈20,60,20〉, and 〈200,600,200〉. We have to be a little careful to avoid warnings and we add a zlim to avoid a flat graph in the case where the Dirichlet distribution parameters are all ones:

> rgl::persp3d(function(x,y) apply(cbind(x,y),1,function(r) if (sum(r)>=1) NA else ddirichlet(c(r,1-sum(r)),c(1,1,1))),zlim=c(0,3)) > rgl::plot3d(function(x,y) apply(cbind(x,y),1,function(r) if (sum(r)>=1) NA e lse ddirichlet(c(r,1-sum(r)),c(2,6,2)))) > rgl::plot3d(function(x,y) apply(cbind(x,y),1,function(r) if (sum(r)>=1) NA e lse ddirichlet(c(r,1-sum(r)),c(20,60,20)))) > rgl::plot3d(function(x,y) apply(cbind(x,y),1,function(r) if (sum(r)>=1) NA e lse ddirichlet(c(r,1-sum(r)),c(20,60,20)))) > rgl::plot3d(function(x,y) apply(cbind(x,y),1,function(r) if (sum(r)>=1) NA e lse ddirichlet(c(r,1-sum(r)),c(200,600,200))))

Essentials of Data Analytics and Machine Learning 28

Basic Classification Methods

As we now expect, we see that when the count parameters are 〈1,1,1〉 we have a uniform distribution over all possible values for p. Again, this represent complete ignorance. As the count parameters increase in the ration 2:6:2 we see the mode is over 〈. 2, .6〉, which is the maximum aposteriori estimate of the categorical distribution parameters. As the count parameters get larger, the density clusters more and more densely around this mode, representing our increased confidence that the true value of the parameter is close the maximum aposteriori estimate.

The utility of this is that we have a method for estimating confidence in our parameters in a categorical or conditional categorical distribution, and it can work with both parameters estimated from data, or encoded from expert knowledge, or both. Although we have simply asserted that the Dirichlet distribution is a model used for this purpose, there are arguments to the effect that it is optimally rational to model our degrees of belief about the parameters of a categorical distribution using a Dirichlet distribution based on count parameters. We will not rehearse them here, but refer the interested reader to (Neopolitan, 2004).

Unfortunately, this requires a function for calculating the interval for a particular variable in the most probable n% of density of a Dirichlet distribution. We have not found any in packages available on CRAN, so include here a copy of the qdirichlet function from Rob Carnell’s R-Forge project Dirichlet Distributions (Carnell) that estimates such a quantity (examine the notes in the function for more details): qdirichlet <- function(X, alpha) { # qdirichlet is not an exact quantile function since the quantile of a # multivariate distribtion is not unique # qdirichlet is also not the quantiles of the marginal distributions since # those quantiles do not sum to one # qdirichlet is the quantile of the underlying gamma functions, normalized # This has been tested to show that qdirichlet approximates the dirichlet # distribution well and creates the correct marginal means and variances # when using a latin hypercube sample lena <- length(alpha) stopifnot(is.matrix(X)) sims <- dim(X)[1] stopifnot(dim(X)[2] == lena) if(any(is.na(alpha)) || any(is.na(X))) stop("NA values not allowed in qdirichlet")

Y <- matrix(0, nrow=sims, ncol=lena) ind <- which(alpha != 0) for(i in ind) { # add check to trap numerical instability in qgamma # start to worry if alpha is less than 1.0 if (alpha[i] < 1) { # look for places where NaN will be returned by qgamma nanind <- which(pgamma(.Machine$double.xmin, alpha[i], 1) >= X[,i]) # if there are such places if (length(nanind) > 0)

Essentials of Data Analytics and Machine Learning 29

Basic Classification Methods

{ # set the output probability to near zero Y[nanind,i] <- .Machine$double.xmin # calculate the rest Y[-nanind,i] <- qgamma(X[-nanind,i], alpha[i], 1) warning("at least one probability set to the minimum machine double") } else { Y[,i] <- qgamma(X[,i], alpha[i], 1) } } else { Y[,i] <- qgamma(X[,i], alpha[i], 1) } } Y <- Y / rowSums(Y) return(Y) }

The function expects a matrix of desired quantiles, one column for each variable, ordered by the order of the alpha values. The alpha values are the Dirichlet parameters. So, for example, if our counts are 〈2,6,2〉 then we can use the above function to estimate the 95% confidence intervals around the MAP estimates of the parameters 〈. 2, .6, .2〉:

> m=matrix(rep(c(.025,.975),3),ncol=3) > qdirichlet(matrix(rep(c(.025,.975),3),ncol=3),c(2,6,2)) [,1] [,2] [,3] [1,] 0.09016421 0.8196716 0.09016421 [2,] 0.24424586 0.5115083 0.24424586

Noisy-Or Distributions

There is a big problem with conditional categorical distributions. The number of rows of a conditional categorical distributions is exponential on the number of variables conditioned upon. So if all variables have n values, and there are c variables conditioned upon, then there will be 푐푛 rows and 푛푐푛 count parameters in such a distribition.

The noisy-or distribution is an alternative approach that utilizes an assumed causal relationships between input and target variables to vastly reduce the number of parameters required. As with the (conditional) categorical distribution, it is simple to use to encode expert knowledge. Since many relationship are in fact causal, and since most expert domain knowledge is causal, this model can be surprisingly useful.

The simplest case is that of binary input and output variables. Each input variable has one value associated with the variable being ‘on’, and one with it being ‘off’, and we assume that an input variable can cause the target variable to be ‘on’ only when it is itself ‘on’. Such effects, though, are noisy in that an ‘on’ input variable may fail to cause the target variable to be ‘on’.

Essentials of Data Analytics and Machine Learning 30

Basic Classification Methods

For each input (or causal) variable, we model its ’causal power’ using a count parameter specification of a categorical distribution :

• a represents how often it is ’as if’ we had observed the cause had been on and this had aailed to caused the effect to be on. • b represents how often it is ’as if’ we had observed the cause had been on and this had caused the effect to be on.

In addition, we include an additional categorical distribution of two parameters that models the causal power of all unknown/unmodeled causes. We can think of this as modeling an additional, slack, cause which is always ‘on’ and which represent everything other than the explicit causes.

Accordingly, if we were modelling 푃(푌|푋1, 푋2) we would have:

Failed to Cause Succeeded to Cause 푋0 = 푆푙푎푐푘 푎0 푏0 푋1 푎1 푏1 푋2 푎2 푏2

For a cause/input variable X with causal power , we define the failrate:

푓푎푖푙푟푎푡푒(푋) = 1 if X is ‘off’

푎 otherwise 푎+푏

Where our slack variable is 푋0, the noisy or distribution is:

푃(푌|푿) = 1 − ∏ 푓푎푖푙푟푎푡푒( 푋푖) 푖=0

It is obvious that the number of parameters in the model is 2(n+1), where n is the number of input variables.

This model can be simply extended to multinomial input variables, or variables where there is no ‘off’ value. In both cases, all active values are given categorical distribution modelling their causal power, and the appropriate rows are used in the multiplication. So if we have an input variable, 푋푖 with n values, all of which can cause Y to be ‘on’ (there is no ‘off’ value) we would have:

… … …

Essentials of Data Analytics and Machine Learning 31

Basic Classification Methods

푋푖1 푎푖1 푏푖1 푋푖2 푎푖2 푏푖2 … … … 푋푖푛 푎푖푛 푏푖푛 … … …

Then we need only redefine the fail rate to:

푎푖1 푓푎푖푙푟푎푡푒(푋) = 푤ℎ푒푟푒 푋 = 푥푖 푎푖1 + 푏푖1

In this case, the number of parameters in the model is 2(nm+1), where n is the number of input variables and m the average number of values for these input variables. Mixed-Discrete Classification

We have examined classifiers where the input variables were real, and those where they were discrete. What about the case where the input variables are a mixture of real and discrete variables?

In fact, there are ways to deal with discrete variables in LDA, QDA, the perceptron and logistic regression. The simplest method is to map the discrete variables onto real values. If the discrete variables have more than two values and are not ordinal it is usually preferable to turn them into multiple binary variables mapped each to {0,1} instead of mapping individual variables with n values to the integers {0,1,…n-1}. By doing this we choose lose in the mathematical space the information that the values are mutually exclusive and covering (that such variables will always take one and only one of their values), but this information will be preserved in the data values. On the other hand, mapping to the integers up to n-1 would make it appear that there are relationships between the values (that value 4 is more like 5 than value 1) that are not actually present and which can cause major problems for the models.

Another alternative is conditional classification. In this case, we make one model for all different combinations of values of the discrete variables. This is exactly what we did in the case of conditional categorical variables, and like in that case the drawback is that the number of models required is exponential on the number of discrete input variables.

We will see that there are more sophisticated modeling techniques that can naturally deal with both discrete and read valued input variables and the best approach is to use these. See, for example, module 15 on tree based methods.

Essentials of Data Analytics and Machine Learning 32