Stat 428/528: Advanced Data Analysis 2

Stat 428/528: Advanced Data Analysis 2 Chapters 17: Classification Instructor: Yan Lu Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 1 / 34 Classification analysis You might have several groups already defined and want to classify a new observation. I If you have a PCA, then you could determine the linear combinations of variables corresponding to PC1 and PC2 that was determined from an original set of data. |{Then you could use those linear combinations on a new data point (even if it didnt contribute to the calculation of the PCs) and see where it fits on plot of PC2 versus PC1. I If clusters have been defined from the original data, you might want to find which cluster the new observation is closest to. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 2 / 34 Examples I classifying a tumor as benign or malignant based on a medical image I making a diagnosis (medical or psychiatric) on the basis of a set of symptoms (this is more open-ended than benign versus malignant) I classifying a fossil bone as belonging to a male or female, adult or juvenille I classifying student applicants as likely to complete college or drop out I finding the best fit for an applicant for a college major, or for a job within the military I determining disputed authorship (Hamilton versus Madison for the Federalist Papers or Shakespeares plays I identifying speech patterns for automated voice recognition systems Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 3 / 34 Example: Fisher's iris data A famous data set used to illustrate classification is Fishers Iris data from 1936. There are three species of iris with 50 observations each. The species are: I Iris setosa I Iris versicolour I Iris virginica The variables are (in cm) 1. sepal length 2. sepal width 3. petal length 4. petal width Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 4 / 34 Figure: Three types of Iris flowers Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 5 / 34 > head(iris) Sepal.Length Sepal.Width Petal.Length Petal.Width Species 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 6 / 34 Obs 1 is classified as Versicolor and Obs 2 is classified as Setosa. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 7 / 34 Classification using Mahalanobis distance Suppose p features X1; X2; ··· ; Xp are used to discriminate among the k groups. I Group sizes are n1; n2; ··· ; nk with a total sample size of n = n1 + n2 + ··· + nk . 0 I Let X¯ i = (X¯i1; X¯i2; ··· ; X¯ip) be the vector of mean responses for the ith group i = 1; 2; ··· ; k I let Si be the p-by-p variance-covariance matrix for the i th group. I The pooled variance-covariance matrix is given by (n − 1)S + (n − 1)S + ··· + (n − 1)S S = 1 1 2 2 k k n − k Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 8 / 34 M-distance The M -distance from an observation X to (the center of) the i th sample is 2 ¯ 0 −1 ¯ Di (X) = (X − Xi ) S (X − Xi ); I Note that if S is the identity matrix, then this is the Euclidean distance. I Given the M -distance from X to each sample, classify X into the group which has the minimum M -distance. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 9 / 34 Figure: How classification works, three groups and two features Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 10 / 34 I Observations 1 is closest in M -distance to the center of group 3. Thus, classify observation 1 into group 3. I Observation 2 is closest to group 1. Thus, classify observation 2 into group 1. I Observation 3 is closest to the center of group 2 in terms of the standard Euclidean (walking) distance. |{ However, observation 3 is more similar to data in group 1 than it is to either of the other groups. |{The M -distance from observation 3 to group 1 is substantially smaller than the M -distances to either group 2 or 3. ||-Thus, you would classify observation 3 into group 1. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 11 / 34 The M -distance from the i th group to the j th group is the M -distance between the centers of the groups: 2 2 0 −1 D (i; j) = D (j; i) = (X¯ i − X¯ j ) S (X¯ i − X¯ j ): I Larger values suggest relatively better potential for discrimination between groups. 2 2 I In the plot above, D (1; 2) < D (1; 3) which implies that it should be easier to distinguish between groups 1 and 3 than groups 1 and 2. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 12 / 34 Evaluating the Accuracy of a Classification Rule The misclassification rate, or the expected proportion of misclassified observations, is a good measurement to evaluate a classification rule. I Better rules have smaller misclassification rates, but there is no universal cutoff for what is considered good in a given problem. I You should judge a classification rule relative to the current standards in your field for \good classification. Table: Classification table or Confusion matrix Actual group Number of obns Predicted group 1 Predicted group 2 1 n1 n11 n12 2 n2 n21 n22 Misclassification rate = (n12 + n21)=(n1 + n2) Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 13 / 34 Resubstitution I evaluates the misclassification rate using the data from which the classification rule is constructed. |{The resubstitution estimate of the error rate is optimistic (too small). |{A greater percentage of misclassifications is expected when the rule is used on new data, or on data from which the rule is not constructed. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 14 / 34 Cross-validation is a better way to estimate the misclassification rate. I In many statistical packages, you can implement cross-validation by randomly splitting the data into a training or calibration set from which the classification rule is constructed. I The remaining data, called the test data set, is used with the classification rule to estimate the error rate. I This process is often repeated, say 10 times, and the error rate estimated to be the average of the error rates from the individual splits. With repeated random splitting I it is common to use 10% of each split as the test data set (a 10-fold cross-validation). Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 15 / 34 Jackknife Another form of cross-validation uses a jackknife method where single cases are held out of the data (an n-fold), then classified after constructing the classification rule from the remaining data. The process is repeated for each case, giving an estimated misclassification rate as the proportion of cases misclassified. I The jackknife method is necessary with small sized data sets so single observations dont greatly bias the classification. Use real testing data You can also classify observations with unknown group membership, by treating the observations to be classified as a test data set. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 16 / 34 Example: Fishers Iris Data cross-validation I The 150 observations were randomly rearranged and separated into two batches of 75. |- assigned a label \test, whereas the rest are \train I The 75 observations in the calibration set were used to develop a classification rule. I This rule was applied to the remaining 75 flowers, which form the test data set. I There is no general rule about the relative sizes of the test data and the training data. Many researchers use a 50-50 split. I Combine the two data sets at the end of the cross-validation to create the actual rule for classifying future data. Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 17 / 34 Construct classification rules > library(MASS) > lda.iris0 <- lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width , data = iris) > lda.iris0 Call: lda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data = iris) Prior probabilities of groups: setosa versicolor virginica 0.3333333 0.3333333 0.3333333 Group means: Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 17 / 34 Sepal.Length Sepal.Width Petal.Length Petal.Width setosa 5.006 3.428 1.462 0.246 versicolor 5.936 2.770 4.260 1.326 virginica 6.588 2.974 5.552 2.026 Coefficients of linear discriminants: LD1 LD2 Sepal.Length 0.8293776 0.02410215 Sepal.Width 1.5344731 2.16452123 Petal.Length -2.2012117 -0.93192121 Petal.Width -2.8104603 2.83918785 Proportion of trace: LD1 LD2 0.9912 0.0088 Chapters 17: Classification Stat 428/528: Advanced Data Analysis 2 Instructor: Yan Lu 18 / 34 I Prior probabilities of groups: the proportion of observations in each group.

Stat 428/528: Advanced Data Analysis 2

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support