V1603419 Discriminant Analysis When the Initial
Total Page:16
File Type:pdf, Size:1020Kb
TECHNOMETRICSO, VOL. 16, NO. 3, AUGUST 1974 Discriminant Analysis When the Initial Samples are Misclassified II: Non-Random Misclassification Models Peter A. Lachenbruch Department of Biostatistics, University of North Carolina Chapel Hill, North Carolina Two models of non-random initial misclassifications are studied. In these models, observations which are closer to the mean of the “wrong” population have a greater chance of being misclassified than others. Sampling st,udies show that (a) the actual error rates of the rules from samples with initial misclassification are only slightly affected; (b) the apparent error rates, obtained by resubstituting the observations into the calculated discriminant function, are drastically affected, and cannot be used; and (c) the Mahalanobis 02 is greatly inflated. KEYWORDS misclassified. This clearly is unrealistic. The present Discriminant Analysis study attempts to present two other models which Robustness Non-Random Initial Misclassification it is hoped are more realistic than the “equal chance” model. To do this, we sacrifice some mathematical rigor, as the equations involved are 1. INTR~DUOTI~N very difficult to set down. Thus, Monte Carlo Lachenbruch (1966) studied the effects of mis- experiments are used to evaluat’e the behavior of classification of the initial samples on the per- the LDF. formance of the linear discriminant function (LDF). 2. hlODELSAND EXPERIMENTS More recently, Mclachlan (1972) developed the asymptotic results for this model and found gen- We assume that observations x, from the ith erally good agreement with the earlier results. It population, 7~; , i = 1, 2 are normally distributed was found that if the total proportion was not with mean p, and covariance matrix I. In ?rl , we large, and equal in both populations, no increase assume p1 = 0 and in aZ , pz’ = (6, 0, . , 0) in the probability of misclassification occurs. The (6 > 0). This can bc done without loss of generality, reason for this is that although the means of the since an appropriate linear transformation can be LDF in population 1 and population 2 were closer applied to data from multivariate normal distri- together, the variance was reduced correspondingly. butions with ot’her means and covariance matrices. If Qi is the proportion of initial observations thought There are p variates; 72; , i = 1, 2 will denote the to be members of the ith group that really are number of observations in the initial samples, and from the other group, P, is the probability of (Y, is the proportion of observations in gSei initially misclassifying a member of the ith group, and 6’ misclassified as ai . is Mahalanobis distance, and Qj the cumulative Initial misclassification can arise in many ways; normal function, then e.g., an imperfect criterion for assigning the initial observations to their true populations. This could PI = a -; (1 + LyI - CYJ arise in biomedical studies when the criterion has ( > a false positive or false negative rate greater than zero. In this study, we use the observations them- P, = @ -f (1 + az - CQ) selves to decide if the individual is initially mis- ( ) classified. This is a limitation of the study in the if QI~+ LYE< 1. Thus if LY, = crz , the probabilities sense that there usually will bc some other criterion of misclassification are unaltered from their optimum used for initial assignment. WC shall study two values. This model assumed that the observations initial misclassification models. were randomly misclassified and that each ob- The first model will be called the complete servation had the same chance of being initially separation model and is defined as follows. For each observation, x, calculate (x - vl)‘(x - cl,) = Received March 1973; Oct. 1973 x’x and (x - &‘(x - p2) and assign the observa- 419 420 P. A. LACHENBRUCH tion to whichever population leads to the smaller are calculated from the misclassified data. No of the two quantities. For the model assumed subscripts have been placed on them to avoid above, this amounts to assigning x to ?rl if x1 < 6/2 additional notation, but the reader should keep this and to 1r2 otherwise. It is easy to show that this in mind. Then we define the following: leads to an initial misclassification rate of ai = 1. DS$(X) = (x - *(xl + Xz))‘s-l(Zl - ft,) a(-6/2) in rrl and r2 . Anderson (1972) has in- is the discriminant function in case i = 0, 1, 2; dicated considerable instability in logistic dis- - criminants when complete separation occurs. After 2. P,, = a+DSi(0)/dVI;) setting up the initial samples, the sample dis- criminant function is calculated and evaluated. The second model is a generalization of the first. where Vi = (xl - x2)Sm1S1(jtl - zJ, the variance The same criterion is used, but, in addition, of D,,(x). These are the actual probabilities of for an observation from r( to be misclassified, misclassification in each case. (x - p%)‘(x - p$) must be greater than a quantity, 3. R,, , Rzt are the resubstitution estimators of Vi . In this study, we use percentiles of the xp2 P,; and P,; . They have also been referred to as distribution. Thus, in addition to being closer to the the apparent error rate (Hills, 1966). mean of the wrong population, it must also be an 4. Qi = +(-D/2) where 0’ = (x, - x2)’ unlikely member of the correct population. Suppose .S1(Zl - jtJ. x E rrl . Then it will be assigned to 1r2if After the samples are generated, Dss(s), P,, , x’x > (x - va)‘(x - p2) unrl x’x > V, . P,; , R,{ , R,i , and Q, are computed for i = 0, 1, 2. The actual error rates, Pli and P,, , tell us how A similar statement applies to observations from the procedure is performing. Rli , Rzi and Qi are rT2 . Note that V, = V2 = 0 is the first model. commonly used estimators of Pli and Pzi . They Table 1 gives values of p, ?xl = n, = 11,and (V, , V,) tell us what our estimators would say, which may be used in this study. (n is the true number of each considerably different from Pli and P,, : rr,). Values of 6 of 1, 2, and 3 were used for each Note that the results for the various misclassifica- combination. tion models are not independent, since they are all These values of V, , V2 are the 25% and 10% applied to the same initial sample. percentage points of the appropriate x2 distributions. In all cases, 10 replications were used. 3. RESULTS Let the subscript 0 refer to the case in which the Table 2 gives the optimal and approximate initial observations are correctly assigned, 1 the expected error rates for samples of size n, = n, = n model 1 misclassification case, and 2 the model 2 for the various cases considered in this paper. misclassification case. These may be compared with the experimental In model 0, X, , x2 and S are calculated from the results in Tables 3 and 4. The values are calculated correctly classified data. In models 1 and 2, they from Lachenbruch (1968). Tables 3 and 4 give the error rates for the cor- TABLE I-Values of p, n, und (VI, V,) rectly classified samples and for the model I and II n P v1 v2 misclassified samples. The parenthetic indicators 4 25 0 0 0 0 TABLE 2-Optimal and Expected Error Rates from “True” 4 100 Samples 4 25 5.4 5.4 4 100 5.4 5.4 P n 6 Optimal Expected 4 25 7.8 7.8 4 100 7.8 7.8 4 25 1 .31 .34 4 25 7.8 5.4 2 .16 .13 4 100 7.8 5.4 3 .07 .08 10 25 0 0 4 100 1 .31 .32 0 2 .16 .16 10 100 0 3 .07 .07 10 25 12.5 12.5 10 25 1 .31 .37 10 100 12.5 12.5 2 .16 .21 10 25 16.0 16.0 3 .07 .lO 10 100 16.0 16.0 10 100 1 .31 .33 10 25 16.0 12.5 2 .16 .17 10 100 16.0 12.5 3 .07 .07 TECHNOMETRICSO, VOL. 16, NO. 3, AUGUST 1974 NON-RANDOM MISCLASSIFICATION MODELS 421 TABLE a--Means and Standard Deviations of True Error Rates e n-25 n=lOO 6 1 2 3 1 2 3 True .320 .168 ,074 .325 ,170 .073 plo .046 .033 ,019 .028 .021 .012 sd plo .351 .189 ,086 .303 .154 .065 p20 .033 .020 .032 .021 .Oll sd p20 .041 Model I P .308 .176 .078 .312 .164 .072 11 .034 .038 .021 .013 ,012 .009 sd pll .313 .159 .071 .306 .156 .066 p21 .023 .018 .013 .013 .OlO sd p21 ,034 Model II .309 .166 .077 .316 .166 .073 (.25,..25) p12 ,041 .034 .019 .028 .020 .Oll sd p12 ,325 .173 .072 .304 ,155 .065 p22 ,018 .029 .019 .Oll sd p22 .048 .038 Model II .306 .163 .073 .320 .167 .074 (.lO,.lO) p12 sd P ,035 ,029 .017 .024 .018 .012 12 .332 ,176 .076 .304 .154 .064 p22 .021 .025 .018 .Oll sd p22 .040 .032 .078 .329 .176 .078 Model II p12 .314 .174 (.25,.10) sd P .043 ,034 .020 .026 .019 .013 12 .320 .167 .071 .298 .146 .061 p22 ,018 .027 .017 .Oll sd p22 .044 .033 ps10 True .358 .210 .104 .329 .172 ,074 .079 .061 .037 ,035 .025 .014 .387 .226 .112 ,320 .168 .073 .059 .040 .025 .028 ,020 .Oll Model I .320 .181 .088 .315 .169 .075 .043 .027 .022 .024 .018 .014 .320 ,181 .109 .308 .160 .071 .042 .026 .031 .024 .020 ,011 Model II .338 .195 .093 .313 .165 .073 (.25,.25) .058 .040 .026 .032 .020 .012 .365 .205 .107 .315 .164 .071 .067 .047 .028 .031 .019 .OlO Model II .355 .204 .lOO .321 .168 .073 (.lO,.lO) .071 .049 .028 .036 .024 .013 .378 .222 .lll .320 .166 .072 .061 .041 .024 .031 .021 .Oll Model II P .360 .214 .103 .329 .177 .077 12 sd P (.25,.10) 12 .067 .048 .031 .034 .023 ,013 P .365 .207 .109 .306 .157 .068 22 .066 sd P22 .047 .024 .032 .020 .Oll TECHNOMETRICSO, VOL.