Testing for Outliers from a Mixture Distribution When Some Data Are Missing

TESTING FOR OUTLIERS FROM A MIXTURE DISTRIBUTION WHEN SOME DATA ARE MISSING WayneA. Woodward* and StephanR Sain** * SouthernMethodist University ** University of Coloradoat Denver ABSTRACT We considerthe problemof multivariateoutlier testingfrom a populationfrom which a training sampleis available.We assumethat a new observationis obtained,and we test whetherthe new observationis from the populationof the training sample. Problemsof this sort arisein a numberof applicationsincluding nuclear monitoring, biometrics (including fingerprint and handwriting identification), and medical diagnosis. In many casesit is reasonableto model the population of the training sample using a mixture- of-normalsmodel (e.g. whenthe observationscome ftom a variety of sourcesor the data are substantially non-normal). In this paper we consider a modified likelihood ratio test / that is applicableto the casein which: (a) the training datafollow a mixture-of-normals distnoution, (b) all labelsin the training sampleare missing, (c) someof the observation vectorsin the training samplehave missing information, and (d) the numberof componentsin the mixture is unknown. The approachoften usedin practiceto handlethe fact that someof the data vectorshave missing observations is to perform the test basedonly on the datavectors with full data. Whenlarge amounts of data are missing,use of this strategymay leadto lossof valuableinformatio~ especiallyin the caseof smalltrajning sampleswhich, for example,is often the casein the nuclearmonitoring settingmentioned previously. An alternativeprocedure is to incorporateall n of the datavectors using the EM algorithmto handlethe missingdata. We use simulationsand examplesto comparethe useof the EM algorithmon the entire data set with the useof only the completedata vectors. Key Words: EM Algorithm, Mixture Model, Missing Data, Outlier Detection 1 1. Introduction We considerthe problemof testinga new data valueto determinewhether it shouldbe consideredan outlier from a distributionfor which we havea training sample, i.e."outlier testing." Fis~ Gray, and McCartor (1996) and Taylor andHartse (1997) haveused a likehl1oodratio test for detectingoutliers ftom a multivariatenormal (MVN) distn"butionfit to the training data when no datawere missing. Theseauthors applied the test to the problemof detectingseismic signals of undergroundnuclear explosions when a training sampleof non-nuclearseismic events is available. Our focus in this paperwill be the casein which the training dataare modeledas a mi:xtureof normals. A mixture model is an obviouschoice for a wide variety of settings. For example,in the seismicsetting discussed above, the populationof non-nuclear observationsin a particularregion may consistof observationsfrom a variety of sources suchas earthquakesand mming explosions,and ~ring typesof earthquakes. Additionally, in the areaof~cal diagnosis,benign tumors may be ofseveral types, etc. The flexibility of the mixture-of-normalsmodel alsomakes it usefulfor modelingnon- nom1aJityeven ifdisting uishablecomponents are not present.The training datawill be considereda sampleof size n ftom a mixture distn"butionwhose density is givenby m j(z) = E~ji(Zj Pi, Ei) i) i=1 wherem is the numberof componentsin the mixture,fi(Z; J.J.i,Ei) is the MVN density with meanvector /.I, and covariancematri:x E, associatedwith the ith component,the Ai' 1, . , m are the mixing proportions, and :z;is a d..djmen..~Qna1vector of variables. Letting the training samplebe denotedby Xl, ..., Xn and the new observation (whosedistribution is unknown) by Xu, then we wish to test the hypotheses 2 Ho : Xu E n Hl: x.~n wheren denotesthe populationof the training data. We considerthe casein which datamay be missingin the training data. In the case of a mixture model,there are at leastthree different waysin which "data" maybe missing: (a) missing labels (b) unknownnumber of components ( c) missingdata in the datavectors A "tabel" is saidto be known for a given observationif it is known to which componentin the mixture that observationbelongs. Wang,W~ard, Gray, Wiechecki,and Sain 1997) developed a modified likelihood ratio test for the case in which some but not all of the labels may be missmg. The authorsassumed that the numberof componen~ m, is known and that thereis no missingdata in the datavectors. The likehl1oodfimction under Ho (i.e. under the assumptionthat Xu E Ho) is denotedby Lo( 9) where9 is an unknown vector-valuedparameter associated with the distrIbutionof X underBo. Likewise, let '"'"' n L 1(8) = I1f(Xs;8) denote the likelihood based only on the training sample Xl, ...,Xn. 8=1 Wang, et a1.(1997) and Sain, Gray, Woodward, and Fisk (1999) used the modified likel1l1ood-ratio test statistic (2) 3. The usuallikelihood ratio involvesa secondfactor in the denominator,h(:cu), where h(z) is the densityfunction of the outlier populationand Xu denotesa singleobservation availablefrom that popuJation.However, ~~1Jg h(x) is very difficult with only one observationavailable and whena priori informationis not availableconcerning the outlier distnoution. Thus, it makes senseto estimate h(x) nonparametrically. Moreover, given any of the potentialnonparametric density estimators of h(z) in the caseof only one data point, the factor h(xu) will not vary with (Jin the maximizationprocess nor will h(:cu) vary as Xu variesfrom sampleto sample. Consider,for example,a histogramestimator of h (z ). With a singledata value,such an estimatorwould be a constantregardless of the value 0 f :1:. Thus, for simplicity we useW in (2). It is easilyseen m (2) that if Xu doesnot belongto ll, then W will tend to be small. Eencethe rejectionregion is of the form W ~ WQ for someW Q pickedto provide a level a test. Sincethe null distn"butionofW hasno known closedform, Wang,et at. (1997) useda bootstrapprocedure (see Efron ~ibshirani, 1993)to derivethe critical valueWa. Wheneversome of the training data are un1abel~ the parameters~,JJ.i., and Ei of the mixture modelare estimatedyja the Expectation-M~~tiQn (EM) algorithm (see Dempster, Laird, and Rubin, 1977, Mclachlan and Krishnan, 1997, and Redner and Walker, 1984). Based on simulations, Wang, et al (1997) showed that in this setting, the modified likelihood ratio test can be used successfully for outlier detection. Sain, et at (1999) extend the results of Wang, et at (1997) to the case in which no data are labeled and in which the number of components in the mixture is unknown.; They demonstrated their resuhs using simulations similar to those of Wang, et a1.(1997) and showed little or no loss of power when no training data are labeled. S~ et aI. (1999) obtained excellent results using their procedme on actual seismic data from the VogtJand region near the Czech-German border and from the WMQ station m western China. Using the China data, the authors de~nstrated that a mixture model may be preferable to the use of a single multivariate normal model due to apparent non-normality of the data 4 even when there are not any identifiable groups of observation types represented in the training data. In this paper we consider the case in which some of the variables may be missing for some observations in the training sample. For example, in the seismic setting log(Pg/Lg) ratios at higher frequency bands are often missing becauseof attenuation effects on high frequencies. We consider the case in which d variables are observed on the new observation,and we denotethis observationby Xu = (Xu!, Xu2,..., Xud)', where Xu; denotesvariable j observedon the new observation. We further ass~ that there existsa training sample Xl = (XII, X12,X13, ..., Xld)' X2 = (X21' X22, X23: ~1' "« Xn = (Xnl' Xn2, Xn8, ftom ll. When SOIMof the trainingdata includes missing data, the training samplehas the generalappearance Xl = (XUt , X13' ... Xl;, , X l.J+20 . XIII)' X2 = ( - , X22, X23, ., X2.j, X2.Jt-l. '.. Xlii)' X,,= :Xnl, -, ~, ... Xnj, Xn.j+l. XnJ+2' ... Xnd)' where" " denotesthat the particularvariable is missingfor that observation.Thus, to apply standardor recentlydeveloped outlier detectionmethodology, one must reducethe datato a subsetof the original trajningdata that includesonly those1 (where < n) data 5 vectorsfor which all of the variableswere observed. It is clearthat sucha procedurecan result in a lossof information and shouldlead to a reductionin detectionpower. To denx>nstratethe extent oftbis problem, Woodward,Sain, Gray,Zbao, and Fisk (2002) showthat, dependingon the missingdata probabilities and the numberof variables,the number of complete vectors available for analysis may be dramatically smaller than the numberof original cases. For example,in the caseof d 4 variablesand a 25% chance that an observationwill be missing,we expectfewer than one-thirdof the data vectorsto be retamedfor analysisusjng the strategyof analyzingonly the completedata vectors. This is in spite of the fuct that about75% of the origjnal data set shouldbe availablefor use. Another problem arisesif there are no casesor only a very few casesin which an d of the variablesare observed. If the strategyofusing only the completevectors is used, then someof the variablesmay need to be deleted. This may alsoresult m loss of some jmportantinformation. It shouldaJso be pointed'Outthat we are not consideringthe case in which there is missingdata in the outlier. In an applicationof the techniquesdeveloped here,the variablesobserved m the outlier determinethe variablesto be usedin the outlier test. The pm-poseof this paperis to examinethe extent to which detectionpower can be improvedby

Testing for Outliers from a Mixture Distribution When Some Data Are Missing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support