Focus Article Robust for detection Peter J. Rousseeuw and Mia Hubert

When analyzing data, outlying observations cause problems because they may strongly influence the result. aims at detecting the by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust proce- dures for univariate, low-dimensional, and high-dimensional data such as esti- mation of location and scatter, linear regression, principal component analysis, and classification. C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73–79 DOI: 10.1002/widm.2

INTRODUCTION ESTIMATING UNIVARIATE n real data sets, it often happens that some obser- LOCATION AND SCALE I vations are different from the majority. Such ob- As an example, suppose we have five measurements servations are called outliers. Outlying observations of a length: may be errors, or they could have been recorded un- der exceptional circumstances, or belong to another 6.27, 6.34, 6.25, 6.31, 6.28 (1) population. Consequently, they do not fit the model and we want to estimate its true value. For this, well. It is very important to be able to detect these  one usually computes the mean x¯ = 1 n x , which outliers. n i=1 i in this case equals x¯ = (6.27 + 6.34 + 6.25 + 6.31 + In practice, one often tries to detect outliers 6.28)/5 = 6.29. Let us now suppose that the fourth using diagnostics starting from a classical fitting measurement has been recorded wrongly and the data method. However, classical methods can be affected become by outliers so strongly that the resulting fitted model does not allow to detect the deviating observations. 6.27, 6.34, 6.25, 63.1, 6.28. (2) This is called the masking effect. In addition, some = . good data points might even appear to be outliers, In this case, we obtain x¯ 17 65, which is far from which is known as swamping. To avoid these effects, the unknown true value. On the contrary, we could the goal of robust statistics is to find a fit that is close also compute the median of these data. To this end, we to the fit we would have found without the outliers. sort the observations in (2) from smallest to largest: We can then identify the outliers by their large devia- 6.25 ≤ 6.27 ≤ 6.28 ≤ 6.34 ≤ 63.10. tion from that robust fit. First, we describe some robust procedures for The median is then the middle value, yielding 6.28, estimating univariate location and scale. Next, we which is still reasonable. We say that the median is discuss multivariate location and scatter, as well as more robust against an outlier. linear regression. We also give a summary of avail- More generally, the location-scale model states able robust methods for principal component analy- that the n univariate observations xi are independent sis (PCA), classification, and clustering. For a more and identically distributed (i.i.d.) with distribution extensive review, see Ref 1. Some full-length books function F [(x − μ)/σ ] where F is known. Typically, F on this topic are Refs 2, 3. is the standard Gaussian distribution function .We then want to find estimates for the center μ and the scale parameter σ . The classical estimate of location is the mean. ∗Correspondence to: [email protected] As we saw above, the mean is very sensitive to even Renaissance Technologies, New York, and Katholieke Universiteit 1 aberrant value out of the $n$ observations. We say Leuven, Belgium that the breakdown value4,5 of the sample mean is DOI: 10.1002/widm.2 1/n,soitis0%forlargen. In general, the breakdown

Volume 1, January/February 2011 c 2011 John Wiley & Sons, Inc. 73 Focus Article wires.wiley.com/widm value is the smallest proportion of observations in the for a real function ψ. The denominatorσ ˆ is an data set that need to be replaced to carry the estimate initial robust scale estimate such as the MAD. A arbitrarily far away. See Ref 6 for precise definitions solution to (4) can be found by the Newton–Raphson and extensions. The robustness of an estimator is also algorithm, starting from the initial location estimate 7 (0) measured by its influence function, which measures θˆ = mediani (xi ). Popular choices for ψ are the the effect of one outlier. The influence function of the Huber function ψ(x) = x min(1, c/|x|), and Tukey’s mean is unbounded, which again illustrates that the biweight function ψ(x) = x(1 − (x/c)2)2 I(|x|≤c). mean is not robust. These M-estimators contain a tuning parameter c, For a general definition of the median, we de- which needs to be chosen in advance. note the ith ordered observation as x(i). Then, the People often use rules to detect outliers. The median is x[(n+1)/2] if n is odd, and [x(n/2) + x(n/2+1)]/2 classical rule is based on the z-scores of the observa- if n is even. Its breakdown value is about 50%, mean- tions given by ing that the median can resist up to 50% of outliers, and its influence function is bounded. Both properties zi = (xi − x¯)/s (5) illustrate the median’s robustness. The situation for the scale parameter σ is simi- where s is the standard deviation. More precisely, the | | lar. The classical estimator is the standard deviation rule flags xi as outlying if zi exceeds 2.5, say. But = n − 2/ − s i=1(xi x¯) (n 1). Because a single outlier in the above-mentioned example (2) with the outlier, can already make s arbitrarily large, its breakdown the z-scores are value is 0%. For instance, for the clean data (1) above, − . , − . , − . , . , − . we have s = 0.035, whereas for the data (2) with the 0 45 0 45 0 45 1 79 0 45 outlier, we obtain s = 25.41! A robust measure of scale is the median of all absolute deviations from the so none of them attains 2.5. The largest value is only median (MAD), given by the median of all absolute 1.79, which is quite similar to the largest z-score for the clean data (1), which equals 1.41. The z-score of deviations from the median:     the outlier is small because it subtracts the nonrobust MAD = 1.483 median xi − median(xj ). (3) i=1,...,n j=1,...,n mean (which was drawn toward the outlier) and be- cause it divides by the nonrobust standard deviation The constant 1.483 is a correction factor that makes (which the outlier has made much larger than in the the MAD unbiased at the normal distribution. The clean data). Plugging robust estimators of location MAD of (2) is the same as that of (1), namely 0.044. and scale into (5), such as the median and the MAD, 8 We can also use the Qn estimator, defined as yields the robust scores

Qn = 2.2219 {|xi − xj |; i < j}(k)     xi − median(xj ) /MAD (6) = h ≈ n / =n + ... j=1,...,n with k 2 2 4 and h 2 1. Here, rounds down to the nearest integer. This scale estima- tor is thus the first quartile of all pairwise differences which are more useful; in the contaminated example between two data points. The breakdown value of (2), the robust scores are both the MAD and the Qn estimator is 50%. Also popular is the interquartile range (IQR) −0.22, 1.35, −0.67, 1277.5, 0.0 defined as the difference between the third and first quartiles, that is, IQR = x3n/4 − xn/4 (where ... in which the outlier greatly exceeds the 2.5 cutoff. rounds up to the nearest integer). Its breakdown value Also Tukey’s boxplot is often used to pinpoint is only 25% but it has a simple interpretation. possible outliers. In this plot, a box is drawn from The robustness of the median and the MAD the first quartile Q1 = xn/4 to the third quartile comes at a cost: At the normal model they are less effi- Q3 = x3n/4 of the data. Points outside the inter- cient than the mean. To find a better balance between val [Q1 − 1.5IQR, Q3 + 1.5 IQR], called the fence, robustness and efficiency, many other robust pro- are traditionally marked as outliers. Note that the cedures have been proposed such as M-estimators.9 boxplot assumes symmetry because we add the same They are defined implicitly as the solution of the amount to Q3 as what we subtract from Q1.Atasym- equation metric distributions, the usual boxplot typically flags  many regular data points as outlying. The skewness- n x − θˆ ψ i = 0(4)adjusted boxplot corrects for this by using a robust σˆ 10 i=1 measure of skewness in determining the fence.

74 c 2011 John Wiley & Sons, Inc. Volume 1, January/February 2011 WIREs Data Mining and Knowledge Discovery Robust statistics for outlier detection

MULTIVARIATE LOCATION AND these subsets, h-subsets are constructed by so-called COVARIANCE ESTIMATION C-steps (see Ref 14 for details). Many other robust estimators of location and From now on, we assume that the data are p- scatter have been presented in the literature. The × dimensional and are stored in an n p data matrix first such estimator was proposed by Stahel15 and = ,..., T = ,..., T X (x1 xn) with xi (xi1 xip) the ith ob- Donoho16 (see also Ref 17). They defined the so- servation. Classical measures of location and scatter called Stahel–Donoho outlyingness of a data point are given by the empirical mean x¯ = 1 n x and the n i=1 i x as = n − − i empirical covariance matrix Sx i=1(xi x¯ )(xi    T  T T  x¯ ) /(n − 1). As in the univariate case, both classi- x d − median = ,..., x d = i j 1 n  j cal estimators have a breakdown value of 0%, that outl(xi ) max T (8) d MAD = ,..., x d is, a small fraction of outliers can completely ruin j 1 n j them. where the maximum is over all directions (i.e., all Robust estimates of location and scatter can p-dimensional unit length vectors d), and xT d is the be obtained by the minimum covariance determinant j projection of x on the direction d. Next, they gave (MCD) method of Rousseeuw.11,12 The MCD looks j each observation a weight wi based on outl(xi ), and for those h observations in the data set (where the computed the resulting weighted mean and covari- number h is given by the user) whose classical covari- ance matrix. ance matrix has the lowest possible determinant. The Multivariate M-estimators18 have a relatively μˆ MCD estimate of location 0 is then the average of low breakdown value due to possible implosion these h points, whereas the MCD estimate of scatter ˆ of the estimated scatter matrix. More recently, ro- 0 is their covariance matrix, multiplied by a consis- bust estimators of multivariate location and scat- tency factor. We can then give each xi some weight 2,19 20 w w = ter include S-estimators, MM-estimators, and i , for instance, by putting i 1 if the initial robust the orthogonalized Gnanadesikan-Kettenring (OGK) distance estimator.21  −1 , μ , ˆ = − μ Tˆ − μ RDi (xi ˆ 0 0) (xi ˆ 0) 0 (xi ˆ 0)  ≤ χ 2 LINEAR REGRESSION p,0.975 (7) The multiple linear regression model assumes that in w = μ = and i 0 otherwise. The weighted mean ˆ w addition to the p independent x-variables also a re- n w / n w ˆ = (i=1 i xi ) ( i=1 i ) and covariance matrix w sponse variable y is measured, which can be explained n w − μ − μ T / n w − ( i=1 i (xi ˆ w)(xi ˆ w) ) ( i=1 i 1) then by a linear combination of the x variables. More pre- have a better finite-sample efficiency. The final robust cisely, the model says that for all observations (xi , yi ) distances RD(xi ) are obtained by inserting μˆ w and it holds that ˆ w into (7). = β + β + ...+ β +  The MCD estimator, as well as its weighted ver- yi 0 1xi1 pxip i T sion, has a bounded-influence function and break- = β0 + β xi + i i = 1,...,n (9) down value (n − h + 1)/n, hence the number h de- termines the robustness of the estimator. The MCD where the errors i are assumed to be independent and has its highest possible breakdown value when h = identically distributed with zero mean and constant (n + p + 1)/2. When a large proportion of con- variance σ 2. Applying a regression estimator to the tamination is expected, h should thus be chosen data yields p + 1 regression coefficients, combined as T close to 0.5n. Otherwise an intermediate value for θˆ = (βˆ0,...,βˆp) . The residual ri of case i is defined h, such as 0.75n, is recommended to obtain a as the difference between the observed response yi higher finite-sample efficiency. We refer to Ref 13 and its estimated value yˆi . for an overview of the MCD estimator and its The classical least squares (LS) method to esti- properties. mate θ minimizes the sum of the squared residuals. It The computation of the MCD estimator is non- is popular because it allows to compute the regression trivial and naively requires an exhaustive investiga- estimates explicitly, and it is optimal if the errors are tion of all h-subsets out of n. Rousseeuw and Van normally distributed. However, LS is extremely sen- Driessen14 constructed a much faster algorithm called sitive to regression outliers, that is, observations that FAST-MCD. It starts by randomly drawing many do not obey the linear pattern formed by the majority p + 1 observations from the data set. On the basis of of the data.

Volume 1, January/February 2011 c 2011 John Wiley & Sons, Inc. 75 Focus Article wires.wiley.com/widm

The least trimmed squares estimator (LTS) pro- a small number of components. These components are posed by Rousseeuw11 is given by linear combinations of the original variables and often allow for an interpretation and a better understanding h of the different sources of variation. PCA is often minimize (r 2) (10) (i) the first step of the data analysis, followed by other = i 1 multivariate techniques. 2 2 2 where (r )(1) ≤ (r )(2) ≤ ...≤ (r )(n) are the ordered In the classical approach, the first principal com- squared residuals. (They are first squared and then ponent corresponds to the direction in which the pro- ordered.) The value h plays the same role as in the jected observations have the largest variance. The sec- MCD estimator. For h ≈ n/2, we find a breakdown ond component is then orthogonal to the first and value of 50%, whereas for larger h, we obtain roughly again maximizes the variance of the data points pro- (n − h)/n. A fast algorithm for the LTS estimator jected on it. Continuing in this way produces all (FAST-LTS) has been developed.22 the principal components, which correspond to the The scale of the errors σ can be estimated by eigenvectors of the empirical covariance matrix. Un- σ 2 = 2 h 2 / fortunately, both the classical variance (which is be- ˆLTS ch,n i=1(r )(i) h, where ri are the residuals from the LTS fit and ch,n is a constant that makesσ ˆ ing maximized) and the classical covariance matrix unbiased at Gaussian error distributions. We can then (which is being decomposed) are very sensitive to identify regression outliers by their standardized LTS anomalous observations. Consequently, the first com- residuals ri /σˆLTS. We can also use the standardized ponents from classical PCA are often attracted toward LTS residuals to assign a weight to every observation. outlying points and may not capture the variation of The weighted LS estimator with these LTS weights the regular observations. inherits the nice robustness properties of LTS, but A first group of robust PCA methods is obtained is more efficient and yields all the usual inferential by replacing the classical covariance matrix by a ro- output, such as t-statistics, F-statistics, an R2 statistic, bust covariance estimator such as the weighted MCD , and the corresponding p-values. estimator or MM-estimators.29 30 Unfortunately, the The outlier map23 plots the standardized LTS use of these covariance estimators is limited to small- residuals versus robust distances (7) based on (for to-moderate dimensions because they are not defined instance) the MCD estimator, which is applied to when p is larger than n. the x-variables only. This allows to distinguish ver- A second approach to robust PCA uses Projec- tical outliers (with small robust distances and outly- tion Pursuit techniques. These methods maximize a ing residuals), good leverage points (with outlying ro- robust measure of spread to obtain consecutive di- bust distances but small residuals), and bad leverage rections on which the data points are projected, see points (with outlying robust distances and outlying Refs 31–34. residuals). The ROBPCA35 approach is a hybrid, which The earliest theory of was combines ideas of projection pursuit and robust co- based on M-estimators,24 R-estimators,25 and L- variance estimation. The projection pursuit part is estimators.26 The breakdown value of all these used for the initial dimension reduction. Some ideas methods is 0% because of their vulnerability to based on the MCD estimator are then applied to this bad leverage points. Generalized M-estimators (GM- lower-dimensional data space. estimators)7 were the first to attain a positive break- For outlier detection, a PCA outlier map35 can down value, which unfortunately still went down to be constructed, similar to the regression outlier map. zero for increasing p. It plots for every observation its (Euclidean) orthog- The low finite-sample efficiency of LTS can be onal distance to the PCA-subspace, against its score improved by replacing its objective function by a more distance, which measures the robust distance of its efficient scale estimator applied to the residuals ri . projection to the center of all the projected observa- This approach has led to the introduction of regres- tions. Doing so, four types of observations are visu- sion S-estimators27 and MM-estimators.28 alized. Regular observations have a small orthogo- nal and a small score distance. Observations with a large score distance but a small orthogonal distance FURTHER DIRECTIONS are called good leverage points. Orthogonal outliers have a large orthogonal distance but a small score Principal Component Analysis distance. Bad leverage points have a large orthogo- PCA is a very popular dimension-reduction method. It nal distance and a large score distance. They lie far tries to explain the covariance structure of the data by outside the space spanned by the robust principal

76 c 2011 John Wiley & Sons, Inc. Volume 1, January/February 2011 WIREs Data Mining and Knowledge Discovery Robust statistics for outlier detection components, and after projection far from the regular When linear models are not appropriate, sup- data points. Typically, they have a large influence on port vector machines (SVM) are powerful tools to classical PCA, as the eigenvectors will be tilted toward handle nonlinear structures.43 An SVM with an un- them. bounded kernel (such as a linear kernel) is not robust Other proposals for robust PCA include spheri- and suffers the same problems as traditional linear cal PCA,36 which first projects the data onto a sphere classifiers. But when a bounded kernel is used, the re- with a robust center, and then applies PCA to these sulting nonlinear SVM classification handles outliers projected data. For a review of robust versions of quite well. principal component regression and partial Least Square, see Ref 1. Clustering Classification is an important technique when han- The goal of classification, also known as discriminant dling large data sets. It searches for homogeneous analysis or supervised learning, is to obtain rules that groups in the data, which afterward may be an- describe the separation between known groups of p- alyzed separately. Nonhierarchical cluster methods dimensional observations. This allows to classify new search for the best clustering in k groups for a user- observations into one of the groups. We denote the specified k. number of groups by l and assume that we can de- For spherical clusters, the most popular method is k-means, which minimizes the sum of the squared scribe each population π j by its density f j .Wewrite Euclidean distances of the observations to the mean pj for the membership probability, that is, the prob- of their group.44 This method is not robust as it uses ability for any observation to come from π j . For low-dimensional data, a popular classifica- group averages. To overcome this problem, one of the first robust proposals was the Partitioning around tion rule results from maximizing the Bayes posterior 45 probability. At gaussian distributions, this leads to Medoids method. It searches for k observations the maximization of the quadratic discriminant scores (medoids) such that the sum of the unsquared dis- dQ(x) given by tances of the observations to the medoid of their j group is minimized. Q 1 1 − Later on, the more robust trimmed k-means d (x) =− ln| |− (x − μ )T 1(x − μ ) j 2 j 2 j j j method has been proposed,46 inspired by the trim-

+ ln(pj ). (11) ming idea of the MCD and the LTS. It searches for the h-subset (with h as in the definition of MCD and When all the covariance matrices are assumed to be LTS) such that the sum of the squared distances of equal, these scores can be simplified to the observations to the mean of their group is min- − 1 − imized. Consequently, not all observations need to dL(x) = μT 1 x − μT 1μ + ln(p ) (12) j j 2 j j j be classified, as n − h cases can be left unassigned. where is the common covariance matrix. This leads To perform the trimmed k-means clustering, an itera- to linear classification boundaries. Robust classifica- tive algorithm has been developed, using C-steps that tion rules can be obtained by replacing the classical are similar to those in the FAST-MCD and FAST- 47 covariance matrices by robust alternatives such as the LTS algorithms. For nonspherical clusters, con- 48,49 MCD estimator and S-estimators, as in Refs 37–40. strained maximum likelihood approaches were When the data are high dimensional, this ap- developed. proach cannot be applied anymore because the robust covariance estimators become uncomputable. This problem can be solved by first applying PCA to the Beyond Outlier Detection entire set of observations. Alternatively, one can also There are other aspects to robust statistics apart apply a PCA method to each group separately. This from outlier detection. For instance, robust estima- is the idea behind the soft independent modeling of tion can be used in automated settings such as com- , class analogy (SIMCA) method.41 A robustification puter vision.50 51 Another aspect is statistical infer- of SIMCA is obtained by first applying robust PCA to ence such as the construction of robust hypothesis each group, and then constructing a classification rule tests, p-values, confidence intervals, and model selec- for new observations based on their orthogonal dis- tion (e.g., variable selection in regression). This as- tance to each subspace and their score distance within pect is studied in Refs 3 and 7, which also cite earlier each subspace.42 work.

Volume 1, January/February 2011 c 2011 John Wiley & Sons, Inc. 77 Focus Article wires.wiley.com/widm

SOFTWARE AVAILABILITY and SAS/IML (version 7 or higher). The free soft- ware R provides many robust procedures in the pack- Matlab functions for many of the procedures men- ages robustbase (e.g., huberM, Qn, adjbox, covMcd, tioned in this article are part of the LIBRA covOGK, ltsReg, and lmrob) and rrcov (many robust , toolbox,52 53 which can be downloaded from wis. covariance estimators, linear and quadratic discrimi- kuleuven.be/stat/robust. nant analysis, and several robust PCA methods). Ro- The MCD and LTS estimators are also built bust clustering can be performed with the cluster into S-PLUS as well as SAS (version 11 or higher) and tclust packages.

REFERENCES

1. Hubert M, Rousseeuw PJ, Van Aelst S, High break- 15. Stahel WA. Robuste Schatzungen:¨ Infinitesimale Op- down robust multivariate methods. Statist Sci 2008, timalitat¨ und Schatzungen¨ von Kovarianzmatrizen, 23:92–119. Ph.D. thesis, ETH Zurich,¨ 1981. 2. Rousseeuw PJ, Leroy AM. Robust regression and 16. Donoho DL, Gasko M. Breakdown properties of outlier detection. New York: Wiley Interscience, location estimates based on halfspace depth and 1987. projected outlyingness. Ann Statist 1992, 20:1803– 3. Maronna RA, Martin DR, Yohai VJ. Robust Statistics: 1827. Theory and Methods. New York: Wiley, 2006. 17. Maronna RA, Yohai VJ. The behavior of the Stahel- 4. Hampel FR. A general qualitative definition of robust- Donoho robust multivariate estimator. J Amn Statist ness. Ann Math Stat 1971, 42:1887–1896. Assoc 1995, 90:330–341. 5. Donoho DL, Huber PJ. The notion of breakdown 18. Maronna RA. Robust M-estimators of multivariate lo- point. In: Bickel P, Doksum K, and Hodges JL, cation and scatter. Ann Statist 1976, 4:51–67. eds., A Festschrift for Erich Lehmann Belmont, MA: 19. Davies L. Asymptotic behavior of S-estimators of mul- Wadsworth; 1983, 157–184. tivariate location parameters and dispersion matrices. Ann Statist 1987, 15:1269–1292. 6. Hubert M, Debruyne M. Breakdown Value. Wiley In- terdiscipl Rev: Comput Stat 2009, 1:296–302. 20. Tatsuoka KS, Tyler DE. On the uniqueness of S- functionals and M-functionals under nonelliptical dis- 7. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. tributions. Ann Statist 2000, 28:1219–1243. Robust Statistics: The Approach Based on Influence Functions. New York: Wiley; 1986. 21. Maronna RA, Zamar RH. Robust estimates of location 8. Rousseeuw PJ, Croux C. Alternatives to the median ab- and dispersion for high-dimensional data sets. Techno- solute deviation. J Amn Statist Assoc 1993, 88:1273– metrics 2002, 44:307–317. 1283. 22. Rousseeuw PJ, Van Driessen K. Computing LTS re- 9. Huber PJ. Robust estimation of a location parameter. gression for large data sets. Data Min Knowledge Dis- Ann Math Stat 1964, 35:73–101. covery 2006, 12:29–45. 10. Hubert M, Vandervieren E. An adjusted boxplot for 23. Rousseeuw PJ, van Zomeren BC. Unmasking multi- skewed distributions. Comput Stat Data Anal 2008, variate outliers and leverage points. J Amn Statist As- 52:5186–5201. soc 1990, 85:633–651. 11. Rousseeuw PJ. Least median of squares regression. J 24. Huber PJ. Robust Statistics. New York: Wiley; 1981. Amn Statist Assoc 1984, 79:871–880. 25. Jureckova´ J. Nonparametric estimate of regression co- 12. Rousseeuw PJ. Multivariate estimation with high efficients. Ann Math Stat 1971, 42:1328–1338. breakdown point. In: Grossmann W, Pflug G., Vincze 26. Koenker R, Portnoy S. L-estimation for linear models. I, and wertz W, eds. Mathematical Statistics and Ap- J Amn Statist Assoc 1987, 82:851–857. plications. Vol.B. Dordrecht, The Netherlands: Reidel 27. Rousseeuw PJ, Yohai VJ. Robust regression by means Publishing Company; 1985, 283–297. of S-estimators, In: Robust and nonlinear time series 13. Hubert M, Debruyne M. Minimum covariance deter- analysis, Franke J, Hardle¨ W, and Martin RD eds. Lec- minant. Wiley Interdiscipl Rev: Comput Stat 2010, ture Notes in Statistics No. 26, Springer-Verlag; 1984, 2:36–43. 256–272. 14. Rousseeuw PJ, Van Driessen K. A fast algorithm for 28. Yohai VJ. High breakdown point and high efficiency the minimum covariance determinant estimator. Tech- robust estimates for regression. Ann Statist 1987, nometrics 1999, 41:212–223. 15:642–656.

78 c 2011 John Wiley & Sons, Inc. Volume 1, January/February 2011 WIREs Data Mining and Knowledge Discovery Robust statistics for outlier detection

29. Croux C, Haesbroeck G. Principal components anal- 40. Hubert M, Van Driessen K. Fast and robust discrimi- ysis based on robust estimators of the covariance or nant analysis. Comput Stat Data Anal 2004, 45:301– correlation matrix: influence functions and efficiencies. 320. Biometrika 2000, 87:603–618. 41. Wold S. Pattern recognition by means of disjoint 30. Salibian-Barrera M, Van Aelst S, Willems G. PCA principal components models. Pattern Recognit 1976, based on multivariate MM-estimators with fast and 8:127–139. robust bootstrap. J Amn Statist Assoc 2006, 101:1198– 42. Vanden Branden K, Hubert M. Robust classification 1211. in high dimensions based on the SIMCA method. 31. Li G, Chen Z. Projection-pursuit approach to robust Chemom Intell Lab Syst 2005, 79:10–21. dispersion matrices and principal components: Primary 43. Steinwart I, Christmann A. Support Vector Machines. theory and Monte Carlo. J Amn Statist Assoc 1985, New York: Springer; 2008. 80:759–766. 44. MacQueen JB. Some methods for classification and 32. Croux C, Ruiz-Gazen A. High breakdown estimators analysis of multivariate observations. Proceedings of for principal components: the projection-pursuit ap- 5th Berkeley Symposium on mathematical statistics proach revisited. J Multivariate Anal 2005, 95:206– and probability, vol. 1, University of California Press, 226. 1967, pp. 281–297. 33. Hubert M, Rousseeuw PJ, Verboven S. A fast robust 45. Kaufman L, Rousseeuw PJ. Finding Groups in Data: method for principal components with applications to An Introduction to Cluster Analysis. New York: Wiley; chemometrics. Chemom Intell Lab Syst 2002, 60:101– 1990. 111. 46. Cuesta-Albertos JA, Gordaliza A, Matran´ C. Trimmed 34. Croux C, Filzmoser P, Oliveira MR. Algorithms for k-means: An attempt to robustify quantizers. Ann projection-pursuit robust principal component analy- Statist 1997, 25:553–576. sis. Chemom Intell Lab Syst 2007, 87:218–225. 47. Garcıa-Escudero´ LA, Gordaliza A, Matran´ C. Trim- 35. Hubert M, Rousseeuw PJ, Vanden Branden K. ming tools in exploratory data analysis. J Comput ROBPCA: A new approach to robust principal Graph Stat 2003, 12:434–449. components analysis. Technometrics 2005, 47:64– 48. Gallegos MT, Ritter G. A robust method for cluster 79. analysis. Ann Statist 2005, 33:347–380. 36. Locantore N, Marron JS, Simpson DG, Tripoli N, 49. Garcıa-Escudero´ LA, Gordaliza A, Matran´ C, Mayo Zhang JT, Cohen KL. Robust principal compo- Iscar A. A general trimming approach to robust cluster nent analysis for functional data. Test 1999, 8:1– analysis. Ann Statist 2008, 36:1324–1345. 73. 50. Meer P, Mintz D, Rosenfeld A, Kim DY. Robust re- 37. Hawkins DM, McLachlan GJ. High-breakdown lin- gression methods in computer vision: A review. Int J ear discriminant analysis. J Amn Statist Assoc 1997, Comput Vision 1991, 6:59–70. 92:136–143. 51. Stewart CV. MINPRAN: A new robust estimator for 38. He X, Fung WK. High breakdown estimation for computer vision. IEEE Trans Pattern Anal Machine multiple populations with applications to discrim- Intell 1995, 17:925–938. inant analysis. J Multivariate Anal 2000, 72:151– 52. Verboven S, Hubert M. LIBRA: a Matlab library for ro- 162. bust analysis. Chemom Intell Lab Syst 2005, 75:127– 39. Croux C, Dehon C. Robust linear discriminant anal- 136. ysis using S-estimators. Can J Statist 2001, 29:473– 53. Verboven S, Hubert M. Matlab library LIBRA. Wiley 492. Interdiscipl Rev: Comput Stat 2010, in press.

Volume 1, January/February 2011 c 2011 John Wiley & Sons, Inc. 79