arXiv:0808.0657v1 [stat.ME] 5 Aug 2008 aaeswt ayosrain n/rvariables, and/or observations in many occur to with likely more datasets are ma- the data. by the suggested of assump- jority pattern usual the the from from and/or deviate tions that points data is, 9 -00Get egu (e-mail: Belgium Ghent, 281 B-9000 Krijgslaan S9, University, Ghent and Science, Mathematics Computer Applied of Department Professor, Belgium Antwerp, (e-mail: B-2020 1, of Middelheimlaan University Antwerp, Science, Computer of Department and Professor, Mathematics is Rousseeuw J. Peter Aelst Van Stefan and Rousseeuw J. Peter Hubert, Mia MultivariateMethods Robust High-Breakdown 08 o.2,N.1 92–119 1, DOI: No. 23, Vol. 2008, Science Statistical [email protected] evn egu (e-mail: B-3001 Belgium B, Leuven, 200 Celestijnenlaan Katholieke Leuven, Mathematics, Universiteit of Department and

i ueti rfso,Uiest etrfor Center University Professor, is Hubert Mia c ttsia Science article Statistical original the the of by reprint published electronic an is This ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint aymliait aaescnanotir,that outliers, contain datasets multivariate Many nttt fMteaia Statistics Mathematical of Institute 10.1214/088342307000000087 [email protected] .INTRODUCTION 1. nttt fMteaia Statistics Mathematical of Institute , e od n phrases: and words Key multi an and components calibration. principal multiple data. analysis, estimation, the discriminant covariance mul gression, in for as outliers methods such robust of settings high-breakdown fraction recent of substantial overview a an with metho high-breakdown deal on focus can We fit. anywhere observati robust outlying a occur detect from to residuals may allow outliers then the methods unannounced against These robust data. several are or that one methods that develop to The outliers. is to assumpti statistics sensitive usual are methods the classical from many ever, deviate observations some that curs Abstract. ait ttsis ules ata es qae,prin squares, robustness. least regression, partial outliers, statistics, variate 08 o.2,N.1 92–119 1, No. 23, Vol. 2008, ). [email protected] 2008 , hnapyn ttsia ehdi rciei fe oc- often it practice in method statistical a applying When .Sea a es is Aelst Van Stefan ). radw au,iflec ucin multi- function, influence value, Breakdown This . ). in 1 u ye fotir,epcal fteeaesvrlof several are there if harm- especially most outliers, the of But so. types do ful may and them model, of fitted some classical indeed the from out” outliers “stick that believe must says to intuition hard Common it paradoxical. find or people many known, stated well when less and is consequence Unfor- second underestimated). often the tunately, is effect the (although of known size well the fairly is consequence first The • • happen: outliers, things nasty even contain two by data typically the affected When strongly outliers. be few can a these of All squares fitting. , least and empirical matrices, correlation on and covariance based are regression) tivariate mul- and analysis discriminant components, principal in- visual simple by up show not spection. do they often and h sa utvraeaayi ehius(e.g., techniques analysis multivariate usual The aaoi itne rtewdl sd“leave-one- diagnostics. Ma- used out” residuals, de- widely their the to or of distances allow means halanobis not by outliers does the model tect fitted outliers; resulting the without the estimates obtained the have as would here we defined answer, “right” the from substantially differ estimates multivariate the ia components, cipal multivariate d olo robust of goal n ytheir by ons ait re- variate possibility n.How- ons. s which ds, tivariate egive We nthe in 2 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST them, may affect the estimated model so much “in 2. MULTIVARIATE LOCATION AND their direction” that they are now well-fitted by it. SCATTER Once this effect is understood, one sees that the 2.1 The Need for Robustness following two problems are essentially equivalent: In the multivariate location and scatter setting • Robust estimation: find a “robust” fit, which is we assume that the data are stored in an n × p data similar to the fit we would have found without matrix X = (x ,..., x )′ with x = (x ,...,x )′ the the outliers. 1 n i i1 ip ith observation. Hence n stands for the number of • detection: find all the outliers that matter. objects and p for the number of variables. Indeed, a solution to the first problem allows us to To illustrate the effect of outliers we consider the identify the outliers by their residuals, and so on, following engineering problem, taken from Rousseeuw from the robust fit. Conversely, a solution to the and Van Driessen (1999). Philips Mecoma (The Nether- second problem allows us to remove or downweight lands) produces diaphragm parts for television sets. the outliers followed by a classical fit, which yields These are thin metal plates, molded by a press. a robust result. When starting a new production line, p = 9 charac- Our research focuses on the first problem, and uses teristics were measured for n = 677 parts. The aim is its results to answer the second. We prefer this ap- to gain insight in the production process and to find proach over the opposite direction because from a out whether abnormalities have occurred. A classical combinatorial viewpoint it is more feasible to search approach is to compute the Mahalanobis distance for sufficiently many “good” data points than to find ′ ˆ −1 all the “bad” data points. (1) MD(xi)= (xi − µˆ0) Σ0 (xi − µˆ0) It turns out that most of the currently available q x µˆ highly robust multivariate estimators are difficult of each measurement i. Here 0 is the arithmetic Σˆ to compute, which makes them unsuitable for the and 0 is the classical covariance matrix. The x x analysis of large and/or high-dimensional datasets. distance MD( i) should tell us how far away i is Among the few exceptions is the minimum covari- from the center of the cloud, relative to the size of ance determinant estimator (MCD) of Rousseeuw the cloud. (1984, 1985). The MCD is a highly robust estima- In Figure 1 we plotted the classical Mahalanobis tor of multivariate location and scatter, that can be distance versus the index i, which corresponds to the computed efficiently with the FAST-MCD algorithm production sequence. The horizontal line is at the 2 of Rousseeuw and Van Driessen (1999). usual cutoff value χ9,0.975 = 4.36. Figure 1 suggests Section 2 concentrates on robust estimation of lo- that most observationsq are consistent with the clas- cation and scatter. We first describe the MCD es- sical assumption that the data come from a multi- timator and discuss its main properties. Alterna- variate normal distribution, except for a few isolated tives for the MCD are explained briefly with relevant outliers. This should not surprise us, even in the pointers to the literature for more details. Section 3 first experimental run of a new production line, be- does the same for and mainly fo- cause the Mahalanobis distances are known to suffer cuses on the least trimmed squares (LTS) estimator from the masking effect. That is, even if there were a (Rousseeuw, 1984), which is an analog of MCD for group of outliers (here, deformed diaphragm parts), ˆ multiple regression. Since estimating the covariance they would affect µˆ0 and Σ0 in such a way that they matrix is the cornerstone of many multivariate sta- get small Mahalanobis distances MD(xi) and thus tistical methods, robust scatter estimators have also become invisible in Figure 1. To get a reliable anal- been used to develop robust and computationally ef- ysis of these data we need robust estimators µˆ and ficient multivariate techniques. The paper then goes Σˆ that can resist possible outliers. For this purpose on to describe robust methods for multivariate re- we will use the MCD estimates described below. gression (Section 4), classification (Section 5), prin- 2.2 Description of the MCD cipal component analysis (Section 6), principal com- ponent regression (Section 7), partial least squares The MCD method looks for the h observations regression (Section 8) and other settings (Section (out of n) whose classical covariance matrix has the 9). Section 10 concludes with pointers to available lowest possible determinant. The MCD estimate of software for the described techniques. location is then the average of these h points, whereas ROBUST 3

Fig. 1. Mahalanobis distances of the Philips data.

∗ the MCD estimate of scatter is a multiple of their For many estimators εn(µˆ, X) varies only slightly covariance matrix. The MCD location and scatter with X and n, so that we can denote its limiting estimates are affine equivariant, which means that value (for n →∞) by ε∗(µˆ). Similarly, the break- they behave properly under affine transformations down value of a covariance matrix estimator Σˆ is of the data. That is, for an n × p dataset X the defined as the smallest fraction of outliers that can ˆ ˆ MCD estimates (µˆ, Σ) satisfy take either the largest eigenvalue λ1(Σ) to infin- ˆ ′ ity or the smallest eigenvalue λp(Σ) to zero. The (2) µˆ(XA + 1nv )= µˆ(X)A + v, MCD estimates (µˆ, Σˆ ) of multivariate location and ˆ ′ ′ ˆ (3) Σ(XA + 1nv )= A Σ(X)A, ∗ ∗ ˆ scatter have breakdown value εn(µˆ)= εn(Σ) ≈ (n − for all p×1 vectors v and all nonsingular p×p matri- h)/n. The MCD has its highest possible breakdown ′ ∗ ces A. The vector 1n is (1, 1,..., 1) with n elements. value (ε = 50%) when h = [(n + p + 1)/2] (see Lop- Affine equivariance is a natural property of the un- uha¨aand Rousseeuw, 1991). Note that no affine derlying model and makes the analysis independent equivariant estimator can have a breakdown value of the measurement scales of the variables as well as above 50%. For a recent discussion of the impor- translations or rotations of the data. tance of equivariance in breakdown considerations, A useful measure of robustness is the finite-sample see Davies and Gather (2005). breakdown value (Donoho and Huber, 1983). The An efficient algorithm to compute the MCD is the ∗ breakdown value εn(µˆ, X) of an estimator µˆ at the FAST-MCD algorithm explained in Appendix A.1. dataset X is the smallest amount of contamination By default FAST-MCD computes a one-step weighted that can have an arbitrarily large effect on µˆ. Con- estimate given by sider all possible contaminated datasets X˜ obtained n n by replacing any m of the original observations by (5) µˆ1 = wixi wi , arbitrary points. Then the breakdown value of a lo- ! ! Xi=1  Xi=1 cation estimator µˆ is the smallest fraction m/n of n ˆ ′ outliers that can take the estimate over all bounds: Σ1 = dh,n wi(xi − µˆ1)(xi − µˆ1) ∗ i=1 ! εn(µˆ, X) (6) X (4) n −1 m := min ;supkµˆ(X˜ ) − µˆ(X)k = ∞ . · wi , m n X˜ !   Xi=1 4 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST where Van Driessen (1999) selected six of the variables 2 (two from each band). The classical Mahalanobis 1, if d ˆ Σˆ (i) ≤ χp,0.975, wi = (µMCD, MCD) distances revealed a set of outliers which turned out ( 0, otherwise, q to be objects for which at least one measurement fell ˆ outside its physically possible . Therefore, the with µˆMCD and ΣMCD the raw MCD estimates. The data was cleaned by removing all objects with phys- number dh,n in (6) is a correction factor (Pison, Van Aelst and Willems, 2002) to obtain unbiased ically impossible measurements, leading to a cleaned and consistent estimates when the data come from dataset of size 132,402. The Mahalanobis distances a multivariate normal distribution. of the cleaned data are shown in Figure 3(a). This one-step weighted estimator has the same This plot (and a Q–Q plot) suggests that the dis- 2 breakdown value as the initial MCD estimator but a tances approximately come from the χ6 distribu- much better statistical efficiency. In practice we of- tion, as would be the case if the dataq came from ten do not need the maximal breakdown value. For a homogeneous population. Figure 3(b) shows the example, Hampel et al. (1986, pages 27–28) write robust distances computed with the FAST-MCD al- that 10% of outliers is quite common. We typically gorithm. In contrast to the innocent-looking Maha- use h = 0.75n so that ε∗ = 25%, which is sufficiently lanobis distances, these robust distances reveal the robust for most applications and has a high sta- presence of two groups. There is a majority with tistical efficiency. For example, with h = 0.75n the 2 RD(xi) ≤ χ6,0.975 and a group with RD(xi) be- asymptotic efficiencies of the weighted MCD loca- tween 8 andq 16. Based on these results the astronomers tion and scatter estimators in 10 dimensions are noted that the lower group are mainly stars while 94% and 88%, respectively (Croux and Haesbroeck, the upper group are mainly galaxies. 1999). 2.4 Other robust estimators of multivariate 2.3 Examples location and scatter Let us now reanalyze the Philips data. For each The breakdown point is not the only important ro- observation xi we now compute the robust distance bustness measure. Another key concept is the influ- (Rousseeuw and Leroy, 1987) given by ence function, which measures the effect on an esti- −1 mator of adding a small mass at a specific point. (See x x µˆ ′Σˆ x µˆ (7) RD( i)= ( i − ) ( i − ), Hampel et al., 1986 for details.) Robust estimators q where (µˆ, Σˆ ) are the MCD location and scatter esti- ideally have a bounded influence function, which mates. Recall that the Mahalanobis distances in Fig- means that a small contamination at any point can ure 1 indicated no groups of outliers. On the other only have a small effect on the estimator. M-estimators (Maronna, 1976; Huber, 1981) were the first class of hand, the robust distances RD(xi) in Figure 2 show a strongly deviating group of outliers, ranging from bounded influence estimators for multivariate loca- index 491 to index 565. Something happened in the tion and scatter. Also the MCD and other estimators production process, which was not visible from the mentioned below have a bounded influence function. classical Mahalanobis distances due to the masking The first high-breakdown location and scatter esti- effect. Furthermore, Figure 2 also shows a remark- mator was proposed by Stahel (1981) and Donoho able change after the first 100 measurements. Both (1982). The Stahel–Donoho estimates are a weighted phenomena were investigated and interpreted by the mean and covariance, like (5)–(6), where the weight engineers at Philips. wi of an observation xi depends on its outlyingness, The second dataset came from a group of Cal Tech given by astronomers working on the Digitized Palomar Sky ′ ′ |xiv − medj(xjv)| Survey (see Odewahn et al., 1998). They made a ui = sup ′ . v mad (x v) survey of celestial objects (light sources) by record- k k=1 j j ing nine characteristics (such as magnitude, area, The estimator has good robustness properties but image moments) in each of three bands: blue, red is computationally very intensive, which limits its and near-infrared. The database contains measure- use (Tyler, 1994; Maronna and Yohai, 1995). The ments for 27 variables on 137,256 celestial objects. Stahel–Donoho estimator measures the outlyingness Based on exploratory data analysis Rousseeuw and by looking at all univariate projections of the data ROBUST MULTIVARIATE STATISTICS 5

Fig. 2. Robust distances of the Philips data.

Fig. 3. Cleaned digitized Palomar data: ( a) Mahalanobis distances; ( b) robust distances. and as such is related to projection pursuit methods lipsoid covering at least half the data points. How- as studied in Friedman and Tukey (1974), Huber ever, the MVE has efficiency zero due to its low rate (1985) and Croux and Ruiz-Gazen (2005). Another of convergence. Rigorous asymptotic results for the highly robust estimator of location and scatter based MCD and the MVE are given by Butler, Davies and on projections has been proposed by Maronna, Sta- Jhun (1993) and Davies (1992a). To improve the hel and Yohai (1992). finite-sample efficiency of MVE and MCD a one- Together with the MCD, Rousseeuw (1984, 1985) step weighted estimator (5)–(6) can be computed. also introduced the minimum volume ellipsoid (MVE) The breakdown value and asymptotic properties of estimator which looks for the minimal volume el- one-step weighted estimators have been obtained by 6 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Fig. 4. Simple regression data with different types of outliers.

Lopuha¨aand Rousseeuw (1991) and Lopuha¨a(1999). construct formal outlier identification rules; see, for Alternatively, a one-step M-estimator starting from example, Becker and Gather (1999). MVE or MCD can be computed as proposed by Davies To extend the notion of to higher dimen- (1992b). sions, Tukey introduced the halfspace depth. Depth- Another approach to improve the efficiency of MVE based estimators have been proposed and studied or MCD is to use a smoother objective function. An by Donoho and Gasko (1992), Rousseeuw, Ruts and important class of robust estimators of multivariate Tukey (1999a), Liu, Parelius and Singh (1999), Zuo location and scatter are S-estimators (Rousseeuw and Serfling (2000a, 2000b) and Zuo, Cui and He and Leroy, 1987; Davies, 1987), defined as the solu- (2004). tion (µˆ, Σˆ ) which minimizes det(Σ) under the con- Robust estimation and outlier detection in higher straint dimensions has been studied by Rocke (1996) and 1 n Rocke and Woodruff (1996). For very high-dimensional (8) ρ( (x − µ)′Σ−1(x − µ) ) ≤ b n i i data, Maronna and Zamar (2002) and Alqallaf et Xi=1 q al. (2002) proposed computationally efficient robust over all vectors µ and all p × p positive definite sym- estimators of multivariate location and covariance metric matrices Σ. Setting b = EF [ρ(kX)k] assures that are not affine equivariant any more. Chen and consistency at the model distribution F . The func- Victoria-Feser (2002) address robust covariance ma- tion ρ is chosen by the and is often taken trix estimation with . to be Tukey’s biweight ρ-function x2 x4 x6 3. MULTIPLE REGRESSION − + , if |x|≤ c, 2 2c2 6c4 (9) ρ(x)=  2 3.1 Motivation  c  , if |x|≥ c. 6 The multiple regression model assumes that also  a response variable y is measured, which can be ex- The constant cdetermines the breakdown value which is given by ε∗ = 6b/c2. The properties of S-estimators plained as an affine combination of the x-variables. have been investigated by Lopuha¨a(1989). Related More precisely, the model says that for all observa- classes include CM-estimators (Kent and Tyler, 1996), tions (xi,yi) with i = 1,...,n it holds that MM-estimators (Tatsuoka and Tyler, 2000) and τ- yi = θ1xi1 + · · · + θpxip + θp+1 + εi, estimators Lopuha¨a(1991). Positive-breakdown es- (10) timators of location and scatter can also be used to i = 1,...,n, ROBUST MULTIVARIATE STATISTICS 7 where the errors ǫi are assumed to be i.i.d. with regression, scale and affine equivariant. That is, for 2 ′ ′ zero mean and constant σ . The vector any X = (x1,..., xn) and y = (y1,...,yn) it holds ′ β = (θ1,...,θp) is called the slope, and α = θp+1 that ′ the intercept. Denote xi = (xi1,...,xip) and θ = ˆ ˆ ′ ′ ′ ′ ′ θ(X, y + Xv + 1nc)= θ(X, y) + (v , c) (β , α) = (θ1,...,θp,θp+1) . The classical least squares method to estimate θ θˆ(X, cy)= cθˆ(X, y), and σ is extremely sensitive to regression outliers, (12) ˆ ′ ′ ˆ′ −1 that is, observations that do not obey the linear θ(XA + 1nv , y) = (β (X, y)A , α(X, y) pattern formed by the majority of the data. In re- ′ βˆ X y A−1v ′ gression we can distinguish between different types − ( , ) ) , of points. This is illustrated in Figure 4 for simple for any vector v, any constant c and any nonsingular regression. Leverage points are observations (xi,yi) p × p matrix A. whose xi are outlying; that is, xi deviates from the The breakdown value of a regression estimator θˆ majority in x-space. We call such an observation at a dataset Z is the smallest fraction of outliers that (xi,yi) a good leverage point if (xi,yi) follows the can have an arbitrarily large effect on θˆ. Formally, it linear pattern of the majority, such as points 2 and is defined by (4) where X is replaced by (X, y). For 21. If, on the other hand, (xi,yi) does not follow this h = [(n + p + 1)/2] the LTS breakdown value equals linear pattern, we call it a bad leverage point, like ε∗(LTS) ≈ 50%, whereas for larger h we have that 4, 7 and 12. An observation whose x belongs to the ∗ i εn(LTS) ≈ (n − h)/n. The usual choice h ≈ 0.75 n majority in x-space but where (xi,yi) deviates from yields ε∗(LTS) = 25%. the linear pattern is called a vertical outlier, like the When using LTS regression, the standard devia- points 6, 13 and 17. A regression dataset can thus tion of the errors can be estimated by have up to four types of points: regular observations, vertical outliers, good leverage points and bad lever- 1 h (13) σˆ = c (r2) , age points. Leverage points attract the least squares h,nv i:n uh solution toward them, so bad leverage points are of- u Xi=1 t ten not apparent in a classical . where ri are the residuals from the LTS fit, and ch,n In low dimensions, as in this example, visual in- makesσ ˆ consistent and unbiased at Gaussian error spection can be used to detect outliers and leverage distributions (Pison, Van Aelst and Willems, 2002). points, but in higher dimensions this is not an op- Note that the LTS scale estimatorσ ˆ is itself highly tion anymore. Therefore, we need robust and com- robust. Therefore, we can identify regression outliers putationally efficient estimators that yield a reli- by their standardized LTS residuals ri/σˆ. able analysis of regression data. We consider the To compute the LTS in an efficient way, Rousseeuw least trimmed squares estimator (LTS) proposed by and Van Driessen (2006) developed the FAST-LTS Rousseeuw (1984) for this purpose. algorithm outlined in Appendix A.2. Similarly to the For a dataset Z = {(xi,yi); i = 1,...,n} and for FAST-MCD algorithm, FAST-LTS returns weighted any θ denote the corresponding residuals by ri = least squares estimates, given by r (θ)= y − β′x − α = y − θ′u with u = (x′ , 1)′. i i i i i i i n −1 n Then the LTS estimator is defined as the θˆ which ˆ ′ (14) θ1 = wiuiui wiuiyi , minimizes ! ! Xi=1 Xi=1 h 2 n ˆ 2 (11) (r )i:n, i=1 wiri(θ1) (15) σˆ1 = dh,nv n , i=1 u i=1 wi X uP 2 2 2 t where (r )1:n ≤ (r )2:n ≤···≤ (r )n:n are the or- ′ ′ P where ui = (x , 1) . The weights are dered squared residuals (note that the residuals are i first squared and then ordered). This is equivalent ˆ 2 1, if |ri(θLTS)/σˆLTS|≤ χ1,0.975, to finding the h-subset with smallest least squares wi = ( 0, otherwise. q objective function, which resembles the definition ˆ of the MCD. The LTS estimate is then the least where θLTS andσ ˆLTS are the raw LTS estimates. squares fit to these h points. The LTS estimates are As before, dh,n is a finite-sample correction factor. 8 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

These weighted estimates have the same breakdown 2 χ8,0.975 are considered to be leverage points. Fig- value as the initial LTS estimates and a much better ureq 6(a) shows that most data points lie between the statistical efficiency. Moreover, from the weighted horizontal cutoffs at ± χ2 which suggests that least squares estimates all the usual inferential out- 1,0.975 put such as t-statistics, F -statistics an R2 statis- most data follow the sameq linear trend. On the other tic and the corresponding p-values can be obtained hand, the outlier map based on LTS residuals and (Rousseeuw and Leroy, 1987). These p-values are robust distances RD(xi) shown in Figure 6(b) tells a approximate since they assume that the data with different story. This plot reveals a rather large group of observations with large robust residuals and large wi =1 come from the model (10) whereas the data robust distances. Hence, these observations are bad with wi =0 do not, and we usually do not know whether that is true. leverage points. This group turned out to be giant stars, which are known to behave differently from In Figure 4 we see that the LTS line obtained by other stars. FAST-LTS yields a robust fit that is not attracted by the leverage points on the right-hand side, and 3.2 Other robust regression methods hence follows the pattern of the majority of the data. The development of robust regression often paral- Of course, the LTS method is most useful when there leled that of robust estimators of multivariate loca- are several x-variables. tion and scatter, and in fact more attention has been To detect leverage points in higher dimensions we dedicated to the regression setting. Robust regres- must detect outlying x in x-space. For this pur- i sion also started with M-estimators (Huber, 1973, pose we will use the robust distances RD based i 1981), later followed by R-estimators (Jureckov´a, on the one-step weighted MCD of the previous sec- 1971) and L-estimators (Koenker and Portnoy, 1987) tion. On the other hand, we can see whether a point that all have breakdown value zero because of their (x ,y ) lies near the majority pattern by looking at i i vulnerability to bad leverage points. its standardized LTS residual r /σˆ. Rousseeuw and i The next step was the development of generalized van Zomeren (1990) proposed an outlier map which M-estimators (GM-estimators) that bound the in- plots robust residuals ri/σˆ versus robust distances fluence of outlying x by giving them a small weight x i RD( i), and indicates the corresponding cutoffs by (see, e.g., Krasker and Welsch, 1982; Maronna and horizontal and vertical lines. It automatically clas- Yohai, 1981). Therefore, GM-estimators are often sifies the observations into the four types of data called bounded influence methods, and they are more points that can occur in a regression dataset. Fig- stable than M-, L- or R-estimators. See Hampel et ure 5 is the outlier map of the data in Figure 4. al. (1986, Chapter 6) for an overview. Unfortunately, To illustrate this plot, we again consider the data- the breakdown value of GM-estimators with a mono- base of the Digitized Palomar Sky Survey. Follow- tone score function still goes down to zero for in- ing Rousseeuw and Van Driessen (2006), we now use creasing p (Maronna, Burtos and Yohai, 1979). GM- the subset of 56,744 stars (not galaxies) for which estimators with a redescending score function can all the characteristics in the blue color (the F band) have a dimension-independent positive breakdown are available. The response variable MaperF is re- value (see He, Simpson and Wang, 2000). Note that gressed against the other eight characteristics of the for a small fraction of outliers in the data GM- color band F. These characteristics describe the size estimators are robust, and they are computation- of a light source and the shape of the spatial bright- ally fast. For a discussion of the differences between ness distribution in a source. Figure 6(a) plots the bounded-influence estimators and high-breakdown standardized LS residuals versus the classical Ma- methods see the recent book by Maronna, Martin halanobis distances. Some isolated outliers in the and Yohai (2006). y-direction as well as in x-space were not plotted to The first high-breakdown regression methods were get a better view of the majority of the data. Ob- least of squares (LMS), LTS and the re- servations for which the standardized absolute LS peated median. The origins of LMS go back to Tukey 2 residual exceeds the cutoff χ1,0.975 are considered (Andrews et al., 1972), who proposed a univariate to be regression outliers, whereasq the other observa- estimator based on the shortest half of the sample tions are thought to obey the linear model. Similarly, and called it the shorth. Hampel (1975, page 380) observations for which MD(xi) exceeds the cutoff modified and generalized it to regression and stated ROBUST MULTIVARIATE STATISTICS 9

Fig. 5. Regression outlier map of the data in Figure 4.

Fig. 6. Digitized Palomar Sky Survey data: regression of MaperF on eight regressors. ( a) Plot of LS residual versus Maha- lanobis distance MD(xi); ( b) outlier map of LTS residual versus robust distance RD(xi). that the resulting estimator has a 50% breakdown was Siegel’s repeated median technique (Siegel, 1982), value. He called it the shordth and considered it which has good properties in the simple regression of special mathematical interest. Later, Rousseeuw case (p = 2) but is no longer affine equivariant in (1984) provided theory, algorithms and programs multiple regression (p ≥ 3). for this estimator, as well as applications (see also As for multiple location and scatter, the efficiency Rousseeuw and Leroy, 1987). However, LMS has an of a high-breakdown regression estimator can be im- abnormally slow convergence rate and hence its proved by computing one-step weighted least squares asymptotic efficiency is zero. In contrast, LTS is estimates (14)–(15) or by computing a one-step M- asymptotically normal and can be computed much estimator as done in Rousseeuw (1984). In order to faster. The other high-breakdown regression method combine these advantages with those of the bounded 10 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST influence approach, it was later proposed by Simp- (1999b), Van Aelst and Rousseeuw (2000), Van Aelst son, Ruppert and Carroll (1992), Coakley and et al. (2002) and Bai and He (2000). Hettmansperger (1993) and Simpson and Yohai (1998) Another important robustness measure, besides to compute a one-step GM-estimator starting from the breakdown value and the influence function, is LTS. the maxbias curve. The maxbias is the maximum Tests and variable selection for robust regression possible caused by a fixed frac- were developed by Markatou and He (1994), Marka- tion ε of contamination. The maxbias curve plots tou and Hettmansperger (1990), Ronchetti and the maxbias of an estimator as a function of the Staudte (1994) and Ronchetti, Field and Blanchard fraction ε = m/n of contamination. Maxbias curves (1997). For high-breakdown methods, variable se- of robust regression estimators have been studied lection by all subsets regression becomes infeasible. in Martin, Yohai and Zamar (1989), He and Simp- One way out is to apply the robust method to all son (1993), Croux, Rousseeuw and H¨ossjer (1994), variables, yielding weights, and then to apply the Adrover and Zamar (2004) and Berrendero and Za- classical selection methods for weighted least squares. mar (2001). Projection estimators for regression Alternatively, a robust R2 measure (Croux and De- (Maronna and Yohai, 1993) combine a low maxbias hon, 2003) or a robust penalized selection criterion with high breakdown value and bounded influence (M¨uller and Welsh, 2006) can be used in a forward but they are difficult to compute. or backward selection strategy. Unbalanced binary regressors that contain, for ex- Another approach to improve the efficiency of the ample, 90% of zeroes and 10% of ones might be ig- LTS is to replace its objective function by a smoother nored by standard robust regression methods. Ro- alternative. Similarly as in (8), S-estimators of re- bust methods for regression models that include cat- gression (Rousseeuw and Yohai, 1984) are defined egorical or binary regressors have been developed as the solution (θˆ, σˆ) that minimizesσ ˆ subject to by Hubert and Rousseeuw (1996) and Maronna and the constraint Yohai (2000). Robust estimators for orthogonal re- n ′ gression and error-in-variables models have been con- 1 yi − θ xi (16) ρ ≤ b. sidered by Zamar (1989, 1992) and Maronna (2005). n σ Xi=1   The constant b usually equals EΦ[ρ(Y )] to assure 4. MULTIVARIATE REGRESSION consistency at the model with normal error distri- The regression model can be extended to the case bution, and as before ρ is often taken to be Tukey’s where we have more than one response variable. For biweight ρ function (9). Salibian-Barrera and Yohai p-variate predictors x = (x ,...,x )′ and q-variate (2006) recently constructed an efficient algorithm to i i1 ip responses y = (y ,...,y )′ the multivariate regres- compute regression S-estimators. Related classes of i i1 iq sion model is given by efficient high-breakdown estimators include ′ MM-estimators (Yohai, 1987), τ-estimators (Yohai (17) yi = B xi + α + εi, and Zamar, 1988), a new type of R-estimators B (H¨ossjer, 1994), generalized S-estimators (Croux, where is the p × q slope matrix, α is the q-dimen- ′ Rousseeuw and H¨ossjer, 1994), CM-estimators sional intercept vector, and the errors εi = (εi1,...,εiq) (Mendes and Tyler, 1996) and generalized τ-esti- are i.i.d. with zero mean and with Cov(ε)= Σε a mators (Ferretti et al., 1999). Inference for these es- positive definite matrix of size q. Note that for q = 1 timators is usually based on their asymptotic distri- we obtain the multiple regression model of the pre- bution at the central model. Alternatively, for MM- vious section. On the other hand, putting p =1 and estimators Salibian-Barrera and Zamar (2002) de- xi = 1 yields the multivariate location and scatter veloped a fast and robust bootstrap procedure that model of Section 2. It is well known that the least yields reliable nonparametric robust inference. squares solution can be written as To extend the good properties of the univariate −1 Bˆ Σˆ Σˆ median to regression, Rousseeuw and Hubert (1999) = xx xy, introduced the notions of regression depth and deep- Bˆ′ (18) αˆ = µˆy − µˆx, est regression. The deepest regression estimator has ˆ ˆ ˆ′ ˆ ˆ been studied by Rousseeuw, Van Aelst and Hubert Σε = Σyy − B ΣxxB, ROBUST MULTIVARIATE STATISTICS 11

′ ′ where where ui = (xi, 1) and d1 is a consistency factor. The weights w are given by µˆ Σˆ Σˆ i µˆ = x and Σˆ = xx xy ˆ ˆ ˆ 2 µˆy Σyx Σyy 1, if d(ri(θMCD)) ≤ χ ,     wi = q,0.975 are the empirical mean and covariance matrix of the ( 0, otherwise, q joint (x, y) variables. ˆ ˆ ′ ˆ −1 ˆ with d(ri(θMCD)) = ri(θMCD) (Σε) ri(θMCD) the Vertical outliers and bad leverage points highly robust distances ofq the residuals, corresponding to influence the least squares estimates in multivari- ˆ ˆ the initial MCD regression estimates θMCD and Σε. ate regression, and may make the results completely Note that these weighted regression estimates (20)– unreliable. Therefore, robust alternatives have been (21) have the same breakdown value as the initial developed. MCD regression estimates. Rousseeuw et al. (2004) proposed to use the MCD To illustrate MCD regression we analyze a dataset estimates for the center µ and scatter matrix Σ from Shell’s polymer laboratory, described in Roussee- in (18). The resulting estimates are called MCD re- uw et al. (2004). The dataset consists of n = 217 ob- gression estimates. It has been shown that the MCD servations with p = 4 predictor variables and q = 3 regression estimates are regression, y-affine and x- response variables. The predictor variables describe ′ affine equivariant. With X = (x1,..., xn) , Y = the chemical characteristics of a piece of foam, whereas ′ ˆ ˆ′ ′ the response variables measure its physical proper- (y1,..., yn) and θ = (B , αˆ) this means that ties such as tensile strength. The physical properties ˆ ′ θ(X, Y + XD + 1nw ) of the foam are determined by the chemical compo- sition used in the production process. Multivariate ˆ ′ ′ = θ(X, Y) + (D , w) , regression is used to establish a relationship between ˆ ′ the chemical inputs and the resulting physical prop- θ(X, YC + 1nw ) erties of the foam. After an initial exploratory study ˆ ′ ′ (19) = θ(X, Y) C + (Opq, w) , of the variables, a robust multivariate MCD regres- sion was used. θˆ XA′ 1 v′ Y ( + n , ) To detect leverage points and outliers the outlier ′ = (Bˆ (X, Y)A−1, αˆ(X, Y) map of Rousseeuw and van Zomeren (1990) has been ′ extended to multivariate regression. In multivari- − Bˆ (X, Y)A−1v)′, ate regression the robust distances of the residuals ˆ ri(θ1) are plotted versus the robust distances of the where D is any p × q matrix, A is any nonsingu- xi. Figure 7 is the outlier map of the Shell foam data. lar p × p matrix, C is any nonsingular q × q ma- Observations 215 and 110 lie far from both the hori- trix, v is any p-dimensional vector and w is any 2 zontal cutoff line at χ3,0.975 = 3.06 and the vertical q-dimensional vector. Here O is the p × q matrix pq 2 q consisting of zeroes. cutoff line at χ4,0.975 = 3.34. These two observa- MCD regression inherits the breakdown value of tions can thusq be classified as bad leverage points. ∗ ˆ Several observations lie substantially above the hor- the MCD estimator, thus εn(θ) ≈ (n − h)/n. To ob- tain a better efficiency, the one-step weighted MCD izontal cutoff but not to the right of the vertical estimates are used in (18) and followed by the re- cutoff, which means that they are vertical outliers (their residuals are outlying but their x-values are gression weighting step described below. For any fit not). θˆ denote the corresponding q-dimensional residuals Based on this list of special points the scientists ˆ Bˆ′ by ri(θ)= yi − xi − αˆ. Then the weighted regres- who had made the measurements found out that a sion estimates are given by fraction of the observations in Figure 7 were made n −1 n with a different production technique and hence be- ˆ ′ ′ long to a different population with other character- (20) θ1 = wiuiui wiuiyi , ! ! istics. These include the observations 210, 212 and Xi=1 Xi=1 n −1 n 215. We therefore remove these observations from ˆ 1 ˆ ˆ ′ the data, and retain only observations from the in- (21) Σε = d1 wi wiri(θ1)ri(θ1) , ! ! tended population. Xi=1 Xi=1 12 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Fig. 7. Regression outlier map of the foam data.

Fig. 8. Regression outlier map of the corrected foam data.

Running the method again yields the outlier map the possibility of some large measurement errors. in Figure 8. Observation 110 is still a bad leverage But the detection of these outliers at least provides point, and also several of the vertical outliers re- main. No chemical/physical mechanism was found us with the option to choose whether or not to allow to explain why these points are outliers, leaving open them to affect the final result. ROBUST MULTIVARIATE STATISTICS 13

Since MCD regression is mainly intended for re- In practice µj, Σj and pj have to be estimated. gression data with random carriers, Agull´o, Croux Classical quadratic discriminant analysis (CQDA) and Van Aelst (2006) developed an alternative ro- uses the group’s mean and empirical covariance ma- bust multivariate regression method which can be trix to estimate µj and Σj. The membership prob- seen as an extension of LTS to the multivariate set- abilities are usually estimated by the relative fre- ting. This multivariate least trimmed squares es- quencies of the observations in each group, hence timator (MLTS) can also be used in cases where pˆj = nj/n with nj the number of observations in the carriers are fixed. The MLTS looks for a subset group j. of size h such that the determinant of the covari- A robust quadratic discriminant analysis (RQDA) ance matrix of its residuals corresponding to its least is derived by using robust estimators of µj, Σj and squares fit is minimal. Similarly as for MCD regres- pj. In particular, we can apply the weighted MCD ∗ sion, the MLTS has breakdown value εn(θMLTS) ≈ estimator of location and scatter in each group. As a (n − h)/n and the equivariance properties (19) are byproduct of this robust procedure, outliers (within satisfied. The MLTS can be computed quickly with each group) can be distinguished from the regular an algorithm similar to that in Appendix A.1. To observations. Finally, the membership probabilities improve the efficiency while keeping the breakdown can be robustly estimated as the relative frequency value, a one-step weighted MLTS estimator can be of regular observations in each group. For an out- computed using expressions (20)–(21). Alternatively, line of this approach, see Hubert and Van Driessen Van Aelst and Willems (2005) introduced multivari- (2004). ate regression S-estimators and extended the fast When the groups are assumed to have a common robust bootstrap methodology of Salibian-Barrera covariance matrix Σ, the quadratic scores (22) can and Zamar (2002) to this setting while Garc´ıa Ben, be simplified to Mart´ınez and Yohai (2006) proposed τ-estimators L ′ −1 1 ′ −1 for multivariate regression. (23) dj (x)= µjΣ x − 2 µjΣ µj + ln(pj) 1 Σ 1 x′Σ−1x 5. CLASSIFICATION since the terms − 2 ln| | and − 2 do not de- pend on j. The resulting scores (23) are linear in x, The goal of classification, also known as discrim- hence the maximum likelihood rule belongs to the inant analysis or supervised learning, is to obtain class of linear discriminant analysis. It is well known rules that describe the separation between known that if we have only two populations (l = 2) with groups of observations. Moreover, it allows to clas- a common covariance structure and if both groups sify new observations into one of the groups. We de- have equal membership probabilities, this rule coin- note the number of groups by l and assume that we cides with Fisher’s linear discriminant rule. Robust can describe our in each population πj linear discriminant analysis based on the MCD esti- by a p-dimensional random variable Xj with density mator or S-estimators has been studied in Hawkins function fj. We write pj for the membership proba- and McLachlan (1997), He and Fung (2000), Croux bility, that is, the probability for an observation to and Dehon (2001) and Hubert and Van Driessen come from πj. The maximum likelihood rule then ˆ (2004). The latter paper computes µˆj and Σj by classifies an observation x into πk if ln(pkfk(x)) is weighted MCD and then defines the pooled covari- the maximum of the set {ln(pjfj(x)); j = 1,...,l}. ˆ l ˆ ance matrix Σ = ( j=1 njΣj)/n. If we assume that the density fj for each group is We consider a dataset that contains the spectra of P Gaussian with mean µj and covariance matrix Σj, three different cultivars of the same fruit (cantaloupe— then it can be seen that the maximum likelihood rule Cucumis melo L. Cantaloupensis). The cultivars is equivalent to maximizing the discriminant scores (named D, M and HA) have sizes 490, 106 and 500, Q x dj ( ) with and all spectra were measured in 256 wavelengths. Q 1 The dataset thus contains 1096 observations and 256 d (x)= − ln|Σj| j 2 variables. First, a robust principal component anal- 1 ′ −1 (22) − 2 (x − µj) Σj (x − µj) ysis (as described in the next section) was applied to reduce the dimension of the data space, and the first + ln(p ). j two components were retained. For a more detailed Q Q That is, x is allocated to πk if dk (x) > dj (x) for all description and analysis of these data, see Hubert j = 1,...,l (see, e.g., Johnson and Wichern, 1998). and Van Driessen (2004). 14 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Fig. 9. ( a) Classical tolerance ellipses for the fruit data with common covariance matrix; ( b) robust tolerance ellipses. ROBUST MULTIVARIATE STATISTICS 15

The data were divided randomly in a training set the analysis of high-dimensional data which are fre- and a validation set, containing 60% and 40% of the quently encountered in , computer vi- observations. Figure 9 shows the training data. In sion, engineering, genetics, and other domains. PCA this figure cultivar D is marked with crosses, culti- is then often the first step of the data analysis, fol- var M with circles and cultivar HA with diamonds. lowed by discriminant analysis, , or We see that cultivar HA has a cluster of outliers other multivariate techniques (see, e.g., Hubert and that are far away from the other observations. As it Engelen, 2004). It is thus important to find those turns out, these outliers were caused by a change in components that contain most of the information. the illumination system. To classify the data, we will In the classical approach, the first component cor- use model (23) with a common covariance matrix Σ. responds to the direction in which the projected Figure 9(a) shows the classical tolerance ellipses for observations have the largest variance. The second ′ ˆ −1 2 component is then orthogonal to the first and again the groups, given by (x−µˆj) Σ (x−µˆj)= χ2,0.975. Note how strongly the classical covariance estima- maximizes the variance of the projected data points. tor of the common Σ is influenced by the outlying Continuing in this way produces all the principal subgroup of cultivar HA. On the other hand, Fig- components, which correspond to the eigenvectors of the empirical covariance matrix. Unfortunately, ure 9(b) shows the same data with the correspond- both the classical variance (which is being maxi- ing robust tolerance ellipses. mized) and the classical covariance matrix (which is The effect on the resulting classical linear discrim- being decomposed) are very sensitive to anomalous inant rules is dramatic for cultivar M. It appears observations. Consequently, the first components are that all the observations are badly classified because often pulled toward outlying points, and may not they would have to belong to a region that lies com- capture the variation of the regular observations. pletely outside the boundary of this figure! The ro- Therefore, data reduction based on classical PCA bust discriminant analysis does a better job. The (CPCA) becomes unreliable if outliers are present tolerance ellipses are not affected by the outliers and in the data. the resulting discriminant lines split up the different To illustrate this, let us consider a small artificial groups more accurately. The misclassification rates dataset in p = 4 dimensions. The Hawkins–Bradu– are 17% for cultivar D, 95% for cultivar M and 6% Kass dataset (see, e.g., Rousseeuw and Leroy, 1987) for cultivar HA. The misclassification rate of culti- consists of n = 75 observations in which two groups var M remains very high. This is due to the intrin- of outliers were created, labeled 1–10 and 11–14. The sic overlap between the three groups, and the fact first two eigenvalues explain already 98% of the to- that cultivar M has few data points compared to the tal variation, so we select k =2. The CPCA scores others. (When we impose that all three groups are plot is depicted in Figure 10(a). In this figure we equally important by setting the membership proba- can clearly distinguish the two groups of outliers, bilities equal to 1/3, we obtain a better classification but we see several other undesirable effects. We first of cultivar M with 46% of errors.) observe that, although the scores have zero mean, This example thus clearly shows that outliers can the regular data points lie far from zero. This stems have a huge effect on the classical discriminant rules, from the fact that the mean of the data points is a whereas the robust version fares better. bad estimate of the true center of the data in the presence of outliers. It is clearly shifted toward the 6. PRINCIPAL COMPONENT ANALYSIS outlying group, and consequently the origin even 6.1 Classical PCA falls outside the cloud of the regular data points. On the plot we have also superimposed the 97.5% Principal component analysis is a popular statis- tolerance ellipse. We see that the outliers 1–10 are tical method which tries to explain the covariance within the tolerance ellipse, and thus do not stand structure of data by means of a small number of out based on their Mahalanobis distance. The ellipse components. These components are linear combina- has stretched itself to accommodate these outliers. tions of the original variables, and often allow for 6.2 Robust PCA an interpretation and a better understanding of the different sources of variation. Because PCA is con- The goal of robust PCA methods is to obtain cerned with data reduction, it is widely used for principal components that are not influenced much 16 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Fig. 10. Score plot and 97.5% tolerance ellipse of the Hawkins–Bradu–Kass data obtained with ( a) CPCA; ( b) MCD. by outliers. A first group of methods is obtained can handle up to about 100 dimensions, whereas by replacing the classical covariance matrix by a there are fields like chemometrics, which need to an- robust covariance estimator. Maronna (1976) and alyze data with dimensions in the thousands. Campbell (1980) proposed using affine equivariant A second approach to robust PCA uses projection M-estimators of scatter for this purpose, but these pursuit (PP) techniques. These methods maximize cannot resist many outliers. Croux and Haesbroeck a robust measure of spread to obtain consecutive di- (2000) used positive-breakdown estimators of scat- rections on which the data points are projected. In ter such as the MCD and S-estimators. Recently, Hubert, Rousseeuw and Verboven (2002) a PP algo- Salibian-Barrera, Van Aelst and Willems (2006) pro- rithm is presented, based on the ideas of Li and Chen posed using S- or MM-estimators of scatter and de- (1985) and Croux and Ruiz-Gazen (1996, 2005). It veloped a fast robust bootstrap procedure for infer- has been successfully applied in several studies, for ence and to assess the stability of the PCA solution. example, to detect outliers in large microarray data Let us reconsider the Hawkins–Bradu–Kass data in (Model et al., 2002). Asymptotic results about this p = 4 dimensions. Robust PCA using the weighted approach are presented in Cui, He and Ng (2003). MCD estimator yields the score plot in Figure 10(b). Hubert, Rousseeuw and Vanden Branden (2005) We now see that the center is correctly estimated in proposed a robust PCA method, called ROBPCA, the middle of the regular observations. The 97.5% which combines ideas of both projection pursuit and tolerance ellipse nicely encloses these points and ex- robust covariance estimation. The PP part is used cludes all 14 outliers. for the initial dimension reduction. Some ideas based Unfortunately, the use of these affine equivariant on the MCD estimator are then applied to this lower- covariance estimators is limited to small to moder- dimensional data space. Simulations in Hubert, ate dimensions. To see why, consider, for example, Rousseeuw and Vanden Branden (2005) have shown the MCD estimator. If p denotes the number of vari- that this combined approach yields more accurate ables in our dataset, the MCD estimator can only be estimates than the raw PP algorithm. An outline of computed if p < h; otherwise the covariance matrix the ROBPCA algorithm is given in Appendix A.3. of any h-subset has zero determinant. Since h

Fig. 11. ( a) Different types of outliers when a three-dimensional dataset is projected on a robust two-dimensional PCA-sub- space; ( b) the corresponding PCA outlier map.

example, Xn,p is an n × p matrix and Pp,k is p × k. k 2 tij (Note that it is possible to robustly scale the vari- = v . u lj ables first by dividing them by a robust scale esti- uj=1 uX mate; see, e.g., Rousseeuw and Croux, 1993.) The t All the above mentioned methods are translation robust scores are the k × 1 column vectors and orthogonal equivariant, that is, (2)–(3) hold for ′ ′ any vector v and any p × p matrix A with AA = I. ti = (Pp,k) (xi − µˆ x). To be precise, let µˆ x and P denote the robust cen- The orthogonal distance measures the distance be- ter and loading matrix of the original observations tween an observation and its projection in the k- xi. Then the robust center and loadings of the trans- Ax v Aµ v AP dimensional PCA subspace: formed data i + are equal to ˆ x + and . The scores (and distances) remain the same after this transformation, since (24) ODi = kxi − µˆ x − Pp,ktik. ′ ′ ti(Axi + v)= P A (Axi + v − (Aµˆ + v)) Let L denote the diagonal matrix which contains the x = P′(x − µˆ )= t (x ). eigenvalues lj of the MCD scatter matrix, sorted i x i i from largest to smallest. The score distance of xi We also mention the robust LTS-subspace esti- with respect to µˆ x, P and L is then defined as mator and its generalizations, introduced and dis- cussed in Rousseeuw and Leroy (1987) and Maronna ′ −1 ′ SDi = (xi − µˆ x) Pp,kLk,k(Pp,k) (xi − µˆ x) (2005). The idea behind these approaches consists q 18 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST in minimizing a robust scale of the orthogonal dis- over p = 750 wavelengths (Lemberge et al., 2000). tances, similar to the LTS estimator and S-estimators The measurements were performed using a Jeol JSM in regression. For functional data, a fast PCA method 6300 scanning electron microscope equipped with is introduced in Locantore et al. (1999). an energy-dispersive Si(Li) X-ray detection system. Three principal components were retained for CPCA 6.3 Outlier Map and ROBPCA, yielding the outlier maps in Fig- The result of the PCA analysis can be represented ure 12. In Figure 12(a) we see that CPCA does not by means of the outlier map given in Hubert, find big outliers. On the other hand the ROBPCA Rousseeuw and Vanden Branden (2005). As in re- plot in Figure 12(b) clearly distinguishes two major gression, this figure highlights the outliers and clas- groups in the data, as well as a smaller group of bad sifies them into several types. In general, an out- leverage points, a few orthogonal outliers, and the lier is an observation which does not obey the pat- isolated case 180 in between the two major groups. tern of the majority of the data. In the context of A high-breakdown method such as ROBPCA de- PCA, this means that an outlier either lies far from tects the smaller group with cases 143–179 as a set the subspace spanned by the k eigenvectors, and/or of outliers. Later, it turned out that the window of that the projected observation lies far from the bulk the detector system had been cleaned before the last of the data within this subspace. This can be ex- 38 spectra were measured. As a result less X-ray ra- pressed by means of the orthogonal and the score diation was absorbed, resulting in higher X-ray in- distances. These two distances define four types of tensities. The other bad leverage points (57–63) and observations, as illustrated in Figure 11(a). Regular (74–76) are samples with a large concentration of observations have a small orthogonal and a small calcic. The orthogonal outliers (22, 23 and 30) are score distance. Bad leverage points, such as observa- borderline cases, although it turned out that they tions 2 and 3, have a large orthogonal distance and have larger measurements at the channels 215–245. a large score distance. They typically have a large This might indicate a larger concentration of phos- influence on classical PCA, as the eigenvectors will phorus. be tilted toward them. When points have a large score distance but a small orthogonal distance, we 7. PRINCIPAL COMPONENT REGRESSION call them good leverage points. Observations 1 and Principal component regression is typically used 4 in Figure 7(a) can be classified into this category. for models (10) or (17) where the Finally, orthogonal outliers have a large orthogonal number of independent variables p is very large or distance, but a small score distance, as, for example, where the regressors are highly correlated (this is case 5. They cannot be distinguished from the reg- known as multicollinearity). An important applica- ular observations once they are projected onto the tion of PCR is multivariate calibration in chemo- PCA subspace, but they lie far from this subspace. metrics, which predicts constituent concentrations The outlier map in Figure 11(b) displays the OD i of a material based on its spectrum. This spectrum versus the SD . In this plot, lines are drawn to distin- i can be obtained via several techniques such as flu- guish the observations with a small and a large OD, orescence spectrometry, near-infrared spectrometry and with a small and a large SD. For the latter dis- (NIR), nuclear magnetic resonance (NMR), ultravi- 2 tances, the cutoff value c = χk,0.975 is used. For the olet spectrometry (UV), energy dispersive X-ray flu- orthogonal distances, the approachq of Box (1954) is orescence spectrometry (ED-XRF), and so on. Since followed. The squared orthogonal distances can be a spectrum typically ranges over a large number of approximated by a scaled χ2 distribution which in wavelengths, it is a high-dimensional vector with its turn can be approximated by a normal distribu- hundreds of components. The number of concentra- tion using the Wilson–Hilferty transformation. The tions, on the other hand, is usually limited to at mean and variance of this normal distribution are most, say, five. In the univariate approach, only one then estimated by applying the univariate MCD to concentration at a time is modeled and analyzed. 2/3 the ODi . The more general problem assumes that the number of response variables q is larger than 1, which means 6.4 Example that several concentrations are to be estimated to- We illustrate the PCA outlier map on a dataset gether. This model has the advantage that the co- consisting of spectra of 180 archaeological glass pieces variance structure between the concentrations is also ROBUST MULTIVARIATE STATISTICS 19

Fig. 12. PCA outlier map of the glass dataset based on three principal components, computed with ( a) CPCA; ( b) ROBPCA.

Fig. 13. Robust R-RMSECVk curve for the Biscuit Dough dataset.

taken into account, which is appropriate when the scores ti for every data point. Then the yi are re- concentrations are known to be strongly correlated gressed on the ti. with each other. The robust PCR method proposed by Hubert and Classical PCR (CPCR) starts by replacing the Verboven (2003) combines robust PCA for high-di- mensional x-data with a robust multivariate regres- large number of explanatory variables X by a small j sion technique such as MCD regression described number of loading vectors, which correspond to the in Section 4. The robust scores ti obtained with first (classical) principal components of Xn,p. Then ROBPCA thus serve as the explanatory variables the response variables Yj are regressed on these com- in the regression model (10) or (17). ponents using least squares regression. It is thus The RPCR method inherits the y-affine equivari- a two-step procedure, which starts by computing ance [the second equation in (19)] from the MCD re- 20 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST gression method. RPCR is also x-translation equiv- R-RMSECVk curve is plotted in Figure 13 and sug- ariant and x-orthogonally equivariant, that is, the gests to select k = 2 components. estimates satisfy the third equation in (19) for any Differences between CPCR and RPCR show up orthogonal matrix A. These properties follow in a in the loading vectors and in the calibration vec- straightforward way from the orthogonal equivari- tors. Figure 14 shows the second loading vector and ance of the ROBPCA method. Robust PCR meth- the second calibration vector for y3 (sucrose). For in- ods which are based on nonequivariant PCA estima- stance, CPCR and RPCR give quite different results tors, such as those proposed in Pell (2000), are not between wavelengths 1390 and 1440 (the so-called x-equivariant. C-H bend). An important issue in PCR is selecting the num- Next, we can construct outlier maps as in Sec- ber of principal components, for which several meth- tions 4 and 6.3. ROBPCA yields the PCA outlier ods have been proposed. A popular approach mini- map displayed in Figure 15(a). We see that there mizes the root mean squared error of cross-validation are no leverage points but there are some orthogonal criterion RMSECVk which, for one response vari- outliers, the largest being 23, 7 and 20. The result able (q = 1), equals of the regression step is shown in Figure 15(b). It plots the robust distances of the residuals (or the 1 n (25) RMSECV = (y − yˆ )2 standardized residuals if q = 1) versus the score dis- k vn i −i,k u i=1 tances. RPCR shows that observation 21 has an ex- u X t tremely high residual distance. Other vertical out- withy ˆ the predicted value for observation i, where −i,k liers are 23, 7, 20 and 24, whereas there are a few i was left out of the dataset when performing the borderline cases. PCR method with k principal components. The goal of the RMSECVk is twofold. It yields an estimate of the root mean squared prediction error 8. PARTIAL LEAST SQUARES REGRESSION 2 E(y −yˆ) when k components are used in the model, Partial least squares regression (PLSR) is similar whereas the curve of RMSECVk for k = 1,...,kmax to PCR. Its goal is to estimate regression coefficients is a popular graphical tool to choose the optimal in a linear model with a large number of x-variables number of components. which are highly correlated. In the first step of PCR, This RMSECVk statistic is, however, not suited the scores were obtained by extracting the main at contaminated datasets because it also includes information present in the x-variables by perform- the prediction error of the outliers in (25). There- ing a principal component analysis on them, with- fore Hubert and Verboven (2003) proposed a ro- out using any information about the y-variables. In bust RMSECV measure. These R-RMSECVk val- contrast, the PLSR scores are computed by maxi- ues were rather time consuming, because for every mizing a covariance criterion between the x- and y- choice of k they required the whole RPCR proce- variables. Hence, this technique uses the responses dure to be performed n times. Faster algorithms for already from the start. cross-validation have recently been developed (En- More precisely, let X˜ and Y˜ denote the mean- gelen and Hubert, 2005). They avoid the complete n,p n,q centered data matrices, with x˜ = x − x¯ and y˜ = recomputation of methods such as the i i i y − y¯. The normalized PLS weight vectors r and MCD when one observation is removed from the i a q (with kr k = kq k = 1) are then defined as the dataset. a a a vectors that maximize To illustrate RPCR we analyze the Biscuit Dough dataset of Osborne et al. (1984), preprocessed as Y˜ ′X˜ (26) cov(Yq˜ , Xr˜ )= q′ r = q′ Σˆ r in Hubert, Rousseeuw and Verboven (2002). This a a a n − 1 a a yx a dataset consists of 40 NIR spectra of biscuit dough ˆ ′ ˆ X˜ ′Y˜ with measurements every 2 nanometers, from 1200 for each a = 1,...,k, where Σyx = Σxy = n−1 is nm up to 2400 nm. The responses are the percent- the empirical cross-covariance matrix between the ˜ ages of four constituents in the biscuit dough: y1 = x- and the y-variables. The elements of the scores ti fat,y2 = flour,y3 = sucrose and y4 = water. Because are then defined as linear combinations of the mean- ˜ ′ ˜ there is a significant correlation among the responses, centered data: tia = x˜ira, or equivalently Tn,k = a multivariate regression is performed. The robust X˜ n,pRp,k with Rp,k = (r1,..., rk). ROBUST MULTIVARIATE STATISTICS 21

Fig. 14. Second loading vector and calibration vector of sucrose for the Biscuit Dough dataset, computed with ( a) CPCR; ( b) RPCR.

Fig. 15. ( a) PCA outlier map when applying RPCR to the Biscuit Dough dataset; ( b) corresponding regression outlier map.

The computation of the PLS weight vectors can den Branden and Hubert (2004) proved that for be performed using the SIMPLS algorithm (de Jong, low-dimensional data the RSIMPLS approach yields 1993), which is described in Appendix A.4. bounded influence functions for the weight vectors Hubert and Vanden Branden (2003) developed ra and qa and for the regression estimates. Also the the robust method RSIMPLS. It starts by applying breakdown value is inherited from the MCD estima- ROBPCA on the x- and y-variables in order to re- tor. place Σˆ xy and Σˆ x by robust estimates, and then pro- The robustness of RSIMPLS is illustrated on the ceeds analogously to the SIMPLS algorithm. Simi- octane dataset (Esbensen, Sch¨onkopf and Midtgaard, larly to RPCR, a robust regression method (ROBPCA 1994), consisting of NIR absorbance spectra over regression) is performed in the second stage. Van- p = 226 wavelengths ranging from 1102 nm to 1552 22 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Fig. 16. ( a) Score outlier map of the octane dataset using the SIMPLS results; ( b) based on RSIMPLS; ( c) regression outlier map based on SIMPLS; ( d) based on RSIMPLS. nm with measurements every two nanometers. For is displayed in Figure 16(b). Here we immediately each of the n = 39 production gasoline samples the spot the six samples with added alcohol. The ro- octane number y was measured, so q =1. It is known bust regression outlier map in Figure 16(d) shows that the octane dataset contains six outliers (25, 26, that the outliers are good leverage points, whereas 36–39) to which alcohol was added. From the RM- SIMPLS again only reveals spectrum 26. SECV values (Engelen et al., 2004) it follows that Note that analysis tries to k = 2 components should be retained. The SIMPLS outlier map is Figure 16(a). We see maximize the correlation between linear combina- that the classical analysis only detects the outlying tions of the x- and the y-variables, instead of the co- spectrum 26, which does not even stick out much variance in (26). Robust methods for canonical cor- above the border line. The robust score outlier map relation are presented in Croux and Dehon (2002). ROBUST MULTIVARIATE STATISTICS 23

9. SOME OTHER MULTIVARIATE It is also included in SAS Version 9 (in PROC RO- FRAMEWORKS BUSTREG). These packages all provide the one- step weighted MCD estimates. The LTS is available Apart from the frameworks covered in the previ- in S-PLUS as the built-in function ltsreg, which uses ous sections, there is also work in other multivari- a slower algorithm and has a low default breakdown ate settings. These methods cannot be described value. The FAST-LTS algorithm is available in R (as in detail here due to lack of space, but here are part of rrcov and robustbase) and in SAS/IML Ver- some pointers to the literature. In the framework of multivariate location and scatter, an MCD-based sion 7. In SAS Version 9 it is incorporated in PROC alternative to the Hotelling test was provided by ROBUSTREG. Willems et al. (2002) and a technique based on ro- Matlab functions for most of the procedures men- bust distances was applied to the control of elec- tioned in this paper (MCD, LTS, MCD-regression, trical power systems in Mili et al. (1996). High- RQDA, ROBPCA, RPCR and RSIMPLS) are part breakdown regression techniques were extended to of LIBRA, a Matlab LIBrary for Robust Analysis computer vision settings (e.g., Meer et al., 1991; (Verboven and Hubert, 2005) which can be down- Stewart, 1995). For generalized linear models, ro- loaded from http://wis.kuleuven.be/stat/robust. bust approaches have been proposed by Cantoni and Several of these functions are also available in the Ronchetti (2001), K¨unsch, Stefanski and Carroll PLS toolbox of Eigenvector Research (1989), Markatou, Basu and Lindsay (1998), M¨uller (www.eigenvector.com). and Neykov (2003) and Rousseeuw and Christmann (2003). A high-breakdown method for mixed linear APPENDIX models has been proposed by Copt and Victoria- A.1 The FAST-MCD Algorithm Feser (2006). Robust methods have been studied by Stromberg (1993), Stromberg Rousseeuw and Van Driessen (1999) developed the and Ruppert (1992) and Mizera (2002), who con- FAST-MCD algorithm to efficiently compute the MCD. sidered a depth-based approach. Boente, Pires and The key component is the C-step: Rodrigues (2002) introduced robust estimators for Theorem. Take X = {x1,..., xn} and let H1 ⊂ common principal components. Robust methods were {1,...,n} be an h-subset, that is, |H1| = h. Put µˆ1 := proposed for (Pison et al., 2003) and 1 ˆ 1 ′ xi and Σ1 := (xi −µˆ )(xi −µˆ ) . If independent component analysis (Brys, Hubert and h i∈H1 h i∈H1 1 1 det(Σˆ ) 6= 0, define the relative distances Rousseeuw, 2005). Croux et al. (2003) fitted gen- P 1 P eral multiplicative models such as FANOVA. Ro- −1 d (i) := (x − µˆ )′Σˆ (x − µˆ ) bust clustering methods have been investigated by 1 i 1 1 i 1 q Kaufman and Rousseeuw (1990), Cuesta-Albertos, for i = 1,...,n. Gordaliza and Matr´an (1997) and Hardin and Rocke (2004). Robustness in analysis and econo- Now take H2 such that {d1(i); i ∈ H2} := {(d1)1:n,..., (d1)h:n} where (d1)1:n ≤ (d1)2:n ≤···≤ (d1)n:n are metrics has been studied by Martin and Yohai (1986), ˆ Bustos and Yohai (1986), Muler and Yohai (2002), the ordered distances, and compute µˆ2 and Σ2 based Franses, Kloek and Lucas (1999), van Dijk, Franses on H2. Then and Lucas (1999a, 1999b) and Lucas and Franses ˆ ˆ det(Σ2) ≤ det(Σ1) (1998). Of course, this short list is far from com- ˆ ˆ plete. with equality if and only if µˆ2 = µˆ1 and Σ2 = Σ1. ˆ ˆ ˆ If det(Σ1) > 0, the C-step yields Σ2 with det(Σ2) ≤ 10. AVAILABILITY ˆ det(Σ1). Note that the C stands for “concentra- ˆ Stand-alone programs carrying out FAST-MCD tion” since Σ2 is more concentrated (has a lower ˆ ˆ and FAST-LTS can be downloaded from the web- determinant) than Σ1. The condition det(Σ1) 6= 0 site http://www.agoras.ua.ac.be, as well as Mat- in the C-step theorem is no real restriction because ˆ lab versions. The FAST-MCD algorithm is avail- if det(Σ1) = 0 we already have the minimal objec- able in the package S-PLUS (as the built-in func- tive value. tion cov.mcd), in R (as part of the packages rrcov, In the algorithm the C-step works as follows. Given ˆ robust and robustbase), and in SAS/IML Version 7. (µˆold, Σold): 24 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

1. compute the distances dold(i) for i = 1,...,n due to the n distances that need to be calculated in 2. sort these distances, which yields a permutation each C-step. For large n FAST-MCD uses a parti- π for which dold(π(1)) ≤ dold(π(2)) ≤ ··· ≤ tioning of the dataset, which avoids doing all the dold(π(n)) calculations in the entire data. In any case, let µˆopt 3. put Hnew := {π(1),π(2),...,π(h)} and Σˆ denote the mean and covariance matrix ˆ opt 4. compute µˆnew := ave(Hnew) and Σnew := of the h-subset with lowest covariance determinant. cov(Hnew). Then the algorithm returns ˆ ˆ For a fixed number of dimensions p, the C-step takes µˆMCD = µˆopt and ΣMCD = ch,nΣopt, only O(n) time [because H can be determined in new where c is the product of a consistency factor and O(n) operations without fully sorting all the d (i) h,n old a finite-sample correction factor (Pison, Van Aelst distances]. ˆ and Willems, 2002). Note that the FAST-MCD al- C-steps can be iterated until det(Σnew)=0 or ˆ ˆ gorithm is itself affine equivariant. det(Σnew) = det(Σold). The sequence of determi- nants obtained in this way must converge in a fi- A.2 The FAST-LTS Algorithm nite number of steps because there are only finitely The basic component of the LTS algorithm is again many h-subsets. However, there is no guarantee that the C-step, which now says that starting from an ini- the final value det(Σˆ ) of the iteration process is new tial h-subset H or an initial fit θˆ , we can construct the global minimum of the MCD objective function. 1 1 a new h-subset H2 by taking the h observations with Therefore an approximate MCD solution can be ob- smallest absolute residuals |r (θˆ )|. Applying LS to tained by taking many initial choices of H , applying i 1 1 H then yields a new fit θˆ which is guaranteed to C-steps to each and keeping the solution with low- 2 2 have a lower objective function (11). est determinant. For more discussion on resampling To construct the initial h-subsets the algorithm algorithms, see Hawkins and Olive (2002). starts from randomly drawn (p+1)-subsets. For each To construct an initial subset H , a random (p + 1 (p + 1)-subset the coefficients θ of the hyperplane 1)-subset J is drawn and µˆ := ave(J) and Σˆ := 0 0 0 through the points in the subset are calculated. [If a Σˆ cov(J) are computed. [If det( 0)=0, then J can be (p +1)-subset does not define a unique hyperplane, Σˆ extended by adding observations until det( 0) > 0.] then it is extended by adding more observations un- 2 x Then, for i = 1,...,n the distances d0(i) := ( i − til it does.] The corresponding initial h-subset is ′ ˆ −1 µˆ0) Σ0 (Xi − µˆ0) are computed and sorted into then formed by the h points closest to the hyper- d0(π(1)) ≤···≤ d0(π(n)), which leads to H1 := {π(1), plane (i.e., with smallest residuals). As was the case ...,π(h)}. This method yields better initial subsets for the MCD, also here this approach yields much than drawing random h-subsets directly, because the better initial fits than would be the case if random probability of drawing an outlier-free subset is much h-subsets were drawn directly. ˆ higher when drawing (p + 1)-subsets than with h- Let θopt denote the least squares fit of the optimal subsets. h-subset found by the whole resampling procedure; The FAST-MCD algorithm contains several com- then FAST-LTS returns putational improvements. Since each C-step involves θˆ = θˆ the calculation of a covariance matrix, its determi- LTS opt nant and the corresponding distances, using fewer and C-steps considerably improves the speed of the al- 1 h gorithm. It turns out that after two C-steps, many σˆ = c (r(θˆ )2) . LTS h,nv opt i:n runs that will lead to the global minimum already uh i=1 have a considerably smaller determinant. Therefore, u X A.3 The ROBPCA Algorithmt the number of C-steps is reduced by applying only two C-steps on each initial subset and selecting the First, the data are preprocessed by reducing their 10 different subsets with lowest determinants. Only data space to the subspace spanned by the n obser- for these 10 subsets, further C-steps are taken until vations. This is done by a singular value decompo- convergence. sition of Xn,p. As a result, the data are represented This procedure is very fast for small sample sizes using at most n − 1 = rank(X˜ n,p) variables without n, but when n grows the computation time increases loss of information. ROBUST MULTIVARIATE STATISTICS 25

In the second step of the ROBPCA algorithm, a Agullo,´ J., Croux, C. and Van Aelst, S. (2006). The mul- measure of outlyingness is computed for each data tivariate least trimmed squares estimator. J. Multivariate point. This is obtained by projecting the high-dimen- Anal. To appear. sional data points on many univariate directions. Alqallaf, F. A., Konis, K. P., Martin, R. D. and Za- mar, R. H. (2002). Scalable robust covariance and corre- On every direction the univariate MCD estimator lation estimates for data mining. In Proceedings of the Sev- of location and scale is computed, and for every enth ACM SIGKDD International Conference on Knowl- data point its standardized distance to that center edge Discovery and Data Mining. Edmonton. is measured. Finally for each data point its largest Andrews, D. F., Bickel, P. J., Hampel, F. R., Huber, distance over all the directions is considered. The P. J., Rogers, W. H. and Tukey, J. W. (1972). Ro- h data points with smallest outlyingness are kept, bust Estimates of Location: Survey and Advances. Prince- and from the covariance matrix Σ of this h-subset ton Univ. Press. MR0331595 h Bai, Z. D. and He, X. (2000). Asymptotic distributions of the we select the number k of principal components to maximal depth estimators for regression and multivariate retain. location. Ann. Statist. 27 1616–1637. MR1742502 The last stage of ROBPCA consists of project- Becker, C. and Gather, U. (1999). The masking break- ing the data points onto the k-dimensional subspace down point of multivariate outlier identification rules. J. Amer. Statist. Assoc. 94 947–955. MR1723295 spanned by the largest eigenvectors of Σh and of computing their center and shape using the weighted Berrendero, J. R. and Zamar, R. H. (2001). Maximum MCD estimator. The eigenvectors of this scatter ma- bias curves for robust regression with non-elliptical regres- sors. Ann. Statist. 29 224–251. MR1833964 trix then determine the robust principal components, Boente, G., Pires, A. M. and Rodrigues, I. (2002). In- and the location estimate serves as a robust center. fluence functions and outlier detection under the com- A.4 The SIMPLS Algorithm mon principal components model: A robust approach. Biometrika 89 861–875. MR1946516 The solution of the maximization problem (26) Box, G. E. P. (1954). Some theorems on quadratic forms applied in the study of problems. I. is found by taking r1 and q1 as the first left and right singular eigenvectors of Σˆ . The other PLSR Effect of inequality of variance in one-way classification. xy Ann. Math. Statist. 25 290–302. MR0061787 weight vectors ra and qa for a = 2,...,k are ob- Brys, G., Hubert, M. and Rousseeuw, P. J. (2005). tained by imposing an orthogonality constraint to A robustification of independent component analysis. J. the elements of the scores. If we require that Chemometrics 19 364–375. n i=1 tiatib =0 for a 6= b, a deflation of the cross- Bustos, O. H. and Yohai, V. J. (1986). Robust estimates Σˆ for ARMA models. J. Amer. Statist. Assoc. 81 155–168. covarianceP matrix xy provides the solutions for the other PLSR weight vectors. This deflation is MR0830576 Butler, R. W., Davies, P. L. and Jhun, M. (1993). Asymp- carried out by first calculating the x-loading pa = totics for the Minimum Covariance Determinant estimator. ˆ ′ ˆ ˆ Σxra/(raΣxra) with Σx the empirical variance–co- Ann. Statist. 21 1385–1400. MR1241271 variance matrix of the x-variables. Next an orthonor- Campbell, N. A. (1980). Robust procedures in multivariate 29 mal base {v1,..., va} of {p1,..., pa} is constructed analysis I: Robust covariance estimation. Appl. Statist. 231–237. and Σˆ xy is deflated as Cantoni, E. and Ronchetti, E. (2001). Robust inference a a−1 ′ a−1 for generalized linear models. J. Amer. Statist. Assoc. 96 Σˆ = Σˆ − va(v Σˆ ) xy xy a xy 1022–1030. MR1947250 1 Chen, T.-C. and Victoria-Feser, M. (2002). High break- with Σˆ = Σˆ xy. In general the PLSR weight vectors xy down estimation of multivariate location and scale with ra and qa are obtained as the left and right singular 55 ˆ a missing observations. British J. Math. Statist. Psych. vectors of Σxy. 317–335. MR1949260 Coakley, C. W. and Hettmansperger, T. P. (1993). ACKNOWLEDGMENTS A bounded influence, high breakdown, efficient regression 88 We would like to thank Sanne Engelen, Karlien estimator. J. Amer. Statist. Assoc. 872–880. MR1242938 Copt, S. and Victoria-Feser, M.-P. (2006). High break- Vanden Branden and Sabine Verboven for help with down inference in the mixed linear model. J. Amer. Statist. preparing the figures of this paper. Assoc. 101 292–300. MR2268046 Croux, C. and Dehon, C. (2001). Robust linear discriminant 29 REFERENCES analysis using S-estimators. Canad. J. Statist. 473–493. MR1872648 Adrover, J. and Zamar, R. H. (2004). Bias robustness of Croux, C. and Dehon, C. (2002). Analyse canonique bas´ee three median-based regression estimates. J. Statist. Plann. sur des estimateurs robustes de la matrice de covariance. Inference 122 203–227. MR2057923 La Revue de Statistique Appliquee 2 5–26. 26 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Croux, C. and Dehon, C. (2003). Estimators of the multi- ative study. In Theory and Applications of Recent Robust ple correlation coefficient: Local robustness and confidence Methods (M. Hubert, G. Pison, A. Struyf and S. V. Aelst, intervals. Statist. Papers 44 315–334. MR1996955 eds.) 105–117. Birkh¨auser, Basel. MR2085889 Croux, C., Filzmoser, P., Pison, P. and Rousseeuw, Esbensen, K., Schonkopf,¨ S. and Midtgaard, T. (1994). P. J. (2003). Fitting multiplicative models by robust alter- Multivariate Analysis in Practice. Camo, Trondheim. nating regressions. Statist. Comput. 13 23–36. MR1973864 Ferretti, N., Kelmansky, D., Yohai, V. J. and Zamar, Croux, C. and Haesbroeck, G. (1999). Influence function R. H. (1999). A class of locally and globally robust re- and efficiency of the Minimum Covariance Determinant gression estimates. J. Amer. Statist. Assoc. 94 174–188. scatter matrix estimator. J. Multivariate Anal. 71 161–190. MR1689223 MR1735108 Franses, P. H., Kloek, T. and Lucas, A. (1999). Out- Croux, C. and Haesbroeck, G. (2000). Principal compo- lier robust analysis of long-run marketing effects for weekly nents analysis based on robust estimators of the covariance scanning data. J. 89 293–315. or correlation matrix: Influence functions and efficiencies. Friedman, J. H. and Tukey, J. W. (1974). A projec- Biometrika 87 603–618. MR1789812 tion pursuit algorithm for exploratory data analysis. IEEE Croux, C., Rousseeuw, P. J. and Hossjer,¨ O. (1994). Gen- Transactions on Computers C 23 881–889. eralized S-estimators. J. Amer. Statist. Assoc. 89 1271– Garc´ıa Ben, M., Mart´ınez, E. and Yohai, V. J. (2006). 1281. MR1310221 Robust estimation for the multivariate linear model based Croux, C. and Ruiz-Gazen, A. (1996). A fast algorithm for on a τ-scale. J. Multivariate Anal. 97 1600–1622. robust principal components based on projection pursuit. Hampel, F. R. (1975). Beyond location parameters: Robust In COMPSTAT 1996 211–217. Physica, Heidelberg. concepts and methods. Bull. Internat. Statist. Inst. 46 375– Croux, C. and Ruiz-Gazen, A. (2005). High breakdown es- 382. MR0483172 timators for principal components: The projection-pursuit Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J. approach revisited. J. Multivariate Anal. 95 206–226. and Stahel, W. A. (1986). : The Ap- MR2164129 proach Based on Influence Functions. Wiley, New York. Cuesta-Albertos, J. A., Gordaliza, A. and Matran,´ C. MR0829458 (1997). Trimmed k-means: An attempt to robustify quan- Hardin, J. and Rocke, D. M. (2004). Outlier detection tizers. Ann. Statist. 25 553–576. MR1439314 in the multiple cluster setting using the minimum covari- Cui, H., He, X. and Ng, K. W. (2003). Asymptotic distribu- ance determinanr estimator. Comput. Statist. Data Anal. tions of principal components based on robust dispersions. 44 625–638. MR2026436 Biometrika 90 953–966. MR2024769 Hawkins, D. M. and McLachlan, G. J. (1997). High- Davies, L. (1987). Asymptotic behavior of S-estimators of breakdown linear discriminant analysis. J. Amer. Statist. multivariate location parameters and dispersion matrices. Assoc. 92 136–143. MR1436102 Ann. Statist. 15 1269–1292. MR0902258 Hawkins, D. M. and Olive, D. (2002). Inconsistency of re- Davies, L. (1992a). The asymptotics of Rousseeuw’s min- algorithms for high breakdown regression esti- imum volume ellipsoid estimator. Ann. Statist. 20 1828– mators and a new algorithm (with discussion). J. Amer. 1843. MR1193314 Statist. Assoc. 97 136–159. MR1947276 Davies, L. (1992b). An efficient Fr´echet differentiable high He, X. and Fung, W. K. (2000). High breakdown estimation breakdown multivariate location and dispersion estimator. for multiple populations with applications to discriminant J. Multivariate Anal. 40 311–327. MR1150615 analysis. J. Multivariate Anal. 72 151–162. MR1740638 Davies, P. L. and Gather, U. (2005). Breakdown and He, X. and Simpson, D. G. (1993). Lower bounds for con- groups. Ann. Statist. 33 977–1035. MR2195626 tamination bias: Globally minimax versus locally linear es- de Jong, S. (1993). SIMPLS: an alternative approach to par- timation. Ann. Statist. 21 314–337. MR1212179 tial least squares regression. Chemometrics and Intelligent He, X., Simpson, D. G. and Wang, G. (2000). Breakdown Laboratory Systems 18 251–263. points of t-type regression estimators. Biometrika 87 675– Donoho, D. L. (1982). Breakdown properties of multivari- 687. MR1789817 ate location estimators. Qualifying paper, Harvard Univ., Hossjer,¨ O. (1994). Rank-based estimates in the linear Boston. model with high breakdown point. J. Amer. Statist. As- Donoho, D. L. and Gasko, M. (1992). Breakdown prop- soc. 89 149–158. MR1266292 erties of location estimates based on halfspace depth Huber, P. J. (1973). Robust regression: Asymptotics, con- and projected outlyingness. Ann. Statist. 20 1803–1827. jectures and Monte Carlo. Ann. Statist. 1 799–821. MR1193313 MR0356373 Donoho, D. L. and Huber, P. J. (1983). The notion of Huber, P. J. (1981). Robust Statistics. Wiley, New York. breakdown point. In A Festschrift for Erich Lehmann MR0606374 (P. Bickel, K. Doksum and J. Hodges, eds.) 157–184. Huber, P. J. (1985). Projection pursuit. Ann. Statist. 13 435– Wadsworth, Belmont, CA. MR0689745 525. MR0790553 Engelen, S. and Hubert, M. (2005). Fast Hubert, M. and Engelen, S. (2004). Robust PCA and clas- for robust calibration methods. Analytica Chimica Acta sification in biosciences. 20 1728–1736. 544 219–228. Hubert, M. and Rousseeuw, P. J. (1996). Robust regres- Engelen, S., Hubert, M., Vanden Branden, K. and Ver- sion with both continuous and binary regressors. J. Statist. boven, S. (2004). Robust PCR and robust PLS: A compar- Plann. Infer. 57 153–163. ROBUST MULTIVARIATE STATISTICS 27

Hubert, M., Rousseeuw, P. J. and Vanden Branden, Lopuhaa,¨ H. P. (1999). Asymptotics of reweighted estima- K. (2005). ROBPCA: A new approach to robust principal tors of multivariate location and scatter. Ann. Statist. 27 components analysis. Technometrics 47 64–79. MR2135793 1638–1665. MR1742503 Hubert, M., Rousseeuw, P. J. and Verboven, S. (2002). Lopuhaa,¨ H. P. and Rousseeuw, P. J. (1991). Breakdown A fast robust method for principal components with ap- points of affine equivariant estimators of multivariate lo- plications to chemometrics. Chemometrics and Intelligent cation and covariance matrices. Ann. Statist. 19 229–248. Laboratory Systems 60 101–111. MR1091847 Hubert, M. and Van Driessen, K. (2004). Fast and robust Lucas, A. and Franses, P. H. (1998). Outlier detection in 16 discriminant analysis. Comput. Statist. Data Anal. 45 301– analysis. J. Bus. Econom. Statist. 459– 320. MR2045634 468. Hubert, M. and Vanden Branden, K. (2003). Robust Markatou, M., Basu, A. and Lindsay, B. G. (1998). methods for Partial Least Squares Regression. J. Chemo- Weighted likelihood equations with bootstrap root search. J. Amer. Statist. Assoc. 93 740–750. MR1631378 metrics 17 537–549. Markatou, M. and He, X. (1994). Bounded influence and Hubert, M. and Verboven, S. (2003). A robust PCR high breakdown point testing procedures in linear models. method for high-dimensional regressors. J. Chemometrics J. Amer. Statist. Assoc. 89 543–549. MR1294081 17 438–452. Markatou, M. and Hettmansperger, T. P. (1990). Ro- Johnson, R. A. and Wichern, D. W. (1998). Applied Multi- bust bounded-influence tests in linear models. J. Amer. variate Statistical Analysis. Prentice Hall Inc., Englewood Statist. Assoc. 85 187–190. MR1137365 Cliffs, NJ. MR1168210 Maronna, R. A. (1976). Robust M-estimators of multivari- ´ Jureckova, J. (1971). Nonparametric estimate of regression ate location and scatter. Ann. Statist. 4 51–67. MR0388656 42 coefficients. Ann. Math. Statist. 1328–1338. MR0295487 Maronna, R. A. (2005). Principal components and orthog- Kaufman, L. and Rousseeuw, P. J. (1990). Finding Groups onal regression based on robust scales. Technometrics 47 in Data. Wiley, New York. MR1044997 264–273. MR2164700 Kent, J. T. and Tyler, D. E. (1996). Constrained M- Maronna, R. A., Bustos, O. and Yohai, V. J. (1979). estimation for multivariate location and scatter. Ann. Bias- and efficiency-robustness of general M-estimators for Statist. 24 1346–1370. MR1401854 regression with random carriers. Smoothing Techniques for Koenker, R. and Portnoy, S. (1987). L-estimation for Curve Estimation (T. Gasser and M. Rosenblatt, eds.). linear models. J. Amer. Statist. Assoc. 82 851–857. Lecture Notes in Math. 757 91–116. Springer, New York. MR0909992 MR0564254 Krasker, W. S. and Welsch, R. E. (1982). Efficient Maronna, R. A., Martin, D. R. and Yohai, V. J. (2006). bounded-influence regression estimation. J. Amer. Statist. Robust Statistics: Theory and Methods. Wiley, New York. Assoc. 77 595–604. MR0675886 MR2238141 Kunsch,¨ H. R., Stefanski, L. A. and Carroll, R. J. Maronna, R. A., Stahel, W. A. and Yohai, V. J. (1992). (1989). Conditionally unbiased bounded influence estima- Bias-robust estimators of multivariate scatter based on pro- tion in general regression models with applications to gen- jections. J. Multivar. Anal. 42 141–161. MR1177523 eralized linear models. J. Amer. Statist. Assoc. 84 460–466. Maronna, R. A. and Yohai, V. J. (1981). Asymptotic be- MR1010334 havior of general M-estimators for regression and scale 58 Lemberge, P., De Raedt, I., Janssens, K. H., Wei, F. with random carriers. Z. Wahrsch. Verw. Gebiete 7–20. and Van Espen, P. J. (2000). Quantitative Z-analysis of MR0635268 16th–17th century archaelogical glass vessels using PLS re- Maronna, R. A. and Yohai, V. J. (1993). Bias-robust esti- 21 gression of EPXMA and µ-XRF data. J. Chemometrics 14 mates of regression based on projections. Ann. Statist. 965–990. MR1232528 751–763. Maronna, R. A. and Yohai, V. J. (1995). The behavior of Li, G. and Chen, Z. (1985). Projection-pursuit approach to the Stahel–Donoho robust multivariate estimator. J. Amer. robust dispersion matrices and principal components: Pri- Statist. Assoc. 90 330–341. MR1325140 mary theory and Monte Carlo. J. Amer. Statist. Assoc. 80 Maronna, R. A. and Yohai, V. J. (2000). Robust regression 759–766. with both continuous and categorical predictors. J. Statist. Liu, R. Y. Parelius, J. M. Singh, K. , and (1999). Multivari- Plann. Infer. 89 197–214. MR1794422 ate analysis by data depth: , graphics Maronna, R. A. and Zamar, R. H. (2002). Robust multi- 27 and inference. Ann. Statist. 783–840. MR1724033 variate estimates for high dimensional data sets. Techno- Locantore, N., Marron, J. S., Simpson, D. G., Tripoli, metrics 44 307–317. MR1939680 N., Zhang, J. T. and Cohen, K. L. (1999). Robust prin- Martin, R. D. and Yohai, V. J. (1986). Influence function- cipal component analysis for functional data. Test 8 1–73. als for time series. Ann. Statist. 14 781–818. MR0856793 MR1707596 Martin, R. D., Yohai, V. J. and Zamar, R. H. (1989). Min- Lopuhaa,¨ H. P. (1989). On the relation between S-estimators max bias robust regression. Ann. Statist. 17 1608–1630. and M-estimators of multivariate location and covariance. MR1026302 Ann. Statist. 17 1662–1683. MR1026304 Meer, P., Mintz, D., Rosenfeld, A. and Kim, D. Y. Lopuhaa,¨ H. P. (1991). Multivariate τ-estimators for loca- (1991). Robust regression methods in computer vision: A tion and scatter. Canad. J. Statist. 19 307–321. MR1144148 review. Internat. J. Comput. Vision 6 59–70. 28 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST

Mendes, B. and Tyler, D. E. (1996). Constrained M- Rousseeuw, P. J. and Christmann, A. (2003). Robustness estimates for regression. Robust Statistics: Data Analysis against separation and outliers in logistic regression. Com- and Computer Intensive Methods (H. Rieder, ed.). Lec- put. Statist. Data Anal. 43 315–332. MR1996815 ture Notes in Statist. 109 299–320. Springer, New York. Rousseeuw, P. J. and Croux, C. (1993). Alternatives to MR1491412 the median absolute deviation. J. Amer. Statist. Assoc. 88 Mili, L., Cheniae, M. G., Vichare, N. S. and Rousseeuw, 1273–1283. MR1245360 P. J. (1996). Robust state estimation based on projection Rousseeuw, P. J. and Hubert, M. (1999). Regression statistics. IEEE Transactions on Power Systems 11 1118– depth. J. Amer. Statist. Assoc. 94 388–402. MR1702314 1127. Rousseeuw, P. J. and Leroy, A. M. (1987). Robust Regres- Mizera, I. (2002). On depth and deep points: A calculus. sion and Outlier Detection. Wiley-Interscience, New York. Ann. Statist. 30 1681–1736. MR1969447 MR0914792 Model, F., Konig,¨ T., Piepenbrock, C. and Adorjan, P. Rousseeuw, P. J., Ruts, I. and Tukey, J. W. (1999a). (2002). Statistical process control for large scale microarray The : A bivariate boxplot. American Statistician 53 . Bioinformatics 1 1–9. 382–387. Muler, N. and Yohai, V. J. (2002). Robust estimates Rousseeuw, P. J., Van Aelst, S. and Hubert, M. (1999b). for ARCH processes. J. Time Series Anal. 23 341–375. Rejoinder to the discussion of “Regression depth.” J. 94 MR1908596 Amer. Statist. Assoc. 388–433. MR1702314 Muller,¨ C. H. and Neykov, N. (2003). Breakdown points Rousseeuw, P. J., Van Aelst, S., Van Driessen, K. and ´ of trimmed likelihood estimators and related estimators in Agullo, J. (2004). Robust multivariate regression. Tech- 46 generalized linear models. J. Statist. Plann. Infer. 116 503– nometrics 293–305. MR2082499 519. MR2000097 Rousseeuw, P. J. and Van Driessen, K. (1999). A fast Muller,¨ S. and Welsh, A. H. (2006). Outlier robust model algorithm for the minimum covariance determinant esti- 41 selection in linear regression. J. Amer. Statist. Assoc. 100 mator. Technometrics 212–223. Rousseeuw, P. J. Van Driessen, K. 1297–1310. MR2236443 and (2006). Comput- ing LTS regression for large data sets. Data Mining and Odewahn, S. C., Djorgovski, S. G., Brunner, R. J. and Knowledge Discovery 12 29–45. MR2225526 Gal, R. (1998). Data from the digitized palomar sky sur- Rousseeuw, P. J. and van Zomeren, B. C. (1990). Un- vey. Technical report, California Institute of Technology. masking multivariate outliers and leverage points. J. Amer. Osborne, B. G., Fearn, T., Miller, A. R. and Douglas, Statist. Assoc. 85 633–651. S. (1984). Application of near infrared reflectance spec- Rousseeuw, P. J. and Yohai, V. J. (1984). Robust regres- troscopy to the compositional analysis of biscuits and bis- sion by means of S-estimators. Robust and Nonlinear Time cuit dough. J. Scientific Food Agriculture 35 99–105. Series Analysis (J. Franke, W. H¨ardle and R. Martin, eds.). Pell, R. J. (2000). Multiple outlier detection for multivariate Lecture Notes in Statist. 26 256–272. Springer, New York. calibration using robust statistical techniques. Chemomet- MR0786313 rics and Intelligent Laboratory Systems 52 87–104. Salibian-Barrera, M., Van Aelst, S. and Willems, G. Pison, G., Rousseeuw, P. J., Filzmoser, P. and Croux, (2006). PCA based on multivariate MM-estimators with C. 84 (2003). Robust factor analysis. J. Multivariate Anal. fast and robust bootstrap. J. Amer. Statist. Assoc. 101 145–172. MR1965827 1198–1211. Pison, G., Van Aelst, S. and Willems, G. (2002). Small Salibian-Barrera, M. and Yohai, V. J. (2006). A fast 55 sample corrections for LTS and MCD. Metrika 111–123. algorithm for S-regression estimates. J. Comput. Graph. MR1903287 Statist. 15 414–427. MR2246273 Rocke, D. M. (1996). Robustness properties of S-estimators Salibian-Barrera, M. and Zamar, R. H. (2002). Boot- of multivariate location and shape in high dimension. Ann. strapping robust estimates of regression. Ann. Statist. 30 24 Statist. 1327–1345. MR1401853 556–582. MR1902899 Rocke, D. M. and Woodruff, D. L. (1996). Identification Siegel, A. F. (1982). Robust regression using repeated me- 91 of outliers in multivariate data. J. Amer. Statist. Assoc. dians. Biometrika 69 242–244. 1047–1061. MR1424606 Simpson, D. G., Ruppert, D. and Carroll, R. J. (1992). Ronchetti, E., Field, C. and Blanchard, W. (1997). Ro- On one-step GM-estimates and stability of inferences in bust linear model selection by cross-validation. J. Amer. linear regression. J. Amer. Statist. Assoc. 87 439–450. Statist. Assoc. 92 1017–1023. MR1482132 MR1173809 Ronchetti, E. and Staudte, R. G. (1994). A robust ver- Simpson, D. G. and Yohai, V. J. (1998). Functional stability sion of Mallows’ Cp. J. Amer. Statist. Assoc. 89 550–559. of one-step estimators in approximately linear regression. MR1294082 Ann. Statist. 26 1147–1169. MR1635458 Rousseeuw, P. J. (1984). Least median of squares regression. Stahel, W. A. (1981). Robuste Sch¨atzungen: Infinites- J. Amer. Statist. Assoc. 79 871–880. MR0770281 imale Optimalit¨at und Sch¨atzungen von Kovarianzma- Rousseeuw, P. J. (1985). Multivariate estimation with high trizen. Ph.D. thesis, ETH Z¨urich. breakdown point. In Mathematical Statistics and Applica- Stewart, C. V. (1995). MINPRAN: A new robust estima- tions, B (W. Grossmann, G. Pflug, I. Vincze and W. Wertz, tor for computer vision. IEEE Trans. Pattern Anal. Mach. eds.). Reidel Publishing Company, Dordrecht. MR0851060 Intelligence 17 925–938. ROBUST MULTIVARIATE STATISTICS 29

Stromberg, A. J. (1993). Computation of high breakdown Laboratory Systems 75 127–136. nonlinear regression. J. Amer. Statist. Assoc. 88 237–244. Willems, G., Pison, G., Rousseeuw, P. J. and Van Aelst, Stromberg, A. J. and Ruppert, D. (1992). Breakdown in S. (2002). A robust Hotelling test. Metrika 55 125–138. nonlinear regression. J. Amer. Statist. Assoc. 87 991–997. MR1903288 MR1209560 Woodruff, D. L. and Rocke, D. M. (1994). Computable Tatsuoka, K. S. and Tyler, D. E. (2000). On the unique- robust estimation of multivariate location and shape in ness of S-functionals and M-functionals under nonelliptical high dimension using compound estimators. J. Amer. distributions. Ann. Statist. 28 1219–1243. MR1811326 Statist. Assoc. 89 888–896. MR1294732 Tyler, D. E. (1994). Finite-sample breakdown points of Yohai, V. J. (1987). High breakdown point and high ef- projection-based multivariate location and scatter statis- ficiency robust estimates for regression. Ann. Statist. 15 tics. Ann. Statist. 22 1024–1044. MR1292555 642–656. MR0888431 Van Aelst, S. and Rousseeuw, P. J. (2000). Robustness Yohai, V. J. and Zamar, R. H. (1988). High breakdown of deepest regression. J. Multivariate Anal. 73 82–106. point estimates of regression by means of the minimization MR1766122 of an efficient scale. J. Amer. Statist. Assoc. 83 406–413. Van Aelst, S., Rousseeuw, P. J., Hubert, M. and MR0971366 Struyf, A. (2002). The deepest regression method. J. Zamar, R. H. (1989). Robust estimation in the errors in vari- Multivariate Anal. 81 138–166. MR1901211 ables model. Biometrika 76 149–160. MR0991433 Van Aelst, S. and Willems, G. (2005). Multivariate re- Zamar, R. H. (1992). Bias robust estimation in orthogonal gression S-estimators for robust estimation and inference. regression. Ann. Statist. 20 1875–1888. MR1193316 Statist. Sinica 15 981–1001. MR2234409 Zuo, Y., Cui, H. and He, X. (2004). On the Stahel–Donoho van Dijk, D., Franses, P. H. and Lucas, A. (1999a). Test- estimator and depth-weighted means of multivariate data. ing for ARCH in the presence of additive outliers. J. Appl. Ann. Statist. 32 167–188. MR2051003 Econometrics 14 539–562. Zuo, Y. and Serfling, R. (2000a). General notions of statis- van Dijk, D., Franses, P. H. and Lucas, A. (1999b). Test- tical depth function. Ann. Statist. 28 461–482. MR1790005 ing for smooth transition nonlinearity in the presence of Zuo, Y. and Serfling, R. (2000b). Nonparametric notions of outliers. J. Bus. Econom. Statist. 17 217–235. multivariate “scatter measure” and “more scattered” based Vanden Branden, K. and Hubert, M. (2004). Robustness on statistical depth functions. J. Multivariate Anal. 75 62– properties of a robust PLS regression method. Analytica 78. MR1787402 Chimica Acta 515 229–241. Verboven, S. and Hubert, M. (2005). LIBRA: A Matlab library for robust analysis. Chemometrics and Intelligent