High-Breakdown Robust Multivariate Methods
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Science 2008, Vol. 23, No. 1, 92–119 DOI: 10.1214/088342307000000087 c Institute of Mathematical Statistics, 2008 High-Breakdown Robust Multivariate Methods Mia Hubert, Peter J. Rousseeuw and Stefan Van Aelst Abstract. When applying a statistical method in practice it often oc- curs that some observations deviate from the usual assumptions. How- ever, many classical methods are sensitive to outliers. The goal of robust statistics is to develop methods that are robust against the possibility that one or several unannounced outliers may occur anywhere in the data. These methods then allow to detect outlying observations by their residuals from a robust fit. We focus on high-breakdown methods, which can deal with a substantial fraction of outliers in the data. We give an overview of recent high-breakdown robust methods for multivariate settings such as covariance estimation, multiple and multivariate re- gression, discriminant analysis, principal components and multivariate calibration. Key words and phrases: Breakdown value, influence function, multi- variate statistics, outliers, partial least squares, principal components, regression, robustness. 1. INTRODUCTION and often they do not show up by simple visual in- spection. Many multivariate datasets contain outliers, that The usual multivariate analysis techniques (e.g., is, data points that deviate from the usual assump- principal components, discriminant analysis and mul- tions and/or from the pattern suggested by the ma- tivariate regression) are based on empirical means, jority of the data. Outliers are more likely to occur in covariance and correlation matrices, and least squares datasets with many observations and/or variables, fitting. All of these can be strongly affected by even a few outliers. When the data contain nasty outliers, typically two things happen: arXiv:0808.0657v1 [stat.ME] 5 Aug 2008 Mia Hubert is Professor, University Center for Statistics and Department of Mathematics, Katholieke • the multivariate estimates differ substantially from Universiteit Leuven, Celestijnenlaan 200 B, B-3001 the “right” answer, defined here as the estimates Leuven, Belgium (e-mail: [email protected]). we would have obtained without the outliers; Peter J. Rousseeuw is Professor, Department of • the resulting fitted model does not allow to de- Mathematics and Computer Science, University of tect the outliers by means of their residuals, Ma- Antwerp, Middelheimlaan 1, B-2020 Antwerp, Belgium halanobis distances or the widely used “leave-one- (e-mail: [email protected]). Stefan Van Aelst is out” diagnostics. Professor, Department of Applied Mathematics and Computer Science, Ghent University, Krijgslaan 281 The first consequence is fairly well known (although S9, B-9000 Ghent, Belgium (e-mail: the size of the effect is often underestimated). Unfor- [email protected]). tunately, the second consequence is less well known, This is an electronic reprint of the original article and when stated many people find it hard to believe published by the Institute of Mathematical Statistics in or paradoxical. Common intuition says that outliers Statistical Science, 2008, Vol. 23, No. 1, 92–119. This must “stick out” from the classical fitted model, and reprint differs from the original in pagination and indeed some of them may do so. But the most harm- typographic detail. ful types of outliers, especially if there are several of 1 2 M. HUBERT, P. J. ROUSSEEUW AND S. VAN AELST them, may affect the estimated model so much “in 2. MULTIVARIATE LOCATION AND their direction” that they are now well-fitted by it. SCATTER Once this effect is understood, one sees that the 2.1 The Need for Robustness following two problems are essentially equivalent: In the multivariate location and scatter setting • Robust estimation: find a “robust” fit, which is we assume that the data are stored in an n × p data similar to the fit we would have found without matrix X = (x ,..., x )′ with x = (x ,...,x )′ the the outliers. 1 n i i1 ip ith observation. Hence n stands for the number of • Outlier detection: find all the outliers that matter. objects and p for the number of variables. Indeed, a solution to the first problem allows us to To illustrate the effect of outliers we consider the identify the outliers by their residuals, and so on, following engineering problem, taken from Rousseeuw from the robust fit. Conversely, a solution to the and Van Driessen (1999). Philips Mecoma (The Nether- second problem allows us to remove or downweight lands) produces diaphragm parts for television sets. the outliers followed by a classical fit, which yields These are thin metal plates, molded by a press. a robust result. When starting a new production line, p = 9 charac- Our research focuses on the first problem, and uses teristics were measured for n = 677 parts. The aim is its results to answer the second. We prefer this ap- to gain insight in the production process and to find proach over the opposite direction because from a out whether abnormalities have occurred. A classical combinatorial viewpoint it is more feasible to search approach is to compute the Mahalanobis distance for sufficiently many “good” data points than to find ′ ˆ −1 all the “bad” data points. (1) MD(xi)= (xi − µˆ0) Σ0 (xi − µˆ0) It turns out that most of the currently available q x µˆ highly robust multivariate estimators are difficult of each measurement i. Here 0 is the arithmetic Σˆ to compute, which makes them unsuitable for the mean and 0 is the classical covariance matrix. The x x analysis of large and/or high-dimensional datasets. distance MD( i) should tell us how far away i is Among the few exceptions is the minimum covari- from the center of the cloud, relative to the size of ance determinant estimator (MCD) of Rousseeuw the cloud. (1984, 1985). The MCD is a highly robust estima- In Figure 1 we plotted the classical Mahalanobis tor of multivariate location and scatter, that can be distance versus the index i, which corresponds to the computed efficiently with the FAST-MCD algorithm production sequence. The horizontal line is at the 2 of Rousseeuw and Van Driessen (1999). usual cutoff value χ9,0.975 = 4.36. Figure 1 suggests Section 2 concentrates on robust estimation of lo- that most observationsq are consistent with the clas- cation and scatter. We first describe the MCD es- sical assumption that the data come from a multi- timator and discuss its main properties. Alterna- variate normal distribution, except for a few isolated tives for the MCD are explained briefly with relevant outliers. This should not surprise us, even in the pointers to the literature for more details. Section 3 first experimental run of a new production line, be- does the same for robust regression and mainly fo- cause the Mahalanobis distances are known to suffer cuses on the least trimmed squares (LTS) estimator from the masking effect. That is, even if there were a (Rousseeuw, 1984), which is an analog of MCD for group of outliers (here, deformed diaphragm parts), ˆ multiple regression. Since estimating the covariance they would affect µˆ0 and Σ0 in such a way that they matrix is the cornerstone of many multivariate sta- get small Mahalanobis distances MD(xi) and thus tistical methods, robust scatter estimators have also become invisible in Figure 1. To get a reliable anal- been used to develop robust and computationally ef- ysis of these data we need robust estimators µˆ and ficient multivariate techniques. The paper then goes Σˆ that can resist possible outliers. For this purpose on to describe robust methods for multivariate re- we will use the MCD estimates described below. gression (Section 4), classification (Section 5), prin- 2.2 Description of the MCD cipal component analysis (Section 6), principal com- ponent regression (Section 7), partial least squares The MCD method looks for the h observations regression (Section 8) and other settings (Section (out of n) whose classical covariance matrix has the 9). Section 10 concludes with pointers to available lowest possible determinant. The MCD estimate of software for the described techniques. location is then the average of these h points, whereas ROBUST MULTIVARIATE STATISTICS 3 Fig. 1. Mahalanobis distances of the Philips data. ∗ the MCD estimate of scatter is a multiple of their For many estimators εn(µˆ, X) varies only slightly covariance matrix. The MCD location and scatter with X and n, so that we can denote its limiting estimates are affine equivariant, which means that value (for n →∞) by ε∗(µˆ). Similarly, the break- they behave properly under affine transformations down value of a covariance matrix estimator Σˆ is of the data. That is, for an n × p dataset X the defined as the smallest fraction of outliers that can ˆ ˆ MCD estimates (µˆ, Σ) satisfy take either the largest eigenvalue λ1(Σ) to infin- ˆ ′ ity or the smallest eigenvalue λp(Σ) to zero. The (2) µˆ(XA + 1nv )= µˆ(X)A + v, MCD estimates (µˆ, Σˆ ) of multivariate location and ˆ ′ ′ ˆ (3) Σ(XA + 1nv )= A Σ(X)A, ∗ ∗ ˆ scatter have breakdown value εn(µˆ)= εn(Σ) ≈ (n − for all p×1 vectors v and all nonsingular p×p matri- h)/n. The MCD has its highest possible breakdown ′ ∗ ces A. The vector 1n is (1, 1,..., 1) with n elements. value (ε = 50%) when h = [(n + p + 1)/2] (see Lop- Affine equivariance is a natural property of the un- uha¨aand Rousseeuw, 1991). Note that no affine derlying model and makes the analysis independent equivariant estimator can have a breakdown value of the measurement scales of the variables as well as above 50%. For a recent discussion of the impor- translations or rotations of the data. tance of equivariance in breakdown considerations, A useful measure of robustness is the finite-sample see Davies and Gather (2005).