Robust Statistics for Outlier Detection Peter J
Total Page:16
File Type:pdf, Size:1020Kb
Focus Article Robust statistics for outlier detection Peter J. Rousseeuw and Mia Hubert When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust proce- dures for univariate, low-dimensional, and high-dimensional data such as esti- mation of location and scatter, linear regression, principal component analysis, and classification. C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73–79 DOI: 10.1002/widm.2 INTRODUCTION ESTIMATING UNIVARIATE n real data sets, it often happens that some obser- LOCATION AND SCALE I vations are different from the majority. Such ob- As an example, suppose we have five measurements servations are called outliers. Outlying observations of a length: may be errors, or they could have been recorded un- der exceptional circumstances, or belong to another 6.27, 6.34, 6.25, 6.31, 6.28 (1) population. Consequently, they do not fit the model and we want to estimate its true value. For this, well. It is very important to be able to detect these one usually computes the mean x¯ = 1 n x , which outliers. n i=1 i in this case equals x¯ = (6.27 + 6.34 + 6.25 + 6.31 + In practice, one often tries to detect outliers 6.28)/5 = 6.29. Let us now suppose that the fourth using diagnostics starting from a classical fitting measurement has been recorded wrongly and the data method. However, classical methods can be affected become by outliers so strongly that the resulting fitted model does not allow to detect the deviating observations. 6.27, 6.34, 6.25, 63.1, 6.28. (2) This is called the masking effect. In addition, some = . good data points might even appear to be outliers, In this case, we obtain x¯ 17 65, which is far from which is known as swamping. To avoid these effects, the unknown true value. On the contrary, we could the goal of robust statistics is to find a fit that is close also compute the median of these data. To this end, we to the fit we would have found without the outliers. sort the observations in (2) from smallest to largest: We can then identify the outliers by their large devia- 6.25 ≤ 6.27 ≤ 6.28 ≤ 6.34 ≤ 63.10. tion from that robust fit. First, we describe some robust procedures for The median is then the middle value, yielding 6.28, estimating univariate location and scale. Next, we which is still reasonable. We say that the median is discuss multivariate location and scatter, as well as more robust against an outlier. linear regression. We also give a summary of avail- More generally, the location-scale model states able robust methods for principal component analy- that the n univariate observations xi are independent sis (PCA), classification, and clustering. For a more and identically distributed (i.i.d.) with distribution extensive review, see Ref 1. Some full-length books function F [(x − μ)/σ ] where F is known. Typically, F on this topic are Refs 2, 3. is the standard Gaussian distribution function .We then want to find estimates for the center μ and the scale parameter σ . The classical estimate of location is the mean. ∗Correspondence to: [email protected] As we saw above, the mean is very sensitive to even Renaissance Technologies, New York, and Katholieke Universiteit 1 aberrant value out of the $n$ observations. We say Leuven, Belgium that the breakdown value4,5 of the sample mean is DOI: 10.1002/widm.2 1/n,soitis0%forlargen. In general, the breakdown Volume 1, January/February 2011 c 2011 John Wiley & Sons, Inc. 73 Focus Article wires.wiley.com/widm value is the smallest proportion of observations in the for a real function ψ. The denominatorσ ˆ is an data set that need to be replaced to carry the estimate initial robust scale estimate such as the MAD. A arbitrarily far away. See Ref 6 for precise definitions solution to (4) can be found by the Newton–Raphson and extensions. The robustness of an estimator is also algorithm, starting from the initial location estimate 7 (0) measured by its influence function, which measures θˆ = mediani (xi ). Popular choices for ψ are the the effect of one outlier. The influence function of the Huber function ψ(x) = x min(1, c/|x|), and Tukey’s mean is unbounded, which again illustrates that the biweight function ψ(x) = x(1 − (x/c)2)2 I(|x|≤c). mean is not robust. These M-estimators contain a tuning parameter c, For a general definition of the median, we de- which needs to be chosen in advance. note the ith ordered observation as x(i). Then, the People often use rules to detect outliers. The median is x[(n+1)/2] if n is odd, and [x(n/2) + x(n/2+1)]/2 classical rule is based on the z-scores of the observa- if n is even. Its breakdown value is about 50%, mean- tions given by ing that the median can resist up to 50% of outliers, and its influence function is bounded. Both properties zi = (xi − x¯)/s (5) illustrate the median’s robustness. The situation for the scale parameter σ is simi- where s is the standard deviation. More precisely, the | | lar. The classical estimator is the standard deviation rule flags xi as outlying if zi exceeds 2.5, say. But = n − 2/ − s i=1(xi x¯) (n 1). Because a single outlier in the above-mentioned example (2) with the outlier, can already make s arbitrarily large, its breakdown the z-scores are value is 0%. For instance, for the clean data (1) above, − . , − . , − . , . , − . we have s = 0.035, whereas for the data (2) with the 0 45 0 45 0 45 1 79 0 45 outlier, we obtain s = 25.41! A robust measure of scale is the median of all absolute deviations from the so none of them attains 2.5. The largest value is only median (MAD), given by the median of all absolute 1.79, which is quite similar to the largest z-score for the clean data (1), which equals 1.41. The z-score of deviations from the median: the outlier is small because it subtracts the nonrobust MAD = 1.483 median xi − median(xj ). (3) i=1,...,n j=1,...,n mean (which was drawn toward the outlier) and be- cause it divides by the nonrobust standard deviation The constant 1.483 is a correction factor that makes (which the outlier has made much larger than in the the MAD unbiased at the normal distribution. The clean data). Plugging robust estimators of location MAD of (2) is the same as that of (1), namely 0.044. and scale into (5), such as the median and the MAD, 8 We can also use the Qn estimator, defined as yields the robust scores Qn = 2.2219 {|xi − xj |; i < j}(k) xi − median(xj ) /MAD (6) = h ≈ n / =n + ... j=1,...,n with k 2 2 4 and h 2 1. Here, rounds down to the nearest integer. This scale estima- tor is thus the first quartile of all pairwise differences which are more useful; in the contaminated example between two data points. The breakdown value of (2), the robust scores are both the MAD and the Qn estimator is 50%. Also popular is the interquartile range (IQR) −0.22, 1.35, −0.67, 1277.5, 0.0 defined as the difference between the third and first quartiles, that is, IQR = x3n/4 − xn/4 (where ... in which the outlier greatly exceeds the 2.5 cutoff. rounds up to the nearest integer). Its breakdown value Also Tukey’s boxplot is often used to pinpoint is only 25% but it has a simple interpretation. possible outliers. In this plot, a box is drawn from The robustness of the median and the MAD the first quartile Q1 = xn/4 to the third quartile comes at a cost: At the normal model they are less effi- Q3 = x3n/4 of the data. Points outside the inter- cient than the mean. To find a better balance between val [Q1 − 1.5IQR, Q3 + 1.5 IQR], called the fence, robustness and efficiency, many other robust pro- are traditionally marked as outliers. Note that the cedures have been proposed such as M-estimators.9 boxplot assumes symmetry because we add the same They are defined implicitly as the solution of the amount to Q3 as what we subtract from Q1.Atasym- equation metric distributions, the usual boxplot typically flags many regular data points as outlying. The skewness- n x − θˆ ψ i = 0(4)adjusted boxplot corrects for this by using a robust σˆ 10 i=1 measure of skewness in determining the fence. 74 c 2011 John Wiley & Sons, Inc. Volume 1, January/February 2011 WIREs Data Mining and Knowledge Discovery Robust statistics for outlier detection MULTIVARIATE LOCATION AND these subsets, h-subsets are constructed by so-called COVARIANCE ESTIMATION C-steps (see Ref 14 for details). Many other robust estimators of location and From now on, we assume that the data are p- scatter have been presented in the literature. The × dimensional and are stored in an n p data matrix first such estimator was proposed by Stahel15 and = ,..., T = ,..., T X (x1 xn) with xi (xi1 xip) the ith ob- Donoho16 (see also Ref 17).