Robust Covariance Estimation for Financial Applications

Total Page:16

File Type:pdf, Size:1020Kb

Robust Covariance Estimation for Financial Applications Robust covariance estimation for financial applications Tim Verdonck, Mia Hubert, Peter Rousseeuw Department of Mathematics K.U.Leuven August 30 2011 Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 1 / 44 Contents 1 Introduction Robust Statistics 2 Multivariate Location and Scatter Estimates 3 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm DetMCD algorithm 4 Principal Component Analysis 5 Multivariate Time Series 6 Conclusions 7 Selected references Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 2 / 44 Introduction Robust Statistics Introduction Robust Statistics Real data often contain outliers. Most classical methods are highly influenced by these outliers. What is robust statistics? Robust statistical methods try to fit the model imposed by the majority of the data. They aim to find a ‘robust’ fit, which is similar to the fit we would have found without outliers (observations deviating from robust fit). This also allows for outlier detection. Robust estimate applied on all observations is comparable with the classical estimate applied on the outlier-free data set. Robust estimator A good robust estimator combines high robustness with high efficiency. ◮ Robustness: being less influenced by outliers. ◮ Efficiency: being precise at uncontaminated data. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 3 / 44 Introduction Robust Statistics Univariate Scale Estimation: Wages data set 6000 households with male head earning less than USD 15000 annually in 1966. Classified into 39 demographic groups (we concentrate on variable AGE). ◮ 1 n 2 Standard Deviation (SD): n−1 i=1(xi − x) =4.91 ◮ Interquartile Range (IQR): 0.74(x(⌊0.75n⌋) − x(⌊0.25n⌋))=0.91 ◮ Median Absolute Deviation (MAD): 1.48 medi |xi − medj xj | =0.96 Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 4 / 44 Introduction Robust Statistics Measures of robustness Breakdown Point The breakdown point of a scale estimator S is the smallest fraction of observations to be contaminated such that S ↑ ∞ or S ↓ 0. Scale estimator Breakdown point 1 SD n ≈ 0 IQR 25% MAD 50% Note that when the breakdown value of an estimator is ε, this does not imply that a proportion of less than ε does not affect the estimator at all. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 5 / 44 Introduction Robust Statistics Measures of robustness A specific type of contamination is point contamination Fε,y = (1 − ε)F + ε∆y with ∆y Dirac measure at y. Influence Function (Hampel, 1986) The influence function measures how T (F ) changes when contamination is added in y T (F ) − T (F ) IF (y; T , F )= lim ε,y ε→0 ε where T (.) is functional version of the estimator. ◮ IF is a local measure of robustness, whereas breakdown point is a global measure. ◮ We prefer estimators that have a bounded IF. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 6 / 44 Introduction Robust Statistics Influence Function (Hampel, 1986) Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 7 / 44 Multivariate Location and Scatter Estimates Multivariate Location and Scatter Scatterplot of bivariate data (ρ =0.990) ◮ ρˆ =0.779 ◮ ρˆMCD =0.987. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 8 / 44 Multivariate Location and Scatter Estimates Boxplot of the marginals In the multivariate setting, outliers can not just be detected by applying outlier detection rules on each variable separately. Only by correctly estimating the covariance structure, we can detect the outliers. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 9 / 44 Multivariate Location and Scatter Estimates Classical Estimator p Data: Xn = x1,..., xn with xi ∈ R . Model: Xi ∼ Np(µ, Σ). More general we can assume that the data are generated from an elliptical distribution, i.e. a distribution whose density contours are ellipses. The classical estimators for µ and Σ are the empirical mean and covariance matrix n 1 x = x n i i=1 n 1 S = (x − x)(x − x)′. n n − 1 i i i =1 Both are highly sensitive to outliers ◮ zero breakdown value ◮ unbounded IF. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 10 / 44 Multivariate Location and Scatter Estimates Tolerance Ellipsoid Boundary contains x-values with constant Mahalanobis distance to mean. ′ −1 MDi = (xi − x) Sn (xi − x) Classical Tolerance Ellipsoid 2 {x|MD(x) ≤ χp,0.975} 2 2 with χp,0.975 the 97.5% quantile of the χ distribution with p d.f. We expect (at large samples) that 97.5% of the observations belong to this ellipsoid. We can flag observation xi as an outlier if it does not belong to the tolerance ellipsoid. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 11 / 44 Multivariate Location and Scatter Estimates Tolerance Ellipsoid Tolerance Ellipsoid for example Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 11 / 44 Minimum Covariance Determinant Estimator (MCD) Robust Estimator Minimum Covariance Determinant Estimator (MCD) ◮ Estimator of multivariate location and scatter [Rousseeuw, 1984]. ◮ Raw MCD estimator: ◮ Choose h between ⌊(n + p + 1)/2⌋ and n. ◮ Find h < n observations whose classical covariance matrix has lowest determinant. H0 = argmin det (cov(xi |i ∈ H)) H ◮ µˆ0 is mean of those h observations. 1 µˆ = xi . 0 n i H X∈ 0 ◮ Σˆ 0 is covariance matrix of those h observations (multiplied by consistency factor). Σˆ 0 = c0 cov(xi |i ∈ H0) Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 12 / 44 Minimum Covariance Determinant Estimator (MCD) Robust Estimator Minimum Covariance Determinant Estimator (MCD) ◮ Estimator of multivariate location and scatter [Rousseeuw, 1984]. ◮ Raw MCD estimator. ◮ Reweighted MCD estimator: ◮ Compute initial robust distances ˆ ′ ˆ −1 di = D(xi , µˆ0, Σ0)= (xi − µˆ0) Σ0 (xi − µˆ0). q ◮ 2 Assign weights wi = 0 if di > χp,0.975, else wi = 1. ◮ Compute reweighted mean andq covariance matrix: n wi xi µˆ = i=1 MCD n w P i=1 i n n −1 P ′ Σˆ MCD = c1 wi (xi − µˆ MCD)(xi − µˆ MCD) ) wi . i=1 ! i=1 ! X X ◮ Compute final robust distances and assign new weights wi . Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 12 / 44 Minimum Covariance Determinant Estimator (MCD) Outlier detection For outlier detection, recompute the robust distances (based on MCD). −1 ′ ˆ RDi = (xi − µˆ MCD ) ΣMCD (xi − µˆ MCD ) 2 Flag observation xi as outlier if RDi > χp,0.975. This is equivalent with flagging the observations that do not belong to the robust tolerance ellipsoid. Robust tolerance ellipsoid 2 {x|RD(x) ≤ χp,0.975} Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 13 / 44 Minimum Covariance Determinant Estimator (MCD) Outlier detection Robust Tolerance Ellipsoid (based on MCD) for example Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 13 / 44 Minimum Covariance Determinant Estimator (MCD) Properties of the MCD ◮ Robust ◮ breakdown point from 0 to 50% ◮ bounded influence function [Croux and Haesbroeck, 1999] . ◮ Positive definite ◮ Affine equivariant ◮ given X, the MCD estimates satisfy ′ µˆ(XA + 1nv ) =µ ˆ(X)A + v ′ ′ Σˆ (XA + 1nv ) = A Σˆ (X)A. for all nonsingular matrices A and all constant vectors v. ⇒ data may be rotated, translated or rescaled without affecting the outlier detection diagnostics. ◮ Not very efficient: improved by reweighting step. ◮ Computation: FAST-MCD algorithm [Rousseeuw and Van Driessen, 1999]. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 14 / 44 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm FAST-MCD algorithm Computation of the raw estimates for n 6 600: ◮ For m =1 to 500: ◮ Draw random subsets of size p + 1. ◮ Apply two C-steps: Compute robust distances ′ −1 di = D(xi , µˆ, Σˆ )= q(xi − µˆ) Σˆ (xi − µˆ). Take h observations with smallest robust distance. Compute mean and covariance matrix of this h-subset. ◮ Retain 10 h-subsets with lowest covariance determinant. ◮ Apply C-steps on these 10 subsets until convergence. ◮ Retain the h-subset with lowest covariance determinant. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 15 / 44 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm FASTMCD algorithm ◮ A C-step will always decrease the determinant of the covariance matrix. ◮ As there are only a finite number of h-subsets, convergence to a (local) minimum is guaranteed. ◮ The algorithm is not guaranteed to yield the global minimum. The fixed number of initial p + 1-subsets (500) is a compromise between robustness and computation time. ◮ Implementations of FASTMCD algorithm widely available. ◮ R: in the packages robustbase and rrcov ◮ Matlab: in LIBRA toolbox and PLS toolbox of Eigenvector Research. ◮ SAS: in PROC ROBUSTREG ◮ S-plus: built-in function cov.mcd. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 16 / 44 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm Example: Animal set Logarithm of body and brain weight for 28 animals. Outlier detection based on MCD correctly indicates the outliers. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 17 / 44 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm Example: Animal set In dimension p > 2, a scatterplot and tolerance ellipsoid can not be drawn. To expose the differences between a classical and a robust analysis, a distance-distance plot can be made Outlier detection based on MCD correctly indicates the outliers. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 18 / 44 Minimum Covariance Determinant Estimator (MCD) DetMCD algorithm DetMCD algorithm Deterministic algorithm for MCD [Hubert, Rousseeuw and Verdonck, 2010]. ◮ Idea: ◮ Compute several ’robust’ h-subsets, based on robust transformations of variables robust estimators of multivariate location and scatter.
Recommended publications
  • Robust Statistics Part 3: Regression Analysis
    Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 1 Linear regression Linear regression: Outline 1 Classical regression estimators 2 Classical outlier diagnostics 3 Regression M-estimators 4 The LTS estimator 5 Outlier detection 6 Regression S-estimators and MM-estimators 7 Regression with categorical predictors 8 Software Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 2 Linear regression Classical estimators The linear regression model The linear regression model says: yi = β0 + β1xi1 + ... + βpxip + εi ′ = xiβ + εi 2 ′ ′ with i.i.d. errors εi ∼ N(0,σ ), xi = (1,xi1,...,xip) and β =(β0,β1,...,βp) . ′ Denote the n × (p + 1) matrix containing the predictors xi as X =(x1,..., xn) , ′ ′ the vector of responses y =(y1,...,yn) and the error vector ε =(ε1,...,εn) . Then: y = Xβ + ε Any regression estimate βˆ yields fitted values yˆ = Xβˆ and residuals ri = ri(βˆ)= yi − yˆi . Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 3 Linear regression Classical estimators The least squares estimator Least squares estimator n ˆ 2 βLS = argmin ri (β) β i=1 X If X has full rank, then the solution is unique and given by ˆ ′ −1 ′ βLS =(X X) X y The usual unbiased estimator of the error variance is n 1 σˆ2 = r2(βˆ ) LS n − p − 1 i LS i=1 X Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 4 Linear regression Classical estimators Outliers in regression Different types of outliers: vertical outlier good leverage point • • y • • • regular data • ••• • •• ••• • • • • • • • • • bad leverage point • • •• • x Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p.
    [Show full text]
  • Robustbase: Basic Robust Statistics
    Package ‘robustbase’ June 2, 2021 Version 0.93-8 VersionNote Released 0.93-7 on 2021-01-04 to CRAN Date 2021-06-01 Title Basic Robust Statistics URL http://robustbase.r-forge.r-project.org/ Description ``Essential'' Robust Statistics. Tools allowing to analyze data with robust methods. This includes regression methodology including model selections and multivariate statistics where we strive to cover the book ``Robust Statistics, Theory and Methods'' by 'Maronna, Martin and Yohai'; Wiley 2006. Depends R (>= 3.5.0) Imports stats, graphics, utils, methods, DEoptimR Suggests grid, MASS, lattice, boot, cluster, Matrix, robust, fit.models, MPV, xtable, ggplot2, GGally, RColorBrewer, reshape2, sfsmisc, catdata, doParallel, foreach, skewt SuggestsNote mostly only because of vignette graphics and simulation Enhances robustX, rrcov, matrixStats, quantreg, Hmisc EnhancesNote linked to in man/*.Rd LazyData yes NeedsCompilation yes License GPL (>= 2) Author Martin Maechler [aut, cre] (<https://orcid.org/0000-0002-8685-9910>), Peter Rousseeuw [ctb] (Qn and Sn), Christophe Croux [ctb] (Qn and Sn), Valentin Todorov [aut] (most robust Cov), Andreas Ruckstuhl [aut] (nlrob, anova, glmrob), Matias Salibian-Barrera [aut] (lmrob orig.), Tobias Verbeke [ctb, fnd] (mc, adjbox), Manuel Koller [aut] (mc, lmrob, psi-func.), Eduardo L. T. Conceicao [aut] (MM-, tau-, CM-, and MTL- nlrob), Maria Anna di Palma [ctb] (initial version of Comedian) 1 2 R topics documented: Maintainer Martin Maechler <[email protected]> Repository CRAN Date/Publication 2021-06-02 10:20:02 UTC R topics documented: adjbox . .4 adjboxStats . .7 adjOutlyingness . .9 aircraft . 12 airmay . 13 alcohol . 14 ambientNOxCH . 15 Animals2 . 18 anova.glmrob . 19 anova.lmrob .
    [Show full text]
  • CFE-Cmstatistics 2020 Book of Abstracts
    CFE-CMStatistics 2020 PROGRAMME AND ABSTRACTS 14th International Conference on Computational and Financial Econometrics (Virtual CFE 2020) http://www.cfenetwork.org/CFE2020 and 13th International Conference of the ERCIM (European Research Consortium for Informatics and Mathematics) Working Group on Computational and Methodological Statistics (Virtual CMStatistics 2020) http://www.cmstatistics.org/CMStatistics2020 19 – 21 December 2020 Computational and Methodological Statistics CMStatistics Computational and CFENetwork Financial Econometrics ⃝c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. I CFE-CMStatistics 2020 ISBN 978-9963-2227-9-7 ⃝c 2020 - ECOSTA ECONOMETRICS AND STATISTICS All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any other form or by any means without the prior permission from the publisher. II ⃝c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. CFE-CMStatistics 2020 International Organizing Committee: Ana Colubi, Erricos Kontoghiorghes and Manfred Deistler. CFE 2020 Co-chairs: Anurag Banerjee, Scott Brave, Peter Pedroni and Mike So. CFE 2020 Programme Committee: Knut Are Aastveit, Alessandra Amendola, David Ardia, Josu Arteche, Anindya Banerjee, Travis Berge, Mon- ica Billio, Raffaella Calabrese, Massimiliano Caporin, Julien Chevallier, Serge Darolles, Luca De Angelis, Filippo Ferroni, Ana-Maria Fuertes, Massimo Guidolin, Harry Haupt, Masayuki Hirukawa, Benjamin Hol- cblat, Rustam Ibragimov, Laura Jackson Young, Michel Juillard, Edward Knotek, Robinson Kruse-Becher, Svetlana Makarova, Ilia Negri, Ingmar Nolte, Jose Olmo, Yasuhiro Omori, Jesus Otero, Michael Owyang, Alessia Paccagnini, Indeewara Perera, Jean-Yves Pitarakis, Tommaso Proietti, Artem Prokhorov, Tatevik Sekhposyan, Etsuro Shioji, Michael Smith, Robert Taylor, Martin Wagner and Ralf Wilke. CMStatistics 2020 Co-chairs: Tapabrata Maiti, Sofia Olhede, Michael Pitt, Cheng Yong Tang and Tim Verdonck.
    [Show full text]
  • Robust Linear Regression: Optimal Rates in Polynomial Time
    Robust Linear Regression: Optimal Rates in Polynomial Time Ainesh Bakshi* Adarsh Prasad [email protected] [email protected] CMU CMU Abstract We obtain robust and computationally efficient estimators for learning several linear mod- els that achieve statistically optimal convergence rate under minimal distributional assump- tions. Concretely, we assume our data is drawn from a k-hypercontractive distribution and an ǫ-fraction is adversarially corrupted. We then describe an estimator that converges to the 2 2/k optimal least-squares minimizer for the true distribution at a rate proportional to ǫ − , when the noise is independent of the covariates. We note that no such estimator was known prior to our work, even with access to unbounded computation. The rate we achieve is information- theoretically optimal and thus we resolve the main open question in Klivans, Kothari and Meka [COLT’18]. Our key insight is to identify an analytic condition that serves as a polynomial relaxation of independence of random variables. In particular, we show that when the moments of the noise and covariates are negatively-correlated, we obtain the same rate as independent noise. 2 4/k Further, when the condition is not satisfied, we obtain a rate proportional to ǫ − , and again match the information-theoretic lower bound. Our central technical contribution is to algo- rithmically exploit independence of random variables in the ”sum-of-squares” framework by formulating it as the aforementioned polynomial inequality. arXiv:2007.01394v4 [stat.ML] 4 Dec 2020 *AB would like to thank the partial support from the Office of Naval Research (ONR) grant N00014-18-1-2562, and the National Science Foundation (NSF) under Grant No.
    [Show full text]
  • Computational Statistics and Data Analysis Robust PCA for Skewed
    Computational Statistics and Data Analysis 53 (2009) 2264–2274 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Robust PCA for skewed data and its outlier map a b b, Mia Hubert , Peter Rousseeuw , Tim Verdonck ∗ a Department of Mathematics, LSTAT, Katholieke Universiteit Leuven, Belgium b Department of Mathematics and Computer Science, University of Antwerp, Belgium article info abstract Article history: The outlier sensitivity of classical principal component analysis (PCA) has spurred the Available online 6 June 2008 development of robust techniques. Existing robust PCA methods like ROBPCA work best if the non-outlying data have an approximately symmetric distribution. When the original variables are skewed, too many points tend to be flagged as outlying. A robust PCA method is developed which is also suitable for skewed data. To flag the outliers a new outlier map is defined. Its performance is illustrated on real data from economics, engineering, and finance, and confirmed by a simulation study. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Principal component analysis is one of the best known techniques of multivariate statistics. It is a dimension reduction technique which transforms the data to a smaller set of variables while retaining as much information as possible. These new variables, called the principal components (PCs), are uncorrelated and maximize variance (information). Once the PCs are computed, all further analysis like cluster analysis, discriminant analysis, regression, ...can be carried out on the transformed data. When given a data matrix X with n observations and p variables, the PCs ti are linear combinations of the data ti Xpi where = pi argmax var(Xa) = a { } under the constraints a 1 and a p1,...,pi 1 .
    [Show full text]
  • Robust Statistics for Outlier Detection Peter J
    Focus Article Robust statistics for outlier detection Peter J. Rousseeuw and Mia Hubert When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust proce- dures for univariate, low-dimensional, and high-dimensional data such as esti- mation of location and scatter, linear regression, principal component analysis, and classification. C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73–79 DOI: 10.1002/widm.2 INTRODUCTION ESTIMATING UNIVARIATE n real data sets, it often happens that some obser- LOCATION AND SCALE I vations are different from the majority. Such ob- As an example, suppose we have five measurements servations are called outliers. Outlying observations of a length: may be errors, or they could have been recorded un- der exceptional circumstances, or belong to another 6.27, 6.34, 6.25, 6.31, 6.28 (1) population. Consequently, they do not fit the model and we want to estimate its true value. For this, well. It is very important to be able to detect these one usually computes the mean x¯ = 1 n x , which outliers. n i=1 i in this case equals x¯ = (6.27 + 6.34 + 6.25 + 6.31 + In practice, one often tries to detect outliers 6.28)/5 = 6.29. Let us now suppose that the fourth using diagnostics starting from a classical fitting measurement has been recorded wrongly and the data method.
    [Show full text]
  • Robust Statistics Part 1: Introduction and Univariate Data General References
    Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 1 General references General references Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. Robust Statistics: the Approach based on Influence Functions. Wiley Series in Probability and Mathematical Statistics. Wiley, John Wiley and Sons, New York, 1986. Rousseeuw, P.J., Leroy, A. Robust Regression and Outlier Detection. Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, New York, 1987. Maronna, R.A., Martin, R.D., Yohai, V.J. Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. John Wiley and Sons, Chichester, 2006. Hubert, M., Rousseeuw, P.J., Van Aelst, S. (2008), High-breakdown robust multivariate methods, Statistical Science, 23, 92–119. wis.kuleuven.be/stat/robust Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 2 General references Outline of the course General notions of robustness Robustness for univariate data Multivariate location and scatter Linear regression Principal component analysis Advanced topics Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 3 General notions of robustness General notions of robustness: Outline 1 Introduction: outliers and their effect on classical estimators 2 Measures of robustness: breakdown value, sensitivity curve, influence function, gross-error sensitivity, maxbias curve. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 4 General notions of robustness Introduction What is robust statistics? Real data often contain outliers.
    [Show full text]
  • Mia Hubert, Peter Rousseeuw, & Karlien Vanden Branden
    Home About Scientific Press Room Contact Us ● ScienceWatch Home ● Inside This Month... ● Interviews Featured Interviews Author Commentaries 2009 : January 2009 - Fast Moving Fronts : Mia Hubert, Peter Rousseeuw, & Karlien Vanden Branden Institutional Interviews Journal Interviews FAST MOVING FRONTS - 2009 Podcasts January 2009 ● Analyses Mia Hubert, Peter Rousseeuw, & Karlien Vanden Branden talk with ScienceWatch.com and answer Featured Analyses a few questions about this month's Fast Moving Front in the field of Mathematics. What's Hot In... Article: ROBPCA: A new approach to robust principal component Special Topics analysis Authors: Hubert, M;Rousseeuw, PJ;Vanden Branden, K Journal: TECHNOMETRICS, 47 (1): 64-79, FEB 2005 ● Data & Rankings Addresses: Katholieke Univ Leuven, Dept Math, B-3001 Louvain, Belgium. Katholieke Univ Leuven, Dept Math, B-3001 Louvain, Belgium. Sci-Bytes Univ Antwerp, Dept Math & Comp Sci, B-2020 Antwerp, Belgium. Fast Breaking Papers Katholieke Univ Leuven, Dept Math, B-3001 Heverlee, Belgium. New Hot Papers Emerging Research Fronts Fast Moving Fronts Why do you think your paper is highly cited? Research Front Maps Our paper offers a solution to an important problem in statistics and data analysis: how to perform data Current Classics reduction when the observations may be contaminated with outlying values. This problem is especially Top Topics important for the analysis of high-dimensional data sets, such as spectral data in chemometrics and Rising Stars genetic data in bio-informatics. We propose an algorithm which is highly robust and computationally New Entrants feasible, and we also provide a graphical tool for outlier detection. Country Profiles Moreover, our method serves as the cornerstone of new highly robust calibration methods (principal component regression and partial least squares regression), a robust classifier, and robust multi-way ● About Science Watch techniques.
    [Show full text]
  • Fast Detection of Outliers in Data Streams with the $ Q N $ Estimator
    Fast Detection of Outliers in Data Streams with the Qn Estimator Massimo Cafaroa,∗, Catiuscia Mellea, Marco Pulimenoa, Italo Epicocoa aUniversity of Salento, Lecce, Italy Abstract We present fqn (Fast Qn), a novel algorithm for fast detection of outliers in data streams. The algorithm works in the sliding window model, checking if an item is an outlier by cleverly computing the Qn scale estimator in the current window. We thoroughly compare our algorithm for online Qn with the state of the art competing algorithm by Nunkesser et al, and show that fqn (i) is faster, (ii) its computational complexity does not depend on the input distribution and (iii) it requires less space. Extensive experimental results on synthetic datasets confirm the validity of our approach. Keywords: data streams, outliers, sliding window model, Qn estimator. Declaration of interest Declarations of interest: none. 1. Introduction A data stream σ can be thought as a sequence of n items drawn from a universe U. In particular, the items need not be distinct, so that an item may appear multiple times in the stream. Data streams are ubiquitous, and, depending on the specific context, items may be IP addresses, graph edges, points, geographical coordinates, numbers etc. Since the items in the input data stream come at a very high rate, and the stream may be of potentially infinite length (in which case n refers to the number of items seen so far), it is hard for an algorithm in charge of processing its items to compute an expensive function of a large piece of the input.
    [Show full text]
  • Package 'Mrfdepth'
    Package ‘mrfDepth’ August 26, 2020 Type Package Version 1.0.13 Date 2020-08-24 Title Depth Measures in Multivariate, Regression and Functional Settings Description Tools to compute depth measures and implementations of related tasks such as outlier detection, data exploration and classification of multivariate, regression and functional data. Depends R (>= 3.6.0), ggplot2 Imports abind, geometry, grid, matrixStats, reshape2, Suggests robustbase LinkingTo RcppEigen (>= 0.3.2.9.0), Rcpp (>= 0.12.6), RcppArmadillo (>= 0.7.600.1.0) License GPL (>= 2) LazyLoad yes URL https://github.com/PSegaert/mrfDepth BugReports https://github.com/PSegaert/mrfDepth/issues RoxygenNote 6.1.0 NeedsCompilation yes Author Pieter Segaert [aut], Mia Hubert [aut], Peter Rousseeuw [aut], Jakob Raymaekers [aut, cre], Kaveh Vakili [ctb] Maintainer Jakob Raymaekers <[email protected]> Repository CRAN Date/Publication 2020-08-26 16:10:33 UTC 1 2 R topics documented: R topics documented: adjOutl . .3 bagdistance . .6 bagplot . .9 bloodfat . 11 cardata90 . 12 characterA . 13 characterI . 14 cmltest . 15 compBagplot . 16 depthContour . 19 dirOutl . 22 distSpace . 25 dprojdepth . 28 dprojmedian . 30 fheatmap . 31 fom ............................................. 32 fOutl . 34 geological . 37 glass . 37 hdepth . 38 hdepthmedian . 42 medcouple . 43 mfd ............................................. 45 mfdmedian . 47 mrainbowplot . 49 mri.............................................. 50 octane . 52 outlyingness . 52 plane . 56 plotContours . 58 projdepth . 59 projmedian . 61 rdepth . 63 rdepthmedian . 65 sdepth . 66 sprojdepth . 68 sprojmedian . 70 stars . 72 symtest . 73 tablets . 74 wine............................................. 75 Index 76 adjOutl 3 adjOutl Adjusted outlyingness of points relative to a dataset Description Computes the skew-adjusted outlyingness of p-dimensional points z relative to a p-dimensional dataset x.
    [Show full text]
  • Review of Robustbase Software for R
    Research Collection Review Article Review of ‘Robustbase’ software for R Author(s): Finger, Robert Publication Date: 2010-11 Permanent Link: https://doi.org/10.3929/ethz-b-000159379 Originally published in: Journal of Applied Econometrics 25(7), http://doi.org/10.1002/jae.1194 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Postprint This is the accepted version of a paper published in the Journal of Applied Econometrics. This paper has been peer-reviewed but does not include the final publisher proof- corrections or journal pagination. Citation for the original published paper: Finger, R. (2010). Review of ‘robustbase’ software for R. Journal of Applied Econometrics 25(7): 1205-1210 https://doi.org/10.1002/jae.1194 Access to the published version may require subscription. N.B. When citing this work, cite the original published paper Review of ‘robustbase’ software for R Robert Finger (ETH Zürich, Switzerland) 1. Introduction Robust statistical methods are powerful tools to increase the reliability and accuracy of statistical modeling and data analysis because these methods still work well when some observations deviate from (or violate) the assumed model. For many statistical models (e.g. location, scale and regression), several robust methods are available as alternative approaches to classical statistical methods (Maronna et al., 2006). However, there has been a lack of availability of robust methods in statistical software (e.g. Stromberg, 2004). The R package ‘robustbase’ (as well as other related packages) helps to fill the gap between theoretically developed robust methods and their availability in standard statistical software and makes both basic and advanced methods available to a broad range of researchers.
    [Show full text]
  • Robust Regression and Outlier Detection with the ROBUSTREG Procedure
    SUGI 27 Statistics and Data Analysis Paper265-27 Robust Regression and Outlier Detection with the ROBUSTREG Procedure Colin Chen, SAS Institute Inc., Cary, NC Many methods have been developed for these prob- lems. However, in statistical applications of outlier Abstract detection and robust regression, the methods most commonly used today are Huber M estimation, high Robust regression is an important tool for analyz- breakdown value estimation, and combinations of ing data that are contaminated with outliers. It these two methods. The ROBUSTREG procedure can be used to detect outliers and to provide re- provides four such methods: M estimation, LTS es- sistant (stable) results in the presence of outliers. timation, S estimation, and MM estimation. This paper introduces the ROBUSTREG procedure, which is experimental in SAS/STAT Version 9. The 1. M estimation was introduced by Huber (1973), ROBUSTREG procedure implements the most com- and it is the simplest approach both computa- monly used robust regression techniques. These tionally and theoretically. Although it is not ro- include M estimation (Huber, 1973), LTS estima- bust with respect to leverage points, it is still tion (Rousseeuw, 1984), S estimation (Rousseeuw used extensively in analyzing data for which and Yohai, 1984), and MM estimation (Yohai, 1987). it can be assumed that the contamination is The paper will provide an overview of robust re- mainly in the response direction. gression methods, describe the syntax of PROC ROBUSTREG, and illustrate the use of the procedure 2. Least Trimmed Squares (LTS) estimation is a to fit regression models and display outliers and lever- high breakdown value method introduced by age points.
    [Show full text]