Review of Robustbase Software for R
Total Page:16
File Type:pdf, Size:1020Kb
Load more
Recommended publications
-
Robust Statistics Part 3: Regression Analysis
Robust Statistics Part 3: Regression analysis Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 1 Linear regression Linear regression: Outline 1 Classical regression estimators 2 Classical outlier diagnostics 3 Regression M-estimators 4 The LTS estimator 5 Outlier detection 6 Regression S-estimators and MM-estimators 7 Regression with categorical predictors 8 Software Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 2 Linear regression Classical estimators The linear regression model The linear regression model says: yi = β0 + β1xi1 + ... + βpxip + εi ′ = xiβ + εi 2 ′ ′ with i.i.d. errors εi ∼ N(0,σ ), xi = (1,xi1,...,xip) and β =(β0,β1,...,βp) . ′ Denote the n × (p + 1) matrix containing the predictors xi as X =(x1,..., xn) , ′ ′ the vector of responses y =(y1,...,yn) and the error vector ε =(ε1,...,εn) . Then: y = Xβ + ε Any regression estimate βˆ yields fitted values yˆ = Xβˆ and residuals ri = ri(βˆ)= yi − yˆi . Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 3 Linear regression Classical estimators The least squares estimator Least squares estimator n ˆ 2 βLS = argmin ri (β) β i=1 X If X has full rank, then the solution is unique and given by ˆ ′ −1 ′ βLS =(X X) X y The usual unbiased estimator of the error variance is n 1 σˆ2 = r2(βˆ ) LS n − p − 1 i LS i=1 X Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. 4 Linear regression Classical estimators Outliers in regression Different types of outliers: vertical outlier good leverage point • • y • • • regular data • ••• • •• ••• • • • • • • • • • bad leverage point • • •• • x Peter Rousseeuw Robust Statistics, Part 3: Regression LARS-IASC School, May 2019 p. -
Robustbase: Basic Robust Statistics
Package ‘robustbase’ June 2, 2021 Version 0.93-8 VersionNote Released 0.93-7 on 2021-01-04 to CRAN Date 2021-06-01 Title Basic Robust Statistics URL http://robustbase.r-forge.r-project.org/ Description ``Essential'' Robust Statistics. Tools allowing to analyze data with robust methods. This includes regression methodology including model selections and multivariate statistics where we strive to cover the book ``Robust Statistics, Theory and Methods'' by 'Maronna, Martin and Yohai'; Wiley 2006. Depends R (>= 3.5.0) Imports stats, graphics, utils, methods, DEoptimR Suggests grid, MASS, lattice, boot, cluster, Matrix, robust, fit.models, MPV, xtable, ggplot2, GGally, RColorBrewer, reshape2, sfsmisc, catdata, doParallel, foreach, skewt SuggestsNote mostly only because of vignette graphics and simulation Enhances robustX, rrcov, matrixStats, quantreg, Hmisc EnhancesNote linked to in man/*.Rd LazyData yes NeedsCompilation yes License GPL (>= 2) Author Martin Maechler [aut, cre] (<https://orcid.org/0000-0002-8685-9910>), Peter Rousseeuw [ctb] (Qn and Sn), Christophe Croux [ctb] (Qn and Sn), Valentin Todorov [aut] (most robust Cov), Andreas Ruckstuhl [aut] (nlrob, anova, glmrob), Matias Salibian-Barrera [aut] (lmrob orig.), Tobias Verbeke [ctb, fnd] (mc, adjbox), Manuel Koller [aut] (mc, lmrob, psi-func.), Eduardo L. T. Conceicao [aut] (MM-, tau-, CM-, and MTL- nlrob), Maria Anna di Palma [ctb] (initial version of Comedian) 1 2 R topics documented: Maintainer Martin Maechler <[email protected]> Repository CRAN Date/Publication 2021-06-02 10:20:02 UTC R topics documented: adjbox . .4 adjboxStats . .7 adjOutlyingness . .9 aircraft . 12 airmay . 13 alcohol . 14 ambientNOxCH . 15 Animals2 . 18 anova.glmrob . 19 anova.lmrob . -
A Practical Guide to Support Predictive Tasks in Data Science
A Practical Guide to Support Predictive Tasks in Data Science Jose´ Augusto Camaraˆ Filho1, Jose´ Maria Monteiro1,Cesar´ Lincoln Mattos1 and Juvencioˆ Santos Nobre2 1Department of Computing, Federal University of Ceara,´ Fortaleza, Ceara,´ Brazil 2Department of Statistics and Applied Mathematics, Federal University of Ceara,´ Fortaleza, Ceara,´ Brazil Keywords: Practical Guide, Prediction, Data Science. Abstract: Currently, professionals from the most diverse areas of knowledge need to explore their data repositories in order to extract knowledge and create new products or services. Several tools have been proposed in order to facilitate the tasks involved in the Data Science lifecycle. However, such tools require their users to have specific (and deep) knowledge in different areas of Computing and Statistics, making their use practically unfeasible for non-specialist professionals in data science. In this paper, we propose a guideline to support predictive tasks in data science. In addition to being useful for non-experts in Data Science, the proposed guideline can support data scientists, data engineers or programmers which are starting to deal with predic- tive tasks. Besides, we present a tool, called DSAdvisor, which follows the stages of the proposed guideline. DSAdvisor aims to encourage non-expert users to build machine learning models to solve predictive tasks, ex- tracting knowledge from their own data repositories. More specifically, DSAdvisor guides these professionals in predictive tasks involving regression and classification. 1 INTRODUCTION dict the future, and create new services and prod- ucts (Ozdemir, 2016). Data science makes it pos- Due to a large amount of data currently available, sible to identifying patterns hidden and obtain new arises the need for professionals of different areas to insights hidden in these datasets, from complex ma- extract knowledge from their repositories to create chine learning algorithms. -
CFE-Cmstatistics 2020 Book of Abstracts
CFE-CMStatistics 2020 PROGRAMME AND ABSTRACTS 14th International Conference on Computational and Financial Econometrics (Virtual CFE 2020) http://www.cfenetwork.org/CFE2020 and 13th International Conference of the ERCIM (European Research Consortium for Informatics and Mathematics) Working Group on Computational and Methodological Statistics (Virtual CMStatistics 2020) http://www.cmstatistics.org/CMStatistics2020 19 – 21 December 2020 Computational and Methodological Statistics CMStatistics Computational and CFENetwork Financial Econometrics ⃝c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. I CFE-CMStatistics 2020 ISBN 978-9963-2227-9-7 ⃝c 2020 - ECOSTA ECONOMETRICS AND STATISTICS All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted, in any other form or by any means without the prior permission from the publisher. II ⃝c ECOSTA ECONOMETRICS AND STATISTICS. All rights reserved. CFE-CMStatistics 2020 International Organizing Committee: Ana Colubi, Erricos Kontoghiorghes and Manfred Deistler. CFE 2020 Co-chairs: Anurag Banerjee, Scott Brave, Peter Pedroni and Mike So. CFE 2020 Programme Committee: Knut Are Aastveit, Alessandra Amendola, David Ardia, Josu Arteche, Anindya Banerjee, Travis Berge, Mon- ica Billio, Raffaella Calabrese, Massimiliano Caporin, Julien Chevallier, Serge Darolles, Luca De Angelis, Filippo Ferroni, Ana-Maria Fuertes, Massimo Guidolin, Harry Haupt, Masayuki Hirukawa, Benjamin Hol- cblat, Rustam Ibragimov, Laura Jackson Young, Michel Juillard, Edward Knotek, Robinson Kruse-Becher, Svetlana Makarova, Ilia Negri, Ingmar Nolte, Jose Olmo, Yasuhiro Omori, Jesus Otero, Michael Owyang, Alessia Paccagnini, Indeewara Perera, Jean-Yves Pitarakis, Tommaso Proietti, Artem Prokhorov, Tatevik Sekhposyan, Etsuro Shioji, Michael Smith, Robert Taylor, Martin Wagner and Ralf Wilke. CMStatistics 2020 Co-chairs: Tapabrata Maiti, Sofia Olhede, Michael Pitt, Cheng Yong Tang and Tim Verdonck. -
Detecting Outliers in Weighted Univariate Survey Data
Detecting outliers in weighted univariate survey data Anna Pauliina Sandqvist∗ October 27, 2015 Preliminary Version Abstract Outliers and influential observations are a frequent concern in all kind of statistics, data analysis and survey data. Especially, if the data is asymmetrically distributed or heavy- tailed, outlier detection turns out to be difficult as most of the already introduced methods are not optimal in this case. In this paper we examine various non-parametric outlier detec- tion approaches for (size-)weighted growth rates from quantitative surveys and propose new respectively modified methods which can account better for skewed and long-tailed data. We apply empirical influence functions to compare these methods under different data spec- ifications. JEL Classification: C14 Keywords: Outlier detection, influential observation, size-weight, periodic surveys 1 Introduction Outliers are usually considered to be extreme values which are far away from the other data points (see, e.g., Barnett and Lewis (1994)). Chambers (1986) was first to differentiate between representative and non-representative outliers. The former are observations with correct values and are not considered to be unique, whereas non-representative outliers are elements with incorrect values or are for some other reasons considered to be unique. Most of the outlier analysis focuses on the representative outliers as non-representatives values should be taken care of already in (survey) data editing. The main reason to be concerned about the possible outliers is, that whether or not they are included into the sample, the estimates might differ considerably. The data points with substantial influence on the estimates are called influential observations and they should be ∗Authors address: KOF Swiss Economic Institute, ETH Zurich, Leonhardstrasse 21 LEE, CH-8092 Zurich, Switzerland. -
Robust Linear Regression: Optimal Rates in Polynomial Time
Robust Linear Regression: Optimal Rates in Polynomial Time Ainesh Bakshi* Adarsh Prasad [email protected] [email protected] CMU CMU Abstract We obtain robust and computationally efficient estimators for learning several linear mod- els that achieve statistically optimal convergence rate under minimal distributional assump- tions. Concretely, we assume our data is drawn from a k-hypercontractive distribution and an ǫ-fraction is adversarially corrupted. We then describe an estimator that converges to the 2 2/k optimal least-squares minimizer for the true distribution at a rate proportional to ǫ − , when the noise is independent of the covariates. We note that no such estimator was known prior to our work, even with access to unbounded computation. The rate we achieve is information- theoretically optimal and thus we resolve the main open question in Klivans, Kothari and Meka [COLT’18]. Our key insight is to identify an analytic condition that serves as a polynomial relaxation of independence of random variables. In particular, we show that when the moments of the noise and covariates are negatively-correlated, we obtain the same rate as independent noise. 2 4/k Further, when the condition is not satisfied, we obtain a rate proportional to ǫ − , and again match the information-theoretic lower bound. Our central technical contribution is to algo- rithmically exploit independence of random variables in the ”sum-of-squares” framework by formulating it as the aforementioned polynomial inequality. arXiv:2007.01394v4 [stat.ML] 4 Dec 2020 *AB would like to thank the partial support from the Office of Naval Research (ONR) grant N00014-18-1-2562, and the National Science Foundation (NSF) under Grant No. -
Outlier Detection
OUTLIER DETECTION Short Course Session 1 Nedret BILLOR Auburn University Department of Mathematics & Statistics, USA Statistics Conference, Colombia, Aug 8‐12, 2016 OUTLINE Motivation and Introduction Approaches to Outlier Detection Sensitivity of Statistical Methods to Outliers Statistical Methods for Outlier Detection Outliers in Univariate data Outliers in Multivariate Classical and Robust Statistical Distance‐ based Methods PCA based Outlier Detection Outliers in Functional Data MOTIVATION & INTRODUCTION Hadlum vs. Hadlum (1949) [Barnett 1978] Ozone Hole Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] The birth of a child to Mrs. Hadlum happened 349 days after Mr. Hadlum left for military service. Average human gestation period is 280 days (40 weeks). Statistically, 349 days is an outlier. Case I: Hadlum vs. Hadlum (1949) [Barnett 1978] − blue: statistical basis (13634 observations of gestation periods) − green: assumed underlying Gaussian process − Very low probability for the birth of Mrs. Hadlums child for being generated by this process − red: assumption of Mr. Hadlum (another Gaussian process responsible for the observed birth, where the gestation period responsible) − Under this assumption the gestation period has an average duration and highest‐possible probability Case II: The Antarctic Ozone Hole The History behind the Ozone Hole • The Earth's ozone layer protects all life from the sun's harmful radiation. Case II: The Antarctic Ozone Hole (cont.) . Human activities (e.g. CFS's in aerosols) have damaged this shield. Less protection from ultraviolet light will, over time, lead to higher skin cancer and cataract rates and crop damage. Case II: The Antarctic Ozone Hole (cont.) Molina and Rowland in 1974 (lab study) and many studies after this, demonstrated the ability of CFC's (Chlorofluorocarbons) to breakdown Ozone in the presence of high frequency UV light . -
Computational Statistics and Data Analysis Robust PCA for Skewed
Computational Statistics and Data Analysis 53 (2009) 2264–2274 Contents lists available at ScienceDirect Computational Statistics and Data Analysis journal homepage: www.elsevier.com/locate/csda Robust PCA for skewed data and its outlier map a b b, Mia Hubert , Peter Rousseeuw , Tim Verdonck ∗ a Department of Mathematics, LSTAT, Katholieke Universiteit Leuven, Belgium b Department of Mathematics and Computer Science, University of Antwerp, Belgium article info abstract Article history: The outlier sensitivity of classical principal component analysis (PCA) has spurred the Available online 6 June 2008 development of robust techniques. Existing robust PCA methods like ROBPCA work best if the non-outlying data have an approximately symmetric distribution. When the original variables are skewed, too many points tend to be flagged as outlying. A robust PCA method is developed which is also suitable for skewed data. To flag the outliers a new outlier map is defined. Its performance is illustrated on real data from economics, engineering, and finance, and confirmed by a simulation study. © 2008 Elsevier B.V. All rights reserved. 1. Introduction Principal component analysis is one of the best known techniques of multivariate statistics. It is a dimension reduction technique which transforms the data to a smaller set of variables while retaining as much information as possible. These new variables, called the principal components (PCs), are uncorrelated and maximize variance (information). Once the PCs are computed, all further analysis like cluster analysis, discriminant analysis, regression, ...can be carried out on the transformed data. When given a data matrix X with n observations and p variables, the PCs ti are linear combinations of the data ti Xpi where = pi argmax var(Xa) = a { } under the constraints a 1 and a p1,...,pi 1 . -
Robust Statistics for Outlier Detection Peter J
Focus Article Robust statistics for outlier detection Peter J. Rousseeuw and Mia Hubert When analyzing data, outlying observations cause problems because they may strongly influence the result. Robust statistics aims at detecting the outliers by searching for the model fitted by the majority of the data. We present an overview of several robust methods and outlier detection tools. We discuss robust proce- dures for univariate, low-dimensional, and high-dimensional data such as esti- mation of location and scatter, linear regression, principal component analysis, and classification. C 2011 John Wiley & Sons, Inc. WIREs Data Mining Knowl Discov 2011 1 73–79 DOI: 10.1002/widm.2 INTRODUCTION ESTIMATING UNIVARIATE n real data sets, it often happens that some obser- LOCATION AND SCALE I vations are different from the majority. Such ob- As an example, suppose we have five measurements servations are called outliers. Outlying observations of a length: may be errors, or they could have been recorded un- der exceptional circumstances, or belong to another 6.27, 6.34, 6.25, 6.31, 6.28 (1) population. Consequently, they do not fit the model and we want to estimate its true value. For this, well. It is very important to be able to detect these one usually computes the mean x¯ = 1 n x , which outliers. n i=1 i in this case equals x¯ = (6.27 + 6.34 + 6.25 + 6.31 + In practice, one often tries to detect outliers 6.28)/5 = 6.29. Let us now suppose that the fourth using diagnostics starting from a classical fitting measurement has been recorded wrongly and the data method. -
Robust Covariance Estimation for Financial Applications
Robust covariance estimation for financial applications Tim Verdonck, Mia Hubert, Peter Rousseeuw Department of Mathematics K.U.Leuven August 30 2011 Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 1 / 44 Contents 1 Introduction Robust Statistics 2 Multivariate Location and Scatter Estimates 3 Minimum Covariance Determinant Estimator (MCD) FAST-MCD algorithm DetMCD algorithm 4 Principal Component Analysis 5 Multivariate Time Series 6 Conclusions 7 Selected references Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 2 / 44 Introduction Robust Statistics Introduction Robust Statistics Real data often contain outliers. Most classical methods are highly influenced by these outliers. What is robust statistics? Robust statistical methods try to fit the model imposed by the majority of the data. They aim to find a ‘robust’ fit, which is similar to the fit we would have found without outliers (observations deviating from robust fit). This also allows for outlier detection. Robust estimate applied on all observations is comparable with the classical estimate applied on the outlier-free data set. Robust estimator A good robust estimator combines high robustness with high efficiency. ◮ Robustness: being less influenced by outliers. ◮ Efficiency: being precise at uncontaminated data. Tim Verdonck, Mia Hubert, Peter Rousseeuw Robust covariance estimation August 30 2011 3 / 44 Introduction Robust Statistics Univariate Scale Estimation: Wages data set 6000 households with male head earning -
Mining Software Engineering Data for Useful Knowledge Boris Baldassari
Mining Software Engineering Data for Useful Knowledge Boris Baldassari To cite this version: Boris Baldassari. Mining Software Engineering Data for Useful Knowledge. Machine Learning [stat.ML]. Université de Lille, 2014. English. tel-01297400 HAL Id: tel-01297400 https://tel.archives-ouvertes.fr/tel-01297400 Submitted on 4 Apr 2016 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. École doctorale Sciences Pour l’Ingénieur THÈSE présentée en vue d’obtenir le grade de Docteur, spécialité Informatique par Boris Baldassari Mining Software Engineering Data for Useful Knowledge preparée dans l’équipe-projet SequeL commune Soutenue publiquement le 1er Juillet 2014 devant le jury composé de : Philippe Preux, Professeur des universités - Université de Lille 3 - Directeur Benoit Baudry, Chargé de recherche INRIA - INRIA Rennes - Rapporteur Laurence Duchien, Professeur des universités - Université de Lille 1 - Examinateur Flavien Huynh, Ingénieur Docteur - Squoring Technologies - Examinateur Pascale Kuntz, Professeur des universités - Polytech’ Nantes - Rapporteur Martin Monperrus, Maître de conférences - Université de Lille 1 - Examinateur 2 Preface Maisqual is a recursive acronym standing for “Maisqual Automagically Improves Software QUALity”. It may sound naive or pedantic at first sight, but it clearly stated at one time the expectations of Maisqual. -
Robust Statistics Part 1: Introduction and Univariate Data General References
Robust Statistics Part 1: Introduction and univariate data Peter Rousseeuw LARS-IASC School, May 2019 Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 1 General references General references Hampel, F.R., Ronchetti, E.M., Rousseeuw, P.J., Stahel, W.A. Robust Statistics: the Approach based on Influence Functions. Wiley Series in Probability and Mathematical Statistics. Wiley, John Wiley and Sons, New York, 1986. Rousseeuw, P.J., Leroy, A. Robust Regression and Outlier Detection. Wiley Series in Probability and Mathematical Statistics. John Wiley and Sons, New York, 1987. Maronna, R.A., Martin, R.D., Yohai, V.J. Robust Statistics: Theory and Methods. Wiley Series in Probability and Statistics. John Wiley and Sons, Chichester, 2006. Hubert, M., Rousseeuw, P.J., Van Aelst, S. (2008), High-breakdown robust multivariate methods, Statistical Science, 23, 92–119. wis.kuleuven.be/stat/robust Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 2 General references Outline of the course General notions of robustness Robustness for univariate data Multivariate location and scatter Linear regression Principal component analysis Advanced topics Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 3 General notions of robustness General notions of robustness: Outline 1 Introduction: outliers and their effect on classical estimators 2 Measures of robustness: breakdown value, sensitivity curve, influence function, gross-error sensitivity, maxbias curve. Peter Rousseeuw Robust Statistics, Part 1: Univariate data LARS-IASC School, May 2019 p. 4 General notions of robustness Introduction What is robust statistics? Real data often contain outliers.