Review of Robustbase Software for R
Total Page:16
File Type:pdf, Size:1020Kb
Research Collection Review Article Review of ‘Robustbase’ software for R Author(s): Finger, Robert Publication Date: 2010-11 Permanent Link: https://doi.org/10.3929/ethz-b-000159379 Originally published in: Journal of Applied Econometrics 25(7), http://doi.org/10.1002/jae.1194 Rights / License: In Copyright - Non-Commercial Use Permitted This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use. ETH Library Postprint This is the accepted version of a paper published in the Journal of Applied Econometrics. This paper has been peer-reviewed but does not include the final publisher proof- corrections or journal pagination. Citation for the original published paper: Finger, R. (2010). Review of ‘robustbase’ software for R. Journal of Applied Econometrics 25(7): 1205-1210 https://doi.org/10.1002/jae.1194 Access to the published version may require subscription. N.B. When citing this work, cite the original published paper Review of ‘robustbase’ software for R Robert Finger (ETH Zürich, Switzerland) 1. Introduction Robust statistical methods are powerful tools to increase the reliability and accuracy of statistical modeling and data analysis because these methods still work well when some observations deviate from (or violate) the assumed model. For many statistical models (e.g. location, scale and regression), several robust methods are available as alternative approaches to classical statistical methods (Maronna et al., 2006). However, there has been a lack of availability of robust methods in statistical software (e.g. Stromberg, 2004). The R package ‘robustbase’ (as well as other related packages) helps to fill the gap between theoretically developed robust methods and their availability in standard statistical software and makes both basic and advanced methods available to a broad range of researchers. ‘robustbase’ is an add-on package for the open-source statistical software R, a language and environment for statistical computing. Brief introductions to R are given, for instance, by Cribari-Neto and Zarkos (1999), and Racine and Hyndman (2002). The ‘robustbase’ package is written by Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke and Martin Maechler1. It was developed to make “basic robust statistics” within R available in a single package and to provide tools that allow analyzing data with robust methods. Furthermore, ‘robustbase’ implements new robust methods, which have not been available so far in R (Maechler and Ruckstuhl, 2006). A further comprehensive add-on package on robust statistics is the ‘robust’ package that is a version of the robust library of S-PLUS made available in R2. This package covers similar topics and partially overlaps with ‘robustbase’. Coordination effort is made to develop the ‘robust’ package as a supplement that provides convenient 1 The package developers acknowledge that the original code has been written by many authors, notably by Peter Rousseeuw and Christophe Croux. 2 The authors of the ‘robust’ package are: Jiahui Wang, Ruben Zamar, Alfio Marazzi, Victor Yohai, Matias Salibian-Barrera, Ricardo Maronna, Eric Zivot, David Rocke, Doug Martin, Martin Maechler, and Kjell Konis. routines and comparisons with classical estimators. In contrast, ‘robustbase’ should provide the more advanced statistician with a considerable choice of methodology (Maechler, 2008). A complement to ‘robustbase’ is the R package ‘RobustX’ (written by Werner Stahel and Martin Maechler) that is especially designed to collect experimental methods as well as methods that are beyond the scope of basic robust statistics. Moreover, additional robust methods are available in various other R packages (see Maechler, 2008, for a brief overview). 2. Installation and Setup Prior to the installation of the ‘robustbase’ package, R has to be installed – available for free from www.r-project.org. The package source, the reference manual and installation files for the ‘robustbase’ package are directly available at http://cran.r- project.org/web/packages/robustbase/index.html. It is installed most easily using the drop down menu of the R workstation environment with ‘Packages…Install Package(s)’. The package then needs to be loaded in the R workspace, either using the drop down menu ‘Packages…Load Package’ or by typing library(robustbase). It requires at least the 2.5.1 version of R and the default R packages stats, graphics and methods. The installation of the MASS package is also suggested. I used version 2.9.0 and 0.4-5 of R and ‘robustbase’, respectively. 3. Using R and ‘robustbase’ R, and therefore its packages like ‘robustbase’, are object-oriented and provide command/code-based interfaces (Harrison, 2008). The main advantages of using R for robust statistics are that it includes the latest, cutting-edge methods, it is easy to develop modifications of existing methods, and it is available for free and runs under all commonly operating systems (Racine and Hyndman, 2002). Introductory notes and examples on the use of R for statistical analysis and graphics are given in several sources such as textbooks3 and articles (e.g. Dalgaard, 2008, Zuur et al., 2009, Racine and 3 A list of books that are related to R and may be useful to the R user is given at www.r- project.org/doc/bib/R-books.html. Hyndman, 2002), or at the R-project homepage (www.r-project.org): either in ‘An Introduction to R’ or in the R-Wiki (http://wiki.r-project.org/). The use of ‘robustbase’ can be demonstrated with a simple example for a robust regression of y on x from dataset d using the MM-estimator: After loading the ‘robustbase’ package with library(robustbase) and typing objectname<-lmrob(y~x, data=d), the standard regression output as well as details on the robust estimation are provided by typing summary(objectname). A complete list of the regression output as well as tuning parameters and algorithms employed in the MM estimation is shown with str(objectname). Very helpful (robust) regression diagnostics plots (e.g. Tukey- Anscombe plot, normal Q-Q plot) are generated with plot(objectname). The manual of ‘robustbase’ is comprehensive and detailed, providing syntax, options and arguments for the commands as well as in providing references for the employed methods. It also provides both a thematic as well as alphabetic index of the data sets and functions included in ‘robustbase’. The manual for a specific command can be retrieved with ?command. To directly open the R-Help and search for keywords, type help.start(). Furthermore, the manual gives examples for most commands that can be executed by typing example(command). The package includes more than 30 datasets, including some of the most prominent example datasets for robust statistics used in the textbooks of Hampel et al. (1986), Maronna et al. (2006) and Rousseeuw and Leroy (1987)4. If the user is familiar with the R language, the ‘robustbase’ package is easily applicable. Datasets from other popular statistical packages can be imported into R with the package ‘foreign’. To improve the user-friendliness of R, freely available graphical user interfaces such as ESS (Emacs Speaks Statistics, http://ess.r-project.org), Tinn-R (www.sciviews.org/Tinn-R), and R-WinEdt (www.winedt.org)5 can be used. Because the package and the manual are still (continuous) work in progress, the user will encounter some incomplete descriptions or functions. However, the current – though preliminary – 4 For instance, the introductory examples given in Rousseeuw and Leroy (1987) of telephone calls in Belgium, the Hertzsprung-Russel star data as well as brain and body weights of different animals are in ‘robustbase’ available under telef, starsCYG and Animals2, respectively. 5 See www.sciviews.org/_rgui/ for a comprehensive overview on graphical user interfaces developed for R. version of ‘robustbase’ is already recommendable for a wide range of applications of robust statistics. 4. Functionality As summarized in Table 1, ‘robustbase’ provides different robust regression techniques as well as robust univariate and multivariate methods. Apart from robust regression techniques for linear regression analysis, it also includes robust methods for nonlinear and generalized linear models. Linear regression analysis in ‘robustbase’ is focused on MM-estimation. It is the strongly recommended estimation technique because it is superior to other robust regression techniques due to its high breakdown point and its high efficiency. Unfortunately, the functions for M- and S-estimation do not have the usual R syntax and outputs for regression analysis, hampering cross-checking and comparison between different methods. Moreover, the regression outputs of the ‘robustbase’ package (and other R packages that include robust regression) do not provide information on the robust goodness of fit yet. The univariate and multivariate methods included in ‘robustbase’ are mainly complements and generalizations of methods that are available in the default R packages. The included robust measures of location, scale and skewness reflect recent developments and are characterized by optimal tradeoffs between efficiency and robustness. However, the inclusion of additional robust univariate and multivariate methods, for instance from other packages, might complement the package. Time and effort searching for appropriate methods would be reduced and users of ‘robustbase’ would have the possibility to easily compare different methods. In order to use