Research Collection

Review Article

Review of ‘Robustbase’ software for

Author(s): Finger, Robert

Publication Date: 2010-11

Permanent Link: https://doi.org/10.3929/ethz-b-000159379

Originally published in: Journal of Applied Econometrics 25(7), http://doi.org/10.1002/jae.1194

Rights / License: In Copyright - Non-Commercial Use Permitted

This page was generated automatically upon download from the ETH Zurich Research Collection. For more information please consult the Terms of use.

ETH Library Postprint

This is the accepted version of a paper published in the Journal of Applied Econometrics. This paper has been peer-reviewed but does not include the final publisher proof- corrections or journal pagination.

Citation for the original published paper: Finger, R. (2010). Review of ‘robustbase’ software for R. Journal of Applied Econometrics 25(7): 1205-1210

https://doi.org/10.1002/jae.1194

Access to the published version may require subscription.

N.B. When citing this work, cite the original published paper

Review of ‘robustbase’ software for R

Robert Finger (ETH Zürich, Switzerland)

1. Introduction

Robust statistical methods are powerful tools to increase the reliability and accuracy of statistical modeling and data analysis because these methods still work well when some observations deviate from (or violate) the assumed model. For many statistical models (e.g. location, scale and regression), several robust methods are available as alternative approaches to classical statistical methods (Maronna et al., 2006). However, there has been a lack of availability of robust methods in statistical software (e.g. Stromberg, 2004). The R package ‘robustbase’ (as well as other related packages) helps to fill the gap between theoretically developed robust methods and their availability in standard statistical software and makes both basic and advanced methods available to a broad of researchers. ‘robustbase’ is an add-on package for the open-source statistical software R, a language and environment for statistical computing. Brief introductions to R are given, for instance, by Cribari-Neto and Zarkos (1999), and Racine and Hyndman (2002). The ‘robustbase’ package is written by Valentin Todorov, Andreas Ruckstuhl, Matias Salibian-Barrera, Tobias Verbeke and Martin Maechler1. It was developed to make “basic robust ” within R available in a single package and to provide tools that allow analyzing data with robust methods. Furthermore, ‘robustbase’ implements new robust methods, which have not been available so far in R (Maechler and Ruckstuhl, 2006). A further comprehensive add-on package on is the ‘robust’ package that is a version of the robust library of S-PLUS made available in R2. This package covers similar topics and partially overlaps with ‘robustbase’. Coordination effort is made to develop the ‘robust’ package as a supplement that provides convenient

1 The package developers acknowledge that the original code has been written by many authors, notably by Peter Rousseeuw and Christophe Croux. 2 The authors of the ‘robust’ package are: Jiahui Wang, Ruben Zamar, Alfio Marazzi, Victor Yohai, Matias Salibian-Barrera, Ricardo Maronna, Eric Zivot, David Rocke, Doug Martin, Martin Maechler, and Kjell Konis. routines and comparisons with classical estimators. In contrast, ‘robustbase’ should provide the more advanced with a considerable choice of methodology (Maechler, 2008). A complement to ‘robustbase’ is the R package ‘RobustX’ (written by Werner Stahel and Martin Maechler) that is especially designed to collect experimental methods as well as methods that are beyond the scope of basic robust statistics. Moreover, additional robust methods are available in various other R packages (see Maechler, 2008, for a brief overview).

2. Installation and Setup

Prior to the installation of the ‘robustbase’ package, R has to be installed – available for free from www.r-project.org. The package source, the reference manual and installation files for the ‘robustbase’ package are directly available at http://cran.r- project.org/web/packages/robustbase/index.html. It is installed most easily using the drop down menu of the R workstation environment with ‘Packages…Install Package(s)’. The package then needs to be loaded in the R workspace, either using the drop down menu ‘Packages…Load Package’ or by typing library(robustbase). It requires at least the 2.5.1 version of R and the default R packages stats, graphics and methods. The installation of the MASS package is also suggested. I used version 2.9.0 and 0.4-5 of R and ‘robustbase’, respectively.

3. Using R and ‘robustbase’

R, and therefore its packages like ‘robustbase’, are object-oriented and provide command/code-based interfaces (Harrison, 2008). The main advantages of using R for robust statistics are that it includes the latest, cutting-edge methods, it is easy to develop modifications of existing methods, and it is available for free and runs under all commonly operating systems (Racine and Hyndman, 2002). Introductory notes and examples on the use of R for statistical analysis and graphics are given in several sources such as textbooks3 and articles (e.g. Dalgaard, 2008, Zuur et al., 2009, Racine and

3 A list of books that are related to R and may be useful to the R user is given at www.r- project.org/doc/bib/R-books.html. Hyndman, 2002), or at the R-project homepage (www.r-project.org): either in ‘An Introduction to R’ or in the R-Wiki (http://wiki.r-project.org/). The use of ‘robustbase’ can be demonstrated with a simple example for a of y on x from dataset d using the MM-estimator: After loading the ‘robustbase’ package with library(robustbase) and typing objectname<-lmrob(y~x, data=d), the standard regression output as well as details on the robust estimation are provided by typing summary(objectname). A complete list of the regression output as well as tuning parameters and algorithms employed in the MM estimation is shown with str(objectname). Very helpful (robust) regression diagnostics plots (e.g. Tukey- Anscombe plot, normal Q-Q plot) are generated with plot(objectname). The manual of ‘robustbase’ is comprehensive and detailed, providing syntax, options and arguments for the commands as well as in providing references for the employed methods. It also provides both a thematic as well as alphabetic index of the data sets and functions included in ‘robustbase’. The manual for a specific command can be retrieved with ?command. To directly open the R-Help and search for keywords, type help.start(). Furthermore, the manual gives examples for most commands that can be executed by typing example(command). The package includes more than 30 datasets, including some of the most prominent example datasets for robust statistics used in the textbooks of Hampel et al. (1986), Maronna et al. (2006) and Rousseeuw and Leroy (1987)4. If the user is familiar with the R language, the ‘robustbase’ package is easily applicable. Datasets from other popular statistical packages can be imported into R with the package ‘foreign’. To improve the user-friendliness of R, freely available graphical user interfaces such as ESS (Emacs Speaks Statistics, http://ess.r-project.org), Tinn-R (www.sciviews.org/Tinn-R), and R-WinEdt (www.winedt.org)5 can be used. Because the package and the manual are still (continuous) work in progress, the user will encounter some incomplete descriptions or functions. However, the current – though preliminary –

4 For instance, the introductory examples given in Rousseeuw and Leroy (1987) of telephone calls in Belgium, the Hertzsprung-Russel star data as well as brain and body weights of different animals are in ‘robustbase’ available under telef, starsCYG and Animals2, respectively. 5 See www.sciviews.org/_rgui/ for a comprehensive overview on graphical user interfaces developed for R. version of ‘robustbase’ is already recommendable for a wide range of applications of robust statistics.

4. Functionality

As summarized in Table 1, ‘robustbase’ provides different robust regression techniques as well as robust univariate and multivariate methods. Apart from robust regression techniques for linear regression analysis, it also includes robust methods for nonlinear and generalized linear models. Linear regression analysis in ‘robustbase’ is focused on MM-estimation. It is the strongly recommended estimation technique because it is superior to other robust regression techniques due to its high breakdown point and its high efficiency. Unfortunately, the functions for M- and S-estimation do not have the usual R syntax and outputs for regression analysis, hampering cross-checking and comparison between different methods. Moreover, the regression outputs of the ‘robustbase’ package (and other R packages that include robust regression) do not provide information on the robust goodness of fit yet. The univariate and multivariate methods included in ‘robustbase’ are mainly complements and generalizations of methods that are available in the default R packages. The included robust measures of location, scale and reflect recent developments and are characterized by optimal tradeoffs between efficiency and robustness. However, the inclusion of additional robust univariate and multivariate methods, for instance from other packages, might complement the package. Time and effort searching for appropriate methods would be reduced and users of ‘robustbase’ would have the possibility to easily compare different methods. In order to use robust methods also for the visualization of data, ‘robustbase’ includes a boxplot and some multivariate plots based on robust measures of location, scatter and skewness. Examples of robust graphics created with ‘robustbase’ are given in Figure 1.

Table 1. Summary of ‘robustbase’ commands. Functionality Command Comment Robust Regression: Linear Models6 . The strongly recommended (linear) robust MM-estimation lmrob regression technique in this package. LTS Regression ltsReg These methods are supposed to support the M-estimation lmrob..M..fit lmrob function and to serve as cross- checking tools, but are not intended to be S-estimation lmrob.S used on its own. Robust Regression: Other Methods and Options M-estimation with iterated reweighted least Robust nonlinear regression nlrob squares The current version supports only robust Robust fit of generalized models glmrob estimation of binomial and Poisson models. Model comparison based on a robust Robust regression model anova(lmrob or glmrob Wald-type or deviance-type test for lmrob comparison object) or glmrob objects. Robust regression diagnostic plot(lmrob or ltsReg Available for ltsReg and lmrob objects. plots object) Robust Univariate Methods Adjusted boxplot for skewed A robust measure of skewness adjbox distributions (medcouple) is used to identify . Robust measure of skewness mc (medcouple) Generalized Huber M-Estimator Robust M-estimate of location with MAD huberM of location scale Robust scale estimators (Qn, Sn, Robust scale estimates that are more Qn, Sn, scaleTau2 Tau estimate) efficient than the MAD. Robust Multivariate Methods A generalized Stahel-Donoho measure of Multivariate outlyingness adjOutlyingness multivariate outlyingness taking skewness into account with medcouple. Robust multivariate location and scale Robust location and covariance covMcd estimate using the Minimum Covariance estimation Determinant (MCD) Multivariate plots based on robust Robust multivariate plots plot(cov.Mcd object) estimates of location and covariance

Figure 1a and 1b show a standard (using the function boxplot from the default R package ‘graphics’) and an adjusted (robust) boxplot, respectively, for a dataset of NOx concentrations that is available in ‘robustbase’ under NOxEmissions. The latter boxplot is based on the adjbox function of ‘robustbase’ that uses a robust measure of skewness to

6 In addition to the functions presented here, several commands to control the estimation and summarize the regression results are integrated in ‘robustbase’. identify outliers7. In contrast to the standard boxplot, the adjusted version takes the (robust) skewness of the data into account and results in longer whiskers and less outliers. Light intensity and temperature of stars in the star cluster CYG OB18 (for details see Rousseeuw and Leroy, 1987) is used to visualize 97.5% tolerance ellipses based on classical and robust multivariate measures of location and scatter in Figure 1c using the plot option of a covMcd object. Using adjOutlyingness, multivariate outliers can be identified based on a generalized Stahel-Donoho measure of multivariate outlyingness and can be visualized in a simple scatter plot (Figure 1d). In the latter graph, might be identified and visualized as follows: compute measures of outlyingness and identify outliers with object<-adjOutlyingness(dataset). A vector p with length equal to number of observations can be given different values for outliers and non-outliers for instance with p <- ifelse(object$nonOut ==FALSE, p, 1). Different plotting symbols can be assigned to outliers using the vector p in the plot function as indicator: plot(dataset, pch=p).

7 When the data are skewed, usually many points exceed the whiskers and are often erroneously declared as outliers (Hubert and Vandervieren, 2008). Thus, less outlier are indicated by adjusted boxplot that takes the skewness of data into account. 8 This dataset is available in ‘robustbase’ under starsCYG.

Figure 1. Graphical representations of robust methods using ‘robustbase’.

The methods and particularly the graphical tools of ‘robustbase’ for univariate and multivariate analysis as well as for regression analysis are perfect instruments to explore and analyze the data from a robust statistics perspective. Additional robust methods that are interesting in the field of applied statistics and econometrics, but are not yet implemented in (or are beyond the scope of) ‘robustbase’ can be found, for instance, in the ‘robust’ package (e.g. the Stahel-Donoho estimator, further options and methods for linear and generalized linear regression models, tests for biases of MM- and LS estimates against S-estimates), the ‘rrcov’ package9 (e.g. robust principal component analysis methods, robust linear and quadratic discriminant analysis), the ‘robustX’ package (e.g. the L1-) or the ‘robust-ts’ package (robust time series analysis).

5. Conclusion

The package ‘robustbase’ for R provides a comprehensive set of methods for basic univariate and multivariate robust statistics as well as for robust regression analysis. Though parts of the program are still preliminary, it offers a wide range of methods to explore and analyze data from a robust statistics perspective. Based on the ‘robustbase’ package and other related packages, R provides some of the most advanced and latest methods in robust statistics, which makes R to one of the leading standard statistic software tools with regard to robust statistical methods.

Acknowledgements

I would like to thank Matias Salibian-Barrera, Valentin Todorov, Andreas Ruckstuhl and the editor for helpful comments on earlier versions of this review. All mistakes are solely the author’s.

References Cribari-Neto F, Zarkos SG. 1999. R: Yet another Econometric Programming Environment. Journal of Applied Econometrics 14: 319-329. Dalgaard P. 2008. Introductory Statistics with R. Springer: Berlin. Hampel FR, Ronchetti EM, Rousseeuw PJ, Stahel WA. 1986. Robust Statistics. Wiley & Sons: New York. Harrison TD. 2008. Review of np Software for R. Journal of Applied Econometrics 23: 861-865. Hubert M, Vandervieren E. 2008. An adjusted boxplot for skewed distributions. Computational Statistics & Data Analysis 52: 5186-5201

9 See Todorov and Filzmoser (2009) for descriptions of the ‘rrcov’ package and further developments and examples in robust multivariate analysis. Maechler M. 2008. CRAN Task View: Robust Statistical Methods. http://cran.r- project.org/web/views/Robust.html [07 April 2010]. Maechler M, Ruckstuhl A. 2006. Robust Statistics Collaborative Package Development: ‘robustbase’. Use R! - The R User Conference 2006, June 15–17 2006, Vienna, Austria. Maronna RA, Martin RD, Yohai V. 2006. Robust Statistics: Theory and Methods. Wiley & Sons: New York. Racine J, Hyndman R. 2002. Using R to teach econometrics. Journal of Applied Econometrics 14: 175-189. Rousseeuw PJ, Leroy A. 1987. Robust Regression and Outlier Detection. Wiley & Sons: New York. Stromberg A. 2004. Why Write Statistical Software? The Case of Robust Statistical Methods. Journal of Statistical Software 10(5). Todorov V, Filzmoser P. 2009. An Object-Oriented Framework for Robust Multivariate Analysis. Journal of Statistical Software 32(3). Zuur AF, Ieno EN, Meesters E. 2009. A Beginner's Guide to R. Springer: Berlin.