R: a Free Software Project in Statistical Computing
Total Page:16
File Type:pdf, Size:1020Kb
R: A Free Software Project in Statistical Computing Achim Zeileis http://statmath.wu-wien.ac.at/∼zeileis/ [email protected] Overview • A short introduction to some letters of interest – R, S, Z • Statistics and computing – applied vs. computational statistics vs. stat. computing – statistical software • The R project • Basic functionality • Empirical application: diabetes in native populations – exploratory analysis – linear regression – tree models • Packages • Summary Some letters R is an interactive computational environment for data analysis, inference and visualization. S is a language for data analysis and graphics, implemented in the commercial software S-PLUS and the open-source software R. Z (aka Achim Zeileis) is a statistician at WU Wien spending a considerable share of his time using and developing R: • for research, • for applied data analysis, • for course administration, • for Web page generation, CD covers, mp3 administration . (but that’s another story). Some letters: R • R is an interactive computational environment for data anal- ysis, inference and visualization. • Developed for the Unix, Windows and Macintosh families of operating systems by an international team. • Released under the GPL (General Public License), similar to the open-source operating system Linux. • Highly extensible through user-defined functions and a fast- growing list of add-on packages. • Based on the S language but with a new underlying imple- mentation. Some letters: S • S is a language for data analysis and graphics developed by John Chambers and co-workers at Bell Labs (of AT&T, now Lucent Technologies). • Exclusively licensed (and eventually sold) to Insightful Corp. as the basis for the commercial statistics system S-PLUS. • Award-winning language which “has forever altered the way how people analyze, visualize and manipulate data. ” (ACM Software System Award 1998 to John Chambers). Some letters: Z Short bio: 1996–2000 University studies in Statistics, Universitat¨ Dortmund, Germany 2000 Dipl.-Stat. (∼ M.Sc.) in Statistics 2000–2003 Research Assistant, SFB “Adaptive Informa- tion Systems and Modelling in Economics and Management Science”, Wirtschaftsuni- versitat¨ Wien, Austria 2003 Dr. rer. nat. (∼ Ph.D.) in Statistics since 2003 Assistant Professor, Department of Statistics & Mathematics, Wirtschaftsuniversitat¨ Wien, Austria Some letters: Z Interests as a student: • Year 1: some interest in exploratory data analysis – tried to learn SAS – failed miserably – decided software is not for me. • Year 2–3: learned a lot about theoretical concepts in statis- tics – did applied work only if necessary – used various com- mercial software packages: SPSS, Minitab, S-PLUS, GLIM, some SAS again, . • Year 4: needed software for seminars and thesis: implement concepts, run simulations, apply to real data – discovered open-source software R – switched from Windows to Linux, from Word to LATEX, from everything else to R – never looked back. Some letters: Z Interests as a researcher: • Statistical computing, • Applied econometrics (in particular: testing, monitoring and dating of structural changes), • Statistical learning (in particular: tree-based models), • Visualization & significance testing. Statistics and computing A large spectrum of statistics involve the use of computers and software programs. Different parts of this spectrum with varying emphasis are the following. Applied statistics: • task: understand structures in data and the underlying mech- anisms. • required: software that provides appropriate statistical tech- niques for application to real data. • tools: preprocessing, inference, visualization, reporting. • extensibility: modify existing tools, customize analysis, define work flows, automatization. Statistics and computing Computational statistics: • task: solve computing-intensive statistical problems. • examples: difficult optimization problems, Markov chain Monte carlo algorithms. • required: efficient implementation/programming language. Statistics and computing Statistical computing: • task: turn theoretical concepts into software • examples: implement a new statistical model so that it can be easily applied to new data sets. • required: flexible software system with programming lan- guage. • tools: object orientation, compiled code, re-usable compo- nents. Of course, these areas are not well separated but overlapping! Software is always the bridge between theoretical concepts and statistical practice. Statistics and computing: Software There is a broad range of statistical software packages, some of the most popular are: Excel most popular spreadsheet, only very basic statistical func- tionality, some programming possible in Visual Basic. SPSS “statistical” spreadsheet, standard statistical tool box (some emphasis on social sciences), programming possible in SPSS language. SAS comprehensive package with numerous interfaces, some emphasis on business solutions, programming in SAS macro language. S-PLUS built on top of S, adds graphical user interface (GUI), spreadsheet, various extensions. R command line interface (CLI), highly extensible, only special- ized GUIs available. Statistics and computing: Excel Statistics and computing: SPSS Statistics and computing: SAS Statistics and computing: SAS Statistics and computing: S-PLUS Statistics and computing: R Statistics and computing: Software GUI CLI learn easy harder flexible not really very repeat tasks tiring easy reproduce results hard easy GUIs are very popular with practitioners, but trained statisticians usually will have to do at least some programming at a certain point. Reproducibility is crucial in scientific work, automatization in many corporate applications. Hence, all large packages offer some “command language” . S/R offers the most complete and modern programming language. The R project History of S: 1976: John Chambers and co-workers at Bell Labs begin work on a project that will become S (S1). 1981: Licenses for a portable Unix version of S outside Bell Labs (S2). 1988: Statistical software package S-PLUS based on S. 1992: Object orientation and statistical modeling toolbox in- cluded (S3). 1993: Exclusively licensed to MathSoft (now Insightful). 1998: New object orientation model introduced (S4). 2004: Sold to Insightful. The R project History of R: 1991: Ross Ihaka and Robert Gentleman begin work on a project that will ultimately become R. 1993: First binary copies of R on Statlib. 1995: R release of sources under the GPL. 1997: R core group is formed. 1998: Comprehensive R Archive Network (CRAN). 1999: DSC meeting in Vienna, first R core meeting. 2000: R 1.0.0 is released. 2001: R Newsletter launched. 2002: R Foundation established. 2004: First useR! conference in Vienna. 2007: R-forge server launched. The R project Home of the R project is http://www.R-project.org/ where manuals, FAQs, links, and many other informations are available. The R software (current version 2.4.0) can be obtained from http://CRAN.R-project.org/ in source and binary form along with many extension packages. The R project Free software: Open-source software like R does not cost money (free as in beer). But it takes time to learn, and as time is money, this talk is mostly about free software in the sense of free as in speach. R is open source, everybody can read the source code, hence you need not to rely on documentation to infer what the software really does. More importantly, everybody can use it, making research reproducible. No owner? R is not in the public domain, you are given a license (GPL) to run the software. The R project The base R system is maintained by the “R Development Core Team” with members from New Zealand, Europe and North America: Douglas Bates, John Chambers, Peter Dalgaard, Robert Gentle- man, Kurt Hornik, Stefano Iacus, Ross Ihaka, Friedrich Leisch, Thomas Lumley, Martin Maechler, Duncan Murdoch, Paul Mur- rell, Martyn Plummer, Brian Ripley, Duncan Temple Lang, Luke Tierney, and Simon Urbanek. But R would not be what it is without the support of a very large and active user community around the world, both in academia and the industry, who contributed by donating code, bug fixes, documentation, packages, discussion on the mailing lists The R project The R user community communicates via means of mailing lists (R-help, R-devel, R-bugs, SIGs) where questions can be asked, problems discussed, bugs reported, solutions suggested. Using the R language interactively allows for a smooth transi- tion from using R to developing in R. Bundles of new functions, manual pages, data sets, examples, demos, documentation can be effectively shared in the R community via means of CRAN packages. Basic functionality • an oversized pocket calculator. • matrix-based language. • full-featured programming language: (statistical) data struc- tures, flow control, object orientation, interfaces to other lan- guages, operating system interaction. • statistical toolbox: exploratory data analysis, inference, (gen- eralized) linear models, multivariate analsyis, time-series analysis, . • production-quality graphics. Basic functionality R> 1 + 1 [1] 2 R> 2^3 [1] 8 R> x <- c(2, 7) R> x [1] 2 7 R> 1/x [1] 0.5000000 0.1428571 Basic functionality R> y <- matrix(c(1, 2, 3, 4), ncol = 2) R> y [,1] [,2] [1,] 1 3 [2,] 2 4 R> y %*% x [,1] [1,] 23 [2,] 32 R> solve(y) [,1] [,2] [1,] -2 1.5 [2,] 1 -0.5 Basic functionality Fundamental language design principle: Everything in R is an object. Every object has a class (e.g., numeric, factor, function, . ) and methods for generic functions are