Statistics with Free and Open-Source Software

Free and Open-Source Software • the four essential freedoms according to the FSF: • to run the program as you wish, for any purpose • to study how the program works, and change it so it does Statistics with Free and your computing as you wish Open-Source Software • to redistribute copies so you can help your neighbor • to distribute copies of your modified versions to others • access to the source code is a precondition for this Wolfgang Viechtbauer • think of ‘free’ as in ‘free speech’, not as in ‘free beer’ Maastricht University http://www.wvbauer.com • maybe the better term is: ‘libre’ 1 2 General Purpose Statistical Software Popularity of Statistical Software • proprietary (the big ones): SPSS, SAS/JMP, • difficult to define/measure (job ads, articles, Stata, Statistica, Minitab, MATLAB, Excel, … books, blogs/posts, surveys, forum activity, …) • FOSS (a selection): R, Python (NumPy/SciPy, • maybe the most comprehensive comparison: statsmodels, pandas, …), PSPP, SOFA, Octave, http://r4stats.com/articles/popularity/ LibreOffice Calc, Julia, … • for programming languages in general: TIOBE Index, PYPL, GitHut, Language Popularity Index, RedMonk Rankings, IEEE Spectrum, … • note that users of certain software may be are heavily biased in their opinion 3 4 5 6 1 7 8 What is R? History of S and R • R is a system for data manipulation, statistical • … it began May 5, 1976 at: and numerical analysis, and graphical display • simply put: a statistical programming language • freely available under the GNU General Public License (GPL) → open-source • cross-platform (can be used under Windows, Unix/Linux, Mac OS, …) Bell Laboratories, Murray Hill, New Jersey 9 10 History of S and R History of S and R • informal meetings to • "the system" → "S" (the S language) (also play discuss development on name of programming languages such as C) of a new system for • first UNIX version of S in 1979 (version 2) statistical computing • distributed outside Bell Labs in 1980 • first implementation • source code released in 1981, then licensed in made by Rick Becker 1984 for educational and commercial purposes and John Chambers (and a few others) • called "the system" sketch of the system design made on the first meeting 11 12 2 History of S and R History of S and R • history/development of S can be traced via a number • S-PLUS, a commercial implementation of S, released of influential books: in 1988 by Statistical Sciences, Inc. (now TIBCO) • Becker & Chambers (1984). S: An Interactive Environment • Ross Ihaka and Robert Gentleman start developing a for Data Analysis and Graphics. statistical programming language "not unlike S" at • Becker & Chambers (1985). Extending the S System. the University of Auckland (New Zealand) in the • Becker, Chambers, & Wilks (1988): The New S Language: A 1990s Programming Environment for Data Analysis and Graphics. • Chambers & Hastie (1991). Statistical Models in S. • Chambers (1998). Programming with Data: A Guide to the S Language. 13 14 History of S and R History of S and R • some R milestones: • new website and logo in 2015: • first binary of R released in 1993 • Ihaka, R., & Gentleman, R. (1996). R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics, 5(3), 299-314. the old one the new one • source code released in 1997 (CRAN is started) • R Core group is formed in 1997 with 9 members (now 20) • R Consortium formed in June 2015 • version 1.0.0 (2000), version 2.0.0 (2004) • current version: R 3.2.2 (released August 2015) • first useR! conference in May 2004 in Vienna, Austria • (3.2.3 about to be released) • version 3.0.0 released April 2013 15 16 2.8 1.4 1.5 1.7 1.8 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.9 2.10 2.15 3.1 3.2 3.2.2 1.3 Other Related Developments 7500 7000 • Bioconductor started 2001 6500 6000 • Revolution Analytics founded in 2007, acquired by 5500 5000 Microsoft in 2015 4500 • RStudio founded in 2008 4000 3500 • New York Times article about R in January 2009 3000 2500 • “big data” (Google, Oracle, IBM, Intel, Microsoft, …) Number of CRAN Packages CRAN of Number 2000 1500 • “data science” (‘hacking’ skills core component) 1000 500 • open science, reproducible research 0 20 12 18 16 31 12 12 18 17 26 22 07 07 21 17 12 25 05 21 16 12 16 08 - - - - - - - - - - - - - - - - - - - - - - - 10 10 06 12 05 12 04 11 03 04 10 08 06 12 06 12 06 05 11 06 09 11 05 - - - - - - - - - - - - - - - - - - - - - - - 2008 2004 2005 2005 2006 2006 2007 2007 2008 2009 2009 2011 2014 2015 2015 2001 2001 2002 2003 2003 2004 2012 2013 17 18 3 Why is it called R? Everybody Loves Videos! • Ross Ihaka and Robert Gentleman • pun/play on the name of the S language • like computer scientists, statisticians are geeks • Data Scientist: The Sexiest Job of the 21st Century 19 20 Why Use R? • IT’S F&CKING FREE (as in beer), YOU DUMMY! • and it is free (as in free speech) & open source • and its capabilities are at least as good as that of proprietary software (often better) • most comprehensive coverage of methods • huge/active/enthusiastic user community • the ‘lingua franca’ of statistic(ian)s • forces you to adopt a scripting approach • cross-platform 21 22 Tooth Growth Data Edgar Anderson's Iris Data • sample: 60 guinea pigs • sepal and petal length and width for 50 • outcome: length of odontoblasts (teeth) samples for 3 different iris species • treatment: Vitamin C supplementation (0.5, 1, or 2 mg/day delivered either via orange juice petal or a solution with ascorbic acid) sepal Iris Setosa Iris Versicolor Iris Virginica http://en.wikipedia.org/wiki/Iris_flower_data_set 23 24 4 Puromycin Data Open Science / Reproducible Research • Puromycin is an antibiotic that is a protein synthesis inhibitor • study examined the velocity of an enzymatic reaction as a function of substrate concentration with and without the enzyme treated with Puromycin https://en.wikipedia.org/wiki/Puromycin 25 26 rmarkdown & knitr My Recommendations • to create dynamic and fully reproducible • learn R if you plan on staying in research documents/presentations/reports with R • knowing SPSS, Stata, and SAS is also useful • in essence: you write a single document that • learn Python is you want to be ‘data scientist’ includes the text and analysis code that is • keep your eyes on Julia then rendered into a desired output format • embrace learning new • more details: tools and statistics • http://rmarkdown.rstudio.com/ • http://yihui.name/knitr/ • Gandrud (2013): Reproducible Research with R & RStudio (website; source code) 27 28 Some Interesting Developments Some R Resources • Calling R from SPSS • R and CRAN (manuals, contributed documentation) • R Integration in JMP • use RStudio (unless you already use vim, Emacs, • R Interface in SAS/IML Studio Notepad++, Sublime Text, Atom, WinEdt, …) • SAS University Edition • many books / hard to give recommendations (search your favorite bookseller and look at reviews) • Python Interfaces for R • Field et al. (2012): Discovering Statistics Using R • Dalgaard (2008): Introductory Statistics with R • Muenchen (2011): R for SAS and SPSS Users • Springer Use R! Series, Chapman & Hall/CRC The R Series • courses (lots: Coursera, DataCamp, Code School, Udemy, Udacity, statistics.com, my own, …) 29 30 5 Free Software and Beyond … • free software is a movement • open access and sharing of knowledge as a U Haz Questionz? general philosophy • Free Software Foundation • Foundation for Open Access Statistics • Open Access Journals • Open Educational Resources • Wikipedia • OpenCola • … 31 32 6 .

Statistics with Free and Open-Source Software

An Evaluation of Statistical Software for Research and Instruction

The R and Omegahat Projects in Statistical Computing

Overview-Of-Statistical-Analytics-And

A Practical Guide for Improving Transparency and Reproducibility in Neuroimaging Research Krzysztof J

Reproducible Reports with Knitr and R Markdown

Hadley Wickham, the Man Who Revolutionized R Hadley Wickham, the Man Who Revolutionized R · 51,321 Views · More Stats

Annual Report of the Center for Statistical Research and Methodology Research and Methodology Directorate Fiscal Year 2017

Fibrinogen Levels Are Associated with Lymph Node Involvement And

Towards a Fully Automated Extraction and Interpretation of Tabular Data Using Machine Learning

R Markdown Cheat Sheet I

R Generation [1] 25

HGC: a Fast Hierarchical Graph-Based Clustering Method