PROGRAMMING TOOLS: ADVENTURES with R a Guide to the Popular, Free Statistics and Visualization Software That Gives Scientists Control of Their Own Data Analysis
Total Page:16
File Type:pdf, Size:1020Kb
TOOLBOX PROGRAMMING TOOLS: ADVENTURES WITH R A guide to the popular, free statistics and visualization software that gives scientists control of their own data analysis. ILLUSTRATION BY THE PROJECT TWINS THE PROJECT BY ILLUSTRATION BY SYLVIA TIPPMANN with her data-processing demands. Besides being free, R is popular partly because With the results of her first genomic sequenc- it presents different faces to different users. It is, or years, geneticist Helene Royo used ing experiments in hand at the start of a new first and foremost, a programming language — commercial software to analyse her work. postdoc, Royo had a choice: pass the sequences requiring input through a command line, which She would extract DNA from the devel- over to the experts or learn to analyse the data may seem forbidding to non-coders. But begin- Foping sperm cells of mice, send it for analysis herself. She took the plunge, and began learning ners can surf over the complexities and call up and then fire up a package called GeneSpring how to parse data in the free, open-source soft- preset software packages, which come ready- to study the results. “As a scientist, I wanted to ware package R. It helped that the centre she had made with commands for statistical analysis understand everything I was doing,” she says. joined — the Friedrich Miescher Institute for and data visualization. These packages create a “But this kind of analysis didn’t allow that: I just Biomedical Research in Basel, Switzerland — welcoming middle ground between the com- pressed buttons and got answers.” And as Royo’s ran regular courses on the software. But she was fort of commercial ‘black-box’ solutions and the studies comparing genetic activity on different also following a wider trend: for many academ- expert world of code. “R made it very easy,” says chromosomes became more involved, she real- ics seeking to wean themselves off commercial Rojo. “It did everything for me.” ized that the commercial tool could not keep up software, R is the data-analysis tool of choice. That, indeed, is what R’s developers 1 JANUARY 2015 | VOL 517 | NATURE | 109 © 2015 Macmillan Publishers Limited. All rights reserved TOOLBOX intended when they designed it in the 1990s. bioinformatics group, she took about half a year Ross Ihaka and Robert Gentleman, statisticians A RISING TIDE OF R to work on R and Bioconductor. But there are at the University of Auckland in New Zealand, An increasing proportion of research articles plentiful chances to learn, says Karthik Ram, had an interest in computing but lacked practi- explicitly reference R or an R package. an ecologist at the Berkeley Institute for Data cal software for their needs. So they developed 4 Science in California who founded rOpenSci, a programming language with which they Agricultural and biological sciences an initiative that helps scientists to adopt and Biochemistry, genetics could perform data analysis themselves. R got and molecular biology develop R (see ‘An R starter kit’). He and his its name in part from its developers’ initials, 3 Earth and planetary sciences colleagues teach free courses that do not require although it was also a reference to the most Environmental science existing programming skills and are targeted widely used coding language at the time, S. Immunology and microbiology towards scientists’ specific problems. Mathematics 2 In the early days of the World Wide Web, R Neuroscience One researcher who took that training is quickly attracted interest from scientists around Megan Jennings, an ecologist at San Diego the globe who needed statistical software and State University in California. She tracks bob- Articles citing R (%) were willing to contribute ideas. Gentleman and 1 cats, mountain lions and other wild animals, SCOPUS DATABASE TIPPMANN/SOURCE: ELSEVIER SYLVIA Ihaka decided to make their source code acces- to understand their movements. Armed with sible to everybody, and coding-literate scientists more than 400,000 time-stamped photos to quickly developed packages of pre-programmed 0 which she had appended species names — taken routines and commands for particular fields. “I 2000 2005 2010 from 36 cameras running for almost a year — can write software that would be good for some- Jennings wanted to follow particular species at body doing astronomy,” says Gentleman, “but it’s particular times of year. At first, she manually a lot better if someone doing astronomy writes packages, and the first citations of the ‘R Pro- selected the photos she wanted and fed them software for other people doing astronomy.” ject’ appeared. Today, nearly 6,000 packages into a black-box program called PRESENCE. exist for all kinds of specialized purposes. But with Ram’s help, she is creating an R package MATHEMATICAL SOLUTIONS They allow scientists to compare a human and that reads in the tagged photos, cleans them up Karline Soetaert, an oceanographer at the a Neanderthal genome (using Bioconductor: and then sends customized subsets of the data Royal Netherlands Institute for Sea Research in go.nature.com/s7mq39); to model population to a pre-existing modelling package in R. “What Yerseke, took up that idea when, in 2008, she growth (IPMpack: go.nature.com/cyhons); took me one hour to do manually, I will now be wanted to check the health of zooplankton in predict equity prices (quantmod: go.nature. able to do in five minutes,” Jennings says. the estuary of the river Scheldt. Soetaert wanted com/jxqasm); and visualize the results in pol- One of the greatest perks of R is its online to calculate how fast zooplankton were dying, ished graphics (ggplot2: ggplot2.org) in a few support. Discussion forums about R-related using measurements along the river, but R was lines of code. Experts can use R to write up topics outstrip online questions about any not equipped for that. To tackle the problem, she manuscripts, embedding raw code in them commercial statistics software says Muenchen. worked with two ecologists to develop deSolve to be run by the reader (knitr: http://yihui. “It’s common to see someone post a question — the first package written in R to solve differ- name/knitr). Nearly 1 in 100 scholarly articles and the person who developed the package ential equations. “Other software can do that, indexed in Elsevier’s Scopus database last year answer within half an hour,” he says. This rapid but it is expensive and closed source,” she notes. cites R or one of its packages — and in agricul- response is key for scientists in basic research. Now deSolve is used by epidemiologists model- tural and environmental sciences, the share is “I can find an answer to almost any question ling infectious diseases, geneticists working on even higher (see ‘A rising tide of R’). online,” says Royo. She can confidently do gene-regulatory networks and drug develop- most of her day-to-day data analysis herself, ers working on pharmaco kinetics (how com- STATISTICAL SUCCESS and she helps out less proficient colleagues. pounds behave in living organisms). For many users, R’s quality as statistics software Still, “I google things every day”, she adds. By 2003, 10 years after R’s first release, stands out. The tool is on a par with commer- Learning R, says Royo, has not only taught her scientists had developed more than 200 cial packages such as SPSS and SAS, says Rob- coding skills, but has also made her more criti- ert Muenchen, a statistician at the University of cal about other scientists’ analyses. Tennessee in Knoxville who analyses the popu- Not every scientist is enthusiastic about learn- TUTORIALS larity of software used in statistical computing. ing the necessary programming — even though, In the past decade, R has caught up with and says Ram, R is less intimidating than languages An R starter kit overtaken the market leaders. “Most likely, R such as Python (let alone Perl or C). “There are became the top statistics package used during going to be far more scientists that will be com- ● Install R at the Comprehensive R the summer of this year,” he says. fortable with click-and-drop interfaces than will Archive Network: http://cran.r-project. In genomics and molecular biology, a soft- ever learn to program at any time,” Muenchen org. This also provides an introduction to ware project called Bioconductor was devel- says. Geneticist Rabih Murr, for example, took the system: go.nature.com/jh9jb8. oped on the back of R. It helps scientists to the same R course as Royo when he was a ● Many researchers recommend using a process and compare huge numbers of genetic postdoc, but he did not invest as much time in (free) powerful interface called RStudio: sequences, to query results against databases practising. To get started and develop research- www.rstudio.com such as Gene Expression Omnibus and to specific skills in R definitely requires a commit- ● Among many online tutorials are upload data to the databases . It includes almost ment: “It’s a matter of priorities,” he says. But those provided by DataCamp (go. 1,000 packages, some of which help to link the after becoming a lab head at the University of nature.com/qndp6w), rOpenSci millions of DNA snippets from next-generation Geneva in Switzerland this year, he is planning (ropensci.org), Software Carpentry sequencing experiments to annotated genes. to hire someone with R experience. (go.nature.com/wg3s9u) and R-bloggers For her dive into R, Like any other skill, learning R cannot (www.r-bloggers.com). Royo had intensive train- NATURE.COM be done overnight. But Jennings says that ● For a sample list of R packages in ing: under the supervi- For more on scientific it is worth it.