The Coming Revolution in Statistics

Lee E. Edlefsen, Ph.D. Chief Scientist

1 • The leading commercial provider of soware and support for the popular open source stascs language. • Palo Alto, Seale, New York. • www.revoluonanalycs.com/video.php

RevoScaleR Webinar 2 “R is the most powerful & flexible statistical programming language in the world”… Revolution Confidential

Internet Discussion Web Site Popularity Mean monthly traffic on email discussion list Number of links to main web site

R 4,000 3,500 R SAS 2,000

3,000 SPSS 1,050

S-Plus 900 2,500 600

2,000

Stata Scholarly Activity 1,500 SAS Google Scholar hits (’05-’09 CAGR)

1,000

R 46% 500 SPSS SAS -11% S-Plus 0 SPSS -27% 1995 2000 2005 2010 S-Plus 0%

Stata 10%

Source: http://r4stats.com/popularity 3 The coming revolution – due to disruptive technological change Revolution Confidential . I believe there is going to be a revolution in both statistical practice and theory over the next several years

. This revolution will be driven by disruptive technological change: our ability to collect and store data is rapidly and greatly outpacing our ability to analyze that data

4 Huge benefits to huge data Revolution Confidential . More information, more to be learned . Variables and relationships can be visualized and analyzed in much greater detail . Can allow the data to speak for itself; can relax or eliminate assumptions . Can get better predictions and better understandings of effects

Revolution R Enterprise 5 We are currently incapable of analyzing much of the data we have Revolution Confidential . The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets

. In many ways we are back where we were in the ‘70s and ‘80’s in terms of ability to handle common data sizes

Revolution R Enterprise 6 Code museums and the end of anRevolution Confidential era . The vast majority of the data analysis software in use today is based on algorithms that are 30, 40, 50 or more years old . Much of the actual code dates back that far . During that period of time the rising tide of technology allowed the same code to run faster and on bigger data sets . We are at the end of that era

Revolution R Enterprise 7 To keep up with the tsunami of dataRevolution Confidential . We must:

. use more cores . use more hard drives . use more computers . Existing statistical software can’t do this . We need new software

Revolution R Enterprise 8 New statistical software must be Revolution Confidential . Scalable – the same code that works on 100 observations should work on a 100 billion

. Fast – need results in a timely manner; good data analysis requires interactivity

. Easy to use – we need to be able to use clusters and clouds as easily as we can use single workstations today

Revolution R Enterprise 9 New software should also Revolution Confidential . Be easily extendible to new fast, scalable algorithms

. Leverage as much existing software as possible

. Be flexible, forgiving, and familiar to lots of people

Revolution R Enterprise 10 Is this possible? Yes! Revolution Confidential . Based on R – R is not only the statistical language of the present, in my opinion it is the language of the future . Based on “parallel external memory algorithms” -- at , we have released a framework for automatically and efficiently parallelizing and distributing a wide class of statistical and algorithms

Revolution R Enterprise 11 Scalability of RevoScaleR: Regression, 1 million - 1.1 billion rows, 443Revolution Confidential betas

Time (secs) 1200 ~ 1.1 million rows/second 1000

800

600

400

200

0 0 200 400 600 800 1000 1200

Revolution R Enterprise 12 Please contact me if you have questionsRevolution Confidential Lee Edlefsen [email protected]

Revolution R Enterprise 13