The Coming Revolution in Statistics
Lee E. Edlefsen, Ph.D. Chief Scientist
1 • The leading commercial provider of so ware and support for the popular open source R sta s cs language. • Palo Alto, Sea le, New York. • www.revolu onanaly cs.com/video.php
RevoScaleR Webinar 2 “R is the most powerful & flexible statistical programming language in the world”… Revolution Confidential
Internet Discussion Web Site Popularity Mean monthly traffic on email discussion list Number of links to main web site
R 4,000 3,500 R SAS 2,000
3,000 SPSS 1,050
S-Plus 900 2,500 Stata 600
2,000
Stata Scholarly Activity 1,500 SAS Google Scholar hits (’05-’09 CAGR)
1,000
R 46% 500 SPSS SAS -11% S-Plus 0 SPSS -27% 1995 2000 2005 2010 S-Plus 0%
Stata 10%
Source: http://r4stats.com/popularity 3 The coming revolution – due to disruptive technological change Revolution Confidential . I believe there is going to be a revolution in both statistical practice and theory over the next several years
. This revolution will be driven by disruptive technological change: our ability to collect and store data is rapidly and greatly outpacing our ability to analyze that data
4 Huge benefits to huge data Revolution Confidential . More information, more to be learned . Variables and relationships can be visualized and analyzed in much greater detail . Can allow the data to speak for itself; can relax or eliminate assumptions . Can get better predictions and better understandings of effects
Revolution R Enterprise 5 We are currently incapable of analyzing much of the data we have Revolution Confidential . The most commonly-used statistical software tools either fail completely or are too slow to be useful on huge data sets
. In many ways we are back where we were in the ‘70s and ‘80’s in terms of ability to handle common data sizes
Revolution R Enterprise 6 Code museums and the end of anRevolution Confidential era . The vast majority of the data analysis software in use today is based on algorithms that are 30, 40, 50 or more years old . Much of the actual code dates back that far . During that period of time the rising tide of technology allowed the same code to run faster and on bigger data sets . We are at the end of that era
Revolution R Enterprise 7 To keep up with the tsunami of dataRevolution Confidential . We must:
. use more cores . use more hard drives . use more computers . Existing statistical software can’t do this . We need new software
Revolution R Enterprise 8 New statistical software must be Revolution Confidential . Scalable – the same code that works on 100 observations should work on a 100 billion
. Fast – need results in a timely manner; good data analysis requires interactivity
. Easy to use – we need to be able to use clusters and clouds as easily as we can use single workstations today
Revolution R Enterprise 9 New software should also Revolution Confidential . Be easily extendible to new fast, scalable algorithms
. Leverage as much existing software as possible
. Be flexible, forgiving, and familiar to lots of people
Revolution R Enterprise 10 Is this possible? Yes! Revolution Confidential . Based on R – R is not only the statistical language of the present, in my opinion it is the language of the future . Based on “parallel external memory algorithms” -- at Revolution Analytics, we have released a framework for automatically and efficiently parallelizing and distributing a wide class of statistical and data mining algorithms
Revolution R Enterprise 11 Scalability of RevoScaleR: Regression, 1 million - 1.1 billion rows, 443Revolution Confidential betas
Time (secs) 1200 ~ 1.1 million rows/second 1000
800
600
400
200
0 0 200 400 600 800 1000 1200
Revolution R Enterprise 12 Please contact me if you have questionsRevolution Confidential Lee Edlefsen [email protected]
Revolution R Enterprise 13