The Present and Future of Statistics Challenges and Opportunities

Marie Davidian

Department of Statistics North Carolina State University

http://www4.stat.ncsu.edu/~davidian

1/47 New York Times, August 6, 2009

“I keep saying that the sexy job in the next 10 years will be statisticians” – Hal Varian, Chief Economist, Google

2/47 McKinsey Report, 2011

2011 McKinsey Global Institute report: Big data: The next frontier for innovation, competition, and productivity

“A significant constraint. . . will be a shortage of . . . people with deep expertise in statistics and data mining. . . a talent gap of 140K - 190K positions in 2018 (in the US)”

http://www.mckinsey.com/insights/mgi/research/technology and innovation/ big data the next frontier for innovation

3/47 The Wall Street Journal, March 1, 2013

4/47 Advanced placement statistics

AP Statistics Exam Participation, 1997−2013 200 150 100 50 Number of Students (thousands) 0

2000 2005 2010 Year

5/47 Rock star Nate Silver attracted ∼ 4,000 statisticians to his president’s invited address at the 2013 Joint Statistical Meetings

6/47 Bestsellers

7/47 Bestsellers

7/47 The Wall Street Journal, November 15, 2013

“Statistics is cool” – Ron Wasserstein, ASA Executive Director

8/47 Rock stars

2013 MacArthur Fellow Susan Murphy

9/47 Rock stars

2013 Prime Minister’s Prize for Science () recipient Terry Speed

10/47 Rock stars

Sir David Spiegelhalter (aka “Professor Risk”)

11/47 Essential

“Statistical rigor is necessary to justify the inferential leap from data to knowledge”

12/47 Big opportunities

The opportunities for statistics to have major impact are endless

13/47 Big challenge

Big Data and data science

14/47 Big hype

15/47 Big hype, receding

“We are now past the ‘peak of inflated expectations’ of the hype cycle”(http://en.wikipedia.org/wiki/Hype_cycle)

16/47 Investment

17/47 Investment

18/47 Investment

19/47 What is data science?

20/47 Big challenge remains

• Perception that other disciplines are more relevant to data-driven science and business • Computer science, machine learning, mathematics, engineering, physics, analytics, . . . • Perception that statistics is old-fashioned and rigid • Institutes, centers, degree and certificate programs; in many cases statistics is MIA

21/47 For example

“Statistics departments and journals still strongly emphasize a very narrow range of topics and methods and techniques, all driven by a tiny handful of results, many dating from the 1930s. . . the older zombie methods persist in the statistics literature and teaching.”

22/47 For example

“My impression is that scien- tists view statistics not so much as a science but as a ‘bag of tools.’ You have a visibility prob- lem in Science and AAAS.” – Alan Leshner, Chief Executive Officer of the American Asso- ciation for the Advancement of Science (AAAS), to representa- tives of the ASA in 2011 Science, February 11, 2011

23/47 • We continue to have daily impact in our critical collaborative roles • We have gotten lots of great press • Enrollments in undergraduate programs and applications to graduate programs are skyrocketing • But statistics is still often absent from the discourse on Big Data and data science • And the importance of statistical rigor is sometimes lost on fellow scientists • Statistics and statistical principles continue to be misunderstood

Weird juxtaposition

24/47 • But statistics is still often absent from the discourse on Big Data and data science • And the importance of statistical rigor is sometimes lost on fellow scientists • Statistics and statistical principles continue to be misunderstood

Weird juxtaposition

• We continue to have daily impact in our critical collaborative roles • We have gotten lots of great press • Enrollments in undergraduate programs and applications to graduate programs are skyrocketing

24/47 Weird juxtaposition

• We continue to have daily impact in our critical collaborative roles • We have gotten lots of great press • Enrollments in undergraduate programs and applications to graduate programs are skyrocketing • But statistics is still often absent from the discourse on Big Data and data science • And the importance of statistical rigor is sometimes lost on fellow scientists • Statistics and statistical principles continue to be misunderstood

24/47 ASA impact

The American Statistical Association has undertaken numerous initiatives to • Promote the role of statistics in all the sciences • Highlight the unique perspectives statistics brings to business and policy • Increase awareness of opportunities for statisticians among students and thereby enhance our impact on science and society

25/47 Key ASA Staff

Ron Wasserstein Steve Pierson Jeff Myers

26/47 AAAS and Science

AAAS – A focal point for science • World’s largest general scientific society, ∼120K members • 261 affiliated scientific societies (including ASA) • Publisher of Science • 24 AAAS sections, including Section U on Statistics

2013 presidential initiative • Raise the profile of statistics within AAAS

27/47 AAAS and Science

• Button campaigns encouraging statisticians to join AAAS • Section U membership increased 14% from 2012 to 2013 • Section U invited session proposals for AAAS Annual Meetings • Nominations for AAAS Fellow

28/47 Impacting AAAS and Science

September 26, 2013 • A group of ASA representatives met with Alan Leshner • Very positive reception! • We also met with new Science editor-in-chief Marcia McNutt and several senior editors • Great interest in enhancing the role of statistics and statisticians

29/47 Impacting Science

Science Editor-in-Chief Marcia McNutt

30/47 Science, January 17, 2014

EDITORIAL

Reproducibility

Marcia McNutt is Editor- SCIENCE ADVANCES ON A FOUNDATION OF TRUSTED DISCOVERIES. REPRODUCING AN EXPERIMENT in-Chief of Science. is one important approach that scientists use to gain confidence in their conclusions. Recently, the scientifi c community was shaken by reports that a troubling proportion of peer-reviewed preclinical studies are not reproducible. Because confi dence in results is of paramount importance to the broad scientifi c community, we are announcing new initiatives to increase confi dence in the studies published in Science. For preclinical studies (one of the targets of recent concern), we will be adopting recommendations of the U.S. National Insti- tute of Neurological Disorders and Stroke (NINDS) for increasing transparency.* Authors will indicate whether there was a pre-experimental plan for data handling (such as how to deal with outliers), whether they conducted a sample size estimation to ensure a suffi cient signal-to-noise ratio, whether samples were treated randomly, and whether the experimenter was blind to the conduct of the experiment. These criteria will be included in our author guidelines. There are a number of reasons why peer-reviewed preclinical studies may not be reproducible. The system under investigation may

be more complex than previously thought, so that the experimenter on July 16, 2014 is not actually controlling all independent variables. Authors may not have divulged all of the details of a complicated experiment, making it irreproducible by another lab. It is also expected that through ran- dom chance, a certain number of studies will produce false positives. If researchers are not alert to this possibility and have not set appro- priately stringent signifi cance tests for their results, the outcome is a study with irreproducible results. Although there is always the possi- bility that an occasional study is fraudulent, the number of preclinical

studies that cannot be reproduced is inconsistent with the idea that all www.sciencemag.org irreproducibility results from misconduct in such research. It is unlikely that the issues with irreproducibility are confi ned to preclinical studies (social science has been equally noted, for example). Unfortunately, there are no equivalents to the NINDS recommendations for other disciplines that provide a basis for requiring trans- parency across all fi elds. For the next 6 months, we will be asking reviewers and editors to identify papers submitted to Science that demonstrate excellence in transparency and instill

confi dence in the results. This will inform the next steps in implementing reproducibility Downloaded from guidelines. Science Translational Medicine, a sister journal of Science, already enforces the NINDS guidelines for preclinical studies. Both journals also are open to improving on the NINDS recommendations for preclinical studies. There is also a wide range of sophistication in the application of statistics displayed in research analysis, ranging from practically no statistics, to the routine use of generic soft- ware packages, to the application of advanced methods that extract subtle signals from noise. Because reviewers who are chosen for their expertise in the subject matter of a study may not be authorities in statistics as well, statistical errors in manuscripts may slip through unde- tected. For that reason, with the advice of the American Statistical Association and others, we are adding new members to our Board of Reviewing Editors from the statistics commu- nity to ensure that manuscripts receive appropriate scrutiny in their methods of data analysis. Science’s standards have always been high, and these measures add to steps we have already taken to increase transparency, such as requiring data accessibility. Nevertheless, journals can only do so much to assure readers of the validity of the studies they publish. The ultimate responsibility lies with authors to be completely open with their methods, all of their fi ndings, and the possible pitfalls that could invalidate their conclusions. – Marcia McNutt 10.1126/science.1250475

*S. C. Landis et al., Nature 490, 187 (2012). CREDITS: (TOP) STACEY PENTLAND PHOTOGRAPHY; (RIGHT)ONEO2/ISTOCKPHOTO.COM PENTLANDPHOTOGRAPHY; CREDITS: (TOP) STACEY

www.sciencemag.org SCIENCE VOL 343 17 JANUARY 2014 229 Published by AAAS

31/47 Science, July 4, 2014

EDITORIAL Raising the bar

umbers. Lots and lots of numbers. It is hard to ticularly when sophisticated approaches are needed. But find a paper published in Science or any other even when taking added precautions, no review system journal that is not full of numbers. Interpreta- is infallible, and no combination of reviewers can be ex- tion of those numbers provides the basis for the pected to uncover all of the ways in which the interpre- conclusions, as well as an assessment of the con- tation of results may have gone wrong. In particular, it fidence in those conclusions. But unfortunately, is difficult for reviewers to detect whether authors have there have been far too many cases where the approached the study with a lack of bias in their data quantitative analysis of those numbers has been flawed, collection and presentation. N Marcia McNutt is causing doubt about the authors’ interpretation and I recall a study that I conducted years ago involving Editor-in-Chief of uncertainty about the result. Furthermore, it is not re- a global analysis of some oceanographic features that alistic to expect that a technical reviewer, chosen for I was modeling to understand the rheology of oceanic Science. her or his expertise in the topical subject matter or ex- plates on million-year time scales. I had only a handful perimental protocol, will of data points—perhaps a also be an expert in data dozen or so—and the fit to analysis. For that reason, my model failed a signifi- with much help from the cance test. Clearly, throw- American Statistical As- ing out a few of the data on July 16, 2014 sociation, Science has es- points by declaring them tablished, effective 1 July “outliers” would have im- 2014, a Statistical Board proved the fit dramati- of Reviewing Editors cally, and in fact I even (SBoRE), consisting of ex- recall a reviewer of the perts in various aspects of paper commenting: “Can’t statistics and data analy- you make the data fit the sis, to provide better over- model better?” sight of the interpretation Really? www.sciencemag.org of observational data. The editor published For those familiar with the paper despite the the role of Science’s Board lousy fit of the model to of Reviewing Editors the data. It was not too (BoRE), the function of long before it was real- the SBoRE will be slightly ized that those “outliers” different. Members of the were the key to a more “Readers must have confidence Downloaded from BoRE perform a rapid complete understanding quality check of manu- in the conclusions published of the long-term rheologi- scripts and recommend in our journal.” cal behavior of the oce- which should receive in- anic plates. Although the depth review by techni- model in the earlier paper cal specialists. Members needed an overhaul, the of the SBoRE will receive manuscripts that have been fundamental observations, because they were presented identified by editors, BoRE members, or possibly review- without bias, inspired much further progress in the field. ers as needing additional scrutiny of the data analysis or In the years since, I have been amazed at how many statistical treatment. The SBoRE member assesses what scientists have never considered that their data might be the issue is that requires screening and suggests experts presented with bias. There are fundamental truths that from the statistics community to provide it. may be missed when bias is unintentionally overlooked, So why is Science taking this additional step? Read- or worse yet, when data are “massaged.” Especially as we ers must have confidence in the conclusions published enter an era of “big data,” we should raise the bar ever in our journal. We want to continue to take reasonable higher in scrutinizing the analyses that take us from ob- measures to verify the accuracy of those results. We be- servations to understanding. lieve that establishing the SBoRE will help avoid honest mistakes and raise the standards for data analysis, par- – Marcia McNutt

IMAGES: (RIGHT) STACEY PENTLAND PHOTOGRAPHY; (INSET) SORBETTO/ISTOCKPHOTO.COM PENTLAND PHOTOGRAPHY; (RIGHT) STACEY IMAGES: 10.1126/science.1257891

SCIENCE sciencemag.org 4 JULY 2014 • VOL 345 Issue 6192 9

Published by AAAS

32/47 Science Statistics Board of Reviewing Editors

“. . . with much help from the American Statistical Association, Science has established, effective 1 July 2014, a Statistical Board of Reviewing Editors (SBoRE), consisting of experts in various aspects of statistics and data analysis, to provide better oversight of the interpretation of observational data.” “I have been amazed at how many scientists have never considered that their data might be presented with bias . . . Especially as we enter an era of ‘big data,’ we should raise the bar ever higher in scrutinizing the analyses that take us from observations to understanding.”

33/47 Science Statistics Board of Reviewing Editors

Ron Brookmeyer, UCLA Alison Motsinger-Reif, NC State University Giovanni Parmigiani, Dana-Faber Cancer Institute Richard Smith, University of North Carolina at Chapel Hill Jane-Ling Wang, University of California, Davis Chris Wikle, University of Missouri Ian A. Wilson, The Scripps Research Institute

34/47 Science Statistics Perspectives

35/47 Impacting US federal research priorities

Whitepapers • Make case that statisticians are essential to tackling national research priorities • Drive research funding • National Science Foundation (NSF) • White House Office of Science and Technology Policy (OSTP) • Modeled after the success of Computing Community Consortium (CCC) whitepapers • See Steve’s article in the July 2014 Amstat News

36/47 OSTP Big Data initiative

Discovery with Data: Leveraging Statistics with Computer Science to Transform Science and Society

July 2, 2014 Cynthia Rudin, Chair A Working Group of the American Statistical Association 1

Summary : David Dunson, Rafael Irizarry,

The Big Data Research and Development Initiative is now in its third year and making great strides to address the challenges of Big Data. To further advance this initiative, we describe how Hongkai Ji, Eric Laber, statistical thinking can help tackle the many Big Data challenges, emphasizing that often the most productive approach will involve multidisciplinary teams with statistical, computational, mathematical, and scientific domain expertise. Jeff Leek, Tyler McCormick,

With a major Big Data objective of turning data into knowledge, statistics is an essential scientific discipline because of its sophisticated methods for statistical inference, prediction, quantification of uncertainty, and experimental design. Such methods have helped and will Sherri Rose, Chad Schafer, continue to enable researchers to make discoveries in science, government, and industry.

The paper discusses the statistical components of scientific challenges facing many broad areas Mark van der Laan, being transformed by Big Data—including healthcare, social sciences, civic infrastructure, and the physical sciences—and describes how statistical advances made in collaboration with other scientists can address these challenges. We recommend more ambitious efforts to incentivize researchers of various disciplines to work together on national research priorities in order to Larry Wasserman, achieve better science more quickly. Finally, we emphasize the need to attract, train, and retain the next generation of statisticians necessary to address the research challenges outlined here. Lingzhou Xue

1 Authors: Cynthia Rudin, MIT (Chair); David Dunson, Duke University; Rafael Irizarry, Harvard University; Hongkai Ji, Johns Hopkins University; Eric Laber, North Carolina State University; Jeffrey Leek, Johns Hopkins University; Tyler McCormick, University of Washington; Sherri Rose, Harvard University; Chad Schafer, Carnegie Mellon University; Mark van der Laan, University of California, Berkeley; Larry Wasserman, Carnegie Mellon University; Lingzhou Xue, Pennsylvania State University. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.

1

37/47 OSTP BRAIN initiative

STATISTICAL RESEARCH AND TRAINING UNDER THE BRAIN INITIATIVE Rob Kass, Chair

A Working Group of the American Statistical Association∗ Genevera Allen, Brian Caffo, April 2014 John Cunningham, Uri Eden, Timothy Johnson, 1 Introduction and Summary

The BRAIN (Brain Research through Advancing Innovative Neurotechnologies) Initiative aims to produce a so- Martin Lindquist, phisticated understanding of the link between brain and behavior and to uncover new ways to treat, prevent and cure brain disorders.1 Success in meeting these multifaceted challenges will require scientific and technological paradigms that incorporate novel statistical methods for data acquisition and analysis. Our purpose here is to Thomas Nichols, substantiate this proposition, and to identify implications for training.

Brain research relies on a wide variety of existing methods for collecting human and animal neural data, in- cluding neuroimaging (radiography, fMRI, MEG, PET), electrophysiology from multiple electrodes (EEG, ECoG, Hernando Ombao, LFP, spike trains), calcium imaging, optical imaging, optogenetics, and anatomical methods (diffusion imaging, electron microscopy, fluorescent microscopy). Each of these modalities produces data with its own set of sta- tistical and analytical challenges. As neuroscientists improve these techniques and develop new ones, data are being acquired at very large scales. For example, advances in multiple-electrode recording and two-photon Liam Paninski, calcium imaging have led to an exponential growth in the size of neural populations that can be observed si- multaneously, at single-cell resolution (2; 3; 52). Similarly, new anatomical methods have led to a rapid rise in the size and the scale of data, and the resulting level of detail with which brain structure can be investigated Russell T. Shinohara, Bin Yu (17; 10; 32; 39). Furthermore, both new and existing technologies are often used together, and are increasingly accompanied by rich characterizations of individuals and their behavior, ranging from genetic information to sensor-based monitors of activity.

∗The working group included Robert E. Kass, Carnegie Mellon University (Chair); Genevera Allen, Rice University; Brian Caffo, Johns Hopkins University; John Cunningham, Columbia University; Uri Eden, Boston University; Timothy D. Johnson, University of Michigan; Martin A. Lindquist, Johns Hopkins University; Thomas A Nichols, University of Warwick; Hernando Ombao, University of California, Irvine; Liam Paninski, Columbia University; Russell T. Shinohara, University of Pennsylvania; Bin Yu, University of California, Berkeley. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document. 1http://www.whitehouse.gov/share/brain-initiative

1

38/47 OSTP climate change initiative

Statistical Science: Contributions to the Administration’s Research Priority on Climate Change April 2014

A White Paper of the American Statistical Association’s Advisory Committee for Climate Change Policy 1

EXECUTIVE SUMMARY Data are fundamental to all of science. Data enhance scientific theories and their statistical Bruno Sanso, Chair analysis suggests new avenues of research and data collection. Climate science is no exception. Earth’s climate system is complex, involving the interaction of many different kinds of physical processes and many different time scales. Thus this area of science has a critical dependence on Mark Berliner, Daniel Cooley, the examination of all relevant data and the application of statistics for its interpretation. Climate datasets are increasing in number, size, and complexity and challenge traditional methods of data analysis. Satellite remote sensing campaigns, automated weather monitoring networks, and climate-model experiments have contributed to a data explosion that provides a wealth of new Peter Cragmile, Noel Cressie, information but can overwhelm standard approaches. Developing new statistical approaches is an essential part of understanding climate and its impact on society in the presence of uncertainty. Experience has shown that rapid progress can be made when “big data” is used with statistics to Murali Haran, Robert Lund, derive new technologies. Crucial to this success are new statistical methods that recognize uncertainties in the measurements and the scientific processes but are also tailored to the unique scientific questions being studied. Doug Nychka, Chris Paciorek, This white paper makes the case for the National Science Foundation (NSF) to establish an interdisciplinary research program around climate, where statisticians have the opportunity to collaborate with researchers from other disciplines to advance the understanding of the climate Stephan Sain, Richard Smith, system (e.g., quantification of uncertainties, the development of powerful tests of scientific hypotheses). Although NSF supports basic and applied statistical research, these efforts often do not involve scientists and statisticians in partnerships or in teams to address problems in climate science. This program would also address the critical need for training a new generation of Michael Stein interdisciplinary researchers who can tackle challenging scientific problems that require complex data analysis by developing and using the necessary sophisticated statistical methods.

1 Authors: Bruno Sanso, University of California, Santa Cruz (Chair); L. Mark Berliner, Ohio State University; Daniel S. Cooley, Colorado State University; Peter Craigmile, Ohio State University; Noel A. Cressie, University of Wollongong; Murali Haran, Pennsylvania State University; Robert B. Lund, Clemson University; Douglas W. Nychka, National Center for Atmospheric Research; Chris Paciorek, University of California, Berkeley; Stephan R. Sain, National Center for Atmospheric Research; Richard L Smith, Statistical and Applied Mathematical Sciences Institute; Michael L. Stein, University of Chicago. Affiliations are for identification purposes only and do not imply an institution’s endorsement of this document.

39/47 Impacting our role in Big Data/data science

Big Data/data science initiative by the ASA presidents (Bob, Marie, Nat) (See my June 2013 Amstat News column) • Workgroup on curriculum development • Meetings between ASA representatives and stakeholders from business and technology companies at the forefront of data science (Alexandria, Cincinnati, New York, Palo Alto) • Training in text analytics at 2014 CSP and JSM

40/47 Data science meetings

Major messages • Great shortage of statistical talent, concerns over pipeline • Concerns over ability of fresh PhDs to work independently, identify problems • Concerns over computational and data manipulation skills – Python favored overR • Communication, collaboration, and leadership skills • Must be able to “make it to the middle”

41/47 Preparing statisticians to make an impact

• Report of the curriculum development workgroup, 2014 JSM panel (Monday) • Professional skills development • Training in statistical leadership (see Nat’s May 2014 Amstat News column)

42/47 Public relations campaign

First national PR campaign for statistics • Promote the profession, increase visibility • Students and those who advise/influence them • Careers in statistics, importance of statistical literacy in everyday life • Consulting with Stanton Communications • Not just an “ASA thing” • See Nat’s June 2014 Amstat News column

43/47 This is Statistics

“It’s not what you think it is.”

44/47 This is Statistics website

http://www.thisisstatistics.org (going live on August45/47 18, 2014) This is Statistics approaches

• Profiles of young statisticians in “cool” positions • Exploit social media • Pitch statistics career stories to the media

46/47 Making an impact

I have touched on just a few of the many things the ASA is doing to enhance the impact of statistics

Please contact me or Ron ([email protected]) if you are interested in participating

47/47