<<

CHANCEVol. 32, No. 1, 2019 Using Data to Advance Science, Education, and Society

Including...

Big Data and Privacy How We Know that the Earth is Warming

09332480(2019)32(1) EXCLUSIVE BENEFITS FOR ALL ASA MEMBERS! SAVE 30% on Book Purchases with discount code ASA18. Visit the new ASA Membership page to unlock savings on the latest books, access exclusive content and review our latest journal articles!

With a growing recognition of the importance of statistical reasoning across many different aspects of everyday life and in our data-rich world, the American Statistical Society and CRC Press have partnered to develop the ASA-CRC Series on Statistical Reasoning in Science and Society. This exciting book series features: • Concepts presented while assuming minimal background in Mathematics and Statistics. • A broad audience including professionals across many fields, the general public and courses in high schools and colleges. • Topics include Statistics in wide-ranging aspects of professional and everyday life, including the media, science, health, society, , law, education, sports, finance, climate, and national security.

DATA VISUALIZATION Charts, Maps, and Interactive Graphs Robert Grant, BayersCamp This book provides an introduction to the general principles of data visualization, with a focus on practical considerations for people who want to understand them or start making their own. It does not cover tools, which are varied and constantly changing, but focusses on the thought process of choosing the right format and design to best serve the data and the message. September 2018 • 210 pp • Pb: 9781138707603: $29.95 $23.96 • www.crcpress.com/9781138707603

VISUALIZING BASEBALL Jim Albert, Bowling Green State University, Ohio, USA A collection of graphs will be used to explore the game of baseball. Graphical displays are used to show how measures of batting and pitching performance have changed over time, to explore the career trajectories of players, to understand the importance of the pitch count, and to see the patterns of speed, movement, and location of different types of pitches. August 2017 • 142 pp • Pb: 9781498782753: $29.95 $23.96 • www.crcpress.com/9781498782753

ERRORS, BLUNDERS, AND LIES How to Tell the Difference David S. Salsburg, Emeritus, Yale University, New Haven, CT, USA In this follow-up to the author’s bestselling classic, “The Lady Tasting Tea”, David Salsburg takes a fresh and insightful look at the history of statistical development by examining errors, blunders and outright lies in many different models taken from a variety of fields. April 2017 • 154 pp • Pb: 9781498795784: $29.95 $23.96 • www.crcpress.com/9781498795784

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION Vol 112, 2017 THE AMERICAN STATISTICIAN Vol 72, 2018 STATISTICS AND PUBLIC POLICY Vol 5, 2018

http://bit.ly/CRCASA2018 CHANCE Using Data to Advance Science, Education, and Society http://chance.amstat.org

4 42 50

ARTICLES COLUMNS

4 Why Most Published Research Findings 55 The Big Picture Are False Nicole Lazar, Column Editor John P. A. Ioannidis Big Data and Privacy 14 Length of the Beatles’ Songs Tatsuki Koyama 59 Book Reviews Christian Robert, Column Editor 19 Unlocking the Statistics of

Kevin Bales 27 Bond. . A Statistical Look at Cinema’s Most Famous Spy DEPARTMENTS Derek S. Young 3 Editor’s Letter 36 How We Know that the Earth is Warming Peter Guttorp 42 A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks Miguel A. Hernán, John Hsu, and Brian Healy 50 Book Excerpt: Improving Your NCAA® Bracket with Statistics Tom Adams

Abstracted/indexed in Academic OneFile, Academic Search, ASFA, CSA/Proquest, Current Abstracts, Current Index to Statistics, Gale, Scholar, MathEDUC, Mathematical Reviews, OCLC, Summon by Serial Solutions, TOC Premier, Zentralblatt Math.

Cover image: Melissa Gotherman EXECUTIVE EDITOR Scott Evans Harvard School of Public Health, Boston, Massachusetts AIMS AND SCOPE [email protected] ADVISORY EDITORS CHANCE is designed for anyone who has an interest in using data to advance science, education, and society. CHANCE is a non-technical magazine highlight- Sam Behseta ing applications that demonstrate sound statistical practice. CHANCE represents a California State University, Fullerton cultural record of an evolving field, intended to entertain as well as inform. Michael Larsen St. Michael’s College, Colchester, Vermont Michael Lavine SUBSCRIPTION INFORMATION University of Massachusetts, Amherst Dalene Stangl CHANCE (ISSN: 0933-2480) is co-published quarterly in February, April, Carnegie Mellon University, Pittsburgh, Pennsylvania September, and November for a total of four issues per year by the American Hal S. Stern Statistical Association, 732 North Washington Street, Alexandria, VA 22314, University of California, Irvine USA, and Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, EDITORS Philadelphia, PA 19106, USA. Jim Albert Bowling Green State University, Ohio U.S. Postmaster: Please send address changes to CHANCE, Taylor & Francis Phil Everson Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA 19106, USA. Swarthmore College, Pennsylvania Dean Follman ASA MEMBER SUBSCRIPTION RATES NIAID and Biostatistics Research Branch, Maryland Toshimitsu Hamasaki ASA members who wish to subscribe to CHANCE should go to ASA Members Office of Biostatistics and Data Management Only, www.amstat.org/membersonly and select the “My Account” tab and then National Cerebral and Cardiovascular Research “Add a Publication.” ASA members’ publications period will correspond with Center, Osaka, Japan their membership cycle. Jo Hardin Pomona College, Claremont, California Tom Lane SUBSCRIPTION OFFICES MathWorks, Natick, Massachusetts USA/North America: Taylor & Francis Group, LLC, 530 Walnut Street, Michael P. McDermott University of Rochester Medical Center, New York Suite 850, Philadelphia, PA 19106, USA. Telephone: 215-625-8900; Fax: 215- Mary Meyer 207-0050. UK/Europe: Taylor & Francis Customer Service, Sheepen Place, Colorado State University at Fort Collins Colchester, Essex, CO3 3LP, . Telephone: +44-(0)-20-7017- Kary Myers 5544; fax: +44-(0)-20-7017-5198. Los Alamos National Laboratory, New Mexico Babak Shahbaba For information and subscription rates please email [email protected] or University of California, Irvine visit www.tandfonline.com/pricing/journal/ucha. Lu Tian Stanford University, California OFFICE OF PUBLICATION COLUMN EDITORS Di Cook American Statistical Association, 732 North Washington Street, Alexandria, Iowa State University, Ames VA 22314, USA. Telephone: 703-684-1221. Editorial Production: Megan Visiphilia Murphy, Communications Manager; Valerie Nirala, Publications Coordinator; Chris Franklin Ruth E. Thaler-Carter, Copyeditor; Melissa Gotherman, Graphic Designer. University of Georgia, Athens Taylor & Francis Group, LLC, 530 Walnut Street, Suite 850, Philadelphia, PA K-12 Education 19106, USA. Telephone: 215-625-8900; Fax: 215-207-0047. Andrew Gelman Columbia University, New York, New York Ethics and Statistics Copyright ©2019 American Statistical Association. All rights reserved. No part Mary Gray of this publication may be reproduced, stored, transmitted, or disseminated in American University, Washington, D.C. any form or by any means without prior written permission from the American The Odds of Justice Statistical Association. The American Statistical Association grants authoriza- Shane Jensen tion for individuals to photocopy copyrighted material for private research use Wharton School at the University of Pennsylvania, Philadelphia on the sole basis that requests for such use are referred directly to the requester’s A Statistician Reads the Sports Pages local Reproduction Rights Organization (RRO), such as the Copyright Nicole Lazar Clearance Center (www.copyright.com) in the United States or The Copyright University of Georgia, Athens Licensing Agency (www.cla.co.uk) in the United Kingdom. This authoriza- The Big Picture tion does not extend to any other kind of copying by any means, in any form, Bob Oster, University of Alabama, Birmingham, and and for any purpose other than private research use. The publisher assumes no Ed Gracely, Drexel University, Philadelphia, Pennsylvania Teaching Statistics in the Health Sciences responsibility for any statements of fact or opinion expressed in the published Christian Robert papers. The appearance of advertising in this journal does not constitute an Université Paris-Dauphine, France endorsement or approval by the publisher, the editor, or the editorial board of Book Reviews the quality or value of the product advertised or of the claims made for it by its Aleksandra Slavkovic manufacturer. Penn State University, University Park O Privacy, Where Art Thou? RESPONSIBLE FOR ADVERTISEMENTS Dalene Stangl, Carnegie Mellon University, Pittsburgh, Pennsylvania, and Mine Çetinkaya-Rundel, Send inquiries and space reservations to: [email protected]. Duke University, Durham, North Carolina Taking a Chance in the Classroom Printed in the United States on acid-free paper. Howard Wainer National Board of Medical Examiners, Philadelphia, Pennsylvania Visual Revelations WEBSITE http://chance.amstat.org EDITOR’S LETTER

Scott Evans

Dear CHANCE Colleagues, appy new year! To bring in this new year, We also re-visit one of our column articles from we are playing some of CHANCE’s greatest yesteryear: Nicole Lazar discusses “Big Data and hits by revisiting some of the most-pop- Privacy” from the Big Picture column. Nicole wraps Hular articles through the years and mixing in a few up the article with, “To all of my students, former new ones. students, collaborators past and present, and old Our first article was quite thought-provoking and friends who try to connect via LinkedIn, Facebook, one of the forerunners of today’s growing discussion ResearchGate, and the like—when I don’t respond, regarding the use of p-values. John Ioannidis dis- just know that I don’t participate in any of those cusses why most published research findings are false! fora. It’s my small way of keeping a corner of pri- Tatsuki Koyama then discusses how the length vacy in the world.” I bet Nicole is good at filling out of the Beatles’ songs increased during the latter part NCAA brackets. of their career. (On another Beatles topic, check out We end this issue with Christian Robert review- Mark Glickman on ScienceFriday: https://www. ing Pragmatics of Uncertainty by Joseph Kadane; 10 sciencefriday.com/segments/who-wrote-that-beatles- Great Ideas About Chance by Persi Diaconis and Brian song-this-algorithm-will-tell-you/. John Lennon Skyrms; Independent Random Sampling Methods by claimed to have written “In My Life,” but Paul Luca Martino, David Luengo, and Joaquin Miguez; McCartney remembers it differently. Mark evaluates and Computational Methods for Analysis with R by who is telling the truth.) James Howard. Browsing further through the CHANCE photo We hope that you enjoy the trip down memory lane. album, Kevin Bales unlocks statistics from a special issue of CHANCE on modern slavery. Derek Young Scott Evans evaluates 007…after all he informs us that “James Bond will return.” Peter Guttorp explains how we know the Earth is warming in his article from a special issue of CHANCE on climate change. Correction: In the CHANCE 31.4 article Let’s move to today: Data science is all the rage! “Elementary Statistics on Trial—The Case of Lucia Miguel Hernan, John Hsu, and Brian Healy discuss de Berk” by Richard D. Gill, Piet Groeneboom, using data science to redefine data analysis to accom- and Peter de Jong, the caption for the Fokke modate causal inference from observational data. They Sukke cartoon on page 11 refers to “two ducks.” offer a classification of data science tasks. It should be: “The canary and the duck are Tom Adams then shares an excerpt from his defending a family guardian…” exciting new book, Improving your NCAA Bracket with Statistics.

CHANCE 3 Why Most Published Research Findings Are False John P.A. Ioannidis Reprinted courtesy of the Public Library of Science

Editor’s note: This article contains outcomes, and analytical modes; be the majority, or even the vast opinion on topics of broad interest and there is greater financial and other majority, of published research does not necessarily reflect the views interest and ; and more claims. However, this should not of CHANCE. teams are involved in a scientific be surprising. It can be proven that field in cases of statistical signifi- most claimed research findings here is increasing concern cance. Simulations show that for are false. that most current pub- most study designs and settings, it lished research findings is more likely for a research claim Modeling the areT false. The probability that a to be false than true. Moreover, Framework for research claim is true may depend for many current scientific fields, False-Positive on study power and , the num- claimed research findings often Findings ber of other studies on the same may be simply accurate measures question, and, importantly, the of the prevailing bias. Several methodologists have ratio of true to no relationships Published research findings are pointed out that the high rate of This article origi- among the relationships probed in refuted sometimes by subsequent nonreplication (lack of confirma- nally appeared in each scientific field. evidence, with ensuing confusion tion) of research discoveries is a CHANCE 18.4. In this framework, a research and disappointment. Refutation consequence of the convenient, finding is less likely to be true and controversy is seen across the yet ill-founded, strategy of claim- when the studies conducted in a range of research designs, from ing conclusive research findings field are smaller; effect sizes are clinical trials and traditional epi- solely on the basis of a single study smaller; there is a greater number demiological studies to the most assessed by formal statistical sig- and lesser preselection of tested modern molecular research. There nificance, typically for a p-value relationships; there is greater is increasing concern that in mod- less than 0.05. Research is not flexibility in designs, definitions, ern research, false findings may most appropriately represented

VOL. 32.1, 2019 4 Table 1—Research Findings and True Relationships

and summarized by p-values, but relationships among thousands finding is more likely true than unfortunately, there is a widespread and millions of hypotheses that false if (1 − )R  0.05. notion that medical research arti- may be postulated. What is less appreciated is that cles should be interpreted based Let us also consider, for compu- bias and the extent of repeated only on p-values. Research findings tational simplicity, circumscribed independent testing by different are defined here as any relationship fields where either there is only teams of investigators around the reaching formal statistical signifi- one true relationship (among many globe may further distort this pic- cance (e.g., effective interventions, that can be hypothesized) or there ture and may lead to even smaller informative predictors, risk fac- is roughly equal power for find- probabilities of the research find- tors, or associations). ‘Negative’ ing any of the several existing true ings being indeed true. We will try research also is very useful. “Nega- relationships. The pre-study prob- to model these two factors in the tive” is actually a misnomer, and the ability of a relationship being true context of similar 2 × 2 tables. misinterpretation is widespread. is R/(R + 1). The probability of a However, here we will target rela- study finding a true relationship Bias tionships that investigators claim reflects the power 1 − b (one minus exist, rather than null findings. the Type II error rate). First, let us define bias as the com- As has been shown previously, The probability of claiming bination of various design, data, the probability that a research a relationship when none truly analysis, and presentation fac- finding is indeed true depends on exists reflects the Type I error tors that tend to produce research the prior probability of it being rate, . Assuming that c relation- findings when they should not be true (before doing the study), the ships are being probed in the field, produced. Let u be the proportion statistical power of the study, and the expected values of the 2 × 2 of probed analyses that would not the level of statistical significance. table are given in Table 1. After a have been ‘research findings,’ but Consider a 2 × 2 table in which research finding has been claimed nevertheless end up presented and research findings are compared based on achieving formal statis- reported as such because of bias. against the gold standard of true tical significance, the post-study Bias should not be confused relationships in a scientific field. probability that it is true is the with chance variability that causes In a research field, both true and positive predictive value (PPV). some findings to be false by chance, false hypotheses can be made about The PPV is also the comple- even though the study design, the presence of relationships. Let mentary probability of what has data, analysis, and presentation are R be the ratio of the number of been called the false positive report perfect. Bias can entail manipula- true relationships to no relation- probability. According to the tion in the analysis or reporting ships among those tested in the 2 × 2 table, one gets PPV = (1 − ) of findings. Selective or distorted field. R is characteristic of the R/(R − R + ). A research finding reporting is a typical form of field and can vary a lot, depend- is thus more likely true than false such bias. We may assume that ing on whether the field targets if (1 − )R  . Because usually u does not depend on whether a highly likely relationships or the vast majority of investigators true relationship exists. This is not searches for only one or a few true depend on  = 0.05, a research an unreasonable assumption, as

CHANCE 5 Table 2—Research Findings and True Relationships in the Presence of Bias

typically it is impossible to know which relationships are indeed true. In the presence of bias (Table 2), one gets PPV = ([1 − ]R + uR)/ (R +  − R + u − u + uR), and PPV decreases with increasing u, unless 1 −   , (i.e., 1 −   0.05), for most situations. Thus, with increasing bias, the chance that research findings are true diminishes considerably. This is shown for different levels of power and for different pre-study odds in Figure 1. Conversely, true research find- ings may occasionally be annulled because of reverse bias. For exam- ple, with large measurement errors, relationships are lost in noise, or investigators use data inefficiently or fail to notice statistically signifi- cant relationships, or there may be conflicts of interest that tend to “bury” significant findings. There is no good large-scale empirical evidence of how fre- quently such reverse bias may occur across diverse research fields. However, it is probably fair to say that reverse bias is not as common. Moreover, measurement errors and inefficient use of data are prob- ably becoming less-frequent prob- lems, since measurement error has decreased with technologi- cal advances in the molecular era and investigators are becoming Figure 1. PPV (Probability that a Research Finding is True) as a Function of the increasingly sophisticated about Pre-study Odds for Various Levels of Bias, u (panels correspond to power of their data. 0.20, 0.50, and 0.80).

VOL. 32.1, 2019 6 Table 3—Research Findings and True Relationships in the Presence of Multiple Studies

Regardless, reverse bias may be modeled in the same way as bias above. Also, reverse bias should not be confused with chance vari- ability that may lead to missing a true relationship because of chance.

Testing by Several Independent Teams Several independent teams may be addressing the same sets of research questions. As research efforts are globalized, it is practi- cally the rule that several research teams, often dozens of them, may probe the same or similar questions. Unfortunately, in some areas, the prevailing mentality until now has been to focus on isolated discoveries by single teams and interpret research experiments in isolation. An increasing number of questions has at least one study claiming a research finding, and this receives unilateral attention. The probability that at least one study, among several on the same question, claims a statistically sig- nificant research finding is easy to estimate. For n independent stud- ies of equal power, the 2 × 2 table is shown in Table 3: PPV = R(1 −n)/(R + 1 − [1 − ]n − Rn) (not considering bias). With an increasing number of independent studies, PPV tends to decrease, unless 1 −    (i.e., Figure 2. PPV (Probability that a Research Finding is True) as a Function of typically 1 −   0.05). This is the Pre-study Odds forVarious Numbers of Conducted Studies, n (panels cor- shown for different levels of power respond to power of 0.20, 0.50, and 0.80).

CHANCE 7 and for different pre-study odds in For example, if the majority of true true findings than fields where Figure 2. For n studies of different genetic or nutritional determinants analytical methods are still under power, the term n is replaced by of complex diseases confer relative experimentation (e.g., artificial the product of the terms i for i risks of less than 1.05, genetic or intelligencemethods) andonly = 1 to n, but inferences are similar. nutritional epidemiology would be “best” results are reported. Regard- largely utopian endeavors. less, even in the most stringent Corollaries Corollary 3: The greater the research designs, bias seems to number and the lesser the selec- be a major problem. For exam- Based on the above considerations, tion of tested relationships in a ple, there is strong evidence that one may deduce several interesting scientific field, the less likely the selective outcome reporting, with corollaries about the probability research findings are to be true. As manipulation of the outcomes and that a research finding is indeed true. shown above, the post-study prob- analyses reported, is a common Corollary 1: The smaller the ability that a finding is true (PPV) problem even for randomized trails. studies conducted in a scientific depends a lot on the pre-study Simply abolishing selective pub- field, the less likely the research odds (R). Thus, research findings lication would not make this findings are to be true. Small are more likely true in confirma- problem go away. sample size means smaller power tory designs, such as large phase Corollary 5: The greater the and, for all functions above, the III randomized controlled trials, financial and other interests and PPV for a true research find- or meta-analyses thereof, than in a scientific field, ing decreases as power decreases  in hypothesis-generating experi- the less likely the research find- toward 1 − = 0.05. Thus, other ments. Fields considered highly ings are to be true. Conflicts of factors being equal, research informative and creative given interest and prejudice may increase findings are more likely true in the wealth of the assembled and bias (u). Conflicts of interest scien- tific fields that undertake tested information, such as micro- are very common in biomedical large studies, such as randomized arrays and other high-throughput research, and typically they are controlled trials in cardiology (sev- discovery-oriented research, inadequately and sparsely reported. eral thousand subjects randomized), should have extremely low PPV. Prejudice may not necessarily than in scientific fields with small Corollary 4: The greater the have financial roots. Scientists in studies, such as most research of flexibility in designs, definitions, a given field may be prejudiced molecular predictors (sample sizes outcomes, and analytical modes in purely because of their belief in a 100-fold smaller). a scientific field, the less likely the scientific theory or commitment Corollary 2: The smaller the research findings are to be true. to their own findings. Many oth- effect sizes in a scientific field, the Flexibility increases the poten- erwise seemingly independent, less likely the research findings tial for transforming what would university-based studies may be are to be true. Power also is related be ‘negative’ results into ‘positive’ conducted for no other reason than to the effect size. Thus, research results (i.e., bias, u). For several to give physicians and research- findings are more likely true in research designs (e.g., randomized ers qualifications for promotion or scientific fields with large effects, controlled trials or meta- analyses), tenure. Such nonfinancial conflicts such as the impact of smoking on there have been efforts to standard- also may lead to distorted reported cancer or cardiovascular disease ize their conduct and reporting. results and interpretations. Presti- (relative risks 3–20), than in scien- Adherence to common standards gious investigators may suppress, tific fields where postulated effects is likely to increase the proportion via the peer review process, the are small, such as genetic risk factors of true findings. The same applies appearance and dissemination of for multigenetic diseases (relative to outcomes. True findings may findings that refute their findings, risks 1.1–1.5). Modern epidemi- be more common when outcomes thus condemning their field to ology is increasingly obliged to are unequivocal and universally perpetuate false dogma. Empirical target smaller effect sizes. Con- agreed (e.g., death), rather than evidence on expert opinion shows sequently, the proportion of true when multifarious outcomes are it is extremely unreliable. research findings is expected to devised (e.g., scales for schizophre- Corollary 6: The hotter a sci- decrease. In the same line of think- nia outcomes). Similarly, fields that entific field (with more scientific ing, if the true effect sizes are very use commonly agreed, stereotyped teams involved), the less likely small in a scientific field, this field analytical methods (e.g., Kaplan- the research findings are to be is likely to be plagued by almost- Meier plots and the log-rank test) true. This seemingly paradoxical ubiquitous false-positive claims. may yield a larger proportion of corollary follows because, as stated

VOL. 32.1, 2019 8 Table 4—PPV of Research Findings for Various Combinations of Power (1 − ), Ratio of True to Not-true Relationships (R), and Bias (u)

above, the PPV of isolated find- alternating extreme research hot or has strong invested inter- ings decreases when many teams claims and equally extreme oppo- ests may sometimes promote of investigators are involved in the site refutations. Empirical evidence larger studies and improved stan- same field. This may explain why suggests this sequence of extreme dards of research, enhancing the we occasionally see major excite- opposites is very common in predictive value of its research find- ment followed rapidly by severe molecular genetics. ings. Massive discovery-oriented disappointments in fields that draw These corollaries consider each testing also may result in such wide attention. With many teams factor separately, but these fac- a large yield of significant rela- working in the same field and with tors often influence each other. tionships that investigators have massive experimental data being For example, investigators work- enough to report and search further produced, timing is of the essence ing in fields where true effect sizes and thus refrain from data dredging in beating competition. Thus, each are perceived to be small may and manipulation. team may prioritize on pursuing be more likely to perform large and disseminating its most impres- studies than investigators work- Most Research sive “positive” results. “Negative” ing in fields where true effect sizes Findings Are False results may become attractive for are perceived to be large. Prejudice for Most Research disseminationonly if some other may prevail in a hot scientific field, Designs and for team has found a “positive” associa- further undermining the predictive Most Fields tion on the same question. value of its research findings. In that case, itmay beattrac- Highly prejudiced stakehold- In the described framework, a PPV tive to refute a claim made in ers may even create a barrier that exceeding 50% is quite difficult to some prestigious journal. The aborts efforts at obtaining and get. Table 4 provides the results term “Proteus phenomenon” has disseminating opposing results. of simulations using the formu- been coined to describe rapidly Conversely, the fact that a field is las developed for the influence of

CHANCE 9 power, ratio of true to non-true in areas with very low pre- and Traditionally, investigators have relationships, and bias for vari- post-study probability for true viewed large and highly significant ous types of situations that may findings. Let us suppose that in effects with excitement, as signs of be characteristic of specific study a research field there are no true important discoveries. Too large designs and settings. A finding findings at all to be discovered. and too highly significant effects from a well-conducted, adequately The history of science teaches us actually may be more likely to be powered, randomized controlled that scientific endeavor has wasted signs of large bias in most fields trial, starting with a 50% pre-study effort in fields with absolutely no of modern research. They should chance that the intervention is yield of true scientific information lead investigators to careful critical effective, is eventually true about often in the past, at least based thinking about what might have 85% of the time. on our current understanding. In gone wrong with their data, analy- A fairly similar performance such a null field, one would ideally ses, and results. is expected of a confirmatory expect all observed effect sizes to Of course, investigators work- meta-analysis of good-quality vary by chance around the null in ing in any field are likely to resist randomized trials: Potential bias the absence of bias. The extent that accepting that the whole field probably increases, but power observed findings deviate from in which they have spent their and pre-test chances are higher what is expected by chance alone careers is a null field. However, compared to a single randomized would be simply a pure measure of other lines of evidence, or advances trial. Conversely, a meta-analytic the prevailing bias. in technology and experimenta- finding from inconclusive studies For example, let us suppose that tion, eventually may lead to the where pooling is used to “correct” no nutrients or dietary patterns are dismantling of a scientific field. the low power of single studies, is actually important determinants Obtaining measures of the net  probably false if R 1:3. for the risk of developing a specific bias in one field also may be use- Research findings from under- tumor. Let us also suppose that the ful for obtaining insight into what powered, early-phase clinical trials scientific literature has examined might be the range of bias operat- would be true about one in four 60 nutrients and claims all of them ing in other fields where similar times, or even less frequently if bias to be related to the risk of develop- analytical methods, technologies, is present. Epidemiological studies ing this tumor with relative risks and conflicts are operating. of an exploratory nature perform in the range of 1.2 to 1.4 for the even worse, especially when under- comparison of the upper to lower How Can We powered, but even well-powered intake tertiles. Then, the claimed Improve the epidemiological studies may have effect sizes are simply measuring Situation? only a one-in-five chance of being nothing but the that has true, if R = 1:10. been involved in the generation of Is it unavoidable that most research Finally, in discovery-oriented this scientific literature. findings are false, or can we improve research with massive testing, Claimed effect sizes are in fact the situation? A major problem is where tested relationships exceed the most accurate estimates of that it is impossible to know with true ones 1,000-fold (e.g., 30,000 the net bias. It even follows that 100% certainty what the truth is genes tested, of which 30 may be between null fields, the fields that in any research question. In this the true culprits), PPV for each claim stronger effects (often with regard, the pure “gold” standard is claimed relationship is extremely accompanying claims of medical unattainable. However, there are low, even with considerable stan- or public health importance) are several approaches to improve the dardization of laboratory and simply those that have sustained post-study probability. statistical methods, outcomes, and the worst . Better-powered evidence (e.g., reporting to minimize bias. For fields with very low PPV, large studies or low-bias meta- the few true relationships would analyses) may help, since it comes Claimed Research not distort this overall picture closer to the unknown gold stan- Findings Often May much. Even if a few relationships dard. However, large studies still Simply Be Accurate are true, the shape of the distribu- may have biases, and these should Measures of the tion of the observed effects would be acknowledged and avoided. Prevailing Bias still yield a clear measure of the Moreover, large-scale evidence biases involved in the field. is impossible to obtain for all of the As shown, the majority of modern This concept totally reverses millions and trillions of research bio-medical research is operating the way we view scientific results. questions posed in current research.

VOL. 32.1, 2019 10 Large-scale evidence should be tar- A WHOLE GENOME ASSOCIATION STUDY geted for research questions where Let us assume a team of investigators performs a whole genome association study to the pre-study probability is already test whether any of 100,000 gene polymorphisms are associated with susceptibility considerably high, so a significant to schizophrenia. Based on what we know about the extent of heritability of the research finding will lead to a disease, it is reasonable to expect that probably around 10 gene polymorphisms post-test probability that would be considered quite definitive. among those tested would be truly associated with schizophrenia, with relatively Large-scale evidence also is similar odds ratios around 1.3 for the 10 or so polymorphisms and with a fairly −4 indicated, particularly when it can similar power to identify any of them. Then R = 10/100,000 = 10 , and the pre- test major concepts rather than study probability for any polymorphism to be associated with schizophrenia is also narrow, specific questions. A nega- R/(R + 1) = 10−4. Let us also suppose the study has 60% power to find an tive finding can then refute not association with an odds ratio of 1.3 at  = 0.05. Then it can be estimated that only a specific proposed claim, if a statistically significant association is found with the p-value barely crossing the but a whole field or considerable 0.05 threshold, the post-study probability that this is true increases about twelve-fold, portion thereof. Selecting the per- compared with the pre-study probability, but it is still only 12 × 10−4. formance of large-scale studies based on narrow-minded criteria, Now let us suppose the investigators manipulate their design, analyses, and reporting such as the marketing promotion so as to make more relationships cross the p = 0.05 threshold, even though this of a specific drug, is largely wasted would not have been crossed with a perfectly adhered to design and analysis and research. Moreover, one should be with perfectcomprehensive reporting of the results, strictly according to the original cautious that extremely large stud- study plan. Such manipulation could be done, for example, with serendipitous ies may be more likely to find a inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, formally statistical significant dif- investigation of genetic contrasts thatwere notoriginally specified, changes in the ference for a trivial effect that is disease or control definitions, and various combinations of selective or distorted not meaningfully different from the null. reporting of the results. Commercially available data mining packages actually are Second, most research questions proud of their ability to yield statistically significant results through data dredging. In are addressed by many teams, and it the presence of bias with u = 0.10, the post-study probability that a research finding is misleading to emphasize the sta- is true is only 4.4 × 10−4. Furthermore, even in the absence of any bias, when 10 tistically significant findings of any independent research teams perform similar experiments around the world, if one single team. What matters is the of them finds a formally statistically significant association, the probability that the totality of the evidence. Diminish- research finding is true is only 1.5 × 10−4, hardly any higher than the probability we ing bias through enhanced research had before any of this extensive research was undertaken! standards and curtailing of preju- dices also may help. However, this may require a change in scientific mentality that might be difficult to achieve. In some research designs, efforts more widely borrowed from ran- should be performed on research also may be more successful with domized controlled trials. findings that are considered rela- upfront registration of studies (e.g., Finally, instead of chasing sta- tively established to see how often randomized trials). Registration tistical significance, we should they are indeed confirmed. I sus- would pose a challenge for hypoth- improve our understanding of the pect several established “classic” esis-generating research. Some range of R values—the pre-study studies will fail the test. kind of registration or networking odds—where research efforts oper- Nevertheless, most new discov- of data collections or investigators ate. Before running an experiment, eries will continue to stem from within fields may be more feasible investigators should consider what hypothesis-generating research than registration of each and every they believe the chances are that with low or very low pre-study hypothesis-generating experiment. they are testing a true, rather than a odds. We should acknowledge Regardless, even if we do not see non-true, relationship. Speculated then that statistical significance a great deal of progress with reg- high R values then may be ascer- testing in the report of a single istration of studies in other fields, tained sometimes. As described study gives only a partial picture, the principles of developing and above, whenever ethically accept- without knowing how much test- adhering to a protocol could be able, large studies with minimal bias ing has been done outside the

CHANCE 11 report and in the relevant field at Bartlett, M.S. 1957. A comment Ioannidis, J.P., Evans, S.J., large. Despite a large statistical on D.V. Lindley’s statisti- Gotzsche, P.C., O’Neill, R.T., literature for multiple testing cor- cal paradox. Biometrika 44: Altman, D.G., et al. 2004. rections, usually it is impossible to 533–534. Better reporting of harms in decipher how much data dredging Chan, A.W., Hrobjartsson, A., randomized trials: an extension by the reporting authors or other Haahr, M.T., Gotzsche, P.C., of the CONSORT statement. research teams has preceded a and Altman, D.G. 2004. Ann Intern Med 141:781–788. reported research finding. Even if Empirical evidence for selective Ioannidis, J.P. 2003. Genetic asso- determining this were feasible, this reporting of outcomes in ran- ciations: False or true? Trends would not inform us about the pre- domized trials: comparison of Mol Med 9:135–138. study odds. Thus, it is unavoidable protocols to published articles. that one should make approximate Ioannidis, J.P., Haidich, A.B., and assumptions on how many rela- JAMA 291: 2457–2465. Lau, J. 2001. Any casualties in tionships are expected to be true Colhoun, H.M., McKeigue, P.M., the clash of randomised and among those probed across the rel- and Davey Smith, G. 2003. observational evidence? BMJ evant research fields and research Problems of reporting genetic 322:879–880. designs. The wider field may yield associations with complex out- Ioannidis, J.P.A. 2005. Contra- some guidance for estimating comes. Lancet 361:865–872. dicted and initially stronger this probability for the isolated De Angelis, C., Drazen, J.M., effects in highly cited clinical research project. Frizelle, F.A., Haug, C., Hoey, research. JAMA 294:218–228. Experiences from biases J., et al. 2004. Clinical trial Ioannidis, J.P.A. 2005. Microarrays detected in neighboring fields also registration: a statement from and molecular research: Noise would be useful to draw upon. the International Committee discovery? Lancet 365:454–455. Even though these assumptions of Medical Journal Editors. would be considerably subjective, Ioannidis, J.P.A., Ntzani, E.E., N Engl J Med 351:1250–1251. they would be useful in interpret- Trikalinos, T.A., and Conto- ing research claims and putting Golub, T.R., Slonim, D.K., Tamayo, poulos-Ioannidis, D.G. 2001. them in context. P., Huard, C., Gaasenbeek, M., Replication validity of genetic John P. A. Ioannidis can be emailed et al. 1999. Molecular classifica- association studies. Nat Genet at [email protected]. tion of cancer: class discovery 29:306–309. and class prediction by gene Kelsey, J.L., Whittemore, A.S., References expression monitoring. Science Evans, A.S., and Thomp- 286:531–537. son, W.D. 1996. Methods in Altman, D.G., and Royston, P. Hsueh, H.M., Chen, J.J., and Observational Epidemiology, 2nd 2000. What do we mean by Kodell, R.L. 2003. Comparison ed. New York: Oxford Univer- validating a prognostic model? of methods for estimating the sity Press. Stat Med 19:453–473. number of true null hypoth- Krimsky, S., Rothenberg, L.S., Altman, D.G., and Goodman, eses in multiplicity testing. Stott, P., and Kyle, G. 1998. S.N. 1994. Transfer of technol- J Biopharm Stat 13:675–689. Scientific journals and their ogy from statistical journals to International Conference on Har- authors’ financial interests:a pilot the biomedical literature. Past monisation E9 Expert Working study. Psychother Psychosom 67: trends and future predictions. Group 1999. ICH Harmonised 194–201. JAMA 272:129–132. Tripartite Guideline. Statistical Lawlor, D.A., Davey Smith, G., Antman, E.M., Lau, J., Kupelnick, principles for clinical trials. Stat Kundu, D., Bruckdorfer, K.R., B., Mosteller, F., and Chalm- Med 18:1905–1942. and Ebrahim, S. 2004. Those ers, T.C. 1992. A comparison confounded vitamins: What can of results of meta-analyses of Ioannidis, J.P., and Trikalinos, T.A. randomized control trials and 2005. Early extreme contradic- we learn from the differences recommendations of clinical tory estimates may appear in between observational versus experts. Treatments for myo- published research: the Pro- randomised trial evidence? cardial infarction. JAMA 268: teus phenomenon in molecular Lancet 363:1724–1727. 240–248. genetics research and random- Lindley, D.V. 1957. A statisti- ized trials. J Clin Epidemiol 58: cal paradox. Biometrika 44: 543–549. 187–192.

VOL. 32.1, 2019 12 Marshall, M., Lockwood, A., Papanikolaou, G.N., Baltogianni, Taubes, G. 1995. Epidemiology Bradley, C., Adams, C., Joy, C., M.S., Contopoulos-Ioan- faces its limits. Science 269: et al. 2000. Unpublished rating nidis, D.G., Haidich, A.B., 164–169. scales: a major source of bias in Giannakakis, I.A. et al. 2001. Topol, E.J. 2004. Failing the public randomised controlled trials of Reporting of conflicts of interest health—Rofecoxib, Merck, and treatments for schizophrenia. in guidelines of preventive and the FDA. N Engl J Med 351: Br J Psychiatry 176:249–252. therapeutic interventions. BMC 1707– 1709. Michiels, S., Koscielny, S., and Hill, Med Res Methodol 1:3. Vandenbroucke, J.P. 2004. When C. 2005. Prediction of cancer Ransohoff, D.F. 2004. Rules of are observational studies as outcome with microarrays: evidence for cancer molecular- credible as randomised trials? a multiple random validation marker discovery and validation. Lancet 363:1728–1731. strategy. Lancet 365:488–492. Nat Rev Cancer 4:309–314. Wacholder, S., Chanock, S., Garcia- Moher, D., Schulz, K.F., and Risch, N.J. 2000. Searching for Closas, M., El ghormli, L., and Altman, D.G. 2001. The CON- genetic determinants in the Rothman, N. 2004. Assessing SORT statement: revised rec- new millennium. Nature 405: the probability that a positive ommendations for improving 847–856. report is false: an approach for the quality of reports of parallel- Senn, S.J. 2001. Two cheers for molecular epidemiology studies. group randomised trials. Lancet p-values. J Epidemiol Biostat J Natl Cancer Inst 96:434–442. 357:1191–1194. 6:193–204. Yusuf, S., Collins, R., and Peto, R. Moher, D., Cook, D.J., Eastwood, Sterne, J.A., and Davey Smith, G. 1984. Why do we need some S., Olkin, I., Rennie, D., et al. 2001. Sifting the evidence— large, simple randomized trials? 1999. Improving the quality of What’s wrong with significance Stat Med 3:409–422. reports of meta-analyses of ran- tests? BMJ 322:226–231. domised controlled trials: the Stroup, D.F., Berlin, J.A., QUOROM statement. Quality Morton, S.C., Olkin, I., of reporting of meta-analyses. Williamson, G.D., et al. 2000. Lancet 354:1896–1900. Meta-analysis of observa- Ntzani, E.E., and Ioannidis, J.P. tional studies in epidemiology: 2003. Predictive ability of a proposal for reporting. Meta- DNA microarrays for cancer analysis of Observational outcomes and correlates: an Studies in Epidemiology empirical assessment. Lancet (MOOSE) group. JAMA 283: 362:1439–1444. 2008–2012.

CHANCE 13 Length of the Beatles’ Songs Tatsuki Koyama

Simple, straightforward, dry, statistical. To chron- icle the Beatles’ transition in any other way would probably require painstaking research and deep under- standing of Sixties culture. Beatles fans of all ages can probably agree that the band got weird as their career progressed. “Revolution 9” is weird, and perhaps so are “I Want You (She’s So Heavy)” and “Her Majesty,” but weirdness is subjec- tive. Not everyone will agree which Beatles’ songs are weird. By contrast, song length is objective and no cultural insight is required to make statements such as, “The Beatles’ songs got longer as their career progressed.” More importantly, song length data are readily available. The data show something a bit more complex than “their songs got longer.” The Beatles indeed had longer songs in the latter part of their career, but their shortest songs also came from that time. Dichotomization of ength of the Beatles’ Songs” shows the length a continuous variable is usually discouraged, but the and release date of each of the Beatles’ distribution of the Beatles’ song lengths, as visual- 210 songs. Perhaps the figure should ized in the plot, begs for a dividing line right after “includeL more than 210 songs, since the list of Beatles “Revolver” (August 5, 1965) to split their career into songs on Wikipedia.org (visited September 5, 2011) early and late phases. gives 305 songs with the caveat, “many more…should Most striking as seen in the plot is the consistency be added.” Two-hundred and ten is the number of of the Beatles’ song lengths in the early phase. The songs released between the band’s debut single and majority of the songs (110/128=86%) in this phase are last studio album, give or take. Perhaps, “Kansas between two and three minutes. This amazing consis- City/Hey-Hey-Hey-Hey” should be counted as two tency and complete lack of songs lasting more than songs. Also, maybe “Revolution” (single version) and four minutes (fairly common in the post-“Revolver” “Revolution 1” (album version) should be counted as This article origi- era) allow the main discussion of the plot to be pre- nally appeared in two songs, given that the versions differ by nearly one minute in length. sented in the empty space above the pre-“Revolver” CHANCE 25.1. data, enhancing the immediacy of the information Song Length describing the plot. Although obvious from a look, consistency of pre-“Revolver” song lengths is con- Many criteria could be considered to illustrate The firmed in the statistics, as shown in Table 1. Beatles’ transition from mop-headed pop idols to As well as providing the big picture of the song psychedelic cultural icons. I chose song length. lengths, the plot also includes details (e.g., song

VOL. 32.1, 2019 14 Table 1—Summary of Song Lengths on the Beatles’ Studio Albums

Song Length Release Number Album Mean SD Min Median Max Date of Songs Please Please Me 1963/3/22 14 2’20 0’21 1’50 2’23 2’57 With the Beatles 1963/11/22 14 2’23 0’20 1’48 2’22 3’02 A Hard Day’s Night 1964/6/26 13 2’21 0’16 1’49 2’20 2’46 Beatles for Sale 1964/12/4 14 2’21 0’18 1’46 2’27 2’55 Help! 1965/8/6 14 2’24 0’20 1’54 2’23 3’10 Rubber Soul 1966/12/3 14 2’33 0’18 2’05 2’29 3’22 Revolver 1966/8/5 14 2’29 0’20 2’01 2’44 3’01 Sgt. Pepper 1967/6/1 13 3’05 1’06 1’20 2’54 5’33 The Beatles 1968/11/22 30 3’06 1’16 0’52 3’21 8’13 Yellow Submarine 1969/1/17 6* 3’37 1’23 2’10 2’45 6’28 Abbey Road 1969/9/26 17 2’46 1’38 0’23 3’20 7’47 Let It Be 1970/5/8 12 2’54 1’05 0’41 2’33 4’01

Table 2—Songs Composed by Rare Combinations of The Beatles

Title Release Date Credits Length What Goes On 1965/12/3 Lennon/McCartney/Starkey 2’50 Flying 1967/12/8 Lennon/McCartney/Harrison/Starkey 2’16 Don’t Pass Me By 1968/11/22 Starkey 3’50 Octopus’s Garden 1969/9/26 Starkey 2’51 Dig It 1970/5/8 Lennon/McCartney/Harrison/Starkey 0’49 Maggie Mae 1970/5/8 Lennon/McCartney/Harrison/Starkey 0’41

names) of interest. The unusually long and short maintains the truth of the claim: “All 23 cover songs songs are all labeled, as well as the songs mentioned came in the early phase.” in the discussion. Perhaps the Beatles are not the only band that The composers are another piece of informa- shows these trends with regard to song lengths in tion depicted in the plot. Of 210 songs, 161 (77%) this era. Superimposing similar data for another are credited to John Lennon-Paul McCartney and band or for the top 10 songs from each year might 20 to George Harrison. This leaves 23 cover songs be of interest. and six compositional outliers (Table 2). “Maggie Mae,” released on August 8, 1970, should probably be regarded as a cover song; however, all four Beatles were credited as composers of this public domain About the Author piece, robbing “Maggie” of the distinction of being the first cover song in nearly five years (since “Dizzy Tatsuki Koyama is a research associate professor in the Division of Miss Lizzy” and “Act Naturally,” released in Help! on Cancer Biostatistics, Department of Biostatistics, and an investigator at the Center for Quantitative Sciences at Vanderbilt University School of Medicine. His research has August 6, 1965) and the shortest song the Beatles primarily focused on adaptive and flexible clinical trial design methodologies. ever covered. Crediting the Beatles for “Maggie Mae”

CHANCE 15 The Beatles released 210 songs, give or take—from their debut Lennon−Paul McCartney; red indicates George Harrison as single, “Love Me Do” (released October 5, 1962) to their final songwriter; the purple originates with other combinations of studio album, Let It Be (May 8, 1970)—in just under eight the Beatles; and the blue points are cover songs. years. This plot shows length versus release date of each song. To prevent songs with exactly the same length and release All release dates are for the United Kingdom market with one date from appearing on top of each other, data points are jittered exception (“Bad Boy”). horizontally by a small amount equivalent to four days when necessary. For example, both “And Your Bird Can Sing” and Data and Notation “For No One” from Revolver are two minutes and one second long. Also, in two instances, a single was released on the same The closed circles indicate songs included on the Beatles’ 12 day as was an album—once with Beatles for Sale and another studio albums, the titles and release dates of which appear with Revolver. These two singles also were moved to the right in the abscissa. The Beatles also released 44 songs as singles, by four days so the data can be seen. shown here with open circles. They released 16 songs twice each, both as singles and as part of an album; all these pairs Observations are identical in both versions, except for “Revolution,” “Get Back,” and “Let It Be.” “Yellow Submarine” was released three The vast majority of songs in the early phase of the Beatles’ times—as a single and on Revolver (both on August 5, 1966) career (up to Revolver) are between two and three minutes in and on Yellow Submarine ( January 17, 1969). length, perhaps reflecting expectations for radio play at the The studio albums and singles encompass all but 10 of time. The inclusion of a relatively large number of cover songs the Beatles’ 210 songs. Two of their 13 extended-play albums during this period also may have been intended to attract a (EPs), Long Tall Sally and Magical Mystery Tour, consist of wider audience. All 23 of their cover songs come from this early songs that were never released elsewhere. A total of nine songs phase. The end of this phase is marked by their final concert, from these two EPs also are shown in the plot. Magical Mystery which took place on August 29, 1966, 24 days after the release Tour includes “I am the Walrus,” which is not shown in the of Revolver. After this concert, the Beatles started to spend plot because it was previously released as a single. One song much more time in the studio creating less-traditional and less was never a part of any studio album, single, or EP; “Bad Boy” radio-friendly songs. Songs lasting longer than three minutes was released on December 10, 1966, on a compilation album, became commonplace, although they also released many short A Collection of Beatles Oldies, with 15 other songs, all of which songs during this time. Furthermore, Harrison’s contribution were previously released. This track was available in the U.S. was much more prominent during this later phase. market more than a year earlier, on June 14, 1965, when it was included in Beatles VI, and it is shown with the release date for Note: The data and R codes are available upon request. Email the U.S. market. This track was recorded on the same day as the author at [email protected]. “Dizzy Miss Lizzy,” which was subsequently included in the studio album Help!. The black points are songs credited to John

Photo by Grace Forrest, Walk Free Foundation, www.by-grace.org.

VOL. 32.1, 2019 18 Unlocking the Statistics of Slavery Kevin Bales

tatistically, the history of be both the earliest depiction of slavery in our world can be a battle scene and include some divided neatly into two. In the of the first representations of the Sfirst half, from the very earliest of glyphs that, in time, would become human writing and records, slaves the Egyptian writing system made up a part of that record. From of hieroglyphs. Sumerian cuneiform to Egyptian Two of these glyphs are impor- hieroglyphs to Greek and Latin tant. One represents the standard scripts and the incorporation of or totem that denotes power and the useful Arabic zero, slavery was authority, and the other is the a blatant and measurable part of “man-prisoner,” or “captive,” glyph. human existence. The clay count- While the meanings of these first ing tablets of Mesopotamia, for written “words” are potent, the example, recorded slaves amongst picture itself is perfectly clear the cattle and grain, and while enough. Note the bound men the papyrus records of the pha- being marched away at the top raohs rarely survive, the great left, the hands that control them stone carvings along the Nile enu- emerging from the “standard merate slave captures and clearly glyph” of power and authority. This article origi- assign ownership. Below, their slaughtered compa- nally appeared in Consider the “Battlefield Pal- triots have been stripped naked and CHANCE 30.3. ette,” (see page 20), thought to are being feasted upon by vultures, originate around 5,200 years ago. crows, and a lion. One bound cap- Carved into a soft, sedimentary tive has been killed and a bird is stone, it is considered an impor- pecking out his eyes. Just above tant link to the distant past. them to the right, another captive While most of the story it tells is (seen only from the waist down) is through pictures, it is thought to being marched away, his hands tied

CHANCE 19 antislavery laws were rarely enforced) meant there were now no, or very few, hidden accounts of slavery. Notable limited exceptions include files kept by social service agencies on escaped slaves, or the very few legal records linked to the arrest of slaveholders. At the beginning of the 21st century, just as interest and aware- ness in modern slavery and its supporting conduit activity of human trafficking was growing rapidly, no reliable information existed on the prevalence of slav- ery—but it is worth noting that there were widely circulating, but baseless, estimates ranging from no slaves in the world to a few Battlefield Palette, thought to originate around 5,200 years ago. million to a much-quoted total of © Trustees of the British Museum. 100 million. Within this context, I carried out a systematic collection of infor- mation from 1994 until 1999 to construct a global estimate (Bales, behind his back. Driving him along the United States calculated how 1999), with a revised version (Bales, is the only fully clothed figure in many representatives each state 2002) and a detailed explanation the stone. could send to Congress based on of my methodology (Bales, 2004). When Rome grew into an population. Although they could This estimate put the number of empire, its economy running on not vote and were essentially slaves in the global population at slavery the way the United States items of property, each slave was 27 million. This was an estimate, as today runs on oil, the counting, included in the population count I made clear, built from secondary buying, selling, transferring, giv- as three-fifths of a person, thus source information, processed by ing, and inheritance of slaves must greatly aggrandizing the voting a team that assessed each source have filled entire record halls. power of each slave owner as well as for validity as far as possible, with When David Eltis and David assuring that numerically fewer care taken to treat each source with Richardson began their project to Southerners could match, through skepticism and to record only the illustrate the entire trans-Atlantic their property, the representatives conservative end of any range slave trade over a 366-year period of the more-populous North. of estimates. (1501–1867), they found surviving The second half of the statisti- From the beginning, I was records covering four-fifths of all cal history of slavery begins when highly critical of my data and voyages made—34,934 deliveries slavery becomes illegal. From that very much aware of the short- in the trade that carried some 12.5 time and through the rest of the comings of the data. I noted, for million slaves to the New World. 20th century, however, there example, that I was “potentially Possibly the last truly accurate has been no reliable measure- building upon bad estimates to measurement of slavery occurred ment of the extent of slavery in construct worse ordinal or rank- in 1860, when the United States any country, with the possible ing estimates. Even worse…there Census enumerated those held in exception of the records of slave was no way to know if this was legal bondage. In that year, the total laborers kept by the Nazi regime the case or not.” That being the number of slaves was 3,950,529, during the Second World War situation, and with that and other accounting for 13% of the (Allen, 2004). While legal slav- provisos, I made my data freely U.S. population. ery meant formal records were available, leading to its use by the A precise count of slaves was created, the ongoing criminal- statistician and methodologist crucial, since the Constitution of ization of slavery (even when Robert Smith and others.

VOL. 32.1, 2019 20 Also at the beginning of the 1) openly sharing results and the surveys, they were able to build the 21st century, a number of other underlying data with other scien- first representative estimate of groups and individual scholars tists; 2) collaboration with other the proportion of each country’s began attempting to measure slav- research groups, both formally and population that had been caught ery in local areas, nations, regions, informally; 3) publicly publish- up in human trafficking. and globally. In doing so, they ing the detail of study protocols; It is worth noting that the terms quickly divided into two groups and 4) reporting guidelines and “modern slavery” and “human and then, in parallel, proceeded checklists that help researchers trafficking” are sometimes used through four stages of method- meet certain criteria when pub- interchangeably, trafficking is sim- ological development. lishing studies. At the time of ply one of many processes by which These two groups were divided writing this article, no consensus a person might be brought into a by their approach to data trans- has arisen concerning the practice state of enslavement. While the parency, reproducibility, and of data transparency in the field of human trafficking process suffers replication. Some researchers, pri- the measurement of slavery preva- from being defined in different ways marily social scientists in academic lence, nor have reporting guidelines in a number of legal instruments appointments and some nongov- been agreed upon and set for slav- and operational definitions, it is ernmental organizations (NGOs), ery researchers. normally understood to mean the operated on the basis that it was Within this context, the first recruitment and then movement of important to make their data stage in the measurement of a person into a situation of enslave- freely available in ways that would contemporary slavery, exem- ment and exploitation.) allow other researchers to test, plified by my work, relied upon The work of Pennington, et replicate, and potentially repro- secondary sources, including gov- al., was critical to advancing the duce their results in commensurate ernmental records, NGO and measurement of the prevalence studies—thus adhering to one of service provider tallies, and reports of slavery for two reasons. Firstly, the fundamental principles of the in the media; in short, any source it demonstrated that, at least in scientific method. that might shed light on the extent some countries and circumstances, The other group, for a num- of slavery. Even when sources were enslavement could be measured ber of reasons, did not feel able to systematically assessed for reliabil- through random sample surveys share their data freely. This was ity, these estimates (Bales, 2004; of the full population. Secondly, sometimes due to political sensitiv- ILO, 2005) could only be seen, at by fixing valid data points for these ities, or notions or requirements of best, as an approximation of the five countries, it became possible proprietary interest in the data col- global situation. to begin the process of estimat- lection, concerns about the data One expansion of this method ing the range of modern slavery in itself, or the methods of collection (Hidden Slaves) in the United countries by using these, and other or analysis. Government sources, in States was an attempt to triangulate emerging survey results, to extrap- particular, were loath to make data secondary sources with surveys of olate the prevalence of slavery in public. So were commercial orga- service providers and government other countries (Datta and Bales). nizations whose business model and law enforcement records. In addressing the question of was to use the freely available data While the estimates derived in this range, it had become clear by 2009 to construct indices and synthetic first methodological stage were not that cases of slavery (although not reports that they then sold to widely different from each other, measures of slavery prevalence) clients, but which were not trans- it was impossible to ensure their were being reported in virtually parent about data origins. comparability or validity. all countries with populations It is worth noting that transpar- The second stage was set in over 100,000 (Bales, 2004; UN- ency, replication, and reproducibil- motion by the pioneering work GIFT, 2009). For that reason, the ity are issues of increasing concern of Pennington, Ball, Hampton, low end of the global range of more broadly across the sciences. A and Soulakova in 2009. This team prevalence, for countries in which recent article in Nature and a recent introduced a series of questions measurement was possible, could report on biomedical research both concerning human trafficking be assumed to be greater than zero. point to a growing unease over the into a random sample health In the same year, a U.S. Agency lack of data sharing and replication. survey of five Eastern European for International Development While there is general dis- countries (Belarus, Ukraine, (US-AID) and Pan American quiet over this issue, there is clear Moldova, Romania, and Bul- Development Foundation report consensus about its remedies: garia). Employing random sample included a random sample survey

CHANCE 21 of child restavek trafficking and surveys (Pakistan, Indonesia, process for estimating slavery in slavery in Haiti. This form of Brazil, Nigeria, Ethiopia, Nepal, those countries without direct the enslavement of children into and Russia), using the Gallup surveys, linked to a number of domestic service and other types International World Poll. These predictors of enslavement, was the of exploitation had been widely comparable surveys were rolled platform for building the fourth investigated (Cadet and Skinner), into the iterative extrapolation stage of prevalence estimation. in part because of its ubiquity in process that generated a global It is important to contextualize urban settings, but never estimated estimate of 35.8 million people in this fourth stage, since all previous through surveys. slavery worldwide, published in the stages lead to this system of lon- This US-AID survey estimated 2014 Global Slavery Index. gitudinal and iterative testing of that 225,000 children were enslaved The World Poll survey data are prevalence measures. Slavery esti- in Haitian cities, equaling 2.3% of representative of 95 percent of mation had moved from secondary the national population. This esti- the world’s adult population. The source “guesstimation” to compa- mated proportion of the Haitian World Poll uses face-to-face or tele- rable random sample surveys to population was assumed to be phone surveys, conducted through an algorithmic process ensuring in the upper range of the global households (where a household is comparability and the potential distribution of slavery prevalence defined as any abode with its own for replication, reproducibility, and for two reasons. The first was that cooking facilities, which could be further “ground-truthing” research. most investigators had noted the anything from a standing stove in Because this fourth stage of testing pervasive nature of this form of the kitchen to a small fire in the can continue to be elaborated over slavery as compared to slavery in courtyard) in more than 160 coun- much iteration, it is unlikely a fifth other countries; the second was tries and more than 140 languages. stage will emerge in the near future. that of the few existing representa- The target sample is the entire Given, as well, that this tech- tive sample measures of slavery by civilian, non-institutionalized pop- nique can also be combined with country, Haitian restavek slavery ulation, aged 15 and older. Multiple Systems Estimation was, by far, the largest. With the exception of areas that (MSE) (van Dijk and van der The culmination of this second are scarcely populated or present a Heijden provide a full discussion stage came with an emerging sense threat to the safety of interview- of MSE in this issue) to gener- of the range of prevalence across ers, samples are probability-based ate prevalence measures for highly countries and an increase in the and nationally representative. The developed nations for which amount of data available from ran- questionnaire is translated into the surveys are not appropriate, it dom sample surveys. In addition to major languages of each country, is possible to imagine a global data from the Pennington, et al., and field staff conduct in-depth estimate in which most country and Haiti surveys, random sample training and receive a standard- estimates rest on a firm quantita- surveys of slavery were also identi- ized training manual. Quality tive methodological foundation. fied in three more countries (Niger, control procedures ensure that cor- If there is a fly in this opti- Namibia, and the Democratic rect samples are selected and the mistic ointment, it is that while Republic of Congo; “Namibia correct person is randomly selected measurement issues are slowly Child Activities Survey” and John- in each household. A detailed being resolved, little progress has son, et al., 2010). The combination description of the World Poll been made in arriving at a shared of these disparate surveys, and their methodology is available online: operational definition for the object use in building an extrapolation http://bit.ly/2u9v6Iz. of study: slavery. estimation process, generated the This mixture of comparable 2013 global estimate of 29.8 mil- representative surveys using the The Definitional lion slaves in the first edition of the same format and wording, and Challenge Global Slavery Index. the “found” surveys, each unique The third stage in the estima- in design and sampling, were used It is worth noting that, for most tion of the prevalence of slavery to build an extrapolation pro- of human history, slavery was both came with the introduction of cess that also included a series of ubiquitous and undefined; slavery systematic and comparable rep- variables measuring a range of was so common that defining it resentative random samples in factors that might predict vulner- was not necessary. Over time, laws a number of countries. In late ability or propensity to slavery did set out who might be enslaved 2013, the Walk Free Foundation within a country. In many ways, this or manumitted, but it was an commissioned seven national introduction of an extrapolation activity so well understood that

VOL. 32.1, 2019 22 it was rarely given a precise defi- that some activities, such as forced experience of slaves is of primary nition. In a number of historical or compelled marriage or organ importance. not least because the contexts, detailed criteria set out trafficking, were defined as subsets way in which slavery is classified who might be enslaved, such as within slavery and others were not. and defined, in law and in public the Slave Codes of the U.S. Deep Still other new legal definitions opinion, determines who is eligible South, Roman slave laws, or even defined slavery itself as a subset for relief and who is not; who may Nazi Nuremberg—Reich Citi- of another activity, such as human live with some measure of per- zenship—laws, which allowed the trafficking. The lack of agreement sonal autonomy and who may die separation within the population of between these legal instruments in bondage. people without rights. has created confusion across juris- It has been necessary to discuss These, however, are not defini- dictions and generated a lack of the legal definitions of slavery as tions in a framework, conceptual clarity when confront- a preamble to understanding the but tools designed by slaveholders ing activities that may or may not lack of a generally accepted opera- to specify and control the enslaved be considered within the wider tional definition of slavery because and/or enslavable. Nor was slavery category of slavery. of controversy, and misunderstand- normally defined in the early trea- A second result is that courts ing, within the larger anti-slavery ties and laws that regulated and have issued rulings that either field concerning how slavery is then abolished legal slavery in the set down divergent definitions defined. Many actors in the larger 18th and 19th centuries. or interpret the same definition academic, as well as the applied anti- For example, the 13th Amend- very differently. Remarkably, slavery movement, have argued ment to the U.S. Constitution international law also says that that, because slavery is now an simply reads, “Neither slavery nor the prohibition of slavery is illegal activity, legal definitions involuntary servitude…shall exist jus cogens—an internationally must be paramount. But legal defi- within the United States.” It was applicable peremptory norm nitions are written for a specific only in the 20th century, when from which no derogation is ever purpose: to guide the implementa- virtually all slavery was ostensi- permitted. Thus, we find ourselves tion of law; to make clear, within the bly illegal, that it was felt that the with a universally and comprehen- legal framework, when a spe- human activity known as “slavery” sively forbidden crime, but one cific crime has been committed, needed a specific definition. This that is defined in different, often and that is not, the aim of an opera- perceived need was exacerbated even contradictory, ways. tional definition. in the early 1990s, when a mush- These disparities in legal An operational definition aims rooming of human trafficking definitions create difficulties in to identify, in a precise way, the paralleled an equally growing traf- developing an operational defini- nature and characteristics of an fic in arms and drugs across the tion for another reason: The voices object of scientific research. It is same borders. and views of those who have been fundamentally a definition that In response to this suddenly enslaved have been excluded in sets out clearly what is, and what visible movement of trafficked their construction. After all, slav- is not, the subject of inquiry and people into developed countries, ery is, first and foremost, a lived measurement. Attempts to use especially into commercial sexual experience—not a legal defini- any of the widely disparate legal exploitation, a number of groups tion, an analytical framework, or definitions to guide research into and political actors pressed for new a philosophical construct. At the the social activity known as slav- regulation. Some commentators moment it is occurring, slavery is ery have not been illuminating or describe a “moral panic” in this first the experience of an individual successful—with one exception. period, pushed by diverse groups. A person, and second a relationship Over a three-year period (2010– key outcome of this sudden inter- between at least two people: the 2012), a group of legal, social est and energy was a number of slave and the slaveholder. Slavery science, and other experts met to new international conventions and also carries cultural, political, and resolve the definitional confusion, national laws, all of which tended social meanings; meanings that asking if there were a definition to define slavery or human traffick- are important to understand if we that might both apply and be ing differently. are to grasp the context of slavery useful within the law and This is not the place to review and the factors that might predict operationally to guide social all these varying definitions, but it its occurrence. science, and especially quantitative, is worth noting, as an example of Within these different dimen- research. The resulting consensus of the mix of definitional frameworks, sions of enslavement, the lived this group was that the definition

CHANCE 23 of slavery available in the existing the right to dispose of the pos- It is unlikely that there will be a international legal framework that session by transfer, consumption, consensus on a shared operational provided the clarity and usefulness or destruction. definition among researchers into was that in the 1926 Slavery Con- These “instances of owner- slavery any time soon, but at the vention of the League of Nations: ship”—control, use, management, very least, a discussion and explora- “Slavery is the status or condition and profit—may be regarded as tion of such potential operational of a person over whom any or all of the central rights of ownership. It is definitions should occur. The the powers attaining to the right of their presence and exercise that can arguments offered for the “excep- ownership are exercised.” be applied and tested within a situ- tionalism” of certain types or The committee of experts added ation, such as slavery, where actual methods of slavery tend to create explanatory guidelines to clarify legal possession is not permitted. studies and reports that are non- the application of the 1926 defi- Given the illegality of slavery in comparable; that is, they “compare nition. The aim was to elucidate all countries, they provide a critical apples and oranges.” The funda- the “powers attaching to the right power of definition and identifica- mental result is a growing body of of ownership” so the attributes of tion of the crime of slavery—these literature that is much less useful any instance of suspected enslave- “instances” can be treated as mea- in addressing the crime of slavery ment might be compared to the surable indicators in an operational than it might be. criteria inherent to the 1926 Con- definition of slavery. If the definitional problem were vention. To accomplish that, it is The other attributes of pos- not sufficiently challenging, it is necessary to, firstly, locate the legal session, as normally expressed, exacerbated by the lack of trans- definition in the lived reality of pertain primarily to ownership parency and reproducibility in enslavement, and, secondly, specify that is sanctioned by law and so research methods and data. the attributes of ownership more are less useful in understanding clearly that apply to the law of modern forms of illegal enslave- The Need for property and make clear how these ment. That is not to say, however, Transparency attributes apply to the situation that modern slaveholders do not and Reproducibility of enslavement. seek to exercise these “rights” The core of this adaptable def- If the social sciences have achieved when they can. These other attri- inition is the powers attaching to a basic set of methodologial tools butes of possession include rights ownership. The most-central with which to measure slavery, of security—protection against of these is the right to possess— and an operational definition that illegal appropriation of a posses- according to Honoré, this is “the might guide and achieve compa- foundation on which the whole sion; transmissibility—the right rable research on contemporary superstructure of ownership rests” to transfer legal ownership; and slavery, a serious challenge remains (1961). Possession is demonstrated two indicators of the permanence in a lack of data transparency, by control—normally, exclusive of possession: absence of term— which makes the fundamental control. This is best demonstrated the lack of a time restriction on scientific requirement of repro- in what Hickey describes as the ownership (an attribute in that ducibility impossible. In many “maintenance of effective control” slavery is a relationship of control ways, it is surprising that such a (2010), meaning exercising con- that exists for an indeterminate lack of transparency exists, given trol over time and likely to include period of time); and the residual the nature of the phenomenon other instances or indicators character of ownership—a posses- being studied. of ownership. sion may be loaned or rented, but Slavery and human traf- These other instances are the will return to its owner and never ficking are serious crimes, with right to use, right to manage, right cease to be property. terrible repercussions on the lives to income—“use” being the right The key product of the of the enslaved. The deaths, dis- to enjoy the benefit of the pos- committee of experts was the eases, injuries, and mental health session; “manage,” the right to Bellagio-Harvard Guidelines on the impacts of slavery are well-known. make decisions about how a pos- Legal Parameters of Slavery. This Slavery is a threat to life, health, session is used; and “income,” the short document sets out how the well-being, and the social stability right to profits generated by a definition of slavery used in the of communities, as well as a known possession. In addition to these 1926 Convention is coherent and facilitator of conflict, rape, vio- central rights of ownership is the useful in both legal and social sci- lence in many forms, and brutal right to capital, which refers to ence contexts. treatment of children. Both the

VOL. 32.1, 2019 24 immediate effects of slavery and its It is important to note that these businesses, since ideas and data sequelae across not only generations rules do not hamper competition; withheld cannot be used to solve but time are also well known. in fact, they increase it and foster pressing problems, reduce suffering Given those demonstrated and it, since giving everyone access to or, even, free slaves. widely known facts, the ongoing the same shared information and In many areas of research lack of transparency, and the lack of data doesn't just level the playing with a direct impact on lives and data sharing in the study of slavery, field; it opens the field to any and well-being, shared systems for is not just a threat to good science. all comers. This competition can be information and data exchange It prevents comparable analyses harsh, energetic, even bruising, but are common. The open and that might reduce suffering and that is also a reflection of the fact freely searchable European Bio- the extreme human cost of slavery. that the reward for competing suc- informatics Database, for example, Shared data have the potential cessfully is nothing as mundane as hosts a whole series of separate of leading to the amelioration and money. It is a much more powerful specialist databases. One of these reduction of this horrific crime. motivator: respect. alone, the Malaria Data site, holds For that reason, a quick review of Of course, if the only reason for records of 371,255 compounds and the nature of why science oper- transparency and reproducibility 25,726 publications. ates through open dialogue and the was to gain respect in a circular The systematic study of con- sharing of data and results, and why game of of academic one-up- temporary slavery is relatively that practice is critically needed manship, there would be little recent, but the destructive poten- today in the study of slavery, point to observing such rules, but tial of the object of study suggests is necessary. the highly productive scientific gift that a system of information and Within the sciences, including economy is only a foundation for data exchange is overdue. In the a much-more-important and prag- the social sciences, the internal same way that the scientific study matic activity. of slavery is hampered by defini- political economy—the measure- Science is based upon the tional confusion, it is also held back ment of worth and meaning—is accretion of ideas and findings. by a failure to respect the rules of not financial. It is much closer to Every scholar may believe their science. In some arcane areas of what anthropologists call a “gift ideas and findings are important, academic endeavor, that might not economy.“ Spufford provided a but more widely, in society as a matter, but slavery—for obvious wonderful explanation of the gift whole, certain ideas and findings reasons—is not one of them. economy in the medical sciences: are considered critically impor- This is why [the special] issue “In a gift economy, status is not tant and valuable in their power to of CHANCE [was] not simply determined by what you have, but transform or protect human life. useful to scholars, but important by what you give away. The more Medical research is a clear example, in the wider sense. The articles in generous you are, the more you and the hoarding of a new idea [that] issue seek to achieve two key are respected; and in turn, your or data with the potential to save goals. The first is to make clear the generosity lays an obligation on lives or reduce suffering would current state of play in the field other people to behave generously be seen as not just unacceptable, of measuring slavery; the second themselves, to try to match your but shameful. is to demonstrate what can be generosity and so claim equal or So, too, it must be argued, would achieved when researchers in this greater status.…When scientists be withholding ideas, findings, or field operate by the shared rules practice [their gift economy], data pertaining to a locus of suf- of scientific endeavour. All of the the gift they give away is informa- fering, a crime as monstrous as authors are keenly aware that they tion” (2003). slavery. When businesses seek are working in a new field; that While there are informal expec- to monetize information about they are, at times, setting out new tations within the academic gift slavery as a condition of their ideas, procedures, and methods, economy, it is also rigorously and business plans, they have to lock and most of all, that to make prog- formally governed by the rules of away ideas and data, since free ress, they must do so in a way that scientific publishing. These rules data cannot be monetized. When is transparent and open to critique include requirements that pub- non-governmental organizations and improvement. lished articles must make data seek to lock away and control data, The work presented in this spe- freely available for re-analysis, and for whatever reason, they place cial edition on developments in that sources of data and ideas are themselves in the same category vulnerability modeling and how that clearly acknowledged and cited. of selfish negligence as such might be used in an extra-polation

CHANCE 25 process to estimate the prevalence Labour in Occupied Europe, International Labor Organization. of slavery is both groundbreak- Chapel Hill: University of North ILO Global Estimate of Forced ing and a work in progress. The Carolina, The History Press. Labor in 2012: Results and explanation of the use of meth- Baker, M. 2016. Is there a reproduc- Methodology. Geneva, Switzer- odologically sophisticated Gallup ibility crisis? Nature 535:452. land: ILO. World Poll surveys to measure Bales, K. 1999. Disposable Peo- Johnson, K., Scott, J., Rughita, B., slavery prevalence better explores ple: New Slavery in the Global Kisielewski, M., Asher, J., and what happens when a trusted and Economy. University of Califor- Ong, R. 2010. Association of refined tool is brought to bear on nia Press. Sexual Violence and Human a hidden human activity. Bales, K. 2002. The Social Psy- Rights Violations with Physical Since much of the information chology of Modern Slavery. and Mental Health in Territo- available globally is the product of Scientific American. ries of the Eastern Democratic governments, it is crucial to assess Republic of the Congo. Journal Bales, K. 2004. International Labor the reliable of such data, and what of the American Medical Associa- Standards: Quality of Informa- tools might be used to resolve tion 304(5). data integrity. The exploration of tion and Measures of Progress in Combating Forced Labor. Ministry of Labour and Social the innovative application of the Welfare, Directorate of Labour technique of MSE to measuring Comparative Labor Law and Policy 24(2). and Market Services, 2009. the prevalence of slavery appears Namibia Child Activities Sur- Cadet, J-R. 1998. Restavec: From to offer a solution to the problem vey (2005), ISBN: 978-086976- Haitian Slave Child to Mid- of estimating the extent of slavery 787-0; Enquete Nationale sur le in well-developed countries. dle-Class American. Austin: University of Texas Press. Travail des Enfants au Niger. The final article [in the special (ILO), accessed 23/09/14. issue] suggests not just the way Datta, M.N., and Bales, K. 2013. Pierre, Y-F., Smucker, G.R., and forward, but the tools and practices Slavery in Europe: Part 1, Esti- Tardieu, J-F. 2009. Lost child- that will be needed to move for- mating the Dark Figure, Human hoods in Haiti: Quantifying ward expeditiously; ideally into a Rights Quarterly 35.3. child trafficking, restavèks, and world where the metrics of slavery Eltis, D., and Richardson, D. 2010. victims of violence. Interna- are used to guide the eradication Atlas of the Transatlantic Slave tional Development and Pan of slavery. Trade. New Haven: Yale Uni- American Development Report. versity Press. http://bit.ly/1N6TLVv. Further Reading Hickey, Robin. 2010. Exploring Pennington, J.R., Ball, W.A., the Conceptual Structure of the Hampton, R.D., and Soulakova, Academy of Medical Sciences. Definition. Paper presented at J.N. 2009. The Cross-National 2015. Reproducibility and reli- Slavery as the Powers Attach- Market in Human Beings, ability of biomedical research: ing to the Right of Ownership, Journal of Macromarketing improving research practice, Bellagio, Italy. Symposium Accessed Octo- 29(2):119–134. Hidden Slaves: Forced Labor in the ber 20, 2016, at http://bit. Skinner, E.B. 2008. A Crime So United States. 2004. National ly/2vYDgZS. Monstrous, New York: Main- report to the International Labor stream Publishing. Allen, M.T. 2004. Hitler’s Slave Office for the Global Survey of Lords: The Business of Forced Forced Labor, with the Human Smith, R.B. 2009. Global human Rights Center. Berkeley: Uni- development: accounting for its versity of California Berkeley. regional disparities. Quality and About the Author Honoré, A.M. 1961. Ownership, in Quantity 43(1). A.G. Guest (ed.). Oxford Essays Spufford, F. 2003. The Backroom Kevin Bales, PhD, CMG, FRSA, is the Professor in Jurisprudence. Oxford, Eng- Boys: The Secret Return of the of Contemporary Slavery at the University of Nottingham, British Boffin. London: Faber research director of the Rights Lab, and lead author of the land: Oxford University Press. Global Slavery Index. He was co-founder of the NGO Free International Labor Office. 2005. & Faber. the Slaves. His book Disposable People: New Slavery in the Report of the Director General: United Nations Global Initiative to Global Economy was published in 11 languages and named Fight Human Trafficking. 2009. one of “100 World-Changing Discoveries” by the Association A global alliance against forced of British Universities. The film version won a Peabody and labor, Global Report under Global Report on Trafficking in two Emmy awards. His newest book, Blood and Earth: the Follow-up to the ILO Persons. Vienna, Austria: U.N. Modern Slavery, Ecocide, and the Secret to Saving the World Declaration on Fundamental Office on Drugs and Crime. (2016), proves the deep link between modern slavery and Principles and Rights at Work, climate change. Geneva, Switzerland.

VOL. 32.1, 2019 26 Bond. James Bond. A Statistical Look at Cinema’s Most Famous Spy Derek S. Young

n 1953, the literary world was intro- duced to the fictional British Secret Service agent James Bond, a.k.a. 007. IThe character was conceived by author Ian Fleming, who drew upon some of his experiences as a British Naval Intel- ligence officer during World War II. Fleming wrote 12 James Bond novels and nine short stories before he died from heart disease in 1964.

CHANCE 27 Fleming’s 007 novels achieved moderate suc- movies made. The unofficial films are the 1954 made- This article origi- cess and critical acclaim; however, what cata- for-TV movie “Casino Royale,” the 1967 comedic nally appeared in pulted James Bond into the pantheon of cultural spoof “Casino Royale” starring , and the CHANCE 27.2. icons was the introduction of the character por- 1983 competing film “,” which trayed by in 1962’s “Dr. No.” As of is a remake of “Thunderball” and stars Connery. 2013, there have been 23 official James Bond films Table 1 shows some of the data for the movies we produced and released by , six use in our analysis. The table includes the year each actors have donned the tuxedo as 007, and numer- film was released, the name of the film, which actor ous cultural influences can be attributed to the secret portrayed James Bond, worldwide gross for the film agent. The films have enjoyed enormous success and (in $1,000s) adjusted for inflation, the budget for popularity spanning their 50-year tenure. When not the film (in $1,000s) adjusted for inflation, and the adjusting for inflation, the Bond franchise is ranked average online ratings for the films. The original box number two on the list of most successful film fran- office totals and budgets were retrieved from www. chises worldwide, behind the “Harry Potter” films the-numbers.com, a website that maintains box office and above the “Star Wars” films. In fact, it has been history and other statistics about movies. These values estimated that around 20% of the world’s population were adjusted to 2013 dollars using the U.S. Bureau has seen at least one James Bond film. of Labor Statistics’ inflation calculator, which uses James Bond also has been the subject for aca- the Consumer Price Index and was accessed on April demic discourse. For example, Christoph Lindner 3, 2013. The average user ratings were pulled from edited the 2010 book The James Bond Phenomenon: A each movie’s webpage on www.imdb.com and www. Critical Reader, which is an essay collection providing rottentomatoes.com, both of which were also accessed theoretical perspectives on James Bond as a cultural on April 3, 2013. These websites allow users to rate figure. Daniel Savoye’s 2013 book, The Signs of James movies on a scale from 1 to 10. For the James Bond Bond: Semiotic Explorations in the World of 007, pro- films, the user ratings between these two websites vides a critical analysis of the formulaic elements that are highly correlated, with a Pearson correlation of comprise James Bond films. There also have been +0.8868. Thus, for our analysis, we average the pairs university topics courses that explore themes and of user ratings. issues conveyed through the James Bond films and In addition to the variables presented in Table 1, we novels. Such courses have been offered at Missouri consider the following about each film for our analysis: State University, Thiel College, and the University Length of the theatrical release of Michigan. Number of “conquests” James Bond had In this article, we analyze the James Bond films from a statistical perspective and provide data visu- Number of martinis Bond had alizations of various aspects of the films. We build a Number of times “Bond. James Bond” was said multivariate regression model to dissect the box office Number of kills by Bond totals (adjusting for inflation) and average ratings when considering other variables about the films. Number of kills by characters other than Bond Using our model, we provide a prediction region for Whether the theme song peaked within the top the box office gross and average rating score of the 100 on the UK Singles Chart and the U.S. next James Bond film. We also provide a ranking of Billboard Hot 100 the actors who have played James Bond in an attempt to answer an often-debated question that even has The values for the number of kills by Bond and surely struggled with: Who is the best Bond? the other characters were obtained from data pub- lished on a data blog at the Guardian’s website (www. The Data guardian.co.uk). As noted earlier, we only include the “official” movies produced by Eon Productions. In addition to these 23 movies, there have been a few “unofficial” James Bond

VOL. 32.1, 2019 28 Table 1—Official James Bond Films by Release Year

Adjusted Adjusted Worldwide Gross Budget Average Year Released Movie Bond Actor (in $1000) (in $1000) Rating 1962 Dr. No Sean Connery 457,928 7,688 7.50 From Russia 1963 Sean Connery 598,624 15,174 7.75 with Love 1964 Goldfinger Sean Connery 935,404 22,468 8.10 1965 Thunderball Sean Connery 1,040,693 66,333 6.90 You Only Live 1967 Sean Connery 775,740 66,035 6.60 Twice On Her Majesty’s 1969 George Lazenby 518,736 50,608 6.75 Secret Service Diamonds are 1971 Sean Connery 664,969 41,274 6.50 Forever 1973 Live and Let Die 846,046 36,603 6.35 The Man with 1974 Roger Moore 459,623 32,965 5.90 the Golden Gun The Spy Who 1977 Roger Moore 710,290 53,636 6.95 Loved Me 1979 Moonraker Roger Moore 672,514 99,134 5.95 For Your 1981 Roger Moore 498,812 71,514 6.55 Eyes Only 1983 Roger Moore 437,059 64,102 5.90 1985 Roger Moore 329,322 64,730 5.45 The Living 1987 390,758 81,749 6.50 Daylights 1989 Timothy Dalton 292,392 78,637 6.25 1995 Goldeneye 542,985 91,404 7.05 Tomorrow 1997 Pierce Brosnan 491,098 159,117 6.20 Never Dies The World Is 1999 Pierce Brosnan 504,091 188,130 6.00 Not Enough 2002 Pierce Brosnan 557,433 183,255 6.05 2006 Casino Royale 686,784 117,465 7.85 Quantum of 2008 Daniel Craig 638,035 248,014 6.40 Solace 2012 Daniel Craig 1,120,980 202,240 8.00

Note: The worldwide gross and budget are both adjusted to 2013 dollars.

CHANCE 29 Success of James Bond Films

James Bond Actor 1000000 Sean Connery George Lazenby Roger Moore Timothy Dalton Pierce Brosnan Daniel Craig

750000 Average User Rating 5.5 6.0

6.5

7.0

500000 7.5 Adjusted Worldwide Gross (in $1,000)

8.0

1960 1970 1980 1990 2000 2010 Release Year

Figure 1. Bubble plot showing the adjusted worldwide gross of James Bond films by the year they were released.

Exploratory Data Analysis “Skyfall” was the highest-grossing film of all time in the United Kingdom. A timeline of the adjusted worldwide gross is given Next, we look at two vices associated with James as a bubble plot in Figure 1, where the scale of the Bond: martinis and conquests of female companions. bubbles reflects the average user rating of the film. Figure 3 is a bar chart showing the distribution of From this plot, we note a few interesting aspects about these two vices by each James Bond actor. Overall, the James Bond films. First is that with the excep- the average number of conquests per Bond is similar tion of the lone George Lazenby film, the first film among the actors. Moore has the highest number, by each actor had higher adjusted worldwide gross but he also starred in the most films. The distribution sales than the last film by the previous actor. We can of martinis by each actor is noticeably more variable. see that Roger Moore, whose tenure as James Bond For example, Moore has the lowest average number was the longest with seven films, tended to decrease of martinis while Craig has both the highest total and across time in both adjusted worldwide gross sales and average number of martinis. In fact, Craig’s James average user ratings. The Pierce Brosnan films tended Bond is the only one whose martini count exceeds to be consistent in terms of the adjusted worldwide his conquest count. gross sales. Finally, we see that the most recent Daniel While James Bond is infamous for his vices, he is Craig film, “Skyfall,” is currently the highest-grossing also known for his license to kill. Figure 4 is a bubble James Bond film and has the second-highest average plot showing each James Bond actor’s kill ratio (i.e., user rating, just behind Connery’s “Goldfinger.” the average number of kills per film) versus his total We also provide a stacked area chart of the cumu- kills in all films. The bubbles are proportional to the lative sales of the James Bond films in Figure 2. This kill ratio by all other characters in the corresponding chart is in terms of millions of U.S. dollars, with the films. For example, we see the kill ratio by other char- sales are adjusted to 2013 dollars. We see that the acters always exceeds that of the James Bond actor, cumulative U.S. sales have increased consistently over except Brosnan. The biggest differential between kill time. However, the total world sales have increased at ratios is with Connery, while clearly Brosnan had the a greater rate relative to just the U.S. sales, especially highest kill ratio and total kills. Thus, Brosnan has given the success of “Skyfall.” For example, as of 2013, the dubious distinction of being the deadliest Bond.

VOL. 32.1, 2019 30 Cumulative Sales of James Bond Films

10000

Sales Net Profit US Gross World Gross

5000 Cumulative Sales (in $1,000,000)

0

1960 1970 1980 1990 2000 2010 Year Figure 2. Stacked area chart of cumulative sales of the James Bond films.

Bar Chart of Bond's Vices

15

Vice 10 Conquests

Count Martinis

5

0

Figure 3. BarSean chart Connery showing totalGeorge number Lazenby of martinisRoger and Moore conquestsTimothy by Bond Dalton actor. DashedPierce Brosnan lines representDaniel the Craig mean for that actor. James Bond Actor

CHANCE 31 Kill Ratio by Bond

30 James Bond Actor Sean Connery George Lazenby Roger Moore Timothy Dalton Pierce Brosnan Daniel Craig

Kill Ratio by Others 20 20

30

Bond's Kill Ratio 40

50

10 60

0 50 100 Total Kills by Bond

Figure 4. Bubble plot of each Bond actor’s kill ratio (i.e., average kills per film) versus his total kills. The area of the bubbles is proportional to the kill ratio by others in the corresponding films.

Building a Statistical Model root. Using a significance level of 0.05, we perform of the Bond Movies backwards elimination by removing the least signifi- cant variable based on the MANOVA results. Using The adjusted worldwide gross and average user ratings this approach, we end up with only the Bond actor are variables that provide a good indication of each and the adjusted budget as significant predictors of movie’s success (see Figure 5). The Pearson correlation the bivariate response. In fact, our decisions based on between these variables is +0.5659, which is a mod- erately high value and significantly different from 0 the p-values for each predictor is consistent in the four (p-value = 0.0049). Treating these variables as a bivari- test statistics. Since our decisions do not vary across ate response, we build a multivariate multiple linear the four tests, we are confident in the results. regression model and explore the variables listed in the Our final multivariate regression model treats the previous section as possible predictors in the model. adjusted worldwide gross and average rating as the We construct the multivariate analysis of variance response vector with Bond actor and adjusted budget (MANOVA) table to assess the significance of each as predictors. We next analyze the residual vectors for independent variable. We compute the following each regression to assess the normality and constant four multivariate test statistics: Wilks’ lambda, Pillai’s variance assumptions. The Shapiro-Wilk test for trace, the Hotelling-Lawley trace, and Roy’s greatest normality applied to the residuals yields p-values of

VOL. 32.1, 2019 32 Scatterplot of James Bond Films 1500000

James Bond Actor 1000000 Sean Connery George Lazenby Roger Moore Timothy Dalton Pierce Brosnan Daniel Craig Adjusted Worldwide Gross (in $1,000)

500000

6 7 8 Average Rating

Figure 5. Scatterplot of the adjusted sales versus average rating for the James Bond films. The 95% prediction ellipse for the next James Bond film starring Daniel Craig is overlaid.

0.2780 and 0.2352 for the adjusted worldwide gross The corresponding 95% prediction ellipse is overlaid and average user relationships, respectively. The cor- on the data plotted in Figure 5. responding results from the Levene test for constant variance yields p-values of 0.4533 and 0.4039. There- Ranking the Bonds fore, we claim that the normal regression assumptions are appropriate for this model. Anyone who considers himself or herself even a casual We also construct a 95% prediction ellipse for the Bond fan typically opines on who they think is the next James Bond film. It has been confirmed that best James Bond. In this section, we present a simple Craig is signed to star in two more James Bond films, ranking methodology as a way to answer this ques- with the next film slated for release in late 2015. tion. We consider the worldwide gross, net profit of Using our estimated multivariate regression model the films (i.e., the adjusted worldwide gross minus and assuming that the next film will have a budget the adjusted budget), the average user ratings of the around $250 million, we obtain an estimated average films, and the number of films featuring each actor. rating for the film of about 6.80, while the estimated For each variable, we rank the James Bond actors worldwide gross of the film is around $865 million. and then calculate the means of their rankings. The

CHANCE 33 Table 2—Rankings of the James Bond Actors by Different Variables

Ranking by Average Ranking Ranking Ranking by Worldwide by Average by Average Number of Overall Bond Actor Gross Net Profit Movie Rating Movies Made Ranking Sean Connery 2 1 2 2 1 Daniel Craig 1 2 1 4 2 Roger Moore 3 3 6 1 3 Pierce Brosnan 4 5 5 3 4 George Lazenby 5 4 3 6 5 Timothy Dalton 6 6 4 5 6

Note: The last column is the ranking of the average of the four ranking variables.

lowest mean is given a final ranking of 1, and the high- est mean is given a ranking of 6. All these rankings are reported in Table 2. The results show that Connery comes out as the best James Bond, according to our methodology. This is not unexpected considering he defined the role and has some of the most-popular movies dur- ing his tenure as 007, like “From Russia with Love” and “Goldfinger.” Craig is second, which again is not unexpected considering he is the current James Bond and his three films have been very successful. Third is Moore, who starred in the most James Bond films of any actor. While we noted the decline in the adjusted worldwide gross of his films, their success is still quite good considering they were released during years that saw competition from box office hits like “E.T.” and the original “Star Wars” films. Ranked fourth by our methodology is Brosnan, whose debut film “Goldeneye” was very well received after the franchise took a six-year hiatus. Each of his subsequent movies saw success at the box office, but they are sometimes viewed as declining in quality, as demonstrated with the decreasing ratings in Table 1. Fifth is Lazenby, who only starred in one James Bond film, “On Her Majesty’s Secret Service.” Fortunately, this movie is often regarded by critics and fans as being one of the best in the Bond canon as we see the character humanized by getting married, but only to have his wife killed on their wedding day. In sixth place is Timothy Dalton, who was actually asked by the original producer of the James Bond films, Cubby Broccoli, to succeed Connery back in the early . Dalton passed at the time because he felt he was “too young” for the role, only being in his late 20s at the time. While Dalton is a classically trained actor who spent time performing with the Royal Shakespeare

VOL. 32.1, 2019 34 Company, he was unfortunate in that his films were Regardless, the appeal of James Bond has spanned made during a time when the Bond formula was many generations and the films have endured through seen as stale. In fact, he was slated to star in a third numerous global events. Even with so many uncer- James Bond film, but the concept was shelved due tainties in an ever-changing world, we can rest assured to studio issues until Brosnan made his Bond debut that “James Bond will return…” in “Goldeneye.” Further Reading Discussion Cork, J., and Stutz, C. 2009. James Bond encyclopedia. This study explored the James Bond film franchise New York, USA: DK Publishing. through various data visualizations and by building a Desowitz, B. 2012. James Bond unmasked. Maryland, multivariate regression model that characterizes the USA: Spies Publishing. worldwide box office receipts and average user rat- Dodds, K. 2003. Licensed to : Geopolitics, ings as a function of the actors and the film’s budgets. James Bond, and the spectre of balkanism. Geopoli- From this model, we constructed a prediction region tics 8(2):125–156. to characterize where the worldwide gross and user ratings will likely fall for the next James Bond film Johnson, R.A., and Wichern, D.W. 2007. Applied starring Craig. Finally, we averaged the rankings of multivariate statistical analysis (6th ed.). New Jersey, certain variables for each James Bond actor to provide USA: Prentice Hall. an overall ranking. Based on this ranking methodol- Moore, R. 2012. Bond on Bond: Reflections on 50 years of ogy, we determined that Connery is the “best” actor James Bond movies. Connecticut, USA: Lyons Press. to have played 007. Neuendorf, K.A., Gore, T.D., Dalessandro, A., Jan- While the model we presented is a way to char- stove, P., and Snyder-Suhy, S. 2010. Shaken and acterize the films based on a few variables, the true stirred: A content analysis of women’s portrayals success and endurance of the films is, perhaps, a topic in James Bond films. Sex Roles 62(11/12):74–77. for a more-philosophical discussion. It could be that we enjoy how the James Bond films depict travels to exotic locales and provide adventurous tales that About the Author romanticize espionage. It could be that while Bond is a flawed human being, at the core of his objective is Derek Young is a research mathematical statistician to protect “Queen and Country” and we enjoy seeing at the U.S. Census Bureau, lecturer in statistics at Penn State TM him being the proverbial good that triumphs over evil. University, and Accredited Professional Statistician . His research interests include tolerance regions, mixture models, More personally, it could simply be that we relate to and statistical depth functions. the era and style reflected in a particular Bond film that reminds us of a particular time in our own lives.

CHANCE 35 How We Know that the Earth is Warming Peter Guttorp

here is no doubt that global Where do the numbers come not always recorded. In the United temperatures are increas- from? We will go through some States, daily maximum and mini- ing, and that human issues that are associated with mum temperature are recorded, and greenhouseT gas emissions largely determining surface temperature, their average is the daily mean tem- are to blame, but how do we go and illustrate some of the uses of perature. In Sweden, three hourly about measuring global temper- these temperatures. readings throughout the day are ature? It is not just a matter of combined with the minimum and reading an instrument. Local Daily Mean the maximum to calculate the daily In Figure 1, we see a variety of mean temperature. In Iceland, The basic measurements that go curves depicting annual global mean linear combinations of two read- into the calculation of global mean temperature. They are not the same, ings in the morning and afternoon temperature are readings of ther- are used. although they all show a strong mometers or other instruments increase after about 1980. Differ- Modern instruments can com- determining temperature. For pute the daily average automati- ent groups, using different data and land stations, these instruments are This article origi- cally, but to compare to historical different techniques, have computed typically kept in some kind of box nally appeared in data, a specific averaging method the different curves. It would be in an open, flat space covered with CHANCE 30.4. has to be applied. hoped that the curves would all be grass (see Figure 2). The box keeps measurements of the annual global direct sunlight from hitting the Local Annual mean temperature, but global mean instrument but allows wind to pen- Mean Temperature temperature is not something that etrate it. can be measured directly using an Readings are done at different Once you have a daily mean tem- instrument. On the other hand, it is schedules in different countries. The perature, it is easy to compute an the quantity most commonly used modern instruments measure con- annual mean temperature: Sum to indicate global warming. tinuously, but the measurements are all the daily means and divide by

VOL. 32.1, 2019 36 Figure 1. Five estimates of the annual global mean anomalies relative to 1981–2010: Black is from Berkeley Earth, red from the UK Met Office Hadley Center, purple from the Japanese Met Office, blue from the Goddard Institute for Space Science (GISS), and green from the National Oceanic and Atmospheric Administration (NOAA).

Figure 2. Thermometer and other instruments at Stockholm Observatory, where measurements have been made daily since 1756. The station has been moved short distances twice during this time. The box to the left is a Stephenson screen, and was used for the measurements until 2006. The pipe sticking up in the middle contains the modern mea- surement devise that has been used since then. Photograph courtesy of Peter Guttorp.

the number of days in the year. global temperature series, has some temperature, and water tempera- What is often used instead of the 39,000 stations and a total of 1.6 ture. The water temperature used annual mean is something called a billion temperature measurements. to be taken in a bucket of seawater. mean anomaly: How much did the Later, it would be measured at the year deviate from the average over Sea Surface cooling water intake for the motor. a reference period? This makes it Temperature Of course, ships do not travel every- easier to compare sites at different where on the oceans and, therefore, altitudes, for example. A station at Since more than two-thirds of the there are fairly large areas of ocean a higher elevation always tends to surface of our planet is water, it is where we have no sea surface tem- be colder than one at a lower eleva- not enough to take temperature perature measurements from ships. tion, but anomalies allow us to see measurements on land to com- In some of these areas, there are if both sites are colder than usual. pute a global average. Ocean-faring buoys that measure the temperature. The largest collection of land sta- ships have long kept daily logbooks, Over the last several decades, there tion data, used in the Berkeley Earth with measurements of wind, air have been satellite measurements of

CHANCE 37 on a globe and not in the plane. The The main groups estimating global mean temperature average temperature anomaly for • Hadley Center of the UK Met Office with the Climate land and ocean can be computed Research Unit of the University of East Anglia, United separately, and the global mean Kingdom temperature would then be the area weighted average of the two means. • Goddard Institute for Space Science (part of NASA), USA Uncertainty • National Centers for Environmental Information (part of NOAA), USA There are several sources of uncer- tainty in the determination of • Japanese Met Office, Japan global mean temperature. First of all, each measurement has error • Berkeley Earth Project, USA associated with it. Second, how to deal with missing areas of mea- surement causes uncertainty. The choice of measurement stations can also be a source of uncertainty, A simultaneous confidence band for n normally distributed as can the homogenization of mea- estimates can be obtained by the Bonferroni inequality surements, such as when stations are moved or measurement devices . updated. There are other sources of uncertainty as well. In fact, we want the complement: It is important to try to quan- tify the uncertainty in global mean

Let Ei be the event that the true value at time i is not covered by its temperature. Different groups (pointwise) confidence set. If we let each confidence set have level approach this issue in different 1   / n, we see that the probability that all parameters (in our ways. Figure 3 uses the Hadley case, the global average temperature for each year) are covered series uncertainties to compute a by their respective intervals is at least 1  n ( / n) = 1  .The simultaneous Bonferroni-based confidence band then is t  -1 (1   / n)se(t ) where t is the 95% confidence band for global i i i average temperature. The term estimated global mean temperature for year i, se(t) is the standard i simultaneous means that the con- error of the estimate, and -1 is the normal quantile function (inverse fidence band covers all the true of the cdf). temperatures at the same time with 95% probability, as opposed to a pointwise confidence interval, which only covers the true tem- perature at a particular time point sea surface temperature; for over a anomaly where there are no actual with 95% probability. decade, floats that measure the tem- measurements, such as on a regular perature profile of the water have grid, and then averaging the esti- Ranking been dropped all over the oceans. mates and measurements (if any) In January 2017, NOAA made the The largest collection of ocean over the grid. Such estimation tools claim that the global mean tem- data, the ICOADS 3.0 data set, uses are derived in what is called spatial perature had set a record for the about 1.2 billion different records. statistics, although atmospheric third straight year. This statement scientists have developed some is not quite accurate: For the third Combining All methods on their own (through straight year, the estimated global the Measurements what they call objective analysis). mean temperature had set a record. In essence, a statistician would In fact, four of the five series had To combine the many measure- treat the problem as one of regres- this feature, while the Berkeley ments over land and oceans into sion, with data that are spatially series showed 2005 as warmer than an average global temperature dependent. The process has to take and 2010 tied with 2014. Only two requires estimating the temperature into account the fact that data are of the estimates (the Hadley series

VOL. 32.1, 2019 38 Figure 3. Hadley series with red dashed line being the lower 95% simultaneous confidence bound on the 2016 temperature and blue dashed line the upper bound on the 2015 temperature.

Figure 4. 10 realizations (blue) of possible Hadley temperature series and the Hadley estimate of global mean tem- perature (black).

and the Berkeley series) provide clearly was warmer than any year standard deviation equal to the uncertainty estimates. before 1998. standard error of the estimate. Fig- Figure 3 shows the Hadley How can we say something ure 4 shows 10 such realizations of series with associated simultane- about the uncertainty in the rank- the Hadley temperature series. ous 95% confidence bands. If the ings as opposed to the estimates? For each of the realizations, we One way is to simulate repeated 2015 actual temperature (which can calculate the rank of 2016. The draws from the sampling distri- we do not know) were at the high distribution of that rank tells us bution of the estimates. Since we end of its confidence band (blue how likely 2016 is to be the warm- are averaging a large number of dashed line), and the 2016 was measurements, many of which est year on record: It is warmest at the low end of its band (red are nearly uncorrelated, a central in 58% of the simulations, while dashed line), it is quite possible that limit theorem leads us to treat the 2015 is warmest in 42%. In 10,000 2015 could have been substantially estimates as normal, with mean simulations, 2016 was as low as the warmer than 2016, but that 2016 equal to the actual estimate and eighth-warmest in one of them.

CHANCE 39 Figure 5. Global annual mean temperature anomalies from 32 CMIP5 models with historical simulations (gray), and the Hadley Center data series (black). Reference period is 1970–1999.

Figure 6. QQ-plots of historical climate model simulations against Hadley Center data or two 30-year periods. The gray lines are simultaneous 95% confidence bands, and the red lines indicate equal distributions.

VOL. 32.1, 2019 40 How about all three years— Many tests have been developed to compare some aspects of 2014–2016—being record- distributions, such as means or medians. To compare two entire breakers? That happened in 21% of the simulations, and in the distributions, we can plot the quantiles of one against the other actual Hadley estimates, of course. (called a quantile-quantile plot or QQ-plot). An advantage of this plot is that if the distributions are the same, then the plot will Models and Data be a straight line. Of course, we will be estimating the quantiles from data, so there will be uncertainty. Another advantage of the Climate, from a statistical point of QQ-plot is being able to develop simultaneous confidence bands, view, is the distribution of weather. enabling a simple test of equal distributions: Does the line y=x fit Climate change means that this inside the confidence band? distribution is changing over time. The World Meteorological Organization recommends using 30 years to estimate climate. This outputs were used for the latest Center data seem to have the same definition indicates, for example, IPCC report in 2013. Figure 5 distribution—they are describing that it does not make sense to look shows the global mean tempera- the same climate. at shorter stretches of data to try to ture anomalies (with respect to assess questions such as “Is global Further Reading warming slowing down?” 1970–1999) with the correspond- ing Hadley Center series. Arguez, A., Karl, T.R., Squires, What is a M.F., and Vose, R.S. 2013. Climate Model? Comparing Distributions Uncertainty in annual rank- A climate model is a deterministic ings from NOAA’s global tem- model describing the atmosphere, It is not trivial to compare climate perature time series. Geophysical Research Letters 40:5,965–5,969. sometimes the oceans, and some- model output to data. Remember, times also the biosphere. It is based the climate model represents the Doksum, K. 1974. Empirical prob- on a numerical solution of coupled distribution of the data. The obser- ability plots and statistical infer- partial differential equations on vations in Figure 5 are, therefore, ence for nonlinear models in the a grid. In fact, the equations for not directly comparable to the two-sample case. Ann. Statist. the atmosphere are essentially the model runs. Instead, we need to 2:267–277. same as for weather prediction, but compare the distributions of model Guttorp, P. 2014. Statistics and Cli- the latter is an initial value problem output and data, respectively. mate. Annual Reviews of Statis- (we use today’s weather to forecast Figure 6 compares these distri- tics and its Applications 1:87–101. tomorrow’s) of a chaotic system, butions using QQ-plots for two Katz, R.W., Craigmile, P.F., Gut- while the climate models has to 30-year stretches. In both cases, the torp, P., Haran, M., Sanso, B., show long-term stability. Many distribution of the data fit the dis- and Stein, M.L. 2013. Uncer- processes, such as hurricanes or tribution of the ensemble of model tainty analysis in climate change thunderstorms, are important in outputs quite well, in that the red assessments. Nature Climate transferring heat between different y=x line falls inside the simultane- Change 3:769–771. layers in the model, but often take ous 95% confidence bands. Since place at a scale that is at most simi- we have 32 x 30 observations of lar to a grid square, and sometimes the models, and only 30 of the data, much smaller. the empirical tails of the model Different climate models deal distribution are much longer than About the Author with this subgrid variability dif- the tails of the data, but the con- ferently and, as a consequence, fidence band is quite wide in the Peter Guttorp is a professor at the Norwegian tails, meaning that we are very Computing Center in Oslo and professor emeritus at the the detailed outputs are different. University of Washington in Seattle. He has worked on CMIP5 is a large collection of uncertain there. stochastic models in a variety of scientific applications, such model runs, using the same input Thus, for these two time inter- as hydrology, climatology, and hematology. He has published variables (solar radiation, volcanic vals and for the global mean tem- six books and about 200 scientific papers. eruptions, greenhouse gas con- perature variable, the ensemble of centrations, etc.). These model CMIP5 models and the Hadley

CHANCE 41 A Second Chance to Get Causal Inference Right: A Classification of Data Science Tasks Miguel A. Hernán, John Hsu, and Brian Healy

or much of the recent history example, the famous “Simpson’s discrepancies remained, a unified of science, learning from paradox” stemmed from a failure theory of quantitative causal infer- data was the academic realm to recognize that the choice of ence had emerged.8,9 of statistics,1,2 but in the early 20th data analysis depends on the causal We now have a historic oppor- F 6 century, the founders of modern structure of the problem. Mistakes tunity to redefine data analysis in statistics made a momentous deci- occurred. For example, as a gen- such a way that it naturally accom- sion about what could and could eration of medical researchers and modates a science-wide framework not be learned from data: They clinicians believed that postmeno- for causal inference from obser- proclaimed that statistics could be pausal hormone therapy reduced vational data. A recent influx of applied to make causal inferences the risk of heart disease because of data analysts, many not formally when using data from randomized data analyses that deviated from trained in statistical theory, bring experiments, but not when using basic causal considerations. Even a fresh attitude that does not a pri- nonexperimental (observational) today, confusions generated by a ori exclude causal questions. This data.3,4,5 This decision classified an century-old refusal to tackle causal new wave of data analysts refer to entire class of scientific questions questions explicitly are widespread themselves as data scientists and in the health and social sciences as in scientific research.7 to their activities as data science, not amenable to formal quantita- To bridge science and data a term popularized by technology tive inference. analysis, a few rogue statisticians, companies and embraced by aca- Not surprisingly, many scientists epidemiologists, econometricians, demic institutions. ignored the statisticians’ decree and and computer scientists developed Data science, as an umbrella continued to use observational data formal methods to quantify causal term for all types of data analysis, to study the unintended harms of effects from observational data. can tear down the barriers erected medical treatments, health effects of Initially, each discipline empha- by traditional statistics; put data lifestyle activities, or social impact sized different types of causal analysis at the service of all sci- of educational policies. Unfor- questions, developed different entific questions, including causal tunately, these scientists’ causal terminologies, and preferred dif- ones; and prevent unnecessary questions often were mismatched ferent data analysis techniques. inferential mistakes. We may miss with their statistical training. By the beginning of the 21st our chance to successfully inte- Perplexing paradoxes arose; for century, while some conceptual grate data analysis into all scientific

1Tukey, J.W. 1962. The future of data analysis. Annals of Mathematical Statistics 33:1-67. 2Donoho, D. 2017. 50 years of data science. Journal of Computational and Graphical Statistics 26(4):745–66. 3Pearl, J. 2009. Causality: Models, Reasoning, and Inference (2nd edition). New York: Cambridge University Press. 4Fisher, R.A. 1925. Statistical Methods for Research Workers, 1st ed. Edinburgh: Oliver and Boyd. 5Pearson, K. 1911. The Grammar of Science, 3rd ed. London: Adam and Charles Black. 6Hernán, M.A., Clayton, D., and Keiding, N. 2011. The Simpson's paradox unraveled. International Journal of Epidemiology 40(3):780–5. 7Hernán, MA. 2018. The C-word: Scientific euphemisms do not improve causal inference from observational data (with discussion). American Journal of Public Health 108(5): 616–9. 8Hernán, M.A., Robins J.M. 2018 (forthcoming). Causal Inference. Boca Raton: Chapman & Hall/CRC. 9Pearl, J. 2018. The Book of Why. New York: Basic Books.

VOL. 32.1, 2019 42 treatment-confounder feedback questions, though, if data science (the inputs) to other features of 11 ends up being defined exclusively the world (the outputs). Pre- (the plug-in g-formula). in terms of technical10 activities diction often starts with simple Note that, contrary to some (management, processing, analysis, tasks (quantifying the association computer scientists’ belief, “causal visualization…) without explicit between albumin levels at admis- inference” and “reinforcement consideration of the scientific tasks. sion and death within one week learning” are not synonyms. Rein- among patients in the intensive care forcement learning is a technique A Classification of unit) and then progresses to more- that, in some simple settings, leads Data Science Tasks complex ones (using hundreds of to sound causal inference. How- ever, reinforcement learning is variables measured at admission Data scientists often define their insufficient for causal inference in to predict which patients are more work as “gaining insights” or complex settings (discussed below). likely to die within one week). “extracting meaning” from data. Statistical inference is often The analytics employed for pre- These definitions are too vague required for all three tasks. For diction range from elementary to characterize the scientific uses example, one might want to add calculations (a correlation coef- of data science. Only by precisely 95% confidence intervals for ficient or a risk difference) to classifying the “insights” and descriptive, predictive, or causal sophisticated pattern recognition “meaning” that data can provide estimates involving samples of tar- methods and supervised learning will we be able to think system- get populations. algorithms that can be used as atically about the types of data, As in most attempts at clas- classifiers (random forests, neural assumptions, and analytics that are sification, the boundaries between needed. The scientific contribu- networks) or predict the joint dis- the above categories are not always tions of data science can be orga- tribution of multiple variables. sharp. However, this trichotomy nized into three classes of tasks: Counterfactual prediction is using provides a useful starting point description, prediction, and coun- data to predict certain features to discuss the data requirements, terfactual prediction (see table for of the world as if the world had assumptions, and analytics nec- examples of research questions for been different, which is required essary to successfully perform each of these tasks). in causal inference applications. An each task of data science. A Description is using data to example of causal inference is the similar taxonomy has tradition- provide a quantitative summary estimation of the mortality rate ally been taught by data scientists of certain features of the world. that would have been observed if from many disciplines, including Descriptive tasks include, for all individuals in a study popu- epidemiology, biostatistics,12 example, computing the propor- lation had received screening for economics,13 and political sci- tion of individuals with diabetes colorectal cancer vs. if they had not ence.14 Some methodologists in a large healthcare database and received screening. have referred to the causal infer- representing social networks in a The analytics employed for ence task as “explanation,”15 but community. The analytics employed causal inference range from this is a somewhat-misleading for description range from ele- elementary calculations in ran- term because causal effects may mentary calculations (a mean or domized experiments with no be quantified while remaining a proportion) to sophisticated loss to follow-up and perfect unexplained (randomized trials techniques such as unsupervised adherence (the difference in mor- identify causal effects even if the learning algorithms (cluster analy- tality rates between the screened causal mechanisms that explain sis) and clever data visualizations. and the unscreened) to complex them are unknown). Prediction is using data to implementations of g-methods Sciences are primarily defined map some features of the world in observational studies with by their questions rather than by

10Cleveland, W. 2001. Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics. Intenational Statistical Review 69(1):21-6. 11Robins, J.M. 1986. A new approach to causal inference in mortality studies with a sustained exposure period—Application to the healthy worker survivor effect. Mathematical Modelling 7:1,393–512 (1987. errata, Mathematicall Modelling 14:917–21). 12Vittinghoff, E., Glidden, D.V., Shiboski, S.C., and McCulloch, C.E. 2012. Regression Methods in Biostatistics. New York: Springer. 13Mullainathan, S., and Spiess, J. 2017. Machine Learning: An Applied Econometric Approach. Journal of Economic Perspectives 31(2):87–106. 14Toshkov, D. 2016. Research Design in Political Science. London: Palgrave McMillan. 15Schmueli, G. 2010. To explain or to predict? Statistical Science 25(3):289–310.

CHANCE 43 their tools: We define astrophysics with cancer), and Facebook’s algo- generate relevant data sources.20 as the discipline that learns the rithm to detect users who may (The extent of expert knowledge composition of the stars, not as be suicidal18 (based on posts and varies with different predic- the discipline that uses the spec- live videos). tion tasks.21) However, no expert troscope. Similarly, data science is All these applications of data knowledge is required for predic- the discipline that describes, pre- science have one thing in common: tion after candidate inputs and dicts, and makes causal inferences They are predictive, not causal. the outputs are specified and mea- (or, more generally, counterfactual They map inputs (an image of a sured in the population of interest. predictions), not the discipline human retina) to outputs (a diag- At this point, a machine learn- that uses machine learning algo- nosis of retinopathy), but they do ing algorithm can take over the rithms or other technical tools. not consider how the world would data analysis to deliver a mapping Of course data science certainly look like under different courses and quantify its performance. The benefits from the development of of action (whether the diagnosis resulting mapping may be opaque, tools for the acquisition, storage, would change if we operated on as in many deep learning applica- integration, access, and processing the retina). tions, but its ability to map the of data, as well as from the develop- Mapping observed inputs to inputs to the outputs with a known ment of scalable and parallelizable observed outputs is a natural can- accuracy in the studied population analytics. This data engineer- didate for automated data analysis is not in question. ing powers the scientific tasks of because this task only requires: 1) The role of expert knowledge data science. a large data set with inputs and is the key difference between pre- outputs, 2) an algorithm that diction and causal inference tasks. Prediction vs. establishes a mapping between Causal inference tasks require Causal Inference inputs and outputs, and 3) a met- expert knowledge not only to Data science has excelled at ric to assess the performance of the specify the question (the causal mapping, often based on a gold effect of what treatment on what commercial applications, such 19 as shopping and movie recom- standard. Once these three outcome) and identify/generate mendations, credit rating, stock elements are in place, as in the relevant data sources, but also to trading algorithms, and adver- retinopathy example, predictive describe the causal structure of tisement placement. Some data tasks can be automated via data- the system under study. Causal scientists have transferred their driven analytics that evaluate and knowledge, usually in the form 22,23 skills to scientific research with iteratively improve the mapping of unverifiable assumptions, is biomedical applications such as between inputs and outputs with- necessary to guide the data analysis Google’s algorithm to diagnose out human intervention. and to provide a justification for diabetic retinopathy16 (after 54 More precisely, the component endowing the resulting numerical ophthalmologists classified of prediction tasks that can be auto- estimates with a causal interpre- more than 120,000 images ), mated easily is the one that does tation. In other words, the valid- Microsoft’s algorithm to predict not involve any expert knowledge. ity of causal inferences depends pancreatic cancer months before Prediction tasks require expert on structural knowledge, which its usual diagnosis17 (using the knowledge to specify the scien- is usually incomplete, to supple- online search histories of 3,000 tific question—what to input and ment the information in the data. users who were later diagnosed what outputs—and to identify/ As a consequence, no algorithm

16Gulshan, V., Peng, L., Coram, M., et al. 2016. Development and Validation of a Deep Learning Algorithm for Detection of Diabetic Retinopathy in Retinal Fundus Photographs. JAMA 316(22):2,402–10. 17Paparrizos, J., White, R.W., and Horvitz, E. 2016. Screening for Pancreatic Adenocarcinoma Using Signals From Web Search Logs: Feasibility Study and Results. Journal of Oncological Practice 12(8):737–44. 18Rosen, G. 2017. Getting Our Community Help in Real Time. https://newsroom.fb.com/news/2017/11/getting-our-community-help-in-real-time/ (accessed April 26, 2018). 19Brynjolfsson, E., and Mitchell, T. 2017. What can machine learning do? Workforce implications. Science 358(6370):1,530–4. 20Conway, D. 2010. The Data Science Venn Diagram. Accessed October 9, 2018. http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram. 21Beam, A.L., and Kohane I.S. 2018. Big Data and Machine Learning in Health Care. JAMA 319(13):1,317–8. 22Robins, J.M. 2001, Data, design, and background knowledge in etiologic inference. Epidemiology 11:313–20. 23Robins, J.M., and Greenland, S. 1986. The role of model selection in causal inference from nonexperimental data. American Journal of Epidemiology 123(3):392–402.

VOL. 32.1, 2019 44 can quantify the accuracy of causal associated with both maternal about the need to adjust for fea- inferences from observational data. smoking and infant mortality. tures that are not in the data. The following simplified example However, not all factors associ- Given the central role of helps fix ideas about the differ- ated with maternal smoking and (potentially fallible) expert causal ent role of expert knowledge for infant mortality are confounders knowledge in causal inference, it is prediction versus causal inference. that should be adjusted for. For not surprising that researchers look Example example, birthweight is strongly for procedures to alleviate the reli- associated with both maternal ance of causal inferences on causal Suppose we want to use a large smoking and infant mortality, knowledge. Randomization is the health records database to pre- but adjustment for birthweight best such procedure. dict infant mortality (the output) induces bias because birthweight When a treatment is randomly using clinical and lifestyle factors is a risk factor that is itself causally assigned, we can unbiasedly esti- collected during pregnancy (the affected by maternal smoking. In mate the average causal effect inputs). We have just applied our fact, adjustment for birthweight of treatment assignment in the expert knowledge to decide what results in a bias often referred to absence of detailed causal knowledge the output and candidate inputs as the “birthweight paradox”: Low about the system under study. Ran- are, and to select a particular birthweight babies from mothers domized experiments are central database in the population of who smoked during pregnancy in many areas of science where interest. The only requirement have a lower mortality than those relatively simple causal questions 26 is that the potential inputs must from mothers who did not smoke are asked. Randomized experi- 24 precede the outputs temporally, during pregnancy. ments are also commonly used, regardless of the causal structure An algorithm devoid of causal often under the name A/B testing, linking them. At this point of the expert knowledge will rely exclu- to answer simple causal questions process, our expert knowledge sively on the associations found in in commercial web applications. will not be needed any more: An the data and is therefore at risk of However, randomized designs algorithm can provide a mapping selecting features, like birthweight, are often infeasible, untimely, or between inputs and outputs at that increase bias. The “birthweight unethical in the extremely com- least as good as any mapping we paradox” is indeed an example of plex systems studied by health and could propose and, in many cases, how the use of automatic adjust- social scientists.26 astoundingly better. ment procedures may lead to A failure to grasp the differ- Now suppose we want to use an incorrect causal conclusion. In ent role of expert knowledge in the same health records database contrast, a human expert can read- prediction and causal inference to determine the causal effect of ily identify many variables that, is a common source of confusion maternal smoking during preg- like birthweight, should not be in data science (the confusion is nancy on the risk of infant mortal- adjusted for because of their posi- compounded by the fact that pre- ity. A key problem is confounding: tion in the causal structure. dictive analytic techniques, such Pregnant women who do and do A human expert also may iden- as regression, can also be used for not smoke differ in many char- tify features that should be adjusted causal inference when combined acteristics (including alcohol for, even if they are not available in with causal knowledge). consumption, diet, access to ade- the data, and propose sensitivity Both prediction and causal quate prenatal care) that affect the analyses25 to assess the reliability inference require expert knowledge risk of infant mortality. Therefore, of causal inferences in the absence to formulate the scientific ques- a causal analysis must identify of those features. In contrast, an tion i, but only causal inference and adjust for those confounding algorithm that ignores the causal requires causal expert knowledge factors which, by definition, are structure will not issue an alert to answer the question. As a result,

24Hernández-Díaz, S., Schisterman, E.F., and Hernán, M.A. 2006. The birth weight "paradox" uncovered? American Journal of Epidemiology 164(11):1,115–20. 25Robins, J.M., Rotnitzky, A., and Scharfstein, D.O. Sensitivity analysis for and unmeasured confounding in missing data and causal inference models. In Halloran E., and Berry D., eds. Statistical Methods in Epidemiology: The Environment and Clinical Trials. New York: Springer Verlag; 1999:1–92. 26Hernán, M.A. 2015. Invited commentary: Agent-based models for causal inference-reweighting data and theory in epidemiology. American Journal of Epidemiology 2181(2):103–5.

CHANCE 45 the accuracy of causal estimates decisions. For example, a pre- sequence of moves. Further, a rein- cannot be assessed by using metrics dictive algorithm that identifies forcement learning algorithm can computed from the data, even if the patients with severe heart failure collect an arbitrary amount of data data were perfectly measured in does not provide information by playing more games (conduct- the population of interest. about whether heart transplant ing numerous experiments), which is the best treatment option. In allows it to learn by trial and error. Implications for contrast, causal analyses are In this setting, a cleverly designed Decision-making designed to help us make deci- algorithm running on a power- sions because they tackle “what if ” ful computer can spectacularly A goal of data science is to help questions. A causal analysis will, for outperform humans—but this people make better decisions. For instance, compare the benefit-risk form of causal inference has, at example, in health settings, the profile of heart transplant versus this time in history, a restricted goal is to help decision-makers— medical treatment in patients with domain of applicability. patients, clinicians, policy-makers, certain severity of heart failure. Many scientists work on com- public health officers, regulators— Interestingly, the distinction plex systems with partly known decide among several possible between prediction and causal and nondeterministic governing strategies. Frequently, the ability of inference (counterfactual pre- laws (the “rules of the game”), data science to improve decision- diction) becomes unnecessary with uncertainty about whether making is predicated on the basis for decision-making when the all necessary data are available, of its success at prediction. relevant expert knowledge can and for which learning by trial and However, the premise that readily be encoded and incorpo- error—or even conducting a single predictive algorithms will lead to rated into the algorithms. A purely experiment—is impossible. Even better decisions is questionable. predictive algorithm that learns to when the laws are known and the An algorithm that excels at using play Go can perfectly predict the data available, the system may still data about patients with heart counterfactual state of the game be too chaotic for exact long-term failure to predict who will die under different moves, and a pre- prediction. For example, it was within the next five years is dictive algorithm that learns to impossible to predict when and agnostic about how to reduce drive a car can accurately predict where the Chinese space station,28 mortality. For example, a prior hos- the counterfactual state of the car while in orbit at an altitude of pitalization may be identified as a if, say, the brakes are not operated. about 250 km, would fall to Earth. useful predictor of mortality, Because these systems are gov- Consider a causal question but nobody would suggest that we erned by a set of known game about the effect of different epo- stop hospitalizing people to reduce rules (in the case of games like etin strategies on the mortality of mortality. Identifying patients Go) or physical laws with some patients with renal disease. We do with bad prognoses is very dif- stochastic components (in the case not understand the causal struc- ferent from identifying the best of engineering applications like ture by which molecular, cellular, course of action for preventing self-driving cars), an algorithm can individual, social, and environ- or treating a disease. Worse, eventually predict the behavior of mental factors regulate the effect predictive algorithms, when incor- the entire system under a hypotheti- of epoetin dose on mortality risk. rectly used for causal inference, cal intervention. As a result, it is currently impossi- may lead to incorrect confounder Take the game of Go, which ble to construct a predictive model adjustment and therefore conclude, has been mastered by an algorithm based on electronic health records for example, that maternal “without human knowledge.”27 to reproduce the behavior of the smoking appears to be beneficial When making a move, the algo- system under a hypothetical inter- for low birthweight babies. rithm has access to all information vention on an individual. Some Predictive algorithms inform that matters: game rules, current widely publicized disappointments us that decisions have to be made, board position, and future out- in causal applications of data sci- but they cannot help us make the comes fully determined by the ence, like “Watson for Oncology,”

27Silver, D., Schrittwieser, J., and Simonyan, K., et al. 2017. Mastering the game of Go without human knowledge. Nature 550(7676):354–9. 28The Data Team. 2018. An out-of-control Chinese space station will soon fall to Earth. The Economist March 19, 2018.

VOL. 32.1, 2019 46 Table 1—Examples of Tasks Conducted by Data Scientists Working with Electronic Health Records

Data Science Task Description Prediction Causal inference Example of How can women aged What is the probability Will starting a statin scientific question 60–80 years with stroke of having a stroke next reduce, on average, the history be partitioned in year for women with cer- risk of stroke in women classes defined by their tain characteristics? with certain characteris- characteristics? tics? Data • Eligibility criteria • Eligibility criteria • Eligibility criteria • Features (symptoms, • Output (diagnosis of • Outcome (diagnosis of clinical parameters …) stroke over the next year) stroke over the next year) • Inputs (age, blood • Treatment (initiation of pressure, history of statins at baseline) stroke, diabetes at • Confounders baseline) • Effect modifiers (optional) Examples of Cluster analysis Regression Regression analytics … Decision trees Matching Random forests Inverse probability Support vector machines weighting Neural networks G-formula … G-estimation Instrumental variable estimation …

have arguably resulted from trying historically tackled by each of laws (like board games or self- to predict a complex system that these groups. Epidemiologists and driving cars), so it is not surprising is still poorly understood and for other data scientists working with that they have deemphasized the which a sound model to combine extremely complex systems tend to distinction between prediction and expert causal knowledge with the focus on the relatively modest goal causal inference. Bringing this dis- available data is lacking.29 of designing observational analyses tinction to the forefront is, however, The striking contrast between to answer narrow causal questions urgent as an increasing number of the cautious attitude of most about the average causal effect data scientists address the causal traditional data scientists (statisti- of a variable (such as epoetin treat- questions traditionally asked by cians, epidemiologists, economists, ment), rather than try to explain health and social scientists. Sophis- political scientists…) and the “can the causal structure of the entire ticated prediction algorithms may do” attitude of many computer system or identify globally optimal suffice to develop unbeatable Go scientists, informaticians, and decision-making strategies. software and, eventually, safe others seems to be, to a large extent, On the other hand, newcomers self-driving vehicles, but causal the consequence of the different to data science have often focused inferences in complex systems (say, complexity of the causal questions on systems governed by known the effects of clinical strategies to

29Ross, C., and Swetlitz, I. 2017. IBM pitched its Watson supercomputer as a revolution in cancer care. It’s nowhere close. STAT. https://www.statnews.com/2017/09/05/ watson-ibm-cancer/. 30Pearl, J. 2018. Theoretical Impediments to Machine Learning With Seven Sparks from the Causal Revolution. Technical Report R-475 (http://ftp.cs.ucla.edu/pub/stat_ser/ r475.pdf. Accessed April 26, 2018.

CHANCE 47 treat a chronic disease) need to rely algorithms, teams of students com- coding and the basics of statistical on data analysis methods equipped pete against each other in a machine inference to deal with the uncer- with causal knowledge.30 learning competition to develop the tainty inherent to any data analyses best predictive model (in an involving description, prediction, Processes and application of the Common Task or causal inference. Implications for Framework2). A data science curriculum Teaching By contrast, after learning along the three dimensions of causal inference techniques, stu- description, prediction, and causal The training of data scientists tends dents understand that a similar inference facilitates interdisciplin- to emphasize mastering tools for competition is not possible because ary integration. Learning from data data management and data analy- their causal estimates cannot be requires paying attention to the sis. While learning to use these ranked automatically. Teams with different emphases, questions, and tools will continue to play a cen- different subject-matter knowl- analytic methods developed over tral role, it is important that the edge may produce different causal several decades in statistics, epide- technical training of data scien- estimates, and there often is no miology, econometrics, computer tists makes it clear that the tools objective way to determine which science, and others. Data scientists are at the service of distinct scien- one is closest to the truth using the without subject-matter knowledge 31 tific tasks—description, prediction, existing data. cannot conduct causal analyses in and causal inference. Then students learn to ask isolation: They don’t know how to A training program in causal questions in terms of a articulate the questions (what the data science can, therefore, be contrast of interventions con- target experiment is) and they don’t organized explicitly in three ducted over a fixed time period know how to answer them (how components, each devoted to as would be specified in the pro- to emulate the target experiment). one of the three tasks of data tocol of a (possibly hypothetical) science. Each component would experiment, which is the target Conclusion describe how to articulate scientific of inference. questions, data requirements, For example, to compare the Data science is a component of threats to validity, data analy- mortality under various epoetin many sciences, including the health sis techniques, and the role of dosing strategies in patients with and social ones. Therefore, the expert knowledge (separately for renal failure, students use subject- tasks of data science are the tasks description, prediction, and causal matter knowledge to 1) outline of those sciences—description, nference). This is the approach the design of the hypothetical ran- prediction, causal inference. A that we adopted to develop the domized experiment that would sometimes-overlooked point curriculum of the Clinical Data estimate the causal effect of inter- is that a successful data science Science core at the Harvard Medi- est—the target trial, 2) identify requires not only good data and cal School, which three cohorts an observational database with algorithms, but also domain knowl- of clinical investigators have sufficient information to approx- edge (including causal knowledge) now learned. imately emulate the target trial, from its parent sciences. Our students first learn to dif- and 3) emulate the target trial and The current rebirth of data sci- ferentiate between the three tasks therefore estimate the causal effect ence is an opportunity to rethink of data science, then how to gener- of interest using the observational data analysis free of the historical ate and analyze data for each task, database. We discuss why causal constraints imposed by traditional as well as the differences between questions that cannot be trans- statistics, which have left scien- tasks. They learn that description lated into target experiments are tists ill-equipped to handle causal and prediction may be affected not sufficiently well-defined,31 and questions. While the clout of sta- by selection and measurement why the accuracy of causal answers tistics in scientific training and biases, but that only causal infer- cannot be quantified using obser- publishing impeded the introduc- ence is affected by confound- vational data. In parallel, the tion of a unified formal framework ing. After learning predictive students also learn computer for causal inference in data

31Hernán, MA. 2019 (in press). Spherical cows in a vacuum: Data analysis competitions for causal inference. Statistical Science.

VOL. 32.1, 2019 48 analysis, the coining of the term validity of causal inferences can- rebranded as AI (and that is exactly “data science” and the recent influx not be exclusively data-driven what the tech industry is doing). of “data scientists” interested in causal because the validity of causal However, mapping observed inputs analyses provides a once-in-a- inferences also depends on the to observed outputs barely qualifies generation chance of integrating adequacy of expert causal knowl- as intelligence. Rather, a hallmark all scientific questions, including edge. Causal directed acyclic of intelligence is the ability to pre- causal ones, in a principled data graphs34,35 may play an impor- dict counterfactually how the world analysis framework. An inte- tant role in the development of would change under different grated data science curriculum can analytic methods that integrate actions by integrating expert present a coherent conceptual learning algorithms and subject- knowledge and mapping algo- framework that fosters understand- matter knowledge. These graphs rithms. No AI will be worthy of the ing and collaboration between data can be used to represent different name without causal inference. analysts and domain experts. sets of causal structures that are On the other hand, if the defi- compatible with existing causal nitions of data science currently knowledge and thus to explore the About the Authors discussed in mainstream statistics impact of causal uncertainty on the take hold, causal inference from effect estimates. Miguel Hernán conducts research to learn what works for the treatment and prevention of cancer, observational data will be once Large amounts of data could cardiovascular disease, and HIV infection. With his more marginalized, leaving health make expert knowledge irrelevant collaborators, he designs analyses of healthcare databases, and social scientists on their own. for prediction and for relatively epidemiologic studies, and randomized trials. He teaches The American Statistical Asso- simple causal inferences involv- clinical data science at the Harvard Medical School, clinical epidemiology at the Harvard-MIT Division of Health Sciences ciation statement on “The Role of ing games and some engineering and Technology, and causal inference methodology at the Statistics in Data Science” (August applications, but expert causal Harvard T.H. Chan School of Public Health, where he is the 8, 2015) makes no reference to knowledge is necessary to for- Kolokotrones Professor of Biostatistics and Epidemiology. causal inference. A recent assess- mulate and answer causal 2 John Hsu is director of the Program for Clinical ment of data science and statistics questions in more-complex systems. Economics and Policy Analysis in the Mongan Institute, did not include the word “causal” Affirming causal inference as a Massachusetts , and Harvard Medical (except when mentioning the legitimate scientific pursuit is the School. He studies innovations in healthcare financing and title of the course “Experiments first step in transforming data delivery, and their effects on medical quality and efficiency. He primarily uses large automated and electronic health and Causal Inference”). Heavily science into a reliable tool to guide record data sets, often exploiting natural experiments from influenced by statisticians, many decision-making. both clinical and behavioral economics perspectives. medical editors actively suppress Finally, the distinction between the term “causal” from their pub- prediction and causal inference is Brian Healy is an assistant professor of neurology at 33 the Harvard Medical School and an assistant professor in the lications. also crucial to defining artificial Department of Biostatistics at the Harvard T.H. Chan School of A data science that embraces intelligence (AI). Some data scien- Public Health. He is the primary biostatistician for the Partners causal inference must (1) develop tists argue that “the essence of MS Center at Brigham and Women’s Hospital and a member methods for the integration intelligence is the ability to pre- of the Massachusetts General Hospital (MGH) Biostatistics Center. He teaches introductory statistics in several programs of sophisticated analytics with dict,” and therefore that good pre- and co-directs the clinical data science sequence in the expert causal expertise, and dictive algorithms are a form of AI. master of medical science and clinical investigation with (2) acknowledge that, unlike for From this point of view, large Miguel Hernán. prediction, the assessment of the chunks of data science can be

32Ruich, P. 2017. The Use of Cause-and-Effect Language in the JAMA Network Journals. AMA Style Insider. http://amastyleinsider.com/2017/09/19/use-cause-effect- language-jama-network-journals/. Accessed May 25, 2018. 33Hernán, M.A, Hernández-Díaz, S., Werler, M.M., and Mitchell, A.A. 2002. Causal knowledge as a prerequisite for confounding evaluation: an application to birth defects epidemiology. American Journal of Epidemiology 155:176–84. 34Greenland, S., Pearl, .J, and Robins, J.M. Causal diagrams for epidemiologic research. Epidemiology 1999; 10(1):37–48.

CHANCE 49 Book Excerpt: Improving Your NCAA® Bracket with Statistics Tom Adams 2019. CRC Press LLC/ASA. Reprinted with permission. Edited for house style.

Statistical Hypothesis terminology of statistical hypoth- represents the probability of any Testing esis testing, we have failed to reject number of successes for a fixed the hypothesis that you are an number of trials and a fixed prob- Let’s say you have been entering average player in this pool. This ability of success. In our case, we one bracket in a winner-take-all hypothesis (the hypothesis that have 10 trials (pools) that result in pool with 30 entries and a $10 you would like to reject) is called either a win or a loss, and the win entry fee for the last 10 years. You the null hypothesis. probability of the average player have won in one of those years. You might want to claim that is 1/30. That puts you well into the black, this test proves nothing—that you The binomial test shows that since you have won $300 and all are indeed a superior player. This an average player in a pool with 30 your fees add up to only $100. But is called the alternative hypothesis. entries would have a 0.042 prob- does your success actually consti- If you are right, then we failed ability of winning two or more tute convincing evidence that you to reject a false null hypothesis. times in 10 years. That would be have a superior bracket strategy? This is called a type II error or a pretty convincing evidence that If you were an average player, false-negative. (The type I error or you are a superior pool player by you would have a 1/30 chance of falsepositive is when we reject a the usual standards of hypothesis winning. (We are assuming that true null hypothesis.) testing. Anything below a sig- the pool has a rule for breaking nificance level of 5% is typically ties, so we can ignore those.) Your Statistical Power considered to be a good basis for chance of losing is 29/30. The rejecting the null hypothesis, but outcomes from year to year are Perhaps you could silence the it also depends on how credible presumably independent events, scoffers via another hypothesis test. the alternative hypothesis is in the so you can multiply their prob- You could do another 10-year test. first place. abilities together. Your chance of If you played one entry for the next If you are picking your teams 10 losing in all 10 years is (29/30) , 10 years, maybe you would win based on which mascot is the which comes out to about 0.71 or more than once. If you won more strongest, or if you claim to have 71%. Your chance of winning one than once in the next 10 years, extra sensory perception, then or more pools is about 29%, a little would that allow us to reject the people might still think you were less than one chance in three, even null hypothesis? just lucky. If you were using bracket if you have no superior skill. Win- The math is a bit more com- improvement strategies based on ning only once is not that impres- plicated for this case, but the the peer-reviewed research, then sive. This or better would happen calculations can be performed using the notion that you are a supe- by chance about one-third of the the binomial test. The binomial test rior pool player would already time even if you are just an aver- is appropriate because the aver- be plausible. age player. age player’s wins can be modeled What are the chances that you This reasoning constitutes as a binomial probability distri- will win twice in the next 10 years? a statistical hypothesis test. In the bution. The binomial distribution We don’t have enough information

VOL. 32.1, 2019 50 to calculate that. How much better your pool might have a rule that A statistical hypothesis test are you than the average player? We limits the number of brackets that where you submit one bracket has have to know the answer to that you can enter, or the pool czar a power of 26%—but if you submit question, or at least have an esti- might balk when you try to do 15 brackets, the power goes up to mate. If you are using the strategies this even if there is no explicit rule 56%. The 56% is not a slam dunk, from the peer-reviewed literature, against it, but you could always do but it’s lot better than 26%. Using then it’s reasonable to estimate a “what if?” experiment where you some basic statistical tools and a that your win probability is around prepare 15 brackets before bet- few assumptions, we have come up three times better than that of the ting closes and see how they would with a better design for a statistical average player. This quantity (three have done if you had been able to hypothesis test. times better or 300%) is a mea- submit them. The rest of this chapter describes sure of the effect size of a strategy. The null hypothesis will now be an actual statistical hypothesis test The inverse (1/3) is the relative that your 15 brackets are average using 10 years of pool data from a risk of losing and is also called the overall. There will be 30 opposing standard scoring pool. This pool risk ratio. entries and 15 of your entries, for has four payouts. The binomial A good strategy protects against a total of 45 entries. You will be test is not applicable to this data the risk of losing. Relative risk is betting 1/3 of the entries. If the because the binomial test is only one of many different metrics that null hypothesis is true, then you applicable to binary outcomes. may be used to quantify effect sizes. have a 1/3 chance of winning the The outcome of a pool with four A particular effect-size metric is pool. According to the binomial payouts is not just a win-lose prop- chosen based on its usefulness for test calculation, the chance that osition. The returns for someone communicating the magnitude you will win in six or more of the playing 1/3 of the total entries for of the effect or improving the 10 years is 7.6%, so that would not 10 years in a pool with four payouts experimental design. In our be good enough to reject the null are roughly normally distributed, example, relative risk is useful for hypothesis at the 5% significance so we can invoke the central limit both purposes. level. You’d have to win seven or theorem and use a statistical test The average player wins 1 in 30 more times in 10 years. The prob- that assumes normal distributions, pools on average. If you are three but first we need to come up with ability of that happening under the times better, then you will win 1 an effective multiple-entry strategy. null hypothesis is less than 2%. in 10 pools or 10% of the time What is the power of this test? on average. Using the binomial Back-testing a If you win three times better than test calculation, your chances of average, then you will win in all Multiple-entry winning two or more pools over Strategy 10 years is about 0.26 or 26% of 10 years, but that is not a realistic the time. The 0.26 value is called assumption. You have been sub- Strategies for submitting multiple the power of the statistical hypoth- mitting the best single bracket that entries had been little discussed esis test. The power of a statistical you could estimate, so those other in the peer-reviewed literature. test is the probability that the test 14 brackets have to be less likely Breiter and Carlin referred to the will result in the rejection of the to win. We need a new estimate “time-honored method of multiple null hypothesis. for the effect size of betting 15 pool entries.” They suggested the This all means that you will wait brackets instead of just one bracket. heuristic of using multiple entries 10 years for a new hypothesis test Let’s assume that 200% or two from different strategies for opti- to complete, and there is only a times better than average is a rea- mizing bracket for submission to 26% chance that it will reject the sonable estimate of the effect size. pools with upset incentives. Clair notion that you are just average. In that scenario, 15 average brack- and Letscher pointed out that And that is even assuming that you ets have a 1/3 chance of winning, submitting multiple entries could are three times better than average! so your superior set of 15 brackets reduce dependence on a specific The problem is that the pro- will have a 2/3 chance of winning. tournament outcome model, and posed statistical test is weak. How If you have a 2/3 chance of win- Breiter and Carlin said that the can we make it stronger? One thing ning, then your chance of winning variability in the tournament out- to try is to enter more brackets. the required seven or more pools is come and the bracket scores is large Let’s see what would happen if you about 56%, according to the bino- even if the tournament outcome entered 15 brackets in a pool with mial test calculation. The power of model specifies each probabil- 30 opposing brackets. Of course, the test is 0.56. ity correctly. The precision of the

CHANCE 51 estimated score of a single opti- modeling a pool are (1) N, the available before tournament tip- mized bracket is low even if we number of opponent entries, (2) off during the period while it is assume that the accuracy is high. a tournament outcome model, still possible to submit brackets In 2017, I presented research on and (3) an opponent model. The to a pool. The Sagarin ratings for multiple-entry strategies in stan- tournament outcome is a random the season are available on Selec- dard scoring pools at the poster variable based on the tournament tion Sunday. The Yahoo and ESPN session of the New England Sym- outcome model. Opponent brack- pick distributions are available and posium on Statistics in Sports. The ets are random variables based on updated multiple times before the goal of this research was to test the the opponent model. One pool Thursday tournament tipoff. The quality of return on investment simulation consists of one tourna- number of opponent entries is esti- (ROI) estimates from a bracket ment outcome and N opponent mated from past pools. strategy using a statistical hypoth- brackets. The tournament outcome The bracket ROIs were based esis test. model was based on the Sagarin on 10,000 pool simulations for A multiple-entry strategy was Predictor ratings. The opponent each year. For each simulated pool, back-tested on 10 years of data model was derived from either the the tournament outcome and all from an office pool in Connecticut. Yahoo bracket contest pick distri- opponent scores that were among The back-test involved predicting bution or the ESPN bracket con- the top four scores were saved. the best brackets to submit to these test pick distribution, using the This is the minimum information pools and then determining how heuristic from Clair and Letscher’s needed to determine the ROI of these brackets would have per- research paper to estimate the any additional brackets that may formed if they had been submitted head-to-head pick probabilities be added to the pool. This infor- to these historical pools. for each game. No adjustments to mation allows us to estimate the Historical pool data were avail- the pick distributions were made to ROI of any candidate bracket that able for the years 2008 through address the home-state bias. we are considering for submission 2017. This was a standard scor- …this heuristic for deriving to the pool. ing pool using exponential scoring an opponent model is somewhat The ROI of a candidate bracket (1, 2, 4, 8, 16, 32). The number of biased. Later review of the results in a simulated pool is calculated brackets submitted ranged from indicated some bias: Simulated by computationally “submitting” it 178 to 241. The entry fee was $5. opponents were somewhat less to the pool and determining how The pool awarded four prizes: 40%, likely to advance the most-popular it would have fared in the pool 30%, 20%, and 10% of the pot. In champ pick than they should have competition. The bracket is scored the actual pools, ties were broken been according to the pick advance- based on the tournament outcome based on the best guess of the total ment table for the nationwide for that pool. If the bracket score score in the championship game— bracket contest. A reanalysis per- is not among the top four scores in a guess had to be submitted with formed using the mRchmadness the pool, then its ROI is minus $5 each bracket entry—but in this method for deriving the opponent (minus one betting unit). That is, ROI analysis, the winnings were model indicated a slightly higher you just lost your entry fee. assumed to be divided in the case mean and a slightly higher vari- If the bracket score is above the of a tie. The betting pattern in these ance in the ROIs, but overall, the highest opponent score, then the pools had the characteristics noted results were not greatly improved bracket ROI will be 40% of the in earlier research. The 1 seeds by changing the opponent model. pot minus the entry fee, or 201 were over-bet. The hometown The number of opponent entries × 0.4 - 1 = 79.4 betting units or effect was evident: There was a used in these simulations was 200. $397. If the score falls among the home-state bias in favor of the In actual practice, the specific num- top four opponent scores, then its Connecticut Huskies. ber of pool entries is not known in ROI depends on the bracket score advance, so the number has to be rank and whether it ties one or Calculating estimated. Since pre-2008 pools more scores. The estimated average Bracket ROIs had approximately 200 entries, 200 ROI of a bracket is the bracket’s was a reasonable estimate. Bracket average ROI over the 10,000 simu- Bracket ROIs were estimated optimization results are relatively lated pools. using pool simulations. These insensitive to errors in the esti- The estimated average ROI used the probability model of a mated number of pool entries. or expected ROI of a set of can- bracket pool described by Clair All the inputs used in this didate brackets submitted to and Letscher. The inputs to bracket pool probability model are the pool can be calculated in a

VOL. 32.1, 2019 52 similar manner. If a set of 10 brack- ets is submitted and none are in the money, then the ROI is –10 betting units. If one has the highest score and none of the others are in the top four scores, then the ROI is 210 × .4 - 10 = 74 betting units. The estimated expected ROI of the set is the average ROI for the 10,000 simulated pools.

Optimizing ROI Clair and Letscher optimized the ROI of a single candidate bracket using a hill-climbing algorithm. It is necessary to generalize this process to mutually optimize mul- tiple brackets. The multiple-entry strategy uses an iterative algorithm. First, a single bracket is optimized using a hill-climbing algorithm. A com- mitment is made to submit this Figure 1. The ROI of 100 mutually optimized pool entries in a pool with approximately 200 optimized bracket to the pool. Then opponent entries. the next bracket is optimized using the hill-climbing algorithm—but this next bracket is optimized relative to the pool consisting of ROI is calculated for each team. Results The bracket ROI is maximized the opponent brackets plus our Figure 1 shows the results of enter- over a range of possible picks for previously submitted optimized ing 100 mutually optimized entries that game. To reduce computation bracket(s). That is, each additional in the 10 bracket pools. The ROIs time, some of the lower-seeded bracket is optimized using knowl- ranged from 209% to –100%. The teams are not considered, since edge of our previous submittals. whole pot was won in the best these lower-seeded teams are very The hill-climbing algorithm year and the pool had more than was used to optimize picks for only unlikely to advance deep into the 200 entries, resulting in an ROI of the last seven games of the tourna- tournament. After the champ pick 209%. In the worst year, there were ment. These seven games are the is optimized, the runner-up is opti- no winnings, leading to a 100% loss four regional championship games, mized in the same manner. Then of all entry fees. the two semi-final games, and the the two other regional champs The two years when the multi- final championship game. The are optimized. ple-entry strategy failed to show optimization was limited to the Finally, the process is repeated a profit were those when the last seven games of the tournament again, starting with the champ. The Connecticut Huskies won the to reduce the computation time process is repeated until repeti- tournament. Since the office was for what is already a computation- tion provides no improvement in in Connecticut, there was a large intensive algorithm. the ROI. home-state bias (see Figure 2) in For each candidate bracket, the This hill-climbing algorithm is the pick distribution. The home- hill-climb always starts with the different from the one that Clair state bias is similar to estimates same bracket. This is the bracket and Letscher employed ... The by Brad Null. Even a multiple- where the higher seed is picked to Clair and Letscher algorithm was entry strategy that involved bet- win each game. Then, starting with not tested in this application, so it ting approximately 1/3 of the pool the championship game, different is unclear whether their algorithm entries could not overcome this teams are tried as picks for that would be more accurate or less bias when the home-state team game and the bracket’s expected computation-intensive. won the tournament.

CHANCE 53 Pick from among the strongest teams according to a predictive rat- ing system… Varying your champ pick is a good way to generate multiple entries for a bracket pool.

Further Reading

Adams, T. 2017. Modeling Mul- tiple Entry Strategies in the NCAA Tournament Bracket Pool. Poster presentation at New England Symposium on Statis- tics in Sports. http://nessis.org/ nessis17/Adams.pdf. Accessed May 31, 2017. Adams, T. 2019. Improving Your NCAA Bracket with Statistics. Boca Raton: CRC Press. Breiter, D.J., and Carlin, B.P. 1997. Figure 2. Home-state bias: the percentage by which the champ pick frequency of the local favorite How to Play Office Pools if You exceeded its estimated value. The estimate was based on the pick advancement table from a Must. CHANCE 10:5–11. nationwide bracket contest. Clair, B., and Letscher, D. 2007. Optimal Strategies for Sports Betting Pools. Oper Res Adjusting the opponent model single pool does not mitigate the 55:1,163–1,177. for home-state bias using Null's risk of the home-state team winning ESPN. 2018. Who Picked Whom. estimates is probably a good idea the tournament. https://bit.ly/2FELGXd. in general, but it would not have Over the 10 years of the back- Accessed April 13, 2018. prevented these losses. Nor would test, the multiple-entry strategy Null, B. 2016. Homer Bias is real such an adjustment help the strat- had an average ROI of 53%. The and it will derail your March egy in any year when the home- standard deviation of returns was Madness bracket. https://bit. state team wins because the more 79%. The p-value of a one-tailed ly/2HlmjPQ. Accessed March over-bet teams tend to be advanced T-test is 3.2%, indicating grounds 3, 2018. less in the optimized brackets. for accepting the hypothesis that Multiple-entry strategy for a this large-scale multiple-entry Shayer, E., and Powers, S. 2017. strategy is superior to the strategy mRchmadness: Numerical Tools employed by the average player. for Filling Out an NCAA Bas- ketball Tournament Bracket. About the Author A Practical https://bit.ly/2SVYHCo. Multiple-entry Accessed March 5, 2019. Tom Adams is a systems analyst who has spent most Yahoo. 2018. Pick Distribution. of his career in scientific research support and the creator Strategy of Poologic, a website that has provided research-based https://tournament.fantasysports. advice on winning bracket pools since 2000. Poologic has Your pool czar will probably not yahoo.com/t1/pickdistribution. been featured in , Sports Illustrated, let you bet one-third of the brack- Accessed April 13, 2018. SmartMoney, and other publications. Adams has published in ets in your pool, but you can gen- statistics and probability. He has a BSc in mathematics from erate a few brackets for multiple the University of North Carolina. entries by varying your champ pick.

VOL. 32.1, 2019 54 [The Big Picture] Nicole Lazar Column Editor

Big Data and Privacy

hen thinking about the topic of this issue’s col- umn, I batted various ideasW around in my mind before settling on the theme of privacy in the age of Big Data. I knew I was on to something when, a few days later, my copy of Significance appeared in my mailbox and on the cover was the same theme! I actually think about the ques- tions of privacy and anonymity a fair bit, because I am by nature a private person, and don’t like to share my information—online or otherwise—unnecessarily. At the same time, I do have an online presence as an academic. Going “off the grid” does not strike me as realistic or desirable. Where is the balance? I’ll admit that at times, I might go too far in the direction of discretion. An anecdote: When I was in my first assistant professor job, living in Pittsburgh, a local grocery chain had a loyalty card. When you bought groceries, you would swipe the card, and in return, the store would print coupons that matched

CHANCE 55 your shopping patterns and would to monitor student attendance. That this is one of the perti- This article origi- give you discounts from time to Cameras in lecture halls took nent issues of the Big Data age nally appeared in time, etc. The usual things. pictures every minute, and the is also made evident by a recent CHANCE 30.4. One of my senior colleagues numbers of empty and full seats report— “Big Data and Privacy: A told me (with a gleam in his eye were counted, presumably by some Technological Perspective”— that only a true lover of data would pattern-recognition software. submitted in May 2014 to Presi- have at such a moment) that the Harvard’s Institutional Review dent Obama by the President’s grocery chain had approached Board ruled that the practice didn't Council of Advisors on Science and him about analyzing the masses of fall under the purview of human- Technology (PCAST). Concerns data that would be accumulating subject research. Faculty members about privacy and anonymity have each day. were apparently informed about long been coupled with changes As a statistician and lover of the study while students were kept in technology, dating back to the data myself, I could see why he was in the dark. Had they known, stu- establishment of the U.S. Postal intrigued. This was a rather early dents likely would have modified System in 1775, and on through example of data mining: looking their behavior by coming to class the inventions of the telegraph, for patterns in shoppers’ baskets more than usual. (Or, for those telephone, and portable cameras. of groceries. What’s the likelihood who like to mess with the data, The PCAST report surveyed the that a customer will buy bread, maybe skipping class when they ways in which data collection, milk, and eggs on the same trip? wouldn’t ordinarily!) analysis, and use have converged Do shoppers who buy beer tend The study was conducted in in the modern age to make these to buy chips as well? You can think spring 2014 and made public in issues particularly fraught. of your own examples. Again, this the fall. Students were outraged, In the past, it was easier—even is all pretty common now, but at feasible—for an individual to con- as were many faculty, claiming the time, for me at least, it seemed trol what personal information that Harvard had invaded their to open up new vistas of data was revealed in the public sphere. privacy and administrators should analysis—and it was very exciting. This is no longer the case. Public have been more forthcoming about That’s how the statistician in surveillance cameras and sensors the project. It didn’t help that this me reacted. The other me, the indi- record data without the individual followed another scandal at vidual, didn't want one of those being filmed or recorded necessar- Harvard, in which the administra- cards. I didn’t want the grocery ily even aware that it’s happening. tion had apparently been scanning chain to know what I bought, Additionally, social media such emails of particular individuals. when I bought it, what my patterns as Facebook and expose were. It was none of their business! Although school officials a great deal of personal informa- When my boyfriend, now husband, destroyed the classroom pic- tion—sometimes intentionally, left Pittsburgh, he gave me his loy- tures immediately, and were at sometimes not. alty card. And I’ll admit, my first pains to insist that individual The emergence of statistical thought was, “Oh, good, now I can students per se were not of techniques for the analysis of dis- mess with their data!” I was amused interest, many students, faculty parate data sources—a hallmark to think about my colleague or one members, and observers con- of Big Data—means that even if of our graduate students mining demned the practice, blasting the information disclosed in one this huge accumulation of data and school officials on comment boards database maintains the individual’s seeing a blip or switch in the record at the Chronicle of Higher Education privacy, when taken together with for that card. and the Harvard Crimson. other (also privacy-protecting) Of course, I knew that it didn’t Others, though, wondered what databases, identification of the really work that way, which brings the fuss was all about, pointing individual is possible, and maybe me around to the bigger point— out that data (including video and inevitable in some cases. There are namely, how much anonymity or photographic) are collected about legal precedents governing some of privacy do we really have in a situ- us all the time, so what’s one more these practices and concerns, but ation like this? data mining experiment? Some not all, and in any event, they’ll all The question was very much also argued that there is no rea- have to be revisited in light of the in the air as I write this column. sonable expectation to privacy in evolving technological landscape. A few days earlier, a story broke a college classroom, especially at Part of the challenge inherent in about Harvard University photo- a private institution that can, in the modern paradigm, as noted in graphing classrooms in an effort many respects, set its own rules. the PCAST report, is that a certain

VOL. 32.1, 2019 56 application may bring both ben- efits and harm, whether intended or not. For example, government is limited by the Fourth Amend- The challenge with data fusion is to devise analysis ment to the U.S. Constitution approaches that preserve the rights of individuals from searching private records in a home without probable cause. to control what they reveal about themselves to This seems straightforward until the world at large. This is an area of current you think about what constitutes the boundaries of your home in a research in statistics. Wi-Fi, cloud-enabled world. The same technology that allows you [ to put family photographs and documents into, say, the cloud, to share with friends and relatives who live far away from you, also blurs the definition of what con- services, data entered into a com- Data techniques. Even if each stitutes your home. Do the same puter or cell phone, location data source on its own provides adequate legal protections extend there? from a GPS, “cookies” that track privacy protection, they may draw a If not, how might your privacy visits to websites, and metadata picture when multiple sources are be compromised? from phone calls. In all of these analyzed together that is detailed That may be a question for cases, there is intent, at some level, enough to reveal specific and lawyers. How about one for stat- to provide the data to the monitor- confidential information at the isticians? The PCAST report ing system. individual level. The challenge with emphasizes that much of the ben- Privacy concerns here stem from data fusion is to devise analysis efit of Big Data comes from the two main sources: over-collection approaches that preserve the rights data mining aspect—the ability of data and fusion of data sources. of individuals to control what they to detect correlations that may be Data over-collection occurs when of interest or use. But it’s impor- reveal about themselves to the tant to keep in mind that these the system, intentionally or oth- world at large. This is an area of are just correlations, which means erwise, collects more information current research in statistics. that the discovered relationships than what the user was aware of. In contrast to data that are born do not hold in all generality, and An example from the PCAST digital, born analog data originate may not hold in particular for report is an app called Brightest in the physical world and are cre- certain—and possibly vulner- Flashlight Free, which millions of ated when some features of the able—sub-populations. Harm in a people have downloaded. Every physical world are detected by a medical context, for example, could time it was used, the app sent sensor of some sort and converted arise from mistaking correlations details of its location to the vendor. to digital form. Examples include for hard-and-fast rules. And, as I’ve Why is this information necessary health statistics as collected by already noted, additional harm can for a flashlight? Clearly, it isn’t, and a Fitbit, imaging infrared video, come from merging data sets that hence, this is an instance of over- cameras in video games that inter- were not meant to be merged, and collection. pret hand or other gestures by the subsequent analysis revealing facts The violation of privacy is obvi- player, and microphones in meet- about an individual that the person ous in that people who download a ing rooms. never wanted to make public. flashlight app for their phones are When a camera takes pic- The PCAST report also distin- not expecting to reveal data about tures of a busy downtown neigh- guishes between two types of data their locations every time they use borhood, it is bound to pick especially relevant for the privacy it. To make it worse, the location up some signal that is not of discussion: “born digital” and “born information was apparently also immediate interest. Hence born analog.” Data that are born digital, sold to advertisers. analog data will frequently contain as the name implies, are created Data fusion is the term used more information than originally specifically for use by a data- when data from various sources are intended. Again, this can result in processing system. Examples combined and analyzed together benefit as well as in harm, depend- include email and short message using data mining or other Big ing on how the data are used. Once

CHANCE 57 the born analog data are digitized, some of the terms seem unreason- sets if they are indeed analyzed in they can be fused with born digital able. In addition, the provider can an aggregated fashion. data, just like any other source, and change the privacy terms down Of course, this raises some other analyzed together as well. the line without informing you of interesting and pertinent statistical Given this new state of reality, is that fact. questions: What does it mean to it even possible to protect one’s pri- Since notice and consent is the be “like me” for the purposes of vacy? The answer to me seems to be most prevalent model, and given this type of analysis? Who defines a cautious yes. There is something the problems I just described, the these categories? How sensitive of an arms race in which people PCAST report recommends that are the analysis and its results to work to crack protections, which major effort be put into devising the particular aggregation scheme? in turn spurs the development of more workable and effective pro- Presumably the questions that more sophisticated protective mea- tections at this level, in part by the insurance company is ask- sures, which then pose a challenge placing the responsibility for using ing should guide the aggregation. to the first group, and on and on. your data according to your wishes In some cases, maybe my age is Encryption is one way to make back with the providers—those relevant; in others, it may be my data more secure, although codes collecting the data. One proposed height, weight, cholesterol levels can be broken or stolen. “Notice framework is to have a variety of blood pressure, and so on. I can and consent” is most often used privacy protocols that empha- think of many dimensions in which for the protection of privacy in size different utilities. You would someone may be “like me.” Some commercial settings. We all have sign on to such a protocol, which will be important for certain types encountered these when we want would be passed on to app stores of analysis and not for others, but to install new software or a new and the like when you use their in any case, “I as me ” will perhaps app and are supposed to read and services. The protocol you chose not be so critical. agree to a long list of terms before would dictate how your data could These questions of data collec- proceeding with the download. be used, shared, disseminated, etc. tion, analysis, and usage, and how Among those terms are items relat- The groups offering the proto- they intersect with personal rights ing to how your data may be used— cols could vet new programs and to (or desire for) anonymity and for example, sold to advertisers or apps to ensure that they meet the privacy, are not going to disappear. other third parties. But how often desired standards. In either case, They are part of our cultural and do you read those notices through consumers would no longer have technological landscape and are and through? Always? Sometimes? to wade through screens of legalese likely to expand over time. I think Never? For most people, it’s prob- (or ignore them altogether!); the it’s important for us to think about ably the last option, and this is burden would be on the other par- these issues and to decide where a problem because it shifts the ties to the transaction. our personal lines are, how much responsibility from the entity col- There is another perspective I we are comfortable sharing about lecting your data back to you. The should add, and that is symbolic ourselves, and what sort of pres- practice is even more problematic data analysis, about which I have ence we want to have in the digital because not everyone is a lawyer. written here in the past. Recently and other modern realms. Even if we take the time to read the my colleague Lynne Billard gave To all of my students, former terms, many of us don’t understand a talk in our weekly colloquium students, collaborators past and the implications and can’t object if series about this approach to data present, and old friends who try to analysis. She mentioned that one connect via LinkedIn, Facebook, of the ways in which a statistician ResearchGate, and the like—when may receive a data set for which I don’t respond, just know that I symbolic methods are appropri- don’t participate in any of those About the Author ate is through aggregation. Her fora. It’s my small way of keeping motivating example was from a a corner of privacy in the world. Nicole Lazar, who writes The Big Picture, earned medical data set, where the insur- her PhD from the University of Chicago. She is a professor in the Department of Statistics at the University of Georgia, ance company almost certainly Further Reading and her research interests include the statistical analysis of doesn’t care about my visits to the neuroimaging data, empirical and other likelihood methods, doctor; it cares about visits of “peo- President’s Council of Advisors on and data visualization. She also is an associate editor of The ple like me.” This provides a type of Science and Technology, May American Statistician and The Annals of Applied Statistics and built-in anonymity for these mas- 2014. “Big Data and Privacy: A author of The Statistical Analysis of Functional MRI Data. sive, automatically generated data Technological Perspective.”

VOL. 32.1, 2019 58 [Book Reviews] Christian Robert Column Editor

Pragmatics of There are also reading suggestions in the other P of U and a few exercises. It thus comes as a complement Uncertainty for the classroom, if needed. Joseph Kadane The purpose of the book is philosophical, to address, with specific examples, the question of whether Bayesian statistics is ready for prime time. Can it Hardcover: 351 pages be used in a variety of applied settings to address Year: 2017 real applied problems? Publisher: Chapman and Hall/ The book is also a logical complement of the CRC Press principles, helping to demonstrate how Jay himself applied his Bayesian principles to specific cases and ISBN-13: 978-1498719841 how one can engage into the construction of a prior, a loss function, or a statistical model in identifiable parts that can then be criticized or reanalyzed. I find browsing through this series of 15 problems I went through this book during the flight to the fascinating and exhilarating, while I admire the dedi- 2017 O’Bayes conference in Austin, somewhat para- cation of Jay to every case he presents in the book. I doxically, since its message remains adamantly and also feel that this comes as a perfect complement to the unapologetically subjectivist. Pragmatics of Uncertainty earlier P of U, in that it makes referring to a complete is to be seen as a practical illustration of the Principles application of a given principle most straightforward, of Uncertainty Jay wrote in 2011 (and which I reviewed with the problem being entirely described, analyzed, for CHANCE). Its avowed purpose is to allow the and—in most cases—solved within a given chapter. reader to check through Jay’s applied work, whether A few chapters have discussions, originally pub- or not he had “made good” on clearly setting out the lished in the Valencia meeting proceedings or in motivations for his subjective Bayesian modeling. another journal with discussions. (While I presume the use of the same sequence “P of U” in both books is mostly a coincidence, I started We think however that Bayes factors are overem- wondering what a third “P of U” volume could be phasized. In the very special case in which there are called. Perils of Uncertainty? Peddlers of Uncertainty? only two possible “states of the world,” Bayes factors Preliminary The game is afoot!) are sufficient. However in the typical case in which versions of The structure of the book is a collection of 15 case there are many possible states of the world, Bayes these reviews studies undertaken by the author over the past 30 years factors are sufficient only when the decision-maker’s were posted of his career, covering paleontology, survey sampling, loss has only two values. (p. 278) (Jay’s reply to a on xianblog. legal knowledge, physics, climate, and even medieval discussion by John Skilling where he regrets the wordpress.com. Norwegian history. Each chapter starts with a short absence of marginal likelihoods in the chapter, a introduction that often explains how he came by the reply with which I completely subscribe.) problem (most often, as an interesting PhD student While all papers have been reset in the book style consulting project at CMU), the difficulties in the and fonts, and hence are seamlessly integrated, I do analysis, and what became of his co-authors. As noted wish the original graphs had been edited as well, since by the author, the main bulk of each chapter is the they do not always look pretty. Although it would have reprint (in a unified style) of the paper, and most of implied a massive effort, it also would have been great these papers are actually and freely available online. had each chapter and problem been re-analyzed or at The chapter always concludes with an epilogue (or least discussed by a fellow Bayesian to illustrate the post-mortem) that reconsiders (very briefly) what had impact of individual modeling sensibilities. This may, been done, what could have been done, and whether however, be a future project for a graduate class, the Bayesian perspective was useful for the problem assuming all data sets are available, which is unclear (unsurprisingly so for the majority of the chapters!). from the text.

CHANCE 59 10 Great Ideas some degree of betting, but not in a rational or rea- soned manner. (Of course, this is not a mathematical, About Chance but rather, a psychological objection.) Further, the jus- Persi Diaconis and Brian Skyrms tification through betting is somewhat tautological in that it assumes probabilities are true probabilities from the start. For instance, the Dutch book example on p. 39 Hardcover: 272 pages produces a gain of .2 only if the probabilities are correct. Year: 2018 The force of accumulating evidence made it less and less plausible to hold that subjective probability is, Publisher: Princeton in general, approximate psychology. (p. 55) University Press A chapter on “psychology” may come as a surprise, ISBN-13: 978-0691174167 but I feel a posteriori that it is appropriate. Most of it is about the Allais paradox, with entries on Ellesberg’s distinction between risk and uncertainty, with only the former being quantifiable by “objec- tive” probabilities, and on Tversky’s and Kahneman’s This book is a well-articulated collection of essays on distinction between heuristics and the framing effect; chance, written jointly by Persi Diaconis, an American i.e., how propositions are expressed affects the choice mathematician of Greek descent and former profes- of decision-makers. sional magician who is the Mary V. Sunseri Professor of Statistics and Mathematics at Stanford University, This is Bernoulli’s swindle. Try to make it precise and Brian Skyrms, an American professor of logic and it falls apart. The conditional probabilities go and philosophy of science. As a warning, let me point in different directions, the desired intervals are of out that I was a reviewer of this book for PUP, which different quantities, and the desired probabilities means this review was adapted from the report I sent are different probabilities. (p. 66) to the publisher. The next chapter (“frequency”) is about Bernoulli’s The historical introduction (“measurement”) of Law of Large numbers and the stabilization of fre- this book is most interesting, especially its analogy of quencies, with von Mises making it the basis of his chance with length. I would have appreciated a connec- approach to probability, and Birkhoff ’s extension, tion earlier than Cardano, such as some of the Greek which is capital for the development of stochastic philosophers, even though I gladly discovered that processes and later for MCMC. I like the notions of Cardano was not responsible only for the closed form “disreputable twin” (p. 63) and “Bernoulli’s swindle,” solutions to the third-degree equation. I would also about the idea that “chance is frequency.” have liked to see more comments on the vexing issue of The authors call the identification of probabilities equiprobability: We all spend (if not waste) hours in the as limits of frequencies Bernoulli’s swindle, because classroom explaining to (or arguing with) students why it cannot handle zero probability events. They offer a their solution is not correct, and they sometimes never nice link with the testing fallacy of equating rejection get it! (We sometimes get it wrong as well.) of the null with acceptance of the alternative and an Why is such a simple concept so hard to explain interesting description of how Venn perceived the or express? In short, although this is nothing but a fallacy but could not overcome it: “If Venn’s theory personal choice, I would have made the chapter more appears to be full of holes, it is to his credit that he conceptual and less chronologically historical. saw them himself.” Coherence is again a question of consistent eval- The description of von Mises’s Kollectiven (and uations of a betting arrangement that can be the welcome intervention of Abraham Wald) clarifies implemented in alternative ways. (p. 46) my previous and partial understanding of the notion, although I am unsure it is that clear for all readers. The second chapter, about Frank Ramsey, is defi- I also appreciate the connection with the very nitely interesting, if only because it puts this “man of notion of randomness, which has not yet found, I genius” back under the spotlight when he has been fear, a satisfactory definition. This chapter asks more all but forgotten (at least in my circles) for joining (interesting) questions than it answers (to those or probability and utility together, and for postulating others). But enough, this is a brilliant chapter! that probability can be derived from expectations rather than the opposite. This is even though betting …a random variable, the notion that Kac found or gambling has a stigma in many cultures; at least, mysterious in early expositions of probability gambling for money, since most of our actions involve theory. (p. 87)

VOL. 32.1, 2019 60 Chapter 5 (“mathematics”) is very important, from This may prove nonsensical, but I would have my perspective, in that it justifies the necessity to asso- welcomed an entry at the end of the chapter on cases ciate measure theory with probability if one wishes where the exchangeability representation fails; for to explore further than urns and dices, or to entitle instance, those cases when there is no sufficiency Kolmogorov to posit his axioms of probability and to structure to exploit in the model. A bonus to the chap- properly define conditional probabilities as random ter is a description of the Birkhoff ergodic theorem variables (as some of my students fail to realize). I very “as a generalization of de Finetti” (p. 134–136), plus much enjoyed reading this chapter, but it may prove half-a-dozen pages of appendices on more technical difficult for readers with no or little background in aspects of de Finetti’s theorem. measure theory (some advanced mathematical details We want random sequences to pass all tests of have vanished from the published version). randomness, with tests being computationally Still, this chapter constitutes a strong argument implemented. (p. 151) for preserving measure theory courses in graduate programs. As an aside, I find it amazing that mathe- The eighth chapter (“algorithmic randomness”) maticians (including Kac!) had not at first realized the comes (again!) as a pleasant surprise since it centers on connection between measure theory and probability the character of Per Martin-Löf, who is little known (p. 84), but it may not be so amazing given the dif- in statistics circles. (The chapter starts with a picture ficulty many still have with the notion of conditional of him with the iconic Oberwolfach sculpture in the probability. (I would have liked to see some descrip- background.) Martin-Löf ’s work concentrates on tion of Borel’s paradox when it is mentioned; p. 89). the notion of randomness, in a mathematical rather than probabilistic sense, and on its algorithmic con- Nothing hangs on a flat prior (…) Nothing hangs sequences. I very much like the section on random on a unique quantification of ignorance. (p. 115) generators, which includes a mention of our old friend The following chapter (“inverse inference”) is about RANDU, the 16-planes random generator! Thomas Bayes and his posthumous theorem, with This chapter connects with Chapter 4, since von an introduction setting the theorem at the center of Mises also attempted to define a random sequence, to the Hume-Price-Bayes triangle. (It is nice that the the extent that it feels slightly repetitive (for instance, authors include a picture of the original version of the Jean Ville is mentioned in rather similar terms in both essay, as the initial title is much more explicit than the chapters). Martin-Löf ’s central notion is computabil- published version, as uncovered by Stephen Stiegler.) ity, which (kindly) forces us to visit Turing’s machine, This is a short coverage, in tune with the fact that and its role in the undecidability of some logical Bayes only contributed a 20-plus paper to the field, statements, along with Church’s recursive functions and it is logically followed by a second part (formerly (with a potential link not exploited here to the notion another chapter) about Pierre-Simon Laplace. of probabilistic programming, where one language is Both parts focus on the selection of prior distribu- actually named Church, after Alonzo Church.) tions on the probability of a binomial (coin tossing) I do not see how Martin-Löf ’s test for random- distribution and emerging into a discussion of the ness can be implemented on a real machine, because position of statistics within or even outside mathe- the whole test requires going through the entire matics. (The assertion that Fisher was the “Einstein of sequence: Since this notion connects with von Mises’s Statistics” on p. 120 may be disputed by many readers.) Kollektivs, I am missing the point. Then Kolmororov is brought back with his own notion of complexity So it is perfectly legitimate to use Bayes’ mathematics (which is also Chaitin’s and Solomonov’s). even if we believe that chance does not exist. (p. 124) Overall, this is a pretty challenging chapter both The seventh chapter is about Bruno de Finetti because of the notions it introduces and because I do and his astounding representation of exchangeable not find it is completely conclusive about the notion(s) sequences as mixtures of Independent and Identically of randomness. (A side remark about casino hustlers Distributed (iid) sequences, defining an implicit prior and their “exploitation” of weak random generators: I on the side. While the description sticks to binary believe Jeff Rosenthal has a similar, if maybe simpler, events, it quickly gets more advanced with the notion story about Canadian lotteries in his book.) of partial and Markov exchangeability, including the Does quantum mechanics need a different notion of most interesting connection between those exchange- probability? We think not. (p. 180) abilities and sufficiency. (I would, however, disagree with the statement that “Bayes was the father of The penultimate chapter is about Boltzmann and parametric Bayesian analysis” [p. 133], since this is the notion of “physical chance” or statistical physics, extrapolating too much from The Essay.) through a story that involves Zermelo and Poincaré,

CHANCE 61 and Gibbs, Maxwell, and the Ehrenfests. The dis- Popper and his exclusion of induction. In this chapter, cussion focuses on the definition of probability in a I appreciated very much the section on skeptical priors thermodynamic setting, opposing time frequencies and its analysis from a meta-probabilist perspective. to space frequencies, which requires ergodicity and There is no conclusion to the book, but to end up hence Birkhoff (no surprise; this is about ergodicity), with a chapter on induction seems quite appropriate. as well as von Neumann. This reaches a point where There is an appendix as a probability tutorial, men- conjectures in the theory are still open. tioning Monte Carlo resolutions, plus notes on all What I always (if, presumably, naïvely) find fasci- chapters and a commented bibliography. Definitely nating in this topic is that ergodicity operates without recommended! requiring randomness. Dynamical systems can enjoy ergodic theorem, while being completely deterministic. This chapter also discusses quantum mechanics, Independent Random the main tenet of which requires probability, which has to be defined, from a frequency or a subjective Sampling Methods perspective. The Bernoulli shift brings us back to random generators. Luca Martino, David Luengo, and The authors briefly mention the Einstein- Joaquín Míguez Podolsky-Rosen paradox, which sounds more metaphysical than mathematical in my opinion, although they go into great detail to explain Bell’s Hardcover: 280 pages conclusion that quantum theory leads to a mathemati- Year: 2018 cal impossibility (but they lost me along the way)— except that we “are left with quantum probabilities” Publisher: Springer Verlag (p. 183). The chapter leaves me still uncertain about why statistical mechanics carries the label “statistical,” ISBN-13: 978-3319726335 since it does not seem to involve inference at all. If you don’t like calling these ignorance priors on the ground that they may be sharply peaked, call them non-dogmatic priors or skeptical priors, because these priors are quite in the spirit of ancient skepti- When I received a dedicated copy of this book from cism. (p. 199) the first author, I was not aware of its existence. The three authors all work at Madrid universities and I The last chapter (“induction”) brings us back to have read (and blogged about) several papers of theirs Hume and the 18th century, where somehow “every- thing”—including statistics—started, except that about (population) Monte Carlo simulation in the Hume’s strong skepticism (or skepticism) makes induc- recent years. tion seemingly impossible (a perspective with which I The book is definitely a pedagogical coverage agree to some extent, if not to Keynes’s extreme version, of most algorithms used to simulate independent when considering for instance financial time series as samples from a given distribution, which, of course, stationary. This is also a reason why I do not see the recoups some of the techniques exposed with more criticisms in The Black Swan as completely pertinent: details by Luc Devroye in his 1986 Non-Uniform They savage normality while accepting stationarity.) Random Variate Generation bible. It includes a whole The chapter re-discusses Bayes’s and Laplace’s chapter on accept-reject methods, in particular with a contributions to inference as well, challenging Hume’s section about Payne-Dagpunar’s band rejection that I conclusion of the impossibility to finer analyses, had not seen previously. There is another entire chap- even though the representation of ignorance is not ter on ratio-of-uniforms techniques, on which the unique (p. 199). The authors call again for de Finetti’s three authors had proposed generalizations covered representation theorem as bypassing the issue of by the book, years before I attempted to go the same whether or not there is such a thing as chance. And way, having completely forgotten reading their paper escaping inductive skepticism. (The section about at the time…or the much-earlier 1991 paper by Jon Goodman’s grue hypothesis is somewhat distracting, Wakefield, Alan Gelfand, and Adrian Smith. maybe because I have always found this argument to The book also covers the “vertical density repre- be quite artificial and based on a linguistic pun rather sentation,” due to Troutt (1991), which consists of than a logical contradiction.) considering the distribution of the density p(.) of The part about (Richard) Jeffrey is quite new to me, a random variable X as a random variable, p(X). I but ends up quite abruptly, similarly to the one about remember pondering about this alternative to the cdf

VOL. 32.1, 2019 62 transform and giving up on it, since the outcome has …you can have a solution fast, cheap, or correct, a distribution depending on p, even when the density provided you only pick two. (p. 27) is monotonous. However, after reading the section The (minor) issue I have with the book—which a and related papers, I am not certain that this is par- potential mathematically keen student could face as ticularly appealing. well—is that there is little in the way of justifying a Given its title, the book obviously contains very particular approach to a given numerical problem (as little about MCMC. The exception is a last and final opposed to others) and in characterizing the limita- chapter that covers adaptive independent Metropolis- tions and failures of the presented methods (although Hastings algorithms, in connection with some of the this happens from time to time, such as inr gradient authors’ recent work, such as multiple-try Metropolis. descent, p. 191). (Basking in my Gallic “mal-être,” I This relates to the (unidimensional) ARMS “ancestor” am prone to over-criticize methods during classes, to of adaptive MCMC methods. the [increased] despair of my students, but I also feel All in all and with the bias induced by my working that avoiding over-rosy presentations is a good way in the same area, I find the book quite a nice entry on to prevent later disappointments or even disasters.) the topic. It can be put to use directly in a Monte Carlo In the case of this book, finding [more] ways of course at both undergraduate and graduate levels to detecting would-be disasters would have been nice avoid going into Markov chains. The book is certainly and to the point. A side comment as a statistician is less likely to scare students away than the comprehen- that mentioning time series inter- or extra-polation sive Non-Uniform Random Variate Generation without a statistical model sounds close to anathema! reference by Devroye (or a certain book by yours truly And makes extrapolation a weapon without a cause. with a pumpkin-orange cover). On the contrary, it may well induce some of them to pursue a research …we know, a priori, exactly how long the [simu- career in this domain. lated annealing] process will take since it is a function of the temperature and the cooling rate. (p. 199) Computational Methods Unsurprisingly, the section on Monte Carlo inte- for Numerical Analysis gration is disappointing for a statistician/probabilistic numericist like me, since it fails to give a complete- with R enough picture of the methodology. All simulations seem to proceed there from a large-enough hypercube, James Howard and recommending the “fantastic” (p. 171) R function integrate as a default is scary, given the ability of the Hardcover: 277 pages selected integration bounds to misled its users. Similarly, I feel that the simulated annealing sec- Year: 2017 tion does not provide enough of a cautionary tale about the highly sensitive impact of cooling rates and Publisher: CRC Press absolute temperatures. It is only through the raw SBN-13: 978-1498723633 output of the algorithm applied to the traveling sales- man problem that the novice reader can perceive the impact of some of these factors. (The acceptance bound on the jump (6.9) is, incidentally, wrongly called a probability on p. 199, since it can take values This book consists of a traditional introduction to larger than 1.) numerical analysis with backup from R codes and packages. The early chapters set the scenery, from basics on R to notions of numerical errors, before moving to linear algebra, interpolation, optimization, integration, About the Author differentiation, and ODEs. It comes with a package cmna that reproduces algorithms and testing. Christian Robert is a professor of statistics at the While I do not find much originality in the book, universities of Paris-Dauphine, France, and Warwick, UK. He has written extensively about Bayesian statistics and given its adherence to simple resolutions of these computational methods, including the books The Bayesian topics, I could nonetheless use it for an elementary Choice and Monte Carlo Statistical Methods. He has served course in our first-year classes, with maybe the excep- as president of the International Society for Bayesian Analysis tion of the linear algebra chapter, which I did not find and editor in chief of the Journal of the Royal Statistical Society (Series B), and currently is deputy editor of Biometrika. very helpful.

CHANCE 63 JSMad_CHANCE8.5x11.pdf 1 1/7/19 2:57 PM

DENVER,COLORADO KEY DATES

ATTEND PARTICIPATE January 23–April 3, 2019 May 1, 2019 Meeting and Event Request Submissions Accepted Registration and Housing Open January 25 May 31, 2019 Government Agency Registration Extension Early Registration Deadline Request Deadline June 1–June 29, 2019 April 1–18, 2019 Regular Registration Abstract Editing Open June 30–August 1 April 15, 2019 Late Registration Late-Breaking Session July 3 Proposals Deadline Housing Deadline May 17, 2019 July 27–August 1 Draft Manuscript Deadline 2019 JOINT STATISTICAL MEETINGS Denver, Colorado

ww2.amstat.org/meetings/jsm/2019

HELP US RECRUIT THE NEXT GENERATION OF STATISTICIANS The field of statistics is growing fast. Jobs are plentiful, Site features:

opportunities are exciting, and salaries are high. • Videos of young So what’s keeping more kids from entering the field? statisticians passionate about their work Many just don’t know about statistics. But the ASA is • A myth-busting quiz working to change that, and here’s how you can help: about statistics

• Send your students to www.ThisIsStatistics.org • Photos of cool careers in statistics, like a NASA and use its resources in your classroom. It’s all biostatistician and a about the profession of statistics. wildlife statistician

• Download a handout for your students about careers • Colorful graphics in statistics at www.ThisIsStatistics.org/educators. displaying salary and job growth data

• A blog about jobs in statistics and data science If you’re on social media, connect with us at www.Facebook.com/ThisIsStats and • An interactive map www.Twitter.com/ThisIsStats. Encourage of places that employ your students to connect with us, as well. statisticians in the U.S.