<<

JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 636–655: and Methods Discussion https://doi.org/10.1080/01621459.2020.1762613

Prediction, Estimation, and Attribution

Bradley Efron

Department of , Stanford University, Stanford, CA

ABSTRACT ARTICLE HISTORY The scientifc needs and computational limitations of the twentieth century fashioned classical statistical Received October 2019 methodology. Both the needs and limitations have changed in the twenty-frst, and so has the methodology. Accepted April 2020 Large-scale prediction algorithms—neural nets, deep learning, boosting, support vector machines, random forests—have achieved status in the popular press. They are recognizable as heirs to the regression KEYWORDS tradition, but ones carried out at enormous scale and on titanic datasets. How do these algorithms compare Black box; Ephemeral predictors; Random forests; with standard regression techniques such as ordinary least squares or ? Several key Surface plus noise discrepancies will be examined, centering on the diferences between prediction and estimation or pre- diction and attribution (signifcance testing). Most of the discussion is carried out through small numerical examples.

1. Introduction algorithms are a powerful addition to the statistician’s armory, yet substantial further development is needed for their routine Statistical regression methods go back to Gauss and Legendre scientifc applicability. Such development is going on already in in the early 1800s, and especially to Galton in 1877. During the the statistical world, and has provided a welcome shot of energy twentieth century, regression ideas were adapted to a variety into our discipline. of important statistical tasks: the prediction of new cases, the This article, originally a talk, is written with a broad brush, estimation of regression surfaces, and the assignment of signif- and is meant to be descriptive of current practice rather than cance to individual predictors, what I have called “attribution” normative of how things have to be. No previous knowledge of in the title of this article. Many of the most powerful ideas the various prediction algorithms is assumed, though that will of twentieth century statistics are connected with regression: sorely underestimate many readers. least squares ftting, logistic regression, generalized linear mod- This is not a research paper, and most of the argumentation els, ANOVA, predictor signifcance testing, regression to the is carried out through numerical examples. These are of small mean. size, even miniscule by current prediction standards. A certain The twenty-frst century has seen the rise of a new breed giganticism has gripped the prediction literature, with swelling of what can be called “pure prediction algorithms”—neural prefxes such as tera-, peta-, and exabestowing bragging rights. nets, deep learning, boosting, support vector machines, ran- But small datasets can be better for exposing the limitations of a dom forests—recognizably in the Gauss–Galton tradition, but new methodology. able to operate at immense scales, with millions of data points An excellent reference for prediction methods, both tradi- and even more millions of predictor variables. Highly suc- tional and modern, is Hastie, Tibshirani, and Friedman (2009). cessful at automating tasks like online shopping, machine Very little will be said here about the mechanics of the pure translation, and airline , the algorithms (partic- prediction algorithms: just enough, I hope, to get across the ularly deep learning) have become media darlings in their idea of how radically diferent they are from their traditional own right, generating an immense rush of interest in the counterparts. business world. More recently, the rush has extended into the world of , a one-minute browser search pro- ducing “deep learning in biology”; “computational linguis- 2. Surface Plus Noise Models tics and deep learning”; and “deep learning for regulatory genomics.” For both the prediction algorithms and traditional regression How do the pure prediction algorithms relate to traditional methods, we will assume that the data d available to the statisti- regression methods? That is the central question pursued in cian has this structure: what follows. A series of salient diferences will be examined— d (xi, yi), i 1, 2, ..., n ; (1) diferences of assumption, scientifc philosophy, and goals. The ={ = } story is a complicated one, with no clear winners or losers; but a here xi is a p-dimensional vector of predictors taking its value rough summary, at least in my mind, is that the pure prediction in a known space contained in Rp,andy is a real-valued X i

CONTACT Bradley Efron [email protected] Department of Statistics, Stanford University, 390 Jane Stanford Way, Sequoia Hall, Stanford, CA 94305-4065. Color versions of one ormore of thefgures in the article can be found online at www.tandfonline.com/r/JASA. ©2020AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 637

Adj Rsquared =.481

80 100

40 60

cholesterol decrease 20

0

− 20

−2−1 0 1 2 normalized compliance

Figure 1. Black curve is OLS ftted regression to cholostyramine data (dots); vertical bars indicate one standard error estimation. ± response. The n pairs are assumed to be independent of each is to learn as much as possible about the surface from the other. More concisely we can write data d. Figure 1 shows a small example, taken from a larger dataset in d x, y , (2) Efron and Feldman (1991): n 164 male doctors volunteered to ={ } = where x is the n p matrix having xt as its ith row, and take the cholesterol-lowering drug cholostyramine. Two num- × i y (y , y , ..., y )t.Perhapsthemosttraditionaloftraditional bers were recorded for each doctor, = 1 2 n regression models is xi normalized compliance and t = yi x β ϵi (i 1, 2, ..., n), (3) y observed cholesterol decrease. (8) = i + = i = iid ϵ (0, σ 2),thatis,“ordinaryleastsquareswithnormal Compliance, the proportion of the intended dose actually taken, i ∼ N errors.”Here, β is an unknown p-dimensional parameter vector. ranged from 0% to 100%, –2.25 to 1.97 on the normalized scale, In matrix notation, and of course it was hoped to see larger cholesterol decreases for the better compliers. y xβ ϵ. (4) = + Anormalregressionmodel(6)wasft,with For any choice of x in ,model(3)saysthattheresponsey ( β) β β β 2 β 3 X s xi, 0 1xi 2xi 3xi , (9) has expectation µ xtβ,soy (µ, σ 2). The linear surface = + + + = ∼ N β , in other words, a cubic regression model. The black curve is the S t estimated surface β µ x β, x , (5) S ={ = ∈ X } s x, β for x , (10) contains all the true expectations, but the truth is blurred by the S = ˆ ∈ X ϵ " # $ % noise terms i. ft by maximum! likelihood or, equivalently, by ordinary least More generally, we might expand (3)to squares (OLS). The vertical bars indicate one standard error ± for the estimated values s(x, β),at11choicesofx,showinghow yi s(xi, β) ϵi (i 1, 2, ..., n), (6) ˆ = + = inaccurate might be as an estimate of the true . S S where s(x, β) is some known functional form that, for any fxed That is the estimation side of the story. As far as attribution value of β,givesexpectationµ s(x, β) as a function of x ! β β = ∈ is concerned, only ˆ0 and ˆ1 were signifcantly nonzero. The . Now the surface of true expectations, that is, the regression 2 X adjusted R was 0.482, a traditional measure of the model’s surface,is predictive power. Another mainstay of traditional methodology is logistic β µ s(x, β), x . (7) S ={ = ∈ X } regression. Table 1 concerns the neonate data (Mediratta et al. Most traditional regression methods depend on some sort of 2019): n 812 very sick babies at an African facility were = surface plus noise formulation (though “plus” may refer to, say, observed over the course of one year, 605 who lived and 207 who binomial variability). The surface describes the scientifc truths died. Eleven covariates were measured at entry: gestational age, we wish to learn, but we can only observe points on the surface body weight, Apgar score, etc., so xi in (1)was11-dimensional, obscured by noise. The statistician’s traditional estimation task while yi equaled 0 or 1 as the baby lived or died. This is a surface 638 B. EFRON

acceleration

force force

mass mass

Figure 2. On left, a surface depicting Newton’s second law of , acceleration force/mass; on right, a noisy version. =

Table 1. Logistic of neonate data. The lefpanel of Figure 2 shows the surface representation of Estimate SE p-value ascientifcicon,Newton’ssecondlawofmotion, force Intercept 1.549 0.457 0.001∗∗∗ acceleration . (12) gest −0.474 0.163 0.004 − ∗∗ = mass ap 0.583 0.110 0.000∗∗∗ bwei −0.488 0.163 0.003 It is pleasing to imagine the second law falling full-born out of − ∗∗ resp 0.784 0.140 0.000∗∗∗ Newton’s head, but he was a master of experimentation. The cpap 0.271 0.122 0.027∗ right panel shows a (fanciful) picture of what experimental data ment 1.105 0.271 0.000∗∗∗ 1 rate 0.089 0.176 0.612 might have looked like. hr− 0.013 0.108 0.905 In the absence of genius-level insight, statistical estimation head 0.103 0.111 0.355 theory is intended as an instrument for peering through the gen 0.001 0.109 0.994 temp− 0.015 0.124 0.905 noisy data and discerning a smooth underlying truth. Neither the cholostyramine nor the neonate examples is as fundamental NOTE: Signifcant two-sided p-values indicated for 6 of 11 predictors; estimated logistic regression made 18% prediction errors. as Newton’s second law but they share the goal of extracting dependable scientifc structure in a noisy environment. The noise is ephemeral but the structure, hopefully, is eternal, or at plus noise model, with a linear logistic surface and Bernoulli least long-lasting (see Section 8). noise. The 11 predictor variables were standardized to have mean 3. The Pure Prediction Algorithms 0andvariance1,aferwhichlogisticregressionanalysiswas carried out. Table 1 shows some of the output. Columns 1 and 2 The twenty-frst century2 has seen the rise of an extraordinary give estimates and standard errors for the regression coefcients collection of prediction algorithms: random forests, gradient (which amount to a description of the estimated linear logistic boosting, support vector machines, neural nets (including deep surface and its accuracy). learning), and others. I will refer to these collectively as the S Column 3 shows standard two-sided p-values for the 11 vari- “pure prediction algorithms” to diferentiate them from the tra- ables, 6 ofwhich! aresignifcantly nonzero, 5 of them strongly so. ditional prediction methods illustrated in the previous section. This is the attribution part of the analysis. As far as prediction is Some spectacular successes—machine translation, iPhone’s Siri, concerned, the ftted logistic regression model gave an estimated facial recognition, championship chess, and Go programs— pi of death for each baby. The prediction rule have elicited a tsunami of public interest. If media attention is the appropriate metric, then the pure prediction algorithms are our era’s statistical . > pi 0.25 dies The adjective “pure” is justifed by the algorithms’ focus if predict (11) on prediction, to the neglect of estimation and attribution. p 0.25 lives i ≤ Their basic strategy is simple: to go directly for high predictive accuracy and not worry about surface plus noise models. This had an empirical error rate of 18%. (Threshold 0.25 was chosen has some striking advantages and some drawbacks, too. Both to compensate for the smaller proportion of deaths.) advantages and drawbacks will be illustrated in what follows. All of this is familiar stuf, serving here as a reminder of how Apredictionalgorithmisageneralprogramforinputtinga dataset d (x , y ), i 1, 2, ..., n (1) and outputting a rule traditional regression analyses typically begin: a description of ={ i i = } the underlying scientifc truth (the “surface”) is formulated, 1A half-century earlier, Galileo famously used inclined planes and a water along with a model of the errors that obscure direct observa- clock to estimate the acceleration of falling objects. tion. The pure prediction algorithms follow a diferent path, as 2Actually, the “long twenty-frst century,” much of the activity beginning in described in Section 3. the 1990s. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 639

<<−− lived died −−>>

cpap< |0.6654

gest>=−1.672 gest>=−1.941

ment< 0.5 1 ment< 0.5 0.8 0.5 0.8 n=41 n=14 resp< 0.3503 ap>=0.1228 n=88 n=52 0.05 0.2 0.06 0.5 n=412 n=142 n=17 n=46

Figure 3. Classifcation tree for neonate data. Triangled terminal nodes predict baby lives, circled predict baby dies; the rule has apparent prediction error rate 17% and cross-validated rate 18%. f (x, d) that, for any predictor vector x,yieldsaprediction two groups.4 Next, each of the two groups was itself split in two, following the same Gini criterion. The splitting process y f (x, d). (13) ˆ= continued until certain stopping rules were invoked, involving We hope that the apparent error rate of the rule, for classifcation very small or very homogeneous groups. At the bottom of Figure 3, the splitting process ended at eight problems the proportion of cases where yi yi, ˆ ̸= terminal nodes: the node at far lefcontained 412 of the original err # f (x , d) y /n (14) 812 babies, only 5% of which were deaths; the far right node = { i ̸= i} contained 41 babies, all of which were deaths. Triangles indicate is small. More crucially, we hope that the true error rate & the three terminal nodes having death proportions less than the Err E f (X, d) Y (15) original proportion 25.5%, while circles indicate proportions = { ̸= } exceeding 25.5%. The prediction rule is “lives” at triangles, “dies” is small, where (X, Y) is a random draw from whatever prob- at circles. If a new baby arrived at the facility with vector x of ability distribution gave the (xi, yi) pairs in d;seeSection 6. 11 measurements, the doctors could predict life or death by Random forests, boosting, deep learning, etc. are algorithms that following x down the tree to termination. have well-earned reputations for giving small values of Err in This prediction rule has apparent error rate 17%, taking the complicated situations. observed node proportions, 0.05, etc., as true. Classifcation Besides being very diferent from traditional prediction trees have a reputation for being greedy overftters, but in this methods, the pure prediction algorithms are very diferent from case a 10-fold cross-validation analysis gave error rate 18%, each other. The least intricate and easiest to describe is random nearly the same. The careful “traditional” analysis of the neonate forests (Breiman 2001). For dichotomous prediction problems, data in Mediratta et al. (2019) gave a cross-validated error rate like that for the neonate babies, random forests depends on of 20%. It is worth noting that the splitting variables in Figure 3 ensembles of classifcation trees. agree nicely with those found signifcant in Table 1. Figure 3 shows a single classifcation tree obtained by apply- So far so good for regression trees, but with larger exam- ing the R program Rpart33 to the neonates. At the top of the ples they have earned a reputation for poor predictive perfor- tree all 812 babies were divided into two groups: those with cpap mance; see Section 9.2 of Breiman (2001). As an improvement, (an airway blockage signifer) less than threshold 0.6654 were Breiman’s random forest algorithm relies on averaging a large put into the more-favorable prognosis group to the lef; those number of bootstrap trees, each generated as follows: with cpap 0.6654 were shunted right into the less-favorable ≥ prognosis group. The predictor cpap and threshold 0.6654 were chosen to maximize, among all possible (predictor, threshold) 4 More precisely: if nL and nR are the numbers in the left and right groups, and choices, the diference in observed death rates between the p and p the proportions of deaths, then the algorithm minimized the Gini ˆL ˆR criterion nLpL(1 pL) nRpR(1 pR). This equals np(1 p) (nLnR/n)(pL 2 ˆ − ˆ + ˆ − ˆ ˆ − ˆ− ˆ − pR) ,wheren nL nR and p (nLpL nRpR)/n,sothatminimizingGini’s ˆ = + ˆ= ˆ + ˆ 2 criterion is equivalent to maximizing (pL pR) ,foranygivenvaluesofnL 3 ˆ − ˆ A knockofof CART (Breiman et al. 1984). and nR. 640 B. EFRON prediction error 0.18 0.20 0.22 0.24

17%

1 101 201 301 401 501 # trees

Figure 4. Random forest prediction error rate for neonate data, as a function of number of bootstrapped trees; it has cross-validated error rate 17%.

1. Draw a nonparametric bootstrap sample d∗ from the original Table 2. Number of random forest test set errors in 100 random training/test splits of prostate data. data d,thatis,arandomsampleofn pairs (xi, yi) chosen with replacement from d. Errors 0 1 2 3 4 5 7 Frequency 3 26 39 12 5 4 1 2. Construct a classifcation tree from d∗ as before, but choose each split using only a random subset of p∗ predictors chosen . independently from the original p(p∗ √p). = The study involved n 102 men, 52 cancer patients and = Having generated, say, B such classifcation trees, a newly 50 normal controls. Each man’s genetic expression levels were observed x is classifed by following it down to termination in measured on a panel of p 6033 genes, each tree; fnally, y f (x, d) is determined by the majority of = ˆ= x activity of jth gene for ith man, (16) the B votes. Typically, B is taken in the range 100–1000. ij = Random forests was applied to the neonate prediction prob- i 1, 2, ...,102and j 1, 2, ...,6033.The n p matrix x lem, using the R program randomForest,withtheresults = = × is much wider than it is tall in this case, “wide data” being the graphed in Figure 4. The prediction error rate5 is shown as trendy name for p n situations, as contrasted with the p n afunctionofthenumberofbootstraptreessampled.Inall, ≫ ≪ “tall” datasets traditionally favored. B 501 trees were used but there was not much change afer = Random forests was put to the task of predicting normal 200. The overall prediction error rate fuctuated around 17%, or cancer from a man’s microarray measurements. Following only a small improvement over the 18% cross-validated rate in standard procedure, the 102 men were randomly divided into Figure 3. Random forests is shown to better advantage in the training and test sets of size 51,6 each having 25 normal controls microarray example of Section 4. and 26 cancer patients. Random forests begins with the p columns of x as predic- The training data d consists of 51 (x, y) pairs, x avectorof tors, but then coins a host of new predictors via the splitting train p 6033 genetic activity measurements and y equal 0 or 1 indi- process (e.g., “cpap less than or greater than 0.6654”). The new = cating a normal or cancer patient. R program randomForest variables bring a high degree of interaction to the analysis, for yielded prediction rule f (x, d ). This rule was applied to the instance, between cpap and gest in Figure 3. Though carried out train test set, yielding predictions y f (x , d ) for the 51 test diferently, high interactivity and fecund coinage of predictor ˆi = i train subjects. variables are hallmarks of all pure prediction algorithms. Figure 5 graphs the test set error rate as the number of random forest trees increased. Afer 100 trees, the test set error 4. A Microarray Prediction Problem rate was 2%. That is, y agreed with y ,theactualoutcome,for50 ˆi i of the 51 test set subjects: an excellent performance by anyone’s Newsworthy breakthroughs for pure prediction algorithms have standards! This was not a particularly lucky result. Subsequently, involved truly enormous datasets. The original English/French random training/test set splits were carried out 100 times, each translator tool on Google, for instance, was trained on millions time repeating the random forest calculations in Figure 5 and of parallel snippets of English and French obtained from Cana- counting the number of test set errors. The modal number dian and European Union legislative records. There is nothing of errors was 2, as seen in Table 2,with“1predictionerror” of that size to ofer here but, as a small step up from the neonate occurring frequently. data, we will consider a microarray study of prostate cancer.

5These are “out-of-bag” estimates of prediction error, a form of cross- 6It would be more common to choose, say, 81 training and 21 test, but for the validation explained in Appendix A. comparisons that follow it will be helpful to have larger test sets. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 641

error rate error

2%

0.0 0.1 0.2 0.3

0 20 40 60 80 100 number of trees

Figure 5. Test set error rate for random forests applied to prostate cancer microarray study, as a function of number of bootstrap trees. error rate error

4% 0.0 0.1 0.2 0.3 0.4 0.5

0 100 200 300 400 # trees

Figure 6. Test set error for boosting algorithm gbm applied to prostate cancer data. Thin curve is training set error, which went to zero at step 86.

Aclassifcationtreecanbethoughtofasafunctionf (x) classifcation trees, taking values 0 or 1 for x in its sample space . The tree in X K Figure 3 partitions the 11-dimensional space into 8 rectan- X wkfk(x), (17) gular regions, three of which having y 0andfvehaving = k 1 y 1. A simpler function is obtained by stopping the division '= = ( ) process afer the frst split, in which case is divided into just at step k 1choosingtreefk 1 x to best improve the ft. X + + two regions, cpap < 0.6654 and cpap 0.6654. Such simple trees The weights wk are kept small to avoid getting trapped in a ≥ are picturesquely known as “stumps.” bad sequence. Afer 400 steps, Figure 6 shows a test sample This brings up another prominent pure prediction method, error of 4%, that is, two mistakes out of 51; once again, an boosting. Figure 6 shows the results of applying the R program impressive performance. (The examples in Hastie, Tibshirani, gbm (for gradient boosting modeling) to the prostate cancer and Friedman (2009) show gbm usually doing a little better than prediction problem.7 Gbm sequentially fts a weighted sum of random forests.) In the evocative language of boosting, the stumps going into Figure 6’s construction are called “weak learners”: any one of 7Applied with d 1, that is, ftting stumps, and shrinkage factor 0.1. them by itself would barely lower prediction errors below 50%. = 642 B. EFRON

That a myriad of weak learners can combine so efectively is estimators like the gbm stumps, that are convenient for mass a happy surprise and a central advance of the pure prediction deployment. The pure prediction algorithms operate nonpara- enterprise. In contrast, traditional methods focus on strong metrically, a side beneft of not having to worry much about individual predictors, as with the asterisks in Table 1,akey estimation efciency. diference to be discussed in subsequent sections. For the comparison of prediction with attribution we con- The light curve in Figure 6 traces the gbm rule’s error rate sider an idealized version of a microarray study involving n on its own training set. It went to zero at step 86 but training subjects, n/2healthycontrolsandn/2sickpatients:anyone continued on, with some improvement in test error. Cross- subject provides a vector of measurements on N genes, X t = validation calculations give some hint of when to stop the ftting (X1, X2, ..., XN) ,with process—here we would have done better to stop at step 200— ind X ( δ /2c,1) c n/4 , (21) but it’s not a settled question. j ∼ N ± j = The umbrella package keras was used to apply neural ... # ( $ nets/deep learning to the prostate data. Results were poorer than for j 1, 2, , N ,“plus”forthesickand“minus”forthe = δ forrandomforestsorgbm:7or8errorsonthetestsetdepending healthy; j is the efect size for gene j.Mostofthegenesarenull, δ δ on the exact stopping rule. A support vector machine algorithm j 0, say N0 of them, but a small number N1 have j equal a = % did worse still, with 11 test set errors. positive value , The deep learning algorithm is much more intricate than N0 : δj 0andN1 : δj %. (22) the others, reporting “780,736 parameters used,” these being = = internally adjusted tuning parameters set by cross-validation. A new person arrives and produces a microarray of mea- surements X (X , X , ..., X )t satisfying (21)butwithout That this is possible at all is a tribute to modern computing = 1 2 N power, the underlying enabler of the pure prediction movement. us knowing the person’s healthy/sick status; that is, without knowledge of the value. Question: How small can N /N get ± 1 0 before prediction becomes impossible? The answer, motivated 5. Advantages and Disadvantages of Prediction in Appendix A, is that asymptotically as N , accurate 0 →∞ prediction is possible if For those of us who have struggled to fnd “signifcant” genes in 8 1/2 amicroarraystudy, the almost perfect prostate cancer predic- N1 O(N0 ), (23) tions of random forests and gbm have to come as a disconcerting = surprise. Without discounting the surprise, or the ingenuity of but not below that. the prediction algorithms, a contributing factor might be that By contrast, Appendix A shows that efective attribution prediction is an easier task than either attribution or estimation. requires This is a difcult suspicion to support in general, but a couple of N1 O(N0). (24) examples help make the point. = Regarding estimation, suppose that we observe 25 indepen- In terms of “needles in haystacks” (Johnstone and Silverman dent replications from a normal distribution with unknown 2004), attribution needs an order of magnitude more needles expectation µ, than prediction. The prediction tactic of combining weak learn- ers is not available for attribution, which, almost by defnition, is ind x1, x2, ..., x25 (µ,1), (18) looking for strong individual predictors. At least in this example, ∼ N it seems fair to say that prediction is much easier than attribu- and consider estimating µ with either the sample mean x or the ¯ tion. sample median x.Asfarassquarederrorisconcerned,themean ˘ The three main regression categories can usefully be is an overwhelming winner, being more than half again more arranged in order efcient, . prediction estimation attribution, (25) E (x µ)2 /E (x µ)2 1.57. (19) ··· ··· { ˘− } { ¯− } = with estimation in a central position and prediction and attribu- Suppose instead that the task is to predict the value of a new, tion more remote from each other. Traditionally, estimation is independent realization X (µ,1). The mean still wins, but ∼ N linked to attribution through p-values and confdence intervals, now by only 2%, as in Table 1. Looking in the other direction, good estimators, E (X x)2 /E (X x)2 1.02. (20) when they are available, are usually good predictors. Both pre- { − ˘ } { − ¯ }= diction and estimation focus their output on the n side of the The reason, of course, is that most of the prediction error comes n p matrix x,whileattributionfocusesonthep side. Estimation from the variability of X,whichneitherx nor x can cure.9 × ¯ ˘ faces both ways in (25). Prediction is easier than estimation, at least in the sense The randomForest algorithm does attempt to connect of being more forgiving. This allows for the use of inefcient prediction and attribution. Along with the predictions, an importance measure10 is computed for each of the p predictor 8See Figure 15.5 of Efron and Hastie (2016). 9 This imagines that we have a single new to predict. Suppose 10 ind There are several such measures. The one in Figure 7 relates to Gini’s instead that we have m new X , X , ..., Xm (µ,1),and criterion, Section 3. At the conclusion of the algorithm we have a long list of 1 2 ∼ N that we wish to predict their mean X.Withm 10 the efciency ratio is all the splits in all the bootstrap trees; a single predictor’s importance score ¯ = E (X x)2 /E (X x)2 1.16; with m 100, 1.46; and with m ,1.57. is the sum of the decreases in the Gini criterion over all splits where that ¯ ¯ One{ can− ˘ think} { of− estimation¯ }= as the prediction= of mean values.=∞ predictor was the splitting variable. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 643

gene 1031

0.3 0.4

0.2 importance 0.0 0.1

1 51 101 151 201 251 301 order number

Figure 7. Random forest importance measures for prostate cancer prediction rule of Figure 5,plottedinorderofdecliningimportance.

Table 3. Number of test set errors for prostate cancer random forest predictions, operate diferently, striving as in Table 1 to identify a small set removing top predictors shown in Figure 7. of causal covariates (even if strict causality cannot be inferred). #removed 0 1 5 10 20 40 80 160 348 The pure prediction algorithms’ penchant for coining weakly #errors 1 0 3 1 1 2 2 2 0 correlated new predictors moves them in the opposite direc- tion from attribution. Section 9 addresses sparsity—a working assumption of there being only a few important predictors— which is not at all the message conveyed by Table 3. variables. Figure 7 shows the ordered importance scores for the prostate cancer application of Figure 5.Ofthep 6033 = genes, 348 had positive scores, these being the genes that ever 6. The Training/Test Set Paradigm were chosen as splitting variables. Gene 1031 achieved the most importance, with about 25 others above the sharp bend in Acrucialingredientofmodernpredictionmethodologyisthe the importance curve. Can we use the importance scores for training/test set paradigm: the data d (1)ispartitionedintoa ( ) attribution, as with the asterisks in Table 1? training set dtrain and a test set dtest; a prediction rule f x, dtrain ( ) In this case, the answer seems to be no. I removed gene 1031 is computed using only the data dtrain;fnally,f x, dtrain is from the dataset, reducing the data matrix x to 102 6032, applied to the cases in dtest,yieldinganhonestestimateofthe × and reran the randomForest prediction algorithm. Now the rule’s error rate. But honest does not mean perfect. number of test set prediction errors was zero. Removing the This paradigm was carried out in Section 4 for the prostate ... cancer microarray study, producing an impressively small error most important fve genes, the most important 10, ,themost 11 important 348 genes had similarly minor efects on the number rate estimate of 2% for random forests. This seemed extraor- of test set prediction errors, as shown in Table 3. dinary to me. Why not use this rule to diagnose prostate cancer At the fnal step, all of the genes involved in constructing based on the vector of a new man’s 6033 gene expression mea- the original prediction rule of Figure 5 had been removed. Now surements?Thenextexamplesuggestshowthismightgowrong. x was 102 5685, but the random forest rule based on the The training and test sets for the prostate cancer data of × reduced dataset d x, y still gave excellent predictions. As Section 4 were obtained by randomly dividing the 102 men into ={ } a matter of fact, there were zero test set errors for the realization two sets of 51, each with 25 normal controls and 26 cancer shown in Table 3. The prediction rule at the fnal step yielded 364 patients. Randomization is emphasized in the literature as a “important” genes, disjoint from the original 348. Removing all guard against bias. Violating this advice, I repeated the analysis, 712 348 364 genes from the prediction set—so now x was this time selecting for the training set the 25 normal controls = + 102 5321—still gave a random forest prediction rule that made and 26 cancer patients with the lowest ID numbers. The test set × only one test set error. was the remaining 51 subjects, those with the highest IDs, and The “weak learners” model of prediction seems dominant again contained 25 normal controls and 26 cancer patients. in this example. Evidently there are a great many genes weakly In the reanalysis randomForest did not perform nearly ( ) correlated with prostate cancer, which can be combined in dif- as well as in Figure 5: f x, dtrain made 12 wrong predictions ferent combinations to give near-perfect predictions. This is an advantage if prediction is the only goal, but a disadvantage as far 11Taking account of the information in Table 2,abettererrorrateestimateis as attribution is concerned. Traditional methods of attribution 3.7%. 644 B. EFRON 0.6 0.5

0.4

error rate error 24%

0.1 0.2 0.3

previous 0.0

1 51 101 151 201 number of trees

Figure 8. randomForest test set error for prostate cancer microarray study, now with training/test sets determined by early/late ID number. Results are much worse than in Figure 5.

28% error rate error

previous

0.0 0.1 0.2 0.3 0.4 0.5

1 101 201 301 401 #trees

Figure 9. gbm test set error, early/late division; compare with Figure 6.Goingonto800treesdecreasederrorestimateto26%.Trainingseterrorrate,thincurve,waszero after step 70 but test error rate continued to decline. See the brief discussion in Criterion 5 of Section 8.

on dtest with error rate 24%, rather than the previous 2%, as vulnerable to such diferences. The prediction literature uses graphed in Figure 8. The boosting algorithm gbm was just as concept drif as a label for this kind of trouble, a notorious bad, producing prediction error rate 28% (14 wrong predic- example being the Google fu predictor, which beat the CDC for tions) as shown in Figure 9. a few years before failing spectacularly.13 Choosing one’s test set Why are the predictions so much worse now? It is not obvious by random selection sounds prudent but it is guaranteed to hide from inspection but the prostate study subjects might have any drifefects. been collected in the order listed,12 with some small method- Concept drifgets us into the question of what our various ological diferences creeping in as time progressed. Perhaps all regression methods, new and old, are supposed to be telling those weak learners going into randomForest and gbm were us. Science, historically, has been the search for the underlying truths that govern our universe: truths that are supposed to be

12Asingularvaluedecompositionofthenormal-subjectdatahadsecond principal vector sloping upward with ID number, but this was not true for 13The CDC itself now sponsors annual Internet-based fu chal- the cancer patient data. lenges (Schmidt 2019); see their past results at predict.cdc.gov. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 645 200 genes 050100150 320

0100200300400

days (subjects)

Figure 10. Black line segments indicate active episodes in the hypothetical microarray study. (Matrix transposed for typographical convenience.)

Table 4. Comparing logistic regression coefcients for neonate data for year 1 (as in Table 1)andyear2;correlationcoefcient0.79. gest ap bwei resp cpap ment rate hr head gen temp Year 1 0.47 0.58 0.49 0.78 0.27 1.10 0.09 0.01 0.1 0.00 0.02 Year 2 −0.65 −0.27 −0.19 1.13 0.15 0.41 −0.47 0.02 0.2 0.04 0.16 − − − − − − − eternal, like Newton’s laws. The eternal part is clear enough in Here is a contrived microarray example where all the pre- and astronomy—the speed of light, E mc2, Hubble’s dictors are ephemeral: n 400 subjects participate in the = = law—and perhaps in medicine and biology, too, for example, study, arriving one per day in alternation between Treatment DNA and the circulation of blood. But modern science has and Control; each subject is measured on a microarray of p = moved on to felds where truth may be more contingent, such 200 genes. The 400 200 data matrix x has independent normal × as economics, sociology, and ecology. entries Without holding oneself to Newtonian standards, traditional ind x (µ ,1) for i 1, 2, ...,400 and estimation and attribution usually aim for long-lasting results ij ∼ N ij = that transcend the immediate datasets. In the surface plus noise j 1, 2, ...,200. (26) = paradigm of Section 2, the surface plays the role of truth—at Most of the µ are null, µ 0, but occasionally a gene will least eternal enough to justify striving for its closest possible ij ij = estimation. have an active episode of 30 days during which

In the neonate example of Table 1 we would hope that µij 2forTreatmentand 2 for Control (27) starred predictors like gest and ap would continue to show = − up as important in future studies. A second year of data was for the entire episode, or in fact obtained, but with only n 246 babies. The same = µij 2forControland 2 for Treatment (28) logistic regression model was run for the year 2 data and = − yielded coefcient estimates reasonably similar to the year 1 for the entire episode. The choice between (27)and(28)is values; see Table 4.Newtonwouldnotbejealous,butsome- random, as is the starting date for each episode. Each gene has thing of more than immediate interest seems to have been expected number of episodes equal 1. The black line segments discovered. in Figure 10 indicate all the active time periods. Nothing rules out eternal truth-seeking for the pure pre- The 400 hypothetical subjects were randomly divided into diction algorithms, but they have been most famously applied a training set of 320 and a test set of 80. A randomForest to more ephemeral phenomena: credit scores, Netfix movie analysis gave the results seen in the lefpanel of Figure 11,with recommendations, facial recognition, Jeopardy! competitions. test set error rate 19%. A second randomForest analysis was The ability to extract information from large heterogeneous data carried out, using the subjects from days 1 to 320 for the training collections, even if just for short-term use, is a great advantage set and from days 321 to 400 for the test set. The right panel of of the prediction algorithms. Random selection of the test set Figure 11 now shows test set error about 45%. makes sense in this setting, as long as one does not accept the In this case, it is easy to see how things go wrong. From estimated error rate as applying too far outside the limited range any one day’s measurements it is possible to predict Treatment of the current data. or Control from the active episode responses on nearby days 646 B. EFRON

Random Test Late Test 0.6 0.5 0.4 0.3 error rate error error rate error 0.2 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.0

1 51 101 151 201 1 51 101 151 201 #trees #trees

Figure 11. randomForest prediction applied to contrived microarray study pictured in Figure 10.Leftpanel:Testsetofsize80,selectedrandomlyfrom400days;heavy black curve shows fnal estimated test error rate of 19%. Right panel: Test set days 321–400; now error rate estimate is 45%. Light dotted curves in both panels are training set errors.

(even without knowledge of the activity map in Figure 10). This are also vulnerable to such changes. In the neonate study, the works for the random training/test division, where most of the logistic regression rule based on the year 1 data had a cross- test days will be intermixed with training days. Not so for the validated error rate of 20% which increased to 22% when applied early/late division, where most of the test days are far removed to the year 2 data. from training set episodes. To put it another way, prediction is The story is more complicated for the contrived example of easier for interpolation than extrapolation.14 Figures 10 and 11,wheremodel(29)doesnotstrictlyapply. What in general can we expect to learn from training/test There the efective predictor variables are ephemeral, blooming set error estimates? Going back to formulation (1), the usual and fading over short time periods. A reasonable conjecture assumption is that the pairs (xi, yi) are independent and iden- (but no more than that) would say the weak learners of the tically distributed (iid) from some probability distribution F on pure prediction algorithms are prone to ephemerality, or at (p 1)-dimensional space, least are more prone than the “main efects” kind of predictors + iid favored in traditional methodology. Whether or not this is true, (x , y ) F for i 1, 2, ..., n. (29) i i ∼ = I feel there is some danger in constructing training/test sets by Atrainingsetd of size n and a test set d of size n n n random selection, and that their error estimates must be taken 0 0 1 1 = − 0 are chosen (how is irrelevant under model (29)), rule f (x, d0) is with a grain of statistical salt. To put things operationally, I’d computed and applied to d1,generatinganerrorestimate worry about recommending the random forests prediction rule 1 in Figure 5 to a friend concerned about prostate cancer. ( ( )) Errn0 L yi, f xi, d0 , (30) This is more than a hypothetical concern. In their 2019 = n1 'd1 article, “Deep Neural Networks are Superior to Dermatologists L some loss function& like squared error or counting error. Then, in Melanoma Image Classifcation,” Brinker et al. demonstrate under model (29), Errn0 is an unbiased estimate of just what the title says; the authors are justifably cautious, Err (F) E Err , (31) recommending future studies for validation. Moreover, they n0 = F n0 & 15 acknowledge the limitations of using a randomly selected test the average prediction error of a rule) f (x*, d0) formed from n0 & set, along with the possible ephemerality of some of the algo- draws from F. rithm’s predictor variables. Frequent updating would be nec- Concept drifcan be interpreted as a change in the data- essary for serious use of any such diagnostic algorithm, along generating mechanism (29), say F changing to some new dis- with studies to show that certain subpopulations were not being tribution F˜,asseemsthelikelyculpritintheprostatecancer misdiagnosed.17 example of Figures 8 and 9.16 Traditional prediction methods

14Yu and Kumbier (2019)proposedtheusefuldistinctionof“internaltesting” 7. Smoothness versus “external testing.” 15An important point is that “a rule” means one formed according to the It was not just a happy coincidence that Newton’s calculus algorithm of interest and the data-generating mechanism, not the specifc accompanied Newton’s laws of motion. The Newtonian world, rule f(x, d0) at hand; see Figure 12.3 of Efron and Hastie (2016). 16Cox, in his discussion of Breiman (2001), says of the applicability of model as feshed out by Laplace, is an infnitely smooth one in which (29): “However, much prediction is not like this. Often the prediction is under quite diferent …conditions …[for example] what would be the efect on annual incidence of cancer in the United States of reducing by 17Facialrecognitionalgorithmshavebeenshowntopossessgender,age,and 10% the medical use of x-rays? etc.” race biases. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 647

randomForest Boosting algorithm gbm

cholesterol decrease cholesterol reduction

020406080

0 20 40 60 80 − 20

−2−1 0 1 2 −2−1 0 1 2 normalized compliance normalized compliance

Figure 12. randomForest and gbm fts to the cholostyramine data of Figure 1 in Section 2.HeavycurveiscubicOLS;dashedcurveinrightpanelis8thdegreeOLSft.

small changes in cause yield small changes in efect; a world It is desired to predict µi from xi. where derivatives of many orders make physical sense. The para- Our data d is unusually favorable in that the 75 supernovas metric models of traditional statistical methodology enforce the occurred near enough to Earth to allow straightforward deter- smooth-world paradigm. Looking back at Figure 1 in Section 2, mination of yi without the use of xi.However,thiskindofdeter- we might not agree with the exact shape of the cholostyramine mination is not usually available, while xi is always observable; cubic regression curve but the smoothness of the response seems an accurate prediction rule unarguable: going from, say, 1 to 1.01 on the compliance scale yi f (xi, d) (34) should not much change the predicted cholesterol decrease. ˆ = Smoothness of response is not built into the pure pre- would let astronomers better use Type 1a supernovas as “stan- 18 diction algorithms. The lefpanel of Figure 12 shows a dard candles” in determining the distances to remote galaxies. randomForest estimate of cholesterol decrease as a function In this situation, the smoothness of f (x, d) as a function of x of normalized compliance. It roughly follows the OLS cubic would be a given. curve but in a jagged, defnitely unsmooth fashion. Algorithm Algorithms randomForest and gbm were ft to the super- gbm, in the right panel, gave a less jagged “curve” but still with nova data (32). How smooth or jagged were they? For any two of substantial local discontinuity. the 75 cases, say i1 and i2, let xα be the straight line connecting 25 { } The choice of cubic in Figure 1 was made on the basis of a Cp xi1 and xi2 in R , comparison of polynomial regressions degrees 1 through 8, with xα αxi1 (1 α)xi2 for α 0, 1 , (35) cubic best. Both randomForest and gbm in Figure 12 began = + − ∈[ ] and yα the corresponding predictions. A linear model would by taking x to be the 164 8matrixpoly(c,8) (in R nota- {ˆ} ) * × yield linear interpolation, yα αy (1 α)y . tion), with c the vector of adjusted compliances—an 8th degree = i1 + − i2 Figure 13 graphs yα for three cases: i 1andi 3, i polynomial basis. The light dashed curve in the right panel is the { } 1 = 2 = 1 = 1andi 39, and i 39 and i 65. The randomForest 8th degree polynomial OLS ft, a pleasant surprise being how the 2 = 1 = 2 = traces are notably eccentric, both locally and globally; gbm less gbm predictions follow it over much of the compliance range. so, but still far from smooth.19 Perhaps this is a hopeful harbinger of how prediction algorithms There is no need for model smoothness in situations where could be used as nonparametric regression estimates, but the the target objects are naturally discrete: movie recommenda- problems get harder in higher dimensions. tions, credit scores, chess moves. For scientifc applications, Consider the supernova data:absolutebrightnessyi has been recorded for each of n 75 supernovas, as well as x avectorof = i 18The discovery of dark energy and the cosmological expansion of the uni- spectral energy measurements at p 25 diferent wavelengths, = verse involved treating Type 1a supernovas as always having the same so the dataset is absolute brightness, that is, as being perfect standard candles. This is not exactly true. The purpose of this analysis is to make the candles more stan- d x, y , (32) dard by regression methods, and so improve the distance measurements ={ } underlying cosmic expansion. Efron and Hastie (2016)discussedasubset with x 75 25 and y a75-vector.Afersomepreprocessing,a × of this data in their Chapter 12. reasonable model is 19The relatively smoother results from gbm have to be weighed against the ind fact that it gave much worse predictions for the supernova data, greatly y (µ ,1). (33) overshrinking the yi toward zero. i ∼ N i ˆ 648 B. EFRON

randomForest gbm

−0.059 −0.01 − 0.010 − 0.060

−0.067 − 0.012 prediction prediction

− 0.080 − 0.075 − 0.070 − 0.065 −0.017 − 0.018 − 0.016 − 0.014 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 from point 1 to point 3 from point 1 to point 3

randomForest gbm

−0.002 − 0.002

−0.029 − 0.004 prediction prediction − 0.008 − 0.006

−0.059 −0.01 − 0.06− 0.05 −0.04 − 0.03 − 0.02 − 0.010

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 from point 1 to point 39 from point 1 to point 39

randomForest gbm

−0.002 − 0.02

−0.029 − 0.03 − 0.005 − 0.04 prediction prediction − 0.05 − 0.06

−0.068 −0.015 − 0.07 − 0.015 − 0.010

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 from point 39 to point 65 from point 39 to point 65

Figure 13. Interpolation between pairs of points in supernova data. Left side is randomForest,rightsideisgbm. at least for some of them, smoothness will be important to a rejecting prediction algorithms but pointing out their limita- model’s plausibility. As far as I know, there is no inherent reason tions. I was the second discussant, somewhat skeptical of Leo’s that a pure prediction algorithm must give jagged results. Neu- claims (which were efusive toward random forests, at that time ral networks, which are essentially elaborate logistic regression new) but also somewhat impressed. programs, might be expected to yield smoother output. Breiman turned out to be more prescient than me: pure prediction algorithms have seized the statistical limelight in the twenty-frst century, developing much along the lines Leo 8. A Comparison Checklist suggested. The present paper can be thought of as a continued efort on my part to answer the question of how prediction Prediction is not the same as estimation, though the two are algorithms relate to traditional regression inference. ofen confated. Much of this article has concerned the dif- Table 5 displays a list of six criteria that distinguish traditional ferences. As a summary of what has gone before as well as regression methods from the pure prediction algorithms. My aspringboardforbroaderdiscussion,thissectionpresentsa previous “broad brush” warning needs to be made again: I am checklist of important distinctions and what they mean in terms sure that exceptions can be found to all six distinctions, nor are of statistical practice. the listed properties written in stone, the only implication being The new millennium got ofto a strong start on the virtues that they refect current usage. of prediction with Leo Breiman’s (2001) Statistical Science pub- lication, “Statistical Modeling: The Two Cultures.” An energetic Criterion 1. Surface plus noise models are ubiquitous in tradi- and passionate argument for the “algorithmic culture”—what tional regression methodology, so much so that their absence Ihavebeencallingthepurepredictionalgorithms—inthis is disconcerting in the pure prediction world. Neither sur- work Leo excoriated the “data modeling culture” (i.e., traditional face nor noise is required as input to randomForest, gbm, methods) as of limited utility in the dawning world of Big Data. or their kin. This is an enormous advantage for easy usage. Professor David Cox, the lead discussant, countered with a Moreover, you cannot be using a wrong model if there is no characteristically balanced defense of mainstream statistics, not model. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 649

Table 5. Acomparisonchecklistofdiferencesbetweentraditionalregression involve hosts of tuning parameters,millionsoftheminthecase methods and pure prediction algorithms. of deep learning, all relating to the algorithm rather than to data Traditionalregressionsmethods Purepredictionalgorithms generation.) Lurking behind a parametric model is usually some 1. Surface plus noise models Direct prediction notion of causality. In the cholostyramine example of Figure 1 (continuous, smooth) (possibly discrete, jagged) in Section 2, we are likely to believe that increased ingestion of the drug cholostyramine causes cholesterol to decrease in a 2. Scientifc truth Empirical prediction accuracy 20 (long-term ) (possibly short-term) sigmoidal fashion, even if strict causality is elusive.

3. Parametric modeling Nonparametric Abandoning mathematical models comes close to abandon- (causality ) (black box) ing the historic scientifc goal of understanding nature. Breiman 4. Parsimonious modeling Anti-parsimony states the case bluntly: (researchers choose covariates) (algorithm chooses predictors) Data models are rarely used in this community [the algorith- 5. x p n:withp npn,bothpossiblyenormous (homogeneous× data)≪ (mixed≫ data) mic culture]. The approach is that nature produces data in a black box whose insides are complex, mysterious, and at least 6. Theory of optimal inference Training/test paradigm partly unknowable.21 (mle, Neyman–Pearson) (Common Task Framework) NOTE: See commentary in the text. The black-box approach has a scientifcally anti-intellectual feeling but, on the other hand, scientifc understanding may be beside the point if prediction is the only goal. Machine translation ofers a useful case study, where there has been a A clinician dealing with possible prostate cancer cases will several-decade confict between approaches based on linguistic certainly be interested in efective prediction, but the disease’s analysis of language structure and more-or-less pure predic- etiology will be of greater interest to an investigating scientist, tion methods. Under the umbrella name of statistical machine and that’s where traditional statistical methods come into their translation (SMT), this latter approach has swept the feld, with own. If random forests had been around since 1908 and some- Google Translate, for example, currently using a deep learning body just invented regression model signifcance testing, the prediction algorithm. news media might now be heralding an era of “sharp data.” Traditional statistical education involves a heavy course of Eliminating surface-building from inference has a rafof probability theory. Probability occupies a smaller portion of downstream consequences, as discussed in what follows. One the nonparametric pure-prediction viewpoint, with probabilis- casualty is smoothness (Section 7). Applications of prediction tically simple techniques such as cross-validation and the boot- algorithms have focused, to sensational efect, on discrete tar- strap shouldering the methodological burden. Mosteller and get spaces—Amazon recommendations, translation programs, Tukey’s (1977) book, Data Analysis and Regression: A Second driving directions—where smoothness is irrelevant. The natural Course in Statistics, favored a nonprobabilistic approach to infer- desire to use them for scientifc investigation may hasten devel- ence that would be congenial to a modern course in machine opment of smoother, more physically plausible algorithms. learning. Criterion 2. The two sides of Table 5 use similar ftting criteria— Criterion 4. The 11 neonate predictor variables in Table 1 were some version of least squares for quantitative responses—but winnowed down from an initial list of 81, following a familiar they do so with diferent paradigms in mind. Following a 200- path of preliminary testing and discussions with the medical year-old scientifc path, traditional regression methods aim to scientists. Parsimonious modeling is a characteristic feature of extract underlying truth from noisy data: perhaps not eternal traditional methodology. It can be crucial for estimation and, truth but at least some takeaway message transcending current especially, for attribution, where it is usually true that the power experience. of discovery decreases as the list of predictors grows. Without the need to model surface or noise mechanisms, The pure prediction world is anti-parsimonious. Control of scientifc truth fades in importance on the prediction side of the predictor set, or the “features” as they are called, passes from the table. There may not be any underlying truth. Prediction the statistician to the algorithm, which can coin highly inter- methods can be comfortable with ephemeral relationships that active new features such as random forests’ tree variables. “The need only remain valid until the next update. To quote Breiman, more predictor variables, the more information,” said Breiman, “The theory in this feld shifs focus from data models to the an especially accurate forecast of the deep learning era. properties of algorithms,” that is, from the physical world to I was doubtful. My commentary on Breiman’s paper began: the computer. Research in the prediction community, which is “At frst glance Leo Breiman’s stimulating paper looks like an an enormous enterprise, is indeed heavily focused on computa- argument against parsimony and scientifc insight, and in favor tional properties of algorithms—in particular, how they behave of black boxes with lots of knobs to twiddle. At second glance it as n and p become huge—and less on how they relate to models still looks that way, but the paper is stimulating ….” Impressive of data generation.

20Efron and Feldman (1991) struggled to make a causality argument, one not Criterion 3. Parametric modeling plays a central role in tra- accepted uncritically by subsequent authors. ditional methods of inference, while the prediction algorithms 21Cox counters: “Formal models are useful and often almost, if not quite, are nonparametric, as in (29). (“Nonparametric,” however, can essential for incisive thinking.” 650 B. EFRON results like the randomForest and gbm predictions for the Traditional regression methods take the matrix of predic- prostate cancer data, Figures 5 and 6 in Section 4, certainly back tions x as a fxed ancillary statistic. This greatly simplifes the up Leo’s claim. But it is still possible to have reservations. The theory of parametric regression models; x is just as random coined features seem here to be of the weak learner variety, as y in the pure prediction world, the only probability model perhaps inherently more ephemeral than the putative strong being the iid nature of the pairs (x, y) F. Theory is more ∼ learners of Table 1. difcult in this world, encouraging the empirical emphasis dis- This is the suggestion made in Section 6.Iftheprediction cussed in Criterion 6. Bayesian statistics is diminished in the a- algorithms work by clever combinations of armies of weak probabilistic prediction world, leaving a tacit frequentist basis as learners, then they will be more useful for prediction than esti- the theoretical underpinning. mation or, especially, for attribution (as suggested in Section 5). “Short-term science” is an oxymoron. The use of prediction Criterion 6. Traditional statistical practice is based on a century algorithms for scientifc discovery will depend on demonstra- of theoretical development. Maximum likelihood estimation tions of their longer-term validity. and the Neyman–Pearson lemma are optimality criteria that guide applied methodology. On the prediction side of Table 5, Criterion 5. Traditional applications ask that the n p data theoretical efciency is replaced by empirical methods, particu- × matrix x (n subjects, p predictors) have n substantially greater larly training/test error estimates. than p, perhaps n > 5 p, in what is now called “tall data.” The · neonate data with n 812 and p 12 (counting the intercept) This has the virtue of dispensing with theoretical modeling, = = is on frm ground; less frm is the supernova data of Section 7, but the lack of a frm theoretical structure has led to “many with n 75 and p 25. On the other side of Table 5 the pure fowers blooming”: the popular pure prediction algorithms are = = prediction algorithms allow, and even encourage, “wide data,” completely diferent from each other. During the past quarter- with p n. The prostate cancer microarray study is notably century, frst neural nets then support vector machines, boost- ≫ wide, with n 102 and p 6033. Even if we begin with ing, random forests, and a reprise of neural nets in their deep = = tall data, as with the cholostyramine example, the prediction learning form have all enjoyed the prediction spotlight. In the algorithms widen it by the coining of new features. absence of theoretical guidance we can probably expect more. In place of theoretical criteria, various prediction competi- How do the prediction algorithms avoid overftting in a p tions have been used to grade algorithms in the so-called “Com- ≫ n situation? There are various answers, none of them completely mon Task Framework.”The common tasks revolve around some convincing: frst of all, using a test set guarantees an honest well-known datasets, that of the Netfix movie recommendation assessment of error (but see the discussion of Criterion 6). data being best known. None of this is a good substitute for a so- Second, most of the algorithms employ cross-validation checks far nonexistent theory of optimal prediction.24 during the training phase. Finally, there is an active research Testsetsareanhonestvehicleforestimatingpredictionerror, area that purports to show a “self-regularizing” property of the but choosing the test set by random selection from the full algorithms such that even running one of them long past the set d (1)mayweakentheinference.Evenmodestamountsof point where the training data are perfectly ft, as in Figure 9 of concept drifcan considerably increase the actual prediction Section 6,willstillproducereasonablepredictions.22 error, as in the prostate data microarray example of Section 6. Estimation and, particularly, attribution work best with In some situations there are alternatives to random selection, homogeneous datasets, where the (x, y) pairs come from a nar- for example, by selecting training and test according to early and rowly defned population. A randomized clinical trial, where the late collection dates, as in Figures 8 and 9.Inthesupernovadata subjects are chosen from a specifc disease category, exemplifes of Section 7, the goal is to apply a prediction rule to supernovas strict homogeneity. Not requiring homogeneity makes predic- much farther from Earth, so choosing the more distant cases for tion algorithms more widely applicable, and is a virtue in terms the test set could be prudent. of generalizability of results, but a defect for interpretability. In 1914 the noted astronomer Arthur Eddington,25 an excel- The impressive scalability of pure prediction algorithms, lent statistician, suggested that mean absolute deviation rather which allows them to produce results even for enormous values than root mean square would be more efcient for estimat- of n and p,isadangerousvirtue.Ithasledtoalustforever ing a standard error from normally distributed data. Fisher larger training sets. This has a good efect on prediction, making responded in 1920 by showing that not only was root mean the task more interpolative and less extrapolative (i.e., more like square better than mean absolute deviation, it was better than Figures 5 and 6,andlesslikeFigures 8 and 9)butmuddies any other possible estimator, this being an early example of his attempts at attribution.23 theory of sufciency.

22Forinstance,inanOLSftting problem with p > n where the usual estimate β (xtx) 1xty is not available, the algorithm should converge to the β ˆ= − ˆ that fts the data perfectly, y xβ, and has minimum norm β ;seeHastie the books on Amazon.com, all possible dog/cat pictures. “Reaching for a ˆ ˆ et al. (2019). = ∥ ∥ random sample in the age of big data is like clutching at a horsewhip in the 23 era of the motor car.”In fairness, the book’s examples of n all are actually An experienced statistician will stop reading an article that begins, “Over = one million people were asked…,”knowing that a random sample of 1000 narrowly defned, for example, all the street manholes in Manhattan. would be greatly preferable. This bit of statistical folk wisdom is in danger 24Bayes’rule ofers such a theory, but at a cost in assumptions far outside the of being lost in the Big Data era. In an otherwise informative popular book limits of the current prediction environment. titled, of course, Big Data, the authors lose all equilibrium on the question 25Later famous for his astronomical verifcation of Einstein’s theory of of sample size, advocating for n all: all the fu cases in the country, all relativity. = JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 651

Traditional methods are founded on these kinds of para- Instead of performing a traditional attribution analysis with metric insights. The two sides of Table 5 are playing by dif- p 106 predictors, the GWAS procedure performed 106 = ferent rules: the lefside functions in a Switzerland of infer- analyses with p 1andthenusedasecondlayerofinferenceto = ence, comparatively well ordered and mapped out, while Wild interpret the results of the frst layer. My next example concerns West exuberance thrives on the right. Both sides have much amoreelaborateimplementationofthetwo-layerstrategy. to gain from commerce. Before the 1920s, statisticians did While not 106,thep 6033 features of the prostate cancer = not really understand estimation, and afer Fisher’s work, microarray study in Section 4 are enough to discourage an over- we did. We are in the same situation now with the large- all surface plus noise model. Instead we begin with a separate scale prediction algorithms: lots of good ideas and excite- p 1 analysis for each of the genes, as in the GWAS example. = ment, without principled understanding, but progress may be in The data (16)forthejth gene is the air. d x : i 1, 2, ...,102 , (37) j ={ ij = } with i 1, 2, ...,50 for the normal controls and i 9. Traditional Methods in the Wide Data Era = = 51, 52, ...,102forthecancerpatients. The success of the pure prediction algorithms has had a stim- Under normality assumptions, we can compute statistics ulating efect on traditional theory and practice. The theory, zj comparing patients with controls which satisfy, to a good forged in the frst half of the twentieth century, was tall-data approximation,27 oriented: small values of n,butevensmallerp,ofenjustp 1 = z (δ ,1), (38) or 2. Whether or not one likes prediction algorithms, parts of j ∼ N j modern science have moved into the wide-data era. In response, where δ is the efect size for gene j : δ equals 0 for “null traditional methods have been stretching to ft this new world. j j genes,”genes that show the same genetic activity in patients and Three examples follow. controls, while δ is large for the kinds of genes being sought, Big data is not the sole possession of prediction algorithms. | j| namely, those having much diferent responses for patients ver- Computational genetics can go very big, particularly in the sus controls. form of a GWAS, genome-wide association study. An impressive Inferences for the individual genes by themselves are imme- example was given by Ikram et al. (2009), in a study concerning diate. For instance, the narrowing of blood vessels in the eye.26 The amount of narrowing was measured for n 15,358 individuals; each '( ) = pj 2 zj (39) individual had their genome assessed for about p 106 = − = is the two-sided p-value for testing δ 0. However, this ignores SNPs (single-nucleotide polymorphisms), a typical SNP having j = acertainchoiceofATCGvaluethatoccursinamajorityofthe having 6033 p-values to interpret simultaneously. As with the population or a minor, less prevalent alternative value. The goal GWAS, a second layer of inference is needed. was to fnd SNPs associated with vascular narrowing. A Bayesian analysis would hypothesize a prior “density” g(δ) With x 15,356 106 we are defnitely in big data and for the efect size, where g includes an atom of probability π0 = × at δ 0 to account for the null genes. Probably, most of the wide data territory. Surface plus noise models seem out of the = question here. Instead, each SNP was considered separately: a genes have nothing to do with prostate cancer so π0 is assumed was carried out, with the predictor variable to be near 1. The local false discovery rate fdr(z)—that is, the the number of minor polymorphisms in the chromosome pair probability of a gene being null given z-value z—is, according at that location—0, 1, or 2 for each individual—and response to Bayes rule, his or her narrowing measure. This gave a p-value p against . i fdr(z) π φ(z δ)/f (z) φ(z δ)/f (z), (40) the null : polymorphism at location i has no efect on = 0 − = − narrowing, i 1, 2, ...,106. The Bonferroni threshold for 0.05 where φ(z) exp z2/2 /√2π,andf (z) is the marginal = = {− } signifcance is density of z, 6 pi 0.05/10 . (36) ∞ ≤ f (z) φ(z δ)g(δ) dδ. (41) = − Ikram et al. (2009)displayedtheirresultsina“Manhattan +−∞ plot” with zi log (pi) graphed against location on the The prior g(δ) is most ofen unknown. An empirical Bayes =− 10 genome. Threshold (36)correspondstozi 7.3; 179 of the analysis supposes f (z) to be in some parametric family fβ (z); ≥ 106 SNPs had zi > 7.3, rejecting the null hypothesis of no the MLE β is obtained by ftting fβ ( ) to z , z , ..., z ,the ˆ · { 1 2 p} efect. These were bunched into fve locations on the genome, observed collection of all 6033 z-values, giving an estimated one of which was borderline insignifcant. The authors claimed false discovery rate credit for discovering four novel loci. These might represent four genes implicated in vascular narrowing (though a spike fdr(z) φ(z δ)/fβ (z). (42) = − ˆ in chromosome 12 is shown to spread over a few adjacent 27 & genes). If tj is the two-sample t-statistic comparing patients with controls, we take z ' 1F (t ),whereF is the cdf of a t-statistic with 100 degrees j = − 100 j 100 of freedom and ' is the standard normal cdf. Efect size δj is a monotone 26Microvascular narrowing is thought to contribute to heart attacks, but it is function of the diference in expectations between patients and controls; difcult to observe in the heart; observation is much easier in the eye. see Section 7.4 of Efron (2010). 652 B. EFRON 1.0

fdr(z)

0.5

0.0

E{del|z}/4 Empirical Bayes Estimates Empirical Bayes − 1.0 − 0.5

−6−4−2 0 2 4 6 z value at z=4: fdr(z)=.129 and E{del|z}=2.30

Figure 14. Estimated local false discovery curve fdr(z) and posterior efect size estimate E δ z from empirical Bayes analysis of prostate cancer data (the latter divided by ˆ{ | } 4fordisplaypurposes).Trianglesindicate29geneshavingfdr(z) 0.20; circles are 29 most signifcant genes from glmnet analysis. & ≤ & ( ) Here λ is a fxed tuning parameter: λ 0corresponds The dashed curve in Figure 14 shows fdr z based on a ffh- = to the ordinary least squares solution for β (if p n) while degree log-polynomial model for fβ , ≤ & λ makes β 0. For large values of λ,onlyafewofthe 5 =∞ ˆ= k β λ log fβ (z) β β z ; (43) coordinates ˆj will be nonzero. The algorithm begins at { }= 0 + k =∞ k 1 and decreases λ, admitting one new nonzero coordinate βj at a '= ˆ time. This works even if p > n. fdr(z) is seen to be near 1.0 for z 2 (i.e., gene almost certainly | |≤ The lasso was applied to the supernova data of Section 7, null) and declines to zero as z grows large, for instance, equal- | | where x has n 75 and p 25. Figure 15 shows the ing& 0.129 at z 4. The conventional threshold for attributing = = = frst six steps, tracking the nonzero coefcients β as new vari- signifcance is fdr(z) 0.20; 29 genes achieved this, as indicated ˆj ≤ ables were added. Predictor 15 was selected frst, then 16, 18, by the triangles in Figure 14. 22, 8, and 6, going on to the full OLS solution β at step 25. We can also& estimate the expected efect size. Tweedie’s for- ˆ An accuracy formula suggested step 4, with the only nonzero mula (Efron 2011) gives a simple expression for the posterior coefcients 15, 16, 18, and 22, as giving the best ft. (These expectation, correspond to energy measurements in the iron portion of the d spectrum.) E δ z z log f (z), (44) { | }= + dz Sparsity and the lasso take us in a direction opposite to the pure prediction algorithms. Rather than combining a myriad f (z) the marginal density (41). Substituting fβ for f gave the ˆ estimate E δ z in Figure 14.Itisnearlyzerofor z 2, rising of weak predictors, inference is based on a few of the strongest { | } | |≤ to 2.30 at z 4. explanatory variables. This is well suited to attribution but less = By using a two-level hierarchical model, the empirical Bayes so for prediction. glmnet analysis reduces the situation from p 6033 to p 5. We An R program for the lasso, , was applied to the = = are back in the comfort zone for traditional methods, where prostate cancer prediction problem of Section 4,usingthe parametric modeling for estimation and attribution works well. same training/test split as that for Figure 5.Itperformed randomForest Both are illustrated in Figure 14. much worse than ,making13errorson Sparsity ofers another approach to wide-data estimation and the test set. Applied to the entire dataset of 102 men, how- glmnet attribution: we assume that most of the p predictor variables ever, gave useful indications of important genes: have no efect and concentrate efort on fnding the few impor- the circles in Figure 14 show z-values for the 29 genes tant ones. The lasso (Tibshirani 1996)providesakeymethod- it ranked as most infuential. These have large values of z , even though the algorithm did not “know” ahead of ology. In an OLS type problem we estimate β,thep-vector of | i| regression coefcients, by minimizing time to take t-statistics between the cancer and control groups. n 1 t 2 The lasso produced biased estimates of β,withthecoordinate (yi xi β) λ β 1, (45) n − + ∥ ∥ values βj shrunk toward zero. The criticism leveled at prediction i 1 ˆ '= methods also applies here: biased estimation is not yet on a frm p where β 1 j 1 βj . theoretical footing. ∥ ∥ = = | | , JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 653

15 0.10 0.15 coefficient

0.05 16 18

0.00 2268

0123456

step

Figure 15. First6stepsofthelassoalgorithmappliedtothesupernovadata;coefcientsofvariouspredictorsareplottedasafunctionofstepsize.Predictor15 was chosen frst, followed by 16, 18, 22, 8, and 6. Stopping at step 4 gives the lowest Cp estimate of estimation error.

10. Two Hopeful Trends One tactic is to use traditional methods for the analysis of apredictionalgorithm’soutput;seeHaraandHayashi(2016) This was not meant to be an “emperor has no clothes” kind and Efron and Hastie (2016,p.346).Wager,Hastie,andEfron of story, rather “the emperor has nice clothes but they’re not (2014)usedbootstrapandjackknifeideastodevelopstandard suitable for every occasion.” Where they are suitable, the pure error calculations for random forest predictions. Murdoch et al. prediction algorithms can be stunningly successful. When one (2019)andVellido,Martín-Guerrero,andLisboa(2012)pro- reads an enthusiastic AI-related story in the press, there’s usually vided overviews of interpretability, though neither focused on one of these algorithms, operating in enormous scale, doing the pure prediction algorithms. Using information theoretic ideas, heavy lifing. Regression methods have come a long and big way Achille and Soatto (2018) discussed statistical sufciency mea- since the time of Gauss. sures for prediction algorithms. Much of this article has been concerned with what the Going in the other direction, Trend 2 moves from lefto prediction algorithms cannot do, at least not in their present right in Table 5, hoping to achieve at least some of the advan- formulations. Their complex “black box” nature makes the algo- tages of prediction algorithms within a traditional framework. rithms difcult to critique. Here I have tried to use relatively An obvious target is scalability. Qian et al. (2019)provideda small datasets (by prediction literature standards) to illustrate glmnet example with n 500,000 and p 800,000. Hastie, their diferences from traditional methods of estimation and = = Tibshirani, and Friedman (2009)successfullyconnectedboost- attribution. The criticisms, most of which will not come as a ing to logistic regression. The traditional parametric model that surprise to the prediction community, were summarized in the has most currency in the prediction world is logistic regression, six criteria of Table 5 in Section 8. so it is reasonable to hope for reunifcation progress in that Some time around the year 2000 a split opened up in the area. world of statistics.28 For the discussion here we can call the “Aspirational” might be a more accurate word than “hopeful” two branches “pure prediction” and “GWAS”: both accommo- for this section’s title. The gulf seen in Table 5 is wide and dating huge datasets, but with the former having become fully the reunifcation project, if going at all, is just underway. In algorithmic while the latter stayed on a more traditional math- my opinion, the impediments are theoretical ones. Maximum modeling path. The “two hopeful trends” in the title of this sec- likelihood theory provides a lower bound on the accuracy of an tion refer to attempts at reunifcation, admittedly not yet very far estimate, and a practical way of nearly achieving it. What can along. we say about prediction? The Common Task Framework ofen Trend 1 aims to make the output of a prediction algorithm shows just small diferences in error rates among the contestants, more interpretable, that is, more like the output of traditional but with no way of knowing whether some other algorithm statistical methods. Interpretable surfaces, particularly those of might do much better. In short, we do not have an optimality linear models, serve as the ideal for this achievement. Something theory for prediction. like attribution is also desired, though usually not in the specifc The talks I hear these days, both in statistics and biostatistics, sense of statistical signifcance. bristle with energy and interest in prediction algorithms. Much of the current algorithmic development has come from outside the statistics discipline but I believe that future progress, espe- 28See the triangle diagram in the epilogue of Efron and Hastie (2016). cially in scientifc applicability, will depend heavily on us. 654 B. EFRON

Appendix A (The calculations which follow continue to focus on the “plus” case of (21).) A.1. Out-of-Bag (OOB) Estimates, Section 3 A linear combination A random forest application involves B nonparametric bootstrap sam- N ples from the original dataset d (1), say d 1, d 2, ..., d B.Let S w X (A.10) ∗ ∗ ∗ = j j j 1 k '= y f (xi, d∗ ), i 1, 2, ..., n and k 1, 2, ..., B,(A.1) ˆik = = = has posterior mean and variance be the estimate for the ith case obtained from the prediction rule based N N on the kth bootstrap sample. Also, let S w A , w2B ,(A.11) ∼ ⎛ 0 0 j j⎞ j 1 j 1 N number of times case (x , y ) appears in d k.(A.2) '= '= ik = i i ∗ ⎝ ⎠ with The OOB estimate of prediction errors is dj vj Aj and Bj 1. B B = 2c = 4c2 + Err L(y , y ) 1, (A.3) 0 = i ˆik For the “minus” arm of (21)(thehealthycontrols),S k 1 i:N 0 -k 1 i:N 0 N N ∼ '= 'ik= '= 'ik= ( w A , B2).WecanuseS as a prediction statistic, predicting & − 1 j j 1 j where L(y , y ) is the loss function; that is, Err is the average loss over sick for S > 0andhealthyforS < 0. i ˆik 0 , , all cases in all bootstrap samples where the sample did not include the The probability of a correct prediction depends on case. & 2 N N “Bagging” stands for “bootstrap aggregation,” the averaging of the R2 mean(S)2/var(S) w A w2B .(A.12) true predictors employed in random forests. The behind (A.3) = = ⎛ j j⎞ ⎛ j j⎞ 1 - 1 is that cases having N 0are“outofthebag”ofd k,andsoforma ' ' ik = ∗ ⎝ ⎠ ⎝ ⎠ This is maximized for w A /B (as with Fisher’s linear discriminant natural cross-validation set for error estimation. j = j j function), yielding

A.2. Prediction Is Easier Than Attribution N N N S A2/B , A2/B and R2 A2/B .(A.13) ∼ ⎛ j j j j⎞ = j j We wish to motivate (23)–(24), beginning with model (21)–(22). Let '1 '1 '1 x be the average of the n/2 healthy control measurements for gene j, ⎝ ⎠ ¯0j A normal approximation for the distribution of S gives and likewise x for the n/2sickpatientmeasurements.Thenwecan ¯1j '( R) (A.14) compute z values − ind as the approximate probability of a prediction error of either kind; see z c(x x ) (δ ,1) (A.4) j = ¯1j − ¯0j ∼ N j Efron (2009). It remains to calculate R2. for j 1, 2, ..., N. Letting π N /N and π N /N be the = 0 = 0 1 = 1 proportions of null and nonnull genes, (22)saysthatthez take on two j Lemma 1. A continuous approximation to the sum R2 N A2/B possible densities, in proportions π and π , = 1 j j 0 1 is , π0 : z f0(z) or π1 : z f1(z),(A.5) N2 %2 N /N 1 ∼ ∼ 2 1 ∞ 0 1 ( ) R 2 ( ) ( ) f1 z dz.(A.15) = N0 4c N0 f0 z v z where f (z) is the standard normal density f (z),andf (z) φ(z %). + 1 1 2 0 1 = − −∞ + N1 f1(z) + 4c With both N0 and N1 going to infnity, we can and will think of π0 and π1 as prior for null and nonnull genes. Before verifying the lemma, we note that it implies (23): letting The posterior moments of δ given z , obtained by applying Bayes N with N O(N ), and using (A.4), gives j j 0 →∞ 1 = 0 rule to model (A.4), have simple expressions in terms of the log deriva- N2 %2 exp (z2/2) 2%z (%2/2) tives of the marginal density 2 1 ∞ {− + − } R 2 dz.(A.16) = N0 4c √2π(1 v(z)/4c2) +−∞ + f (z) π f (z) π f (z),(A.6) = 0 0 + 1 1 The variance v(z) is a bounded quantity under (A.4)—it equals 0.25 d for % 1, for instance—so the integral is a fnite positive number, d(z) E δ z z log f (z), = ≡ { | }= + dz (%) γ 1/2 say I .IfN1 N0 , then the prediction error probabilities are and = 1/2 approximately '( γ I(%) ) according to (A.14). However, N1 1/2 − = o(N ) gives R2 0 and error probabilities 1/2. d2 0 → → v(z) var δ z 1 log f (z).(A.7)It remains to verify the lemma. Since any δ equals either 0 or % (22), ≡ { | }= + dz2 j d(z) E δ z Pr δ 0 z % See Efron (2011). That is, = { | }= { ̸= | } tdr(z)%,(A.17) = δj zj (dj, vj),(A.8) | ∼ where tdr(z) is the true discovery rate Pr δ 0 z , { ̸= | } where the parenthetical notation indicates the mean and variance of a π1f1(z) 1 ( ) ( ) (δ / ) tdr(z) random quantity and dj d zj , vj v zj .ThenXj j 2c,1 π f (z) = = ∼ N = π0f0(z) π1f1(z) = 1 0 0 for the sick subjects in (21)gives + + π1 f1(z) 1 ind dj vj ( ) .(A.18) Xj zj , 1 .(A.9) = N0 f0 z | ∼ 2 + 1 ( ) . 2c 4c / + N1 f1 z JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 655

From (A.11)and(A.13), we get Efron, B., and Feldman, D. (1991), “Compliance as an Explanatory Variable in Clinical Trials,” Journal of the American Statistical Association,86,9– N A2 N d2 17. [637,649] 2 j j R 2 Efron, B., and Hastie, T. (2016), Computer Age : Algo- = Bj = 4c vj '1 '1 + rithms, Evidence, and Data Science, Institute of Mathematical Statistics Monographs, Cambridge: Cambridge University Press. [642,646,647, N 2 tdr tdr(z)2%2 653] j ∞ ( ) 2 N 2 f z dz,(A.19)Hara, S., and Hayashi, K. (2016), “Making Tree Ensembles Interpretable,” = 4c vj = 4c v(z) '1 + +−∞ + arXiv no. 1606.05390. [653] Hastie, T., Montanari, A., Rosset, S., and Tibshirani, R. J. (2019), “Surprises the last equality being the asymptotic limit of the discrete sum. in High-Dimensional Ridgeless Least Squares Interpolation,” arXiv no. Since tdr(z)f (z) π f (z) (N /N )f (z),(A.18)becomes = 1 1 = 1 0 1 1903.08560. [650] 2 Hastie, T., Tibshirani, R., and Friedman, J. (2009), The Elements of Statistical 2 ∞ tdr(z)% R N1 f1(z)dz.(A.20)Learning: Data Mining, Inference, and Prediction,SpringerSeriesin = 4c2 v(z) Statistics (2nd ed.), New York: Springer. [636,641,653] +−∞ + Ikram, M. K., Xueling, S., Jensen, R. A., Cotch, M. F., Hewitt, A. W., Ikram, Then (A.17) gives the lemma. Moreover, unless N O(N ),(A.17) 1 = 0 M. A., Wang, J. J., Klein, R., Klein, B. E., Breteler, M. M., and Cheung, N. shows that tdr(z) 0forallz, supporting (24). → (2010), “Four Novel Loci (19q13, 6q24, 12q24, and 5q14) Infuence the Microcirculation In Vivo,” PLOS Genetics,6,1–12.[651] Johnstone, I. M., and Silverman, B. W. (2004), “Needles and Straw in Funding Haystacks: Empirical Bayes Estimates of Possibly Sparse Sequences,” Annals of Statistics,32,1594–1649.[642] The author’s work is supported in part by an award from the National Mediratta, R., Tazebew, A., Behl, R., Efron, B., Narasimhan, B., Teklu, Science Foundation (DMS 1608182). A., Shehibo, A., Ayalew, M., and Kache, S. (2019), “Derivation and Validation of a Prognostic Score for Neonatal Mortality Upon Admission to a Neonatal Intensive Care Unit in Gondar, Ethiopia” (submitted). [637,639] References Mosteller, F., and Tukey, J. (1977), Data Analysis and Regression: A Second Course in Statistics, Addison-Wesley Series in Behavioral Science: Quan- Achille, A., and Soatto. S. (2018), “Emergence of Invariance and Disentan- titative Methods, Reading, MA: Addison-Wesley. [649] glement in Deep Representations,” Journal of Research, Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019), 19, 1–34. [653] “Interpretable Machine Learning: Defnitions, Methods, and Applica- Breiman, L. (2001), “Statistical Modeling: The Two Cultures,” Statistical tions,” arXiv no. 1901.04592. [653] Science,16,199–231.[639,646,648] Qian, J., Du, W.,Tanigawa, Y., Aguirre, M., Tibshirani, R., Rivas, M. A., and Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984), Hastie, T. (2019), “AFast and Flexible Algorithm for Solving the Lasso in Classifcation and Regression Trees, Wadsworth Statistics/Probability Large-Scale and Ultrahigh-Dimensional Problems,”bioRxiv no. 630079. Series, Belmont, CA: Wadsworth Advanced Books and Sofware. [653] [639] Schmidt, C. (2019), “Real-Time Flu Tracking,” Nature,573,S58–S59.[644] Brinker, T. J., Hekler, A., Enk, A. H., Berking, C., Haferkamp, S., Hauschild, Tibshirani, R. (1996), “Regression Shrinkage and Selection via the Lasso,” A., Weichenthal, M., Klode, J., Schadendorf, D., Holland-Letz, T., von Journal of the Royal Statistical Society,SeriesB,58,267–288.[652] Kalle, C., Fröhling, S., Schilling, B., and Utikal, J. S. (2019), “Deep Vellido, A., Martín-Guerrero, J. D., and Lisboa, P. J. G. (2012), “Making Neural Networks Are Superior to Dermatologists in Melanoma Image Machine Learning Models Interpretable,” in Proceedings of the 20th Classifcation,” European Journal of Cancer,119,11–17.[646] European Symposium on Artifcial Neural Networks, Computational Intel- Efron, B. (2009), “Empirical Bayes Estimates for Large-Scale Prediction ligence and Machine Learning (ESANN 2012),Bruges,Belgium,April25– Problems,” Journal of the American Statistical Association,104,1015– 27, 2012, pp. 163–172. [653] 1028. [654] Wager, S., Hastie, T., and Efron, B. (2014), “Confdence Intervals for Ran- (2010), Large-Scale Inference: Empirical Bayes Methods for Estima- dom Forests: The Jackknife and the Infnitesimal Jackknife,” Journal of tion, Testing, and Prediction (Vol. 1), Institute of Mathematical Statistics Machine Learning Research,15,1625–1651.[653] Monographs, Cambridge: Cambridge University Press. [651] Yu, B., and Kumbier, K. (2019), “Three Principles of Data Science: Pre- (2011), “Tweedie’s Formula and Selection Bias,” Journal of the dictability, Computability, and Stability (PCS),” arXiv no. 1901.08152. American Statistical Association,106,1602–1614.[652,654] [646] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 656–658: Comment https://doi.org/10.1080/01621459.2020.1762618

Discussion of the Paper “Prediction, Estimation, and Attribution”by B. Efron

Emmanuel Candèsa,b and Chiara Sabattib,c aDepartment of Mathematics, Stanford University, Stanford, CA; bDepartment of Statistics, Stanford University, Stanford, CA; cDepartment of Biomedical Data Science, Stanford University, Stanford, CA

We enjoyed reading Professor Efron’s (Brad) paper just as we are concerned, this is the key take-way message from his much as we enjoyed listening to his June 2019 lecture in Lei- article. den. One of the core values underlying statistical research is in Efron is right to point out that in case we observe good how it enables scientifc understanding and discoveries, and we performance on a random train/test split, we should probably appreciate that Efron has placed “science” at the center of his not cork champagne. He introduces a healthy dose of reality and article.Oratleast,thediscoveriesofassociationsthatarenotjust we applaud him for this. In high-stakes applications, we should valid here and now—in the particular dataset I have at hand— systematically and empirically tests prediction rules and evalu- but are likely to be stable and replicated under slightly diferent ate their robustness just as science tests and hypotheses. circumstances. This of course applies irrespective of the nature of the learning The successes of purely predictive algorithms (gradient algorithm, for example, whether it is parametric or not. In boosting, deep learning) in object recognition, speech recog- Efron’s example, there must be a clear shifin the distribution nition, machine translation, and others, are real.1 Incidentally, of the data and it is important to understand it if we had the when one examines these successes as reported in the media, intention of using the learned model in real life. We regard the one observes that they all have in common a nearly infnite modeling of distributional shifs and the performance monitor- sample size (consider that WhatsApp record billions of ing of predictive models as one of the most import activities conversations every year) and a nearly infnite signal-to-noise statisticians need to engage in. How do we evaluate algorithms ratio (SNR) or, put diferently, very little . If you give beyond the simple metric of checking a classifcation error on a me a photograph, there is little debate as to whether there is test set? In our view, algorithmic developments need perhaps to a cow on it or not. These instances are far from the settings focus less on a single objective and more on appropriate metrics statisticians are accustomed to: moderate sample sizes, a higher of stability/robustness and reproducibility. degree of uncertainty, and variables/factors which have not We wrote that Efron put “science” at the center of his article. been measured—prosaically, things we cannot see. Using Efron’s To maximize the resonance of this choice, and articulate the example, it is not because I know how much you comply with thoughts that the article provoked in us, we fnd it useful to recall atreatmentthatIcanbecertainabouthowmuchitwillbeof what Tukey (1962)identifedasthreeconstituentsofscience: beneft to you. “(1) intellectual content, (2) organization into an understand- Interestingly, Efron puts the pure predictive algorithms to able form, (3) reliance upon the test of experience as the ultimate the test in a scenario where the sample size is extremely mod- standard of validity.” erate by today’s standards (n 102). In the prostate cancer Efron contrasts model-less predictive rules with parametric = microarray study, he observes the following: split the sample at models, the latter ofering, in his view, several advantages, that random as to create a training and test set with the same number have to do with all the three constituents of science. Certainly of cases and controls, and train a random forest. Perhaps to the surface plus noise view of the world—where the surface is his surprise, random forests perform extremely well correctly described by a sparse parametric model on whose estimation classifying all 51 test patients but one! Surely, we must have we are to focus—represents a “more understandable form” than “learned” something valuable. Or have we? His second fnding the black-boxes at the core of many predictive rules. There is this: if we split patients according to what we believe is are however limits to simplicity. The phenomena contemporary the time of entry in the study, then the performance is spec- science studies are ever more complex and our data gathering tacularly degraded; that is, he now records 12 wrong predic- is more comprehensive than ever before. To give an exam- tions corresponding to an error rate of about one in four. So ple, we have by now (practically) found all the genes behind what have we really learned? This is all the more discomfort- Mendelian monogenic diseases and we are now focusing on ing since prediction rules are intended to be deployed in the complex diseases, which are going to be infuenced by many second scenario, and not in the highly idealized setting where genes, possibly interacting. Hence, the phenomena under study test and training samples are exchangeable or, said diferently, may be far too rich to be described with a few parameters. If this where current data are representative of future data. As far as is the case, why would we want to estimate parameters that only

CONTACT Emmanuel Candès [email protected] Department of Statistics, 390 Jane Stanford Way, Stanford, CA 94305. 1 Has the world ever seen such a massive amount of activity on tasks this well delineated? ©2020AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 657 exist in the analyst’s mind? Similarly, if the model is incorrect, Romano, Patterson, and Candès 2019) and further advances precise inference about these parameters is of dubious scientifc can be expected. The careful reader might tick at the notion merit. of exchangeable samples per our earlier discussion. We note In this context, it is easy to understand the allure of “pre- that researchers take this issue seriously as well, see Tibshirani diction” as a tool to assess the validity of any strategy to learn et al. (2019)whichadaptstheideasofconformalinference from data: the focus is on observable variables, rather than to work under covariate shifs. In the latter reference, para- on the estimation of some parameters that might not actually metric models are found to be useful to capture distributional correspond to any existing entities. The importance of pre- shifs. diction in science is, of course, nothing new. In the scientifc Efron uses the term “attribution”todescribethegoalthatis method, theories are tested by deriving predictions and cross ofen pursued with tests of hypotheses. We like the word choice checking with empirical data. The history of statistical rea- and we agree that attribution is important. Being able to identify soning itself documents an attention to prediction. An exam- at least the variables that are stably associated with an outcome ple is de Finetti’s concept of exchangeable random variables, both increases our understanding and ofers a precise way to which emphasizes that the realizations of these random vari- adopt the test of experience as the ultimate standard of validity. ables are all what we ever get to observe. In contrast, a model There is not much utility in capturing “ephemeral” associations. for how they are generated is by and large a construct of our The “replicability crisis” which has received so much attention imagination, and the rational validity of any mental model is in both the technical and popular media, makes it abundantly predicated on the coherence of the rules with which we rely clear. The good news is that one can perform attribution without on past observations to predict the value of future ones (de relying on the truth of a parametric models. In the knockof Finetti 1992). More recently, and with a diferent emphasis, a framework, one can use any black-box (any predictive model) to related point is made by the authors of “veridical data science,” identify any kind possible association—linear or not—between who identify as a core principle of data science a large set of explanatory variables X (X , X , ..., X ) and an = 1 2 p (Yu and Kumbier 2020). outcome Y. This framework rigorously tests whether the condi- Without contradicting any of this, Efron points out that sim- tional distribution Y X actually depends on any of the variables | ply comparing the predictive performance of diferent strategies Xj without assuming any model for how X and Y are connected might not be as fruitful as one might think. It is hard to know (Candès et al. 2018). The burden of knowledge is here on the when a predictive model works because everything is drowned distribution of the covariates and there are many problems for in the incompressible randomness of new observations. Ofen, which good models are at hand. This is the case in genetic the diference between competing algorithms can be seen only studies where large collections of genotype data are available on the second digits of misclassifcation error (expressed in and we have models of linkage disequilibrium to describe their percentage so the fourth digit), even if they encode substan- distribution (Sesia, Sabatti, and Candès 2019;Sesiaetal.2020). tially diferent pictures of reality. How can one learn from In such studies, conditional testing—as opposed to the marginal this? testing discussed in Efron’s paper—seems far closer to the sci- We believe that the problem is not with “prediction” per se, entifc question of interest, see Bates et al. (2020) for links with but on how we measure performance: one has to take Tukey’s causality. call for “understandable form” and “test of experience” more Returning to Tukey’s point about intellectual content, per- seriously. For prediction to provide actionable information, we haps the most powerful form of knowledge is the ability to act on need to be able to answer questions as: how accurate are the asystemtomodifyitsoutcome.Thisrequireshavingahandle results of a prediction model? Does the margin of error changes on causal efects. Recent years have witnessed an explosion with the values of some predictors? There is much work to be of research on causal inference methods, many of which try done both in developing efective ways of measuring and commu- to leverage the power of machine learning algorithms (Wager nicating errors,andindesigningalgorithmsthatleadtorobust and Athey 2018;Rothenhäusler,Bühlmann,andMeinshausen performance. On the frst front, resampling-based techniques 2019). (the bootstrap (Efron 1979)) and techniques based only upon Efron ofered a useful and powerful reminder of what are the an exchangeability assumption (the jackknife (Quenouille 1949, ultimate goals of data analysis, and what increased knowledge 1956), conformal inference (Vovk, Gammerman, and Saunders looks like (or does not). We hope that larger portions of 1999; Papadopoulos et al. 2002;LeiandWasserman2014;Lei the energy and time currently invested in machine learning et al. 2018)) have already proven powerful. By now we have can be directed to making attribution possible, to assessing concrete and efcient methods for constructing predictive inter- uncertainty correctly, and to achieving robust performance as ( ) vals Cˆ x from a training sample such that for a new test point to guide interventions. The agenda is clear and we have seen (Xtest, Ytest),wehave that promising steps have been taken. Another challenge is to educate the growing community of data scientists on what P(Y C(X ) X ) 1 α, test ∈ ˆ test | test ∈ X ≥ − it takes to “learn” general truths from a series of individual where the probability is averaged over both the training sample observations. One hundred years of statistics (in which the “data and the test point. (The statement above holds with the proviso modeling culture” has played a large role) have both exposed that is not too complicated as in the case where is the some of the challenges and identifed a number of meaningful X X whole sample space, see Barber et al. (2019a).) In short, we can guarantees that mitigate the difculties linked to inference. use model-less learning algorithms (black boxes) to predict with It is important that we efectively share this rich statistical confdence. This is an active area of research (Barber et al. 2019b; thinking. 658 E. CANDÈS AND C. SABATTI

References Romano, Y., Patterson, E., and Candès, E. J. (2019), “Conformalized Quan- tile Regression,”in Advances in Neural Information Processing Systems 32 Barber, R. F., Candès, E. J., Ramdas, A., and Tibshirani, R. (2019a), “The (NIPS 2019),eds.H.Wallach,H.Larochelle,A.Beygelzimer,F.d’Alché- Limits of Distribution-Free Conditional Predictive Inference,” arXiv no. Buc, E. Fox, and R. Garnett, Curran Associates, Inc., pp. 3538–3548. 1903.04684. [657] [657] (2019b), “Predictive Inference With the Jackknife+,” arXiv no. Rothenhäusler, D., Bühlmann, P., and Meinshausen, N. (2019), “Causal 1905.02928. [657] Dantzig: Fast Inference in Linear Structural Models With Bates, S., Sesia, M., Sabatti, C., and Candès, E. J. (2020), “Causal Inference Hidden Variables Under Additive Interventions,” The Annals of Statistics, in Genetic Trio Studies,” arXiv no. 2002.09644. [657] 47, 1688–1722. [657] Candès, E. J., Fan, Y., Janson, L., and Lv, J. (2018), “Panning for Gold: Sesia, M., Katsevich, E., Bates, S., Candès, E. J., and Sabatti, C. (2020), ‘Model-X’ Knockofs for High Dimensional Controlled Variable Selec- “Multi-Resolution Localization of Causal Variants Across the Genome,” tion,” Journal of the Royal Statistical Society,SeriesB,80,551–577.[657] Nature Communications,11,1093.[657] de Finetti, B. (1992), “Foresight: Its Logical Laws, Its Subjective Sources,” Sesia, M., Sabatti, C., and Candès, E. J. (2019), “Gene Hunting With Knock- in Breakthroughs in Statistics, Springer Series in Statistics (Perspectives ofs for Hidden Markov Models” (with discussion), Biometrika,106,1– in Statistics), eds. S. Kotz and N. L. Johnson, New York: Springer, pp. 18. [657] 134–174. [657] Tibshirani, R. J., Barber, R. F., Candès, E. J., and Ramdas, A. (2019), Efron, B. (1979), “Bootstrap Methods: Another Look at the Jackknife,” The “Conformal Prediction Under Covariate Shif,” in Advances in Neural Annals of Statistics,7,1–26.[657] Information Processing Systems 32 (NIPS 2019),eds.H.Wallach,H. Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. Fox, and R. Garnett, “Distribution-Free Predictive Inference for Regression,” Journal of the Curran Associates, Inc., pp. 2526–2536. [657] American Statistical Association,113,1094–1111.[657] Tukey, J. W. (1962), “The Future of Data Analysis,” The Annals of Mathe- Lei, J., and Wasserman, L. (2014), “Distribution-Free Prediction Bands matical Statistics,33,1–67.[656] for Non-Parametric Regression,” Journal of the Royal Statistical Society, Vovk, V., Gammerman, A., and Saunders, C. (1999), “Machine-Learning Series B, 76, 71–96. [657] Applications of Algorithmic Randomness,” in International Conference Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. (2002), on Machine Learning,pp.444–453.[657] “Inductive Confdence Machines for Regression,” in European Confer- Wager,S.,andAthey,S.(2018),“EstimationandInferenceofHeterogeneous ence on Machine Learning,Springer,pp.345–356.[657] Treatment Efects Using Random Forests,” Journal of the American Sta- Quenouille, M. H. (1949), “Approximate Tests of Correlation in Time- tistical Association,113,1228–1242.[657] Series,” Journal of the Royal Statistical Society,SeriesB,11,68–84.[657] Yu, B., and Kumbier, K. (2020), “Veridical Data Science,” Proceedings of the (1956), “Notes on Bias in Estimation,” Biometrika,43,353–360. National Academy of of the United States of America,117,3920– [657] 3929. [657] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 659: Comment https://doi.org/10.1080/01621459.2020.1762451

Discussion of Paper by Brad Efron

D. R. Cox

Nufeld College, Oxford, UK

It is a privilege to have the chance of congratulating Professor processes and their application and Bartlett’s infuence on statisti- Efron frst on this wise paper, then on the richly merited Inter- cal development, at least in UK, was second only to Fisher’s. See, national Prize, the award of which the paper commemorates, forexample,histreatmentin1937ofasymptotictheoryinwhich but above all on the whole body of his deeply impressive, wide- the dimension of the parameter space increases proportionally ranging contributions to our subject. to sample size. One of the themes of his wide-ranging work One issue which the paper indirectly raises is the role of was the use of specifc stochastic processes, Markov chains, diferent approaches to the conceptual and mathematical the- generalized birth-death processes and others, for the detailed ory of our feld. One of the appeals of the feld is its totally interpretation of biological or physical science data; his fnal international character yet, inevitably, broad and hazily defned post was as Professor of Biomathematics. An aspect of both national contrasts are visible. Fisher’s and Bartlett’s work, which I must admit I admire, is a Thus Professor Efron links “traditional” statistical theory to total lack of concern with formal mathematical regularity con- Neyman and Pearson. Yet the two set out to clarify earlier work ditions which, of course subject to due care, typically contribute of R.A. Fisher, frst with Fisher’s encouragement, which only little to understanding. Professor Efron in his theoretical papers later turned to destructive hostility. The mathematical clarity of manages with great skill and panache to combine careful math- Neyman’s work is, of course, appealing but it may be argued that ematical discussion with judicious statistical emphasis. Special its overformalization continues to lead to misunderstanding, stochastic models probably tend nowadays to be treated much unproductive discussion and rigidity concerning, in particular, more ofen by computer than by mathematical anal- the role of signifcance tests. ysis of the diferential involved with obvious gains and The Nordic approaches have made an important and distinc- some loss, especially when the model is intended to lead to semi- tive contribution to the feld. See, in particular, the recentf ne qualitative understanding of an empirical phenomenon. The account of Rolf Sundberg, Statistical modelling by exponential aethos is rather diferent from the use of families of regression- families,stemmingfromthemuchearliercontributionofPer type models and even more so from that of machine learning, Martin-Lof. In the UK, Egon Pearson played a distinctive role in the latter achieving very specifc purposes ofen with high efec- the development of our subject. He mostly had a preference for tiveness, as Professor Efron’s discussion so elegantly illustrates. numerical illustration and was deeply involved in applications, But how useful are they as a guide to deeper interpretation or stemming in part from an early visit he made to Bell Labs. for broader context prediction? One of the great masterpieces of our feld, largely unread I reemphasize my admiration for Professor Efron’s work and nowadays, is M.S. Bartlett’s (1958) Introduction to stochastic for this paper.

CONTACT D. R. Cox [email protected] Nufeld College, Oxford OX1 1NF, UK. ©2020AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 660–662: Comment https://doi.org/10.1080/01621459.2020.1762619

Comment: When Is It Data Science and When Is It Data ?

Noel Cressie

National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong, NSW, Australia

1. Introduction the degree of veracity of the surface s(x, β) is captured by the equation, The International Prize in Statistics (IPS) is awarded by the International Statistical Institute (ISI) every two years, to indi- w s(x, β) ξ, (2) viduals whose work has made a seismic breakthrough in Statis- = + tics. The IPS rivals in stature to the Fields Medal in Math- which is the “process model” in a hierarchical statistical model ematics, although it is more recent, and the ISI is to be (HM). Importantly, y are observations on w not on s(x, β): highly commended for initiating the prize. Brad Efron’s boot- y w ϵ, (3) strap stands out among all the amazing contributions made = + in Statistics in the last 50 years, and Brad himself is a sin- which is the “data model” in an HM. Combining (2) and (3) gularity in the profession. He is known for his deep think- results in: ing about important science problems, elegant development y s(x, β) ξ ϵ, (4) of solutions with uncertainty quantifed, and clear exposi- = + + tions of the statistical science. The IPS Laureate delivers a which reduces to (1) if the process-model error, ξ,isidentically lecture at the World Statistics Congress (WSC). Brad’s lec- zero. Interpreted in terms of linear regression, where s(x, β) ture at the 2019 WSC in Kuala Lumpur, Malaysia, was very = x′β, (4) tells us that the diference between the data and the forward-looking and concentrated on prediction (a more gen- surface is eral concept than forecasting). Toward the end of my dis- cussion, I move away from the specifcs of Brad’s very fne y x′β ξ ϵ, paper to a more philosophical place, where I suggest that − = + prediction is looking more like Data Engineering than Data and hence if x′β is a fxed efect the error budget is Science. var(y) var(ξ) var(ϵ). (5) = + Misinterpretations of are possible by not recog- 2. Prediction With an Extra Component of Variation nizing equation (5): The budget may not add up to the total I really liked the way Brad trichotomized statistical analysis variance if var(ϵ) is determined from independent calibration; into prediction (inference on random quantities), estimation or,ifthenoise-varianceisassumedtobevar(y),over-smoothing (inference on fxed but unknown parameters), and attribution will occur. In the model I use for spatial and spatio-temporal (includes hypothesis testing, model choice, and classifcation). prediction, (4) is my starting point. Iwouldliketoaddtothis,afourth“-ion,”decision (in the Usually, I want to predict w (where the scientifc truths presence of uncertainty), which I discuss in later sections. reside), for which the predictor, Brad posed the prediction problem in terms of a “surface- w E(w y) E(s(x, β) y) E(ξ y), plus-noise” paradigm, in which the surface is a latent process ˆ ≡ | = | + | where the scientifc truths reside. In his paper, he gives a decom- is commonly used, along with its uncertainty, var(w y).In | position that I go along with, until it stops short of what I need spatial, temporal, and spatio-temporal prediction, the process- for spatial, temporal, and spatio-temporal prediction. In broad- model error ξ has statistical dependencies that contribute to brush strokes, his model for data y is both E(w y) and var(w y).Further,x could be random, a | | common example of which is the errors-in-variables problem. y s(x, β) ϵ, (1) = + There are instances (e.g., when considering the output from where s(x, β) is a “surface” and ϵ is (independent) “noise.” I numerical climate models) when it is the surface s(x, β),not believe there is an extra component that should be included in w,thatonewantstopredict.Inthiscase,ξ is fltered out and one would use the predictor, E(s(x, β) y),withuncertainty, (1), which is uncertainty in the representation of the scientifc | var(s(x, β) y). truths. That is, if the truths are represented by w (say), then |

CONTACT Noel Cressie [email protected] National Institute for Applied Statistics Research Australia, University of Wollongong, Wollongong, NSW 2522, Australia. ©2020AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 661

The next section discusses how diferent queries about scien- • our attempts to explain the world (science) are uncertain; tifc truths in w usually lead to diferent optimal predictors. •ourmeasurementsofouruncertainworldareuncertain.

For example, in the simple case where w is a random variable, 3. Querying the Predictand the commonly used utility that has a squared-error penalty yields the optimal predictor, δ (y) E(w y),ofw.Nowconsider The predictand is the random quantity w (or some function ∗ = | prediction of g(w) exp(w/a),wherea is a known positive of it) to be predicted. In spatio-temporal statistics, w is ofen = astochasticprocessthatisobservedincompletelyandnoisily, constant that has the same units as w. This problem occurs resulting in data y.Apredictor of w is a statistic, namely a when, for example, soil or mineral properties are modeled on function of the data y that here we write as δ(y).Wearedoing the log scale but inferences are needed back on the original scale. this statistical analysis for a reason, presumably so that we can Then, answer a question about how our world works. δ∗(y) E(exp(w/a) y) exp(E(w/a y)) g(δ∗(y)), In the presence of uncertainty, there is no universal optimal g = | ̸= | = predictor of w that answers every question about it, which because g( ) exp( /a) is nonlinear. In fact, because of Jensen’s · = · Inowdiscussbriefy.Apredictioncanbelookeduponasa inequality, decision in a decision-theoretic context, and hence predictions δ∗(y) g(δ∗(y)); are made based on cost-beneft (utility, U)andrisk(probabilities g ≥ conditioned on the data y). There are many possible predictions and note that δ (y) is unbiased for g(w).Hence,the“shape- δ, and we need some way to compare them. The utility tells us g∗ shifing” predictor g(δ (y)) is biased for g(w),hasinferiorpos- how well we do if the predictand is w,andweuseδ to predict ∗ terior expected utility, and should be avoided despite its one- it; we write the utility as U(w, δ).WewritetheriskasPr(w y). | size-fts-all appeal. Then weight the utility by the : U(w, δ)Pr(w y). (6) | 4. What Metrics Are We Using in the Twenty-First Recall that w is the random process we are trying to predict: In Century? many of the problems I work on, w is a high-dimensional spatial “What we measure afects what we do. If we have the wrong or spatio-temporal random feld for which I have incomplete metrics, we will strive for the wrong things”: Joseph Stiglitz, and noisy data y. Then for each decision δ, U(w, δ) is an ensem- Nobel Laureate in Economics (Financial Times,13September ble of utilities, where we can weight each one by its posterior 2009). probability (as we do in (6)). Integrating (6) yields the posterior Statistics has sometimes been described as a discipline that expected utility, translates variability into uncertainty. We have all seen how our careful attempts to get the probability space just right, so that the U(w, δ)Pr(w y)dw. (7) | uncertainty can be expressed with auditorial rigor, are ignored ! or misinterpreted by scientists and policy-makers. Really fast Then maximizing (7) with respect to δ over a given predictor computers can ingest really big datasets and predict what look space yields an optimal predictor δ that is a function of the ∗ like really interesting signals. But are the signals real? Scientists data y. (including statistical scientists) like to assess their predictions Now suppose a diferent question is asked that involves with the LG (Looks Good) “metric,”which works well when the making inference not on w but on a given function g(w).A signal is both strong and simple. diferent utility function might be used, but suppose for the sake Statistical scientists can derive probabilities (denoted Pr(.)) of argument that here it is the same. Then (6) is replaced with for how diferent the predicted signal is from no signal. Statistics U(g(w), δg) Pr(w y), (8) has “metrics” that include bias, variance, mean-squared (predic- | tion) error, Pr(Type I error), Power = 1 – Pr(Type II error), p- where δg is a generic decision about g(w).Integrating(8)yields, value, and so forth. In simulation where the signal is known and from which the simulated data are generated, U(g(w), δ ) Pr(w y)dw, (9) g | or in cross-validations, we can obtain empirical measures (or ! metrics), such as the false positive rate, false negative rate, false which when maximized yields δg∗ that is a function of y.How- discovery rate, and false non-discovery rate. ever, those who would like a simple life might base their predic- For example, the false positive rate can be interpreted as an tion of g(w) on estimate of the conditional probability, Pr(a signal is declared | no signal is present). Both the scientifc method and the legal δ (y) g(δ∗(y)), (10) ˜g ≡ system are predicated on this type of probability, where there is a presumption of the status quo in Science or of innocence which is diferent from δg∗(y). Now, if δ∗(y) is a great predictor of w,thenwouldnot before the court. In statistical hypothesis testing, this is known g(δ∗(y)) be a great predictor of g(w)?Yes,iftherewerecomplete as Pr(Type I error), and in the legal system it corresponds to Pr(conviction innocent). Another metric is the false negative certainty in w but, modulo death and taxes, | rate, where there is a presumption of a signal being present, •ourworldisuncertain; that is, Pr(no signal is declared signal is present). In statistical | 662 N. CRESSIE hypothesis testing, this is known as Pr(Type II error), and in the economy. The debt notice that was sent included instructions legal system it corresponds to Pr(acquittal guilty). that failure to pay would be pursued by private debt collectors or | To a sofware engineer, these metrics number among many by garnishing the individual’s ATO tax returns. (I do not know that could be computed to evaluate an algorithm, but to a of any credit notices being issued under the scheme.) statistician the choice of metric depends on the question being There was a right of appeal that involved a lengthy, cum- asked. The statistician is trained to look for variability over bersome, and psychologically stressful process for a sector of sub-populations of the evaluated metric which, amongst other the population that was already fnancially distressed. Moreover, things, might indicate prejudicial behavior of an algorithm. even during the appeal, active collection of alleged debts pro- However, our training typically comes up short when an answer ceeded. In media interviews, the government minister respon- isneeded tothefollowingtype ofquestion:Isafalsepositiverate sible quoted an approximately 1% error rate without clarifying or a false negative rate of 2% equally acceptable for a population exactly what that metric was. Subsequently, it was revealed that of 250 people as it is for a population of 25 million? This is an these “errors” were clerical data-entry errors and, in fact, the example of where the decision-theoretic machinery in Section false positive rate of the scheme was on the order of 20%. 3couldbeusedtoestablishcost-beneftinthepresenceof In the case of robo-debt, prediction failed because the algo- uncertainty and then used to quantify the choices available. rithm was fawed and the metric was ill-defned. What the government did with that prediction was shocking and, in late 2019, the Australian Federal Court found the scheme to be 5. Robo-Debt: Australian Government Demands Its illegal. Citizens Prove Their Innocence

The Australian social-services agency, Centrelink, has data on 6. Science and Engineering the earnings of Australians receiving various types of welfare payments. Its records are not perfect, and it looks for data Broadly speaking, science tells us “why” and Engineering tells from other sources, such as the Australian Tax Ofce (ATO), to us “how.”We keep them separate in our academies and colleges. provide a cross-check. If there is a discrepancy, a please-explain Is this a lef-over from the Twentieth Century? In our daily life, notice is issued, at least that was the case up to the frst half of do we no longer care why something works, as long as it works? 2016. In the middle of that year, a new scheme was introduced: Up to now, Engineering has been an important vehicle for It went from issuing about 20,000 notices annually to issuing society’s uptake of Science’s discoveries. Machine Learning, 142,634 debt notices in the year following July 1, 2016, with a which looks for patterns in very big datasets and classifes requirement that recipients prove there was no overpayment to attributes according to those patterns, is ofen (incorrectly) them by Centrelink. equated with AI, which will allow us to do everything in our “Robo-debt,” as it was dubbed by the media, was devised lives faster, smarter, and cheaper...until it will not because we under a conservative Australian government, where artifcial have been incorrectly classifed! intelligence (AI) was used to catch unentitled welfare recipients An AI algorithm is trained on a “typical” dataset, but it is and human intelligence (HI) was all but suspended. Clause 39 of almost certainly subject to unconscious bias that may or may Magna Carta Libertatum (signed in 1215) recognizes individual not show up when applied to the general population. AI will rights over and above the arbitrary will of a government, and tend to reinforce prejudices in the data in a manner similar to a it formed the basis of that part of the Fourteenth Amendment Bayesian analysis where the posterior is used as a prior for the of the US Constitution that guaranteed “due process of law” same or a similarly unrepresentative dataset. Are the error rates to its citizens. This right to due process has, over the last 800 spread evenly across all demographics and, if not, is it because years, included the presumption of innocence, and it is the of unconscious bias? cornerstone of most democracies. In Australia, from mid-2016 In conclusion, Brad’s paper has embraced Data Science and and beyond, that presumption was suspended in approximately forcefully argued for statistical thinking to be an essential com- 700,000 cases, using an algorithm that was fawed and a scheme ponent. The latter part of my discussion is purposely binary, that had a presumption of guilt at its core. so that as statisticians we continue to ask whether what we are Robo-debt was introduced to save the government money doing is Data Science or Data Engineering. by removing HI and relying on AI and “big data.” The robo- debt algorithm was based on matching the Centrelink database Funding with an ATO database. An uneven wages profle, common in the gig economy, was treated diferently by the two agencies, This work was supported by an Australian Research Council Discovery which prejudiced the casual and under-employed sector of the Project DP190100180. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 663–664: Comment https://doi.org/10.1080/01621459.2020.1762616

Discussion

A. C. Davison

Ecole Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

IcongratulateProfessorEfrononreceivingtheInternational doubtful value and with missing or incomplete scientifc theory, Prize in Statistics. Although given primarily for his invention and to which therefore so-called predictive methods may not of the bootstrap, this and his many other contributions add up apply. to a hugely infuential body of work, and the award is richly Iwanttomaketwomaincommentsaboutthisinteresting deserved. paper. As a graduate student I had the privilege of hearing him speak in London, and as we lefthe seminar someone quietly said to me that this bootstrap thing could not possibly work. Many Prediction (too many!) years have passed since then, but what seemed visionary around 1980 has now come to pass: despite the initial My frst comment is that I think we use the word “prediction” doubts of some, resampling does indeed provide a very powerful too glibly. Given sufcient data, linear regression or a deep approach to uncertainty assessment in a huge variety of settings, learning, random forest or other machine learning algorithm efectively replacing the analytical derivation of standard errors could provide highly accurate predictions of when, and with and suchlike with large-scale computations. There are caveats, of what momentum, a stone dropped from the Leaning Tower course, and during the 1980s and 1990s powerful theoreticians of Pisa will hit the earth. It could not, however, help with a such as Peter Hall, sorely missed, shed light on how and when moon shot, since the force of lunar is quite diferent, the bootstrap would work successfully and how to fx it when it and learning in Pisa would not be relevant. Understanding failed. But the basic vision, that computer power could replace of Newton’s laws, on the other hand, did enable a successful mathematical derivations, has come true. At the time some col- moon landing—which of course required enormous numbers of leagues feared that computing power would replace statisticians, predictions that fortunately turned out to be correct. We should but this (to us) disastrous scenario has not materialized. The last distinguish “algorithmic prediction,”based on empirical model- time I looked, the top three “best jobs” in a standard US listing ing of data, be it by linear regression, an ARIMA model or a deep were “data scientist,” “statistician,” and “university professor” learner, from “scientifc prediction,” based on some knowledge, (very gratifying for those wearing all three hats), so I do not necessarily imperfect, of “laws of nature.”Only the latter enables think we need fear the disappearance of “traditional” statistics the reliable transfer of knowledge from one setting to another— any time soon. and prediction in a new context enables scientifc theories to One reason for this is that despite all the hype about scraping be tested, perhaps to destruction, or extended or modifed. A information from the web and other sources of big data, the vast further distinction to be made within “algorithmic” predictors majority of datasets remain small or moderate in size. Quite is between those that provide some measure of the uncertainty apart from issues of sampling bias and representativity, there of the prediction, which might properly be called “statistical,” is typically a quality/quantity tradeofin data-gathering. One and those, like most machine learning algorithms, that do not. of many possible examples arises in studying the ages of those So using logistic regression to obtain a single number would be humans who live the longest, to see if there is an upper limit to an algorithmic prediction, whereas using it to give an estimated the human lifespan. Individuals who live over 100 years form distribution would be a statistical prediction. atinyproportionofthegeneralpopulation,andevenofcial Although I would distinguish between scientifc and empiri- documentation can be lacunary, so that a considerable efort cal predictions, there are scientifc settings in which algorithmic is needed to verify their identities and thus their ages. One predictions might be appropriate. For example, numerical cli- can have lots of data, mostly reliable, about the bulk of the mate models, which are ultimately based on the Navier–Stokes population, but the information about the small group of most equations, are becoming more and more detailed, but there interest must be carefully verifed. Moreover one runs up against will always be small-scale processes that they cannot properly the extrapolation problem mentioned in the paper—since the include, either for computational reasons or because the pro- aspect of most interest lies outside the data, some sort of theory cesses are insufciently well understood. Such processes might seems essential, be it mathematical, biological or both. Many be better modeled empirically within a climate model driven by problems are of this general sort, involving data that are of basic large-scale scientifc considerations.

CONTACT A. C. Davison [email protected] Institute of Mathematics, Ecole Polytechnique Fédérale de Lausanne EPFL-FSB-MATH-STAT, Station 8, 1015 Lausanne, Switzerland. ©2020AmericanStatisticalAssociation 664 A. C. DAVISON

Uncertainty in Big Data that complicate attribution in wide-data problems. In genetic problems local correlations at the genome level mean that attri- Mysecondcommentrelatestoindependenceassumptionsinbig bution to a single gene may be unreliable, whereas modeling data problems. In classical statistics we expect standard errors correlations along the genome, in terms of hot spots, leads to to shrink at rate 1/√n,sowhenthesamplesizen is enormous more secure inferences. The correlations I refer to are between everything becomes highly signifcant: for a huge sample, the the rows of the data matrix and the response variables for dif- logistic regression output shown by Professor Efron would have ferent units. Sir David Cox, the frst winner of the International stars not only for every predictor but also for every possible Prize in Statistics, has discussed an approach to this based on interaction of the predictors, and we would fnd ourselves using notions of long-range dependence (Cox 2015), but surely much the signifcance levels as a ranking tool rather than for attri- more remains to be done to ensure secure attribution in huge bution in Efron’s sense. Yet it seems implausible that our usual datasets, and some form of subsampling akin to the bootstrap over-simplifed independence assumptions will apply to huge might form the basis of a simple widely applicable approach. samples, owing to underlying factors that lead to subtle depen- To fnish, I would like to congratulate Professor Efron again, dencies, individually too small to be detected, but important and thank him not only for his stimulating paper, but also for all when cumulated. We are used to this in the familiar context the tools he has given us. of linear regression: the small dependencies between residuals can be neglected for many purposes, but when combined in the residual sum of squares, ignoring the loss of degrees of freedom Reference can lead to serious underestimation of the error variance. As an aside, this is diferent from the correlations between predictors Cox, D. R. (2015), “Big Data and Precision,” Biometrika,102,712–716.[664] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 665–666: Comment https://doi.org/10.1080/01621459.2020.1762617

Discussion of “Prediction, Estimation, and Attribution” by Bradley Efron

Jerome Friedmana, Trevor Hastiea,b,andRobertTibshirania,b aStatistics Department, Stanford University, Stanford, CA; bBiomedical Data Science Department, Stanford University, Stanford, CA

ABSTRACT Professor Efron has presented us with a thought-provoking paper on the relationship between prediction, estimation, and attribution in the modern era of data science. While we appreciate many of his arguments, we see more of a continuum between the old and new methodology, and the opportunity for both to improve through their synergy.

Discussion for accurate predictions. As Professor Efron notes, prediction tends to be applied more in ephemeral settings. This does not We would like to thank Professor Efron for a timely and make it fundamentally more so. thought-provoking paper. His discussion of the relationship 3. We agree that interpretable prediction models may some- between prediction, estimation, and attribution (a new term?) times be helpful in getting closer to the ground truth. Sim- is an important one and deserves further thought. ple interpretable models may sometimes be more robust Overall, however, we fnd ourselves only in partial agree- and trustworthy. And when a simple model stops working, ment with his arguments. We summarize our views as it is easier to diagnose and possibly correct the problem. follows: Interpreting predictive models is an active area of research. 1. Unlike Professor Efron, we do not see any fundamental ten- Even when a prediction model appears to be working, it is sion between prediction and estimation/attribution. They important to know why. have diferent goals, and both are needed. We agree that For example, many groups are using deep learning to estimation and attribution may be harder than prediction. diagnose breast cancer from large sets of mammograms, with Science (e.g., biology) is much more complex than it was 200– very promising results.In many cases thesystems outperform 300 years ago, and estimation and attribution are therefore expert radiologists (e.g., McKinney et al. 2020). However, more difcult. But in this complex world, prediction can still sometimes the algorithms can be fooled by what had been be successful and useful! called the “Clever Hans” efect (Lapuschkin et al. 2019). Here is an example from our own collaborative work (RT): There can be a small label in the corner of the mammogram The project involves the diagnosis of skin cancer from indicating the lab where the scan was taken. The network can 10,000 protein levels measured on skin biopsies. We collected seize on that label and exploit it to help classify the images. training and test sets, and applied the lasso to try to distin- But this information is clearly not generalizable to the general guish cancer from normal tissue. The lasso chose about 200 population of mammograms. Hence, there is now a large proteins and predicts quite accurately (test set AUC about body of work trying to pinpoint which pixels in a specifc 95%). Now if we were to remove those 200 proteins from test image are responsible for the particular classifcation the dataset and reft, the new model would still predict well. produced by a deep neural net. This is because of correlation between predictors. Hence our 4. In our view the modern prediction algorithms go hand-in- model has not uncovered the ground truth of skin cancer: this glove with the more traditional estimation models. Boosting is very difcult to do (it may take 10–20 years!). But the model for example can be viewed as estimation of a response surface can be used now to help doctors to diagnose skin cancer. in a surface plus noise scenario. If you ft a boosting model, Should we stop using it until we discover the ground truth? and it explains 70% of the (test set) variance compared to Of course not. the 30% explained by your parametric model, that suggests 2. Professor Efron correctly states that the prediction problem you may have lefout some variables, some interactions, can be “ephemeral.” But this does not distinguish it from and possibly need some nonlinearities. On the other hand, estimation. Data and processes can drif. In estimation, the if they do about the same, you have more confdence that “law” that is true today may not hold exactly tomorrow. When your simpler explanation is on the right track. These same modeling the spread of a virus like coronavirus, models for comments apply to other models, like logistic regression, Cox existing viruses are likely to be inadequate. Augmenting these models, and so on, as well as to neural nets and Random models with empirical data seems like a crucial component Forests.

CONTACT Trevor Hastie [email protected] Statistics Department, Stanford University, Stanford, CA 94305. ©2020AmericanStatisticalAssociation 666 J. FRIEDMAN, T. HASTIE, AND R. TIBSHIRANI

References Etemadi, M., Garcia-Vicente, F., Gilbert, F. J., Halling-Brown, M., Hassabis, D., Jansen, S., Karthikesalingam, A., Kelly, C. J., King, Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Samek, W., and D., Ledsam, J. R., Melnick, D., Mostof, H., Peng, L., Reicher, J. J., Müller, K.-R. (2019), “Unmasking Clever Hans Predictors and Assessing Romera-Paredes, B., Sidebottom, R., Suleyman, M., Tse, D., Young, What Machines Really Learn,” Nature Communications,10,1096.[665] K. C., De Fauw, J., and Shetty, S. (2020), “International Evalua- McKinney, S. M., Sieniek, M., Godbole, V., Godwin, J., Antropova, tion of an AI System for Breast Cancer Screening,” Nature,577, N., Ashrafan, H., Back, T., Chesus, M., Corrado, G. C., Darzi, A., 89–94. [665] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 667–671: Comment https://doi.org/10.1080/01621459.2020.1762614

Discussion of Professor Bradley Efron’s Article on “Prediction, Estimation, and Attribution”

Min-ge Xie and Zheshi Zheng

Department of Statistics, Rutgers University, Piscataway, NJ

1. Introduction IID draw from the general patient population. Similarly, in the fnance sector, one is ofen interested in predicting the fnancial By noting the rapid growing trend of “pure prediction algo- performance of a particular company. If a predictive model is rithms,” Professor Efron compares and bridges the statistics of trained on all institutes, then the testing data (of the specifc the 20th Century (estimation and attribution) to that of the company of interest) are unlikely IID draws from the same current fast growing development of the 21st Century (pre- general population of the training data. The limitation of the IID diction). The outstanding discussion ofers many deep-rooted assumption, in our opinion, has hampered our eforts to fully insights and comments. As did his forward thinking article take advantage of fast-developing machine learning methodolo- on Fisher’s infuence on modern statistics (Efron 1998), which gies (e.g., deep neural network model, tree based methods, etc.) helped shape many recent developments on statistical inference in many real-world applications. (including our own work on confdence distribution (Singh, Our discussions in this note are based on a so-called confor- Xie, and Strawderman 2005;XieandSingh2013)), this equally mal prediction procedure, an attractive new prediction frame- inspiring article by Professor Efron will certainly galvanize work that is error (or model) distribution free (see, e.g., Vovk, many contemporary and powerful developments for modern Gammerman, and Shafer 2005;ShaferandVovk2008). We statistics and for the foundations of data science. discover a homeostasis phenomenon that the expected bias In this note, we echo and also provide additional support to caused by using a wrong model can largely be ofset by the two important points made by Professor Efron: (1) prediction corresponding negatively shifed predictive errors under the IID is “an easier task than either attribution or estimation”; (2) the IID assumption (e.g. random splitting of training and testing setting. Thus, the predictive conclusion is always valid even datasets) is crucial in the current developments on predic- if the model used to train the data is completely wrong. This tions, but we also need to do more for the case when the IID robustness result clearly supports the claim that prediction is an assumption is not met. Based on our own research, we provide easier task than modeling and estimation. However, the use of additional evidence to support these discussions. We discover a wrong training (learning) model has at least two undesirable that prediction has a homeostasis property and works well under impacts on prediction: (a) a prediction based on a wrong model the IID setting even if the learning model used is completely typically produces much wider predictive intervals than those wrong. We also highlight the importance of having a good based on a correct model; (b) although the IID case enjoys a modeling and inference practice: a good learning model with nice homeostatic cancellation of bias (in ftted model) and shifs good estimation is important to improve prediction efciency (in associated predictive errors) when using a wrong learning in the IID case and it becomes essential to maintain validity model, in the non-IID case this cancellation is ofen no longer in the non-IID case. The message remains: we still need to efective, resulting in invalid predictions. The use of a correct make efort to build a good learning model and estimation learning model can help mitigate and sometimes solve the prob- algorithm in prediction, even if prediction is an easier task than lem of invalid prediction for non-IID (e.g., drifed or individual- estimation. specifc) testing data. From the outset, we would like to point out that it is not a Section 2 reviews a conformal predictive procedure and straw-man argument to consider non-IID testing data. On the shows that the prediction is valid under the IID setting, even if contrary, such data are prevalent in data science. In addition the learning model is completely wrong. Section 3 is a numerical to those examples provided by Professor Efron that showed study using a neural network model to demonstrate the impact “drif,” we can easily imagine non-IID examples in many typical of a wrong learning model and estimation on prediction in applications. For instance, a predictive algorithm is trained on both the IID and non-IID cases. Section 4 is a concluding adatabaseofpatientmedicalrecordsandwewouldliketo remark. A more detailed discussion, including an introduction predict potential outcomes of a treatment for a new patient with of predictive curve (to represent predictive intervals of all levels) more severe symptoms than what the average patient shows. and an elaborated study of linear models, is in Xie and Zheng The new patient with more severe symptoms is not a typical (2020).

CONTACT Min-ge Xie [email protected] Department of Statistics, Rutgers University, Piscataway, NJ 08854. ©2020AmericanStatisticalAssociation 668 M. XIE AND Z. ZHENG

2. Prediction, Testing Data, and Learning Models the training data residuals Ri,n 1 and thus “most conformal.” + When Qn(yn 1) 0or 1, Rn 1,i is at the extreme ends of the As in Equation (6.4) of Professor Efron’s article, we assume that + ≈ ≈ + training data residuals Ri,n 1 and thus “least conformal.” This a training (observed) dataset of size n,say, (x , y ), i + Dobs ={ i i = intuition leads us to defne a conformal predictive interval of 1, ..., n is available, where (x , y ), i 1, ..., n,areIIDrandom } i i = y as samples from an unknown population .Foragivenx ,we new F new would like to predict what y would be. We frst use the typical α α new Cα y : Qn(y) y :1 Qn(y) assumption that (x , y ) is also an IID draw from .Later = ≥ 2 − ≥ 2 new new F " ( ) # $ " # we relax this requirement and only assume that ynew xnew relates α i,n 1 n | q yn− 1 + Ri,n 1 i 1 , to xnew the same way as yi xi relates to xi,butxnew follows a = 2 {ˆ + + + } = | % & ' marginal distribution that is diferent from that of xi. ( ) α i,n 1 n For notation convenience, we consider (x , y ) as the q1 yn− 1 + Ri,n 1 i 1 , (2) new new − 2 {ˆ + + + } = (n 1)th observation and introduce the index n 1, with & '( + + n xn 1 xnew and yn 1 as a potential value of the unobserved where qα ai is the αth quantile of a1, ..., an. The interval + = + { }i 1 y .Unlessspecifedotherwise,theindex“n 1” and index (2)isavariantversionofthe= Jackknife-plus predictive interval new + ) * “new” are exchangeable throughout the note. proposed by Barber et al. (2019b)inwhichRi,n 1 is replaced by (i,n 1) + Ri,n 1 yi y− + instead. The following proposition | + |=| − ˆi | states that, under the IID assumption, Cα defned in (2)isa 2.1. Conformal Prediction Inference With Quantifed predictive set for y with guaranteed level-(1 2α). Confdence Levels new − iid The conformal prediction method has attracted increasing Proposition 1. If (x , y ), (x , y ) ,fori 1, ..., n,then i i new new ∼ F = attention in learning communities in recent years (see, e.g., P ynew Cα 1 2α. Vovk, Gammerman, and Shafer 2005;ShaferandVovk2008;Lei ∈ ≥ − et al. 2018;Barberetal.2019a, 2019b). The idea is straightfor- AproofofthepropositioncanbefoundinXieandZheng) * ward. To make a prediction of the unknown ynew given xn 1 (2020), which holds for a fnite n.Barberetal.(2019b)pointed + = xnew,weexamineapotentialvalueyn 1,andseehow“confor- out empirically intervals like Cα have a typical coverage rate of + mal” the pair (xn 1, yn 1) is among the observed n pairs of IID 1 α.Intherestofthenote,wetreatCα as an approximate + + − data points (x , y ), i 1, ..., n. The higher the “conformality,” level-(1 α) predictive interval. i i = − the more likely ynew takes the potential value yn 1.Frequently,a Note that the function Qn(y) defned in (1)isinessencea + learning model, say y µ(x ) for i 1 ..., n, n 1, is used to predictive distribution function of y (see, e.g., Shen, Liu, and i ∼ i = + new assist prediction. However, the learning model is not essential. Xie 2018;Vovketal.2019). The corresponding predictive curve As we will see later, even if µ( ) is totally wrong or does not exist, of y is · new aconformalpredictioncanstillprovideusavalidprediction,as PV (y) 2min Q (y),1 Q (y) . long as the IID assumption holds. n = { n − n } To be specifc, this note employs a conformal prediction Clearly, Cα y :PV(y) α .Aplotofpredictivecurve ={ n ≥ } procedure known as the Jackknife-plus method (see, e.g., Barber function PVn(y) provides a full picture of conformal predictive et al. 2019b). Consider a combined collection of both the train- intervals of all levels. Analogous to that of confdence distri- ing and testing data but with the unknown ynew replaced by a bution and Birnbaum’s confdence curve, predictive function potential value yn 1: obs (xnew, yn 1) (xi, yi), i ( ) + A = D ∪{ + }={ = Qn y has a confdence interpretation as the p-value function 1, ..., n, n 1 .Wedefneconformal residuals of the one-sided test H : y y versus H : y y,and + } 0 new = 1 new ≤ (i,j) the predictive curve PV(y) has the same implementation for the R y y− ,fori j and i, j 1, ..., n, n 1, ij = i − ˆi ̸= = + corresponding two sided test (see. e.g., Xie and Zheng 2020,sec. (i,j) 2.2). A formal defnition of conformal predictive function is in where y− is the prediction of y based on the leave-two-out ˆi i Vovk et al. (2019). (i,j) ( ) ( ) dataset − xi, yi , xj, yj .Ifaworkingmodel AstrikingresultisthatProposition 1 holds, even if the µ( ) A = A −{ } is used, for instance, the model is frst ft based on the learning model µ( ) used to obtain the prediction is com- · (i,j) · leave-two-out dataset − andthepointpredictionissettobe pletely wrong, as long as the IID assumption holds. This robust (i,j) A − (i,j) (i,j) yi µ(xi; − ),whereµ( ; − ) is the ftted (trained) property against model misspecifcation is highly touted in the ˆ =ˆ A(i,j) ˆ · A model using − . machine learning community. It gives support to the sentiment A For each given yn 1 (a potential value of ynew), we defne of using “black box” algorithms where the role of model ftting + n is reduced to an aferthought, although we will also provide 1 arguments to counter this sentiment. Qn(yn 1) 1 Rn 1,i Ri,n 1 , (1) + = n { + ≥ + } i 1 != which relates to the degree of “conformity” of the residual values 2.2. IID Versus Non-IID: Efciency and Validity Under a (i,n 1) Wrong Model Rn 1,i yn 1 yn− 1 + among the conformal residuals + = + −(i,ˆn +1) Ri,n 1 yi y− + , i 1, ..., n. (Here, Ri,n 1 are in Although the validity of prediction is robust against wrong + = − ˆi = + fact the leave-one-out residuals of using the training dataset learning models in the IID case, there is no free lunch. The 1 obs.) If Qn(yn 1) ,thenRn 1,i is around the middle of predictive intervals obtained under a wrong model are typically D + ≈ 2 + JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 669 wider. For instance, suppose that the true model is y µ (x) remain relevant and ofen crucial for prediction in both IID and = 0 + ϵ,butawrongmodely µ (x) e is used. Since y non-IID cases. = 1 + = µ (x) ϵ µ (x) µ (x) µ (x) ϵ,wehavee 0 + = 1 +{ 0 − 1 }+ = µ (x) µ (x) ϵ.So,whenϵ is independent of x,var(e) { 0 − 1 }+ = var( µ (x) µ (x) ) var(ϵ) var(ϵ) and the equality holds 3. Case Study: Prediction Under Neural Network { 0 − 1 } + ≥ only when µ (x) µ (x). Thus, the error term e under a wrong Models 1 = 0 model has a larger variance than the error term ϵ under the true We use a neural network model and a simulation study to model. The larger the variance var( µ (x) µ (x) ) is (i.e., the { 0 − 1 } provide an empirical support for our discussion. In the current more discrepant µ1(x) and µ0(x) are), the larger the variance neural network development, model ftting algorithms do not of the error term e is. A larger error translates to less accurate pay much attention to correctly estimate the model param- estimation and prediction. See also Proposition 2 of Xie and eters. We fnd that the estimation of model parameters also Zheng (2020) for a formal statement regarding the predictive plays an important role in prediction, in addition to a cor- interval lengths in linear models. rect model specifcation. We have an explanation why a conformal predictive algo- rithm can still provide valid prediction even under a totally Example 1. Suppose our training data (y , x ), i 1, ..., n,are i i = wrong learning model in the IID case. Specifcally, when we IID samples from the model use a wrong model µ (x),thecorrespondingpointpredic- 1 µ ( ) ϵ tor will be biased by the magnitude of µ (x ) µ (x ), yi 0 xi i 1 new − 0 new = + but at the same time the error term e absorbs the bias and max 0, max 0, zi1 zi2 max 0, wi ϵi, = { + }− { } + produces residuals with a shif by the magnitude of µ (x ) 0 i − ϵ iid ( +σ 2) , µ (x ) µ (x ) µ (x ) .Intheconformalpredic- i N 0, , (3) 1 i =−{1 i − 0 i } ∼ tive interval (2), the quantiles of residuals are added back to iid where x (z , z , w )T N(µ , $ ) and ϵ and x are the point prediction to form the interval bounds. If the IID i = i1 i2 i ∼ x x i i independent. Here, µ (0, 0, 0)T,the(k, k )-element of $ assumption holds, the bias is ofset by the shif. See also Xie x = ′ x is 0.5 k k′ /2, for k, k 1, 2, 3 , σ 2 1andn 300. Model and Zheng (2020)inwhichanexplicitmathematicalexpres- | − | ′ ∈{ } = = sion of this cancellation in linear models is derived. Along (3)isinfactaneuralnetworkmodel(withadiagrampresented µ ( ) with greater residual variance, the ofsetting ensures the valid- in Figure 1(a)) and we can re-express 0 xi as ity of the conformal prediction in the IID case. We call this µ0(xi) f A2f (A1xi) . (4) tendency of self-balance to maintain validity a homeostasis = ( ) ( ) phenomenon. Here, f x max x,0 is the) ReLU activate* function, and (1)= (1) (1) The IID assumption is a crucial condition to ensure the a11 a12 a13 (2) (2) A1 (1) (1) (1) and A2 a1 , a2 are the model validity of a prediction under a wrong model. If the IID = - a21 a22 a23 . = assumption does not hold for the testing data, the prediction parameters. Corresponding to (3), the) true model* parameter based on a wrong learning model (or a correct model but a (1) (1) (1) (1) (1) (1) values are a11 a12 a23 1, a13 a21 a22 0 wrong parameter estimation) is ofen invalid with large errors, (2) (2) ======and (a , a ) (1, 1).Inouranalysis,weassumethatwe as we see in our case studies. We think this IID assumption 1 2 = − know the model form (4)butdonotknowthevaluesofmodel also explains why deep neural network and other machine parameters A and A . learning methods work so well in academic research settings 1 2 (where random split of data into training and testing sets is For the testing data, we consider two scenarios: (i) [IID case] acommonpractice)butfailtoproduce“killerapplications” iid (µ $ ) ( ) to make predictions for a given patient or company whose xnew N x, x and, given xnew, xnew, ynew follows (3); (ii) ∼ iid xnew are ofen not close to the center of the training data. [Non-IID case] the marginal distribution x (T , T , T ) new ∼ 1 2 3 The good news is that, if we use a correct model for training and, given xnew, (xnew, ynew) follows (3). Here, T1, T2, T3 are and can get good model estimates, it is still possible to get a iid random variables from the t-distribution with degrees of valid prediction for a specifc xnew.Modelingandestimation freedom 3 and non-centrality parameter 1.

Figure 1. Diagrams of four neural network models: (a) true µ0( );(b)partialµ1( );and(c,d)over-parameterizedµ2( ) and µ3( ) of (20 nodes in each layer) L layers, with L 20 and 100, respectively. · · · · × = 670 M. XIE AND Z. ZHENG

Table 1. Mean square error of each parameter in µ0 (training data n 300; Reported in Table 2 are the coverage rate and average interval repetition = 10). = length of predictive intervals computed under 10 5 2set- = × MSE a11 a12 a13 a21 a22 a23 b1 b2 tings with fve diferent learning models µ ( ), k 0, 1, ...,4, k · = Opt-MSE 0.07 0.059 0.31 0.154 0.109 0.124 0.06 0.101 and in two scenarios. The analysis is repeated for 10 times Neuralnet 4.87 5.9 1.53 1.14 2.23 2.08 4.87 0.84 with 10 simulated training datasets from model (3). We use 10 repetitions and not a greater number, because it takes a long time to ft a neural network model. However, for each of the 10 training datasets, 20 pairs of (y , x ) are used. So, for In addition to (a) the true model µ0( ),fourwronglearning new new · the reported values, each is computed using 10 20 200 models are considered: × = pairs of (y , x ).Forthetrueneuralnetworkmodelµ ( ), new new 0 · Opt-MSE is also used to ft the model. As we can see in Table 2, (b) µ (x ) f (Bz ) (partially correct neural network 1 i = i under the IID scenario, all predictive intervals are valid with a model, missing wi); correct coverage. The best one with the shortest interval length is the one that uses the correct model and Opt-MSE estimation (c) µ2(xi) f C20f (C19 f (C1x)) = ··· method. In the non-IID case, only the shallow neural network (deep neural) network model* with 20 layers); models provide valid predictions, and among them, Opt-MSE (d) µ3(xi) f D100f (D99 f (D1x)) can give us confdence intervals with half the width. Indeed, = ··· when a wrong learning model is used, the IID assumption is (deep neural) network model with* 100 layers); essential for the prediction validity and the use of a wrong model (e) µ (x ) η (without any covariates), 4 i = 0 ofen results in wider intervals. Furthermore, the estimation of model parameters seems to also have a big impact on prediction. T 20 3 where zi (zi1, zi2) , B (b1, b2), C1, D1 R × , To get a full picture of the predictive intervals at all lev- = 1 20 = 20 20 ∈ C , D R × ,andC , D R × ,2 i els, we plot in Figure 2 the predictive curves of ynew. The 20 100 ∈ i j ∈ ≤ ≤ 19, 2 j 99. In our analysis, the neural network plots are based on the frst training dataset and making pre- ≤ ≤ models µ ( ) - µ ( ) are ft using the neuralnetpackage diction for (a) the IID case with the realization x 0 · 3 · new = (cran.r-project.org/web/packages/neuralnet/). ( 0.909, 1.149, 0.771),and(b)thenon-IIDcasewiththe − − − The Neuralnet package is an of-the-shelf machine learning realization x (3.653, 1.748, 1.063). The realized value new = algorithm. Its emphasis is on learning and not on model param- of µ0(xnew) is 0 and 4.338 in (a) and (b), respectively. From eter estimation. Even under the true model µ ( ),theestimates Figure 2,weseethattheuseofawrongmodelµ ( )–µ ( ) 0 · 1 · 4 · of model parameters from Neuralnet are not very accurate (see results in wider predictive curve (and predictive intervals at Table 1). In the table, “Opt-MSE” refers to a code that we wrote all levels 1 α (0, 1))inbothIIDandnon-IIDcases. − ∈ by directly minimizing MSE n (y µ (x ))2,which Although the shallow neural network models µ ( ) and µ ( ) = j 1 j − 0 j 0 · 1 · can be implemented when the neural= network is small. The can provide good coverage rates, the predictive curves in the / calculation in the table is based on 20 repeated runs, each with non-IID case are much wider than other approaches. This atrainingdatasetofsizen 300 from model (3). peculiar phenomenon occurs even when we assume to know =

Table 2. Performance of 95% predictive intervals under fve diferent learning modelsandintwoscenarios:coveragerates(beforebrackets)andaverageinterval lengths (inside brackets) (training data size = 300; testing data size = 20; repetition = 10). True model Wrong model µ ( )µ( )µ( )µ( )µ( ) 0 · 1 · 2 · 3 · 4 · Opt-MSE Neuralnet Nueralnet Nueralnet Nueralnet Nueralnet IID scenario 0.995 (4.462) 0.99 (4.608) 0.99 (4.809) 0.99 (5.212) 0.99 (5.201) 0.985 (5.26) Non-IID scenario 0.955 (4.52) 0.985 (9.327) 0.98 (9.77) 0.71 (5.899) 0.695 (5.201) 0.685 (5.277)

iid Figure 2. Plots of predictive curves for (a) xnew x and (b) xnew x .Ineachplot,theredsolidcurveisthetarget(oracle)predictivecurvePVn(y) 2max &(y ∼ i ̸∼ i = { − µnew),1 &(y µnew) ,obtainedassumingthatthedistributionofynew N(µnew,1) is completely known. The two predictive curves obtained using µ0( ) are in black (solid− line for− Opt-MSE;} dashed line for Neuralnet). The other predictive curves∼ (all in a dashed or broken line and in various colors) are obtained using the other· four wrong working models. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 671 the true model structure µ ( ),indicatingtheimportanceof Funding 0 · estimating model parameters accurately. Furthermore, in the The research is supported in part by research grants from NSF. non-IID case, there are large shifswhenweusedeepneural network models µ ( ) and µ ( ), leading to invalid predictions. 2 · 3 · The best prediction result is from the one obtained by using the References correct learning model µ ( ) with the more accurate parameter 0 · estimation method Opt-MSE. The message is the same as what Barber, R. F., Candes, E. J., Ramdas, A., and Tibshirani, R. J. (2019a), “The we have learned from Table 2,whichalsoexactlymirrorswhat Limits of Distribution-Free Conditional Predictive Inference,” arXiv no. 1903.04684. [668] is found in the case study of linear models in Xie and Zheng (2019b), “Predictive Inference With the Jackknife+,” arXiv no. (2020). 1905.02928. [668] Efron, B. (1998), “R. A. Fisher in the 21st Century (Invited paper Presented at the 1996 R. A. Fisher Lecture),” Statistical Science, 13, 95–122. [667] Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. (2018), 4. Conclusion “Distribution-Free Predictive Inference for Regression,” Journal of the American Statistical Association,113,1094–1111.[668] Professor Efron pointed out that “the 21st Century has seen Shafer, G., and Vovk, V. (2008), “A Tutorial on Conformal Prediction,” the rise of a new breed of what can be called ‘pure prediction Journal of Machine Learning Research, 9, 371–421. [667,668] algorithms’.” We are fully in agreement with Professor Efron’s Shen, J., Liu, R., and Xie, M. (2018), “Prediction With Confdence—A Gen- discussion that the prediction algorithms “can be stunningly eral Framework for Predictive Inference,” Journal of Statistical Planning and Inference, 195, 126–140. [668] successful,” and that “the emperor has nice clothes but they’re Singh, K., Xie, M., and Strawderman, W. E. (2005), “Combining Infor- not suitable for every occasion.” Along the same line and under mation From Independent Sources Through Confdence Distributions,” the setting of conformal prediction, we have demonstrated and The Annals of Statistics,33,159–183.[667] explained how and why a prediction method can be success- Vovk, V., Gammerman, A., and Shafer, G. (2005), Algorithmic Learning in ful under the IID assumption, even if the learning model is aRandomWorld,NewYork:Springer.[667,668] Vovk, V., Shen, J., Manokhin, V., and Xie, M. (2019), “Nonparametric completely wrong. More importantly, we have also demon- Predictive Distributions by Conformal Prediction,” Machine Learning, strated that it is still meaningful, and ofen crucial, to build our 108, 445–474. [668] prediction algorithms based on a good practice of modeling, Xie, M., and Singh, K. (2013), “Confdence Distribution, the Frequentist estimation and inference. We fully anticipate and believe that Distribution Estimator of a Parameter” (with discussion), International “the most powerful ideas of Twentieth Century statistics”— Statistical Review,81,3–39.[667] Xie, M., and Zheng, Z. (2020), “Homeostasis Phenomenon in Predic- modeling, estimation, and inference, will play a pivotal role in tive Inference When Using a Wrong Learning Model: A Tale of Ran- building the mathematical foundation of modern data science dom Split of Data Into Training and Test Sets,” arXiv no. 2003.08989. and in fully realizing its potential for real-world applications. [667,668,669,671] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 672–674: Comment https://doi.org/10.1080/01621459.2020.1762615

The Data Science Process: One Culture

Bin Yua,b,c and Rebecca Bartera aStatistics Department, University of California Berkeley, Berkeley, CA; bEECS Department, University of California Berkeley, Berkeley, CA; cChan Zuckerberg Biohub, San Francisco, CA

We would like to congratulate Professor Brad Efron for the or under diferent circumstances). While Professor Efron uses well-deserved honor of winning the 2019 International Statis- this example to focus on how prediction is easier than extrap- tics Prize, and for his thought-provoking paper discussing the olation or interpolation (based on the observation that a ran- diverging paths of so-called traditional statistical regression dom split yields better test-set performance than the early/late and pure prediction approaches. Professor Efron has made vast split in time-dependent data), we view this as an example of contributions to the feld of statistics spanning several decades, the importance of critical thinking. In our view, the train/test and we enjoyed reading his take on the current state of the felds paradigm should not focus on randomness, it should instead of statistics and machine learning. focus on representing the way the predictive algorithm will be Inthisdiscussion,weprovideaportrayalofourownperspec- used. The training set should be representative of the current tives on the intersection of statistics and machine learning meth- data being used to train the algorithm, and the test set should ods, and compare them with Professor Efron’s. While we came be representative of the future data to which the algorithm will from very similar beginnings as Professor Efron, rather than be applied. So while we agree with Professor Efron’s point that arriving at his view of a “discrepancy” or a “radical diference” a random split is not always appropriate, we argue that it is not between “traditional statistical regression methods” and “pure the randomness of the split that is critical to the pure prediction prediction” methods, we arrived at a surprisingly contrasting paradigm, but the representativeness of the split (relative to the view that unifes these two approaches. At the very end of his prediction problem). paper, Professor Efron seems “hopeful” about the “reunifcation” We were particularly delighted to read Professor Efron’s of these two paradigms. However, in his view, eforts toward refections on Leo Breiman’s 2001 two cultures paper, including this reunifcation are “just underway,” with the primary imped- his own discussion and perspectives at the time, along with how iments being a lack of theory. In our contrasting view, we are they have been updated in the two decades since. Interestingly, much further along the path of reunifcation, with the theoret- Leo’s paper, and author Yu’s own interactions with him, were ical underpinnings being less critical than—but still important, the starting point for her own foray into machine learning 20 and following on from— in today’s reality- years ago. The lessons learned and perspectives gained from rooted era. this journey have become a focal part of Yu’s own research While our views are quite diferent, we certainly agree with and philosophy that she has both passed on to—and developed many of the points that Professor Efron makes in his paper. together with—her students (including coauthor Barter). Today For instance, Professor Efron appears weary of relying solely looking back on Brieman’s work, the main diference between on random splits when implementing the train/test paradigm our view and Professor Efron’s is that his view focuses on the that is pervasive in the world of pure prediction algorithms. diferences between machine learning (or “pure prediction”) Specifcally, he points out that there are instances where instead methods and “traditional regression” methods, whereas our own of providing an honest estimation of the error, a random split view focuses on the connections that bridge these two cultures, will be far too optimistic, as in the case of time-dependent data. and the expansions building on both. He argues that a more realistic split in such scenarios is not Professor Efron highlights his perspective on the diferences random, but rather uses an early/late split where the early data between “pure prediction” and “traditional regression methods” are used to train and the later data are used to test. Indeed, a via six criteria presented in Table 5. For instance, he views random split in many cases ensures that the training and test sets traditional regression models as focusing on scientifc truth are identically distributed, but fails to ensure that they are inde- with pure prediction methods focusing instead on prediction pendent (i.e., a random split does not resemble two independent accuracy; he views traditional regression models as focusing samples taken from the same population). Such a random split on low dimension problems while pure prediction methods certainly does not resemble the relationship between the data focus on high dimensional problems; and he views traditional used to train an algorithm and the future data that it will be regression problems as utilizing optimal inference theory such used on (future data are, at best, an independent sample from the as maximum likelihood and Neyman–Pearson while pure pre- same population, but are ofen collected from a diferent source diction methods focus instead on training/test paradigms.

CONTACT Bin Yu [email protected] Statistics Department, University of California Berkeley, CA 94720. ©2020AmericanStatisticalAssociation JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 673

However, in our experience, pure prediction methods are models. Our own perspective has transitioned into a framework routinely used in practice together with classical statistical meth- of accurate approximation for a particular domain, and relative ods such as PCA and logistic regression. Right or wrong, logis- to a particular performance metric, rather than truth-seeking as tic regression is ofen even categorized as a machine learning was the goal of traditional statistical inference. (pure prediction) method, and many modern machine learning Finally, Professor Efron calls for “substantial further develop- books include traditional statistical ideas such as maximum ment” of machine learning approaches before they are ready for likelihood, linear regression, and logistic regression. We have “routine scientifc applicability.” However, such approaches are experienced the use of both traditional regression and pure already being used successfully in routine scientifc applicability. prediction approaches in low- and high-dimensional problems. Of course we also agree that further development is always a The idea that traditional regression methods focus on scientifc good thing. In our research group, we have been working toward truth while pure prediction approaches focus on prediction this goal, building on our own experience gained from many accuracy certainly has some merit, but in our experience the interdisciplinary projects applying pure prediction methods in “truth” that traditional regression methods supposedly repre- conjunction with traditional statistical ideas in neuroscience, sent is rarely justifed or validated. The mathematical formu- genomics and precision medicine. As a result of our work, we lae on which traditional regression models are built are rarely have developed a framework that we call veridical (“truthful”) based on verifable domain knowledge. Instead, the primary data science,whichencompassesaunifedviewofpurepre- method for which people fnd evidence that these mathematical diction and traditional regression approaches under the three regression formulations represent the truth is ofen measured principles of predictability, computability, and stability (PCS). by their predictive accuracy (which, in Professor Efron’s paper, The idea of using prediction as a reality check forms the basis is the primary concern of pure prediction methods, rather than of our predictability principle, and ideas from the traditional traditional regression models). At the very least, if traditional regression approaches are modernized in the stability princi- regression methods do not generate accurate predictions, then it ple, which itself bears a strong relationship to and expands is unlikely that it is adequately capturing any kind of underlying on Professor Efron’s groundbreaking work on the bootstrap. “truth.” Moreover, since a reasonable goal is to ensure that con- Stability is a critical component of all stages of a data science clusions drawn (and predictions generated) are generalizable project, ranging from the literal words used to formulate the to new data, the training/test paradigm is just as relevant for problem (every word in the problem formulation should mean traditional regression approaches as it is for the pure prediction the same thing—have stable meaning—to every individual on approaches. a multidisciplinary project team) to data cleaning, and data Professor Efron separates regression approaches into three and model perturbations in the modeling stage. Data-driven categories: prediction (“the prediction of new cases”), estima- approaches are brought into the modern big data era via the tion (“an instrument for peering through the noisy data and computability principle. Our “one culture” veridical PCS data discerning a smooth underlying truth”), and attribution (“the science framework highlights the practice of veridical (“truth- assignment of signifcance to individual predictors”). ful”) data science for reliable, reproducible and transparent data- Professor Efron argues that pure prediction algorithms driven decision making and knowledge generation from data tend to “focus on prediction, to the neglect of estimation for a particular domain problem (Yu and Kumbier 2020). In and attribution.” However, while traditional regression meth- addition, our paper also introduces and advocates for PCS doc- ods generate attribution via p-values associated with model umentation which records the analytical judgement calls made coefcients, many pure prediction methods, such as random by humans throughout the data science life cycle, and records forest, have their own methods of attribution in the form of relevant domain knowledgeandcode. Our paper provides a nice variable importance (gini or permutation) scores. Professor summary of our views as this one does for Professor Efron’s. The Efron discusses this idea but dismisses these modern attribution two papers would be an interesting side-by-side read. attempts as being suboptimal, dampened by the pure prediction Professor Efron’s piece gave us much food for thought, allow- method’s focus on prediction using many weak learners that ing us the opportunity to refect on changing perspectives over do not prioritize individual strong predictors. Meanwhile, in time. We appreciate the invitation to provide a discussion for our own research group we have seen great success using these this piece, and we remain ever in debt to the truly innovative importance scores to understand attribution. We developed contributions that Professor Efron has made to the feld of statis- the iterative random forest (iRF) (Basu et al. 2018) approach tics that have now also made their way into the feld of machine for identifying higher-order interactions based on decision learning, and in our view, contributed to the unifcation (rather paths in (pure prediction) random forest models, and have than the division) of these two felds to arrive at “one culture” in found such approaches to be extremely useful in practical today’s feld of data science. applications including verifying existing and identifying novel gene interactions in drosophila embryonic development. Our experience indicates that prediction and attribution can go hand-in-hand, rather than being at odds with one another. In Funding addition, Professor Efron’sfocus on estimation implies that there is an underlying “truth” that can be uncovered by a model. As We would like to acknowledge and thank the following funding sources: Ofce of Naval Research grant N00014-16-1-2664, NSF grants DMS- we discussed above, our view has become that no such truth is 1613002 and IIS 1741340, and the Center for Science of Information, a ever really attainable, whether or not you are working within US NSF Science and Technology Center, under grant CCF-0939370. B.Y. is the realm of traditional regression models or pure prediction aChanZuckerbergBiohubinvestigator. 674 B. YU AND R. BARTER

References Yu, B., and Kumbier, K. (2020), “Veridical Data Science,” Proceedings of the National Academy of Sciences of the United States of America,117,3920– Basu, S., Kumbier, K., Brown, J. B., and Yu, B. (2018), “Iterative Random 3929. [673] Forests to Discover Predictive and Stable High-Order Interactions,” Proceedings of the National Academy of Sciences of the United States of America,115,1943–1948.[673] JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 2020, VOL. 115, NO. 530, 675–677: Rejoinder https://doi.org/10.1080/01621459.2020.1762453

Rejoinder

Bradley Efron

Ahotnewstatisticaltechnologylikethepurepredictionalgo- … but in our experience the “truth” that traditional regres- rithms divides the discipline into insiders and outsiders. The sion methods supposedly represent is rarely justifed or vali- insiders, caught up in the excitement of new lands to conquer, dated… are usually in no mood to think critically about methodological This is a hard-line Breimanian point of view. That “rarely” weaknesses. Outsiders, everyone else, may lack enough infor- is over the top. The truism that begins “all models are wrong” mation and insight to mount an efective critique, especially ends with “but some are useful.” Traditional models tend to err if the new methods are intricate, sprawling, and somewhat on the side of over-simplicity (not enough interactions, etc.) but mysterious. Eventually though, there is enough mixing between still manage to capture at least some aspect of the underlying the two groups to allow the critical process to begin. Statis- mechanism. “Eternal truth” is a little too much to ask for, but in tics has a strong tradition of internal criticism—witness the the Neonate example we did wind up believing that respiratory Bayesian/frequentist debates—with discussion papers like this strength had something lasting to do with the babies’ survival. one serving as useful outlets. Beyond the assessment of a model’s correctness lies a deeper The discussants here span the insider/outsider spectrum, question: do we believe that there is a model? The history including some of the most prominent developers of modern of science is a story of successful models—Newton, Maxwell, prediction methods. Predictably, the insiders are the happiest Einstein, circulation of the blood, the periodic table, seafoor with the ft between traditional and algorithmic methods: they spreading, DNA, and the double helix. Prediction by itself is “go hand in glove” say Professors Friedman, Hastie, and Tibshi- an end state. Ofen a very useful end state, but we seem to rani, (henceforth FHT), speaking with authority as the authors need some form of theoretical underpinning to make further of seminal texts on statistical learning, as well as key technolo- scientifc progress. This is the place where the paper parts ways gies such as the lasso. “Data Science Process: One Culture” is with the hard-line predictionist point of view. the title of Professor Yu and Dr Barter’s commentary, a reverse Asideefectoftheanti-modelingethosistheadhocnature echo of Leo Breiman’s famous “two cultures” paper, which, of the pure prediction algorithms; ingeniously ad hoc it is fair contrariwise, claimed a stark dichotomy between traditional to say, but still devoid of theoretical justifcation, as shown methods and the “prediction culture.” Again, Yu/Barter speak for instance by the disparity between random forests and deep as experienced and accomplished inside developers of modern learning. The fathers of statistical theory—Pearson, Fisher, Ney- prediction methodology. man, Hotelling, Wald—forgot to provide us with a comprehen- The outsiders are less happy with the current state of sive theory of optimal prediction. We will have to count on the connection between traditional and algorithmic regression current generation of young statisticians to fll the gap and put ideas. (The term “outsider” here is only relative to algorithmic prediction on a principled foundation. prediction; all of the discussants have been infuential develop- Professor Cressie adds a fourth category, “decision,” to my ers of modern data-analytic techniques such as false discovery prediction-estimation-attribution triad, and also adds a term rates, genomic analysis, and the bootstrap.) This paper was to the surface-plus-noise paradigm, allowing the underlying written from an outsider’s point of view. Rereading it now, a “truth” to vary over time. He makes an interesting compar- year later, made me aware of some outsider limitations: there ison between data science and data engineering, not unlike is in fact more connective tissue between the two cultures the paper’s distinction between long-term and ephemeral infer- than I credited, both FHT and Yu/Barker giving convincing ences, science being more “why” and engineering more “how.” examples. The Robo-debt story is a chilling example of what can go wrong That being said, an extra year of experience has not changed when using the wrong metric of prediction accuracy (too much my belief in a disconnect between traditional statistical regres- “how.”) The question of prediction metrics, missing from the sion methods and the pure prediction algorithms. Section 8 of paper, is also raised by Professors Candes and Sabatti. the paper, featuring Table 5, makes the case directly in terms Candes and Sabatti’s wide-ranging essay brings up a variety of six criteria. Five of the six emerged more or less unscathed of interesting topics and questions: from the discussion. Criteria 2, long-time scientifc truth versus possibly short-term prediction accuracy, was doubted by FHT •Iused“attribution”toavoid“signifcance,”anifyterminol- and received vigorous push-back from Yu/Barter: ogy in general, and defnitely out of place here.

©2020AmericanStatisticalAssociation 676 BRADLEY EFRON

•Causalinference,alongwithreproducibilityanotherhot (1) I certainly agree that prediction is a useful part of statistical topic, lives on a far shore from the prediction algorithms, practice. The paper’s critique was aimed at the pure predic- being a model-heavy topic. But, as Candes/Sabatti say, there tion algorithms, random forests etc., and not at prediction are ongoing attempts to use the algorithms for causality in general. Lasso applications lie somewhere between tra- purposes. Maybe this would have been an apt example for ditional and algorithmic methods in ofering at least some both of Section 10’s “hopeful trends.” hope of principled attribution. • In de Finetti’s profound Bayesian universe, parameters (let To blame correlation between predictors for the lack of alone parametric models) do not exist, only observations specifcity seen in Table 3—where removing “top” predic- and how well we can predict them. Candes/Sabatti smartly tors does not increase error rates—is only one possibility. connect this with Yu and Kumbier’s extensively developed Another is that a great many predictors may be weakly cor- theory of “veridical” data science. related with the response variable, and, because prediction • The simplest probability model for prediction problems is is comparatively easy, these can be assembled into strong the independence of the (x, y)pairsfromoneanotherin prediction rules in many diferent ways. A more attribution- (2.1). How far can independence by itself take us? The non- friendly prediction algorithm would need, like the lasso, to parametric bootstrap is one example, conformal inference be less promiscuous in selecting predictors. (I wondered if another, and also, almost, the ingenious knock-ofmethods the early selections of the lasso-based skin cancer rule were Candes/Sabatti describe. It is worth remembering that the strongly implicated?) independence model is amodelandcangowrong,asthe (2) Both War and Peace and The Big Bang theory contain great discussion of concept drifin Section 6 shows. writing, but they are great on diferent time scales. The actual big bang theory is a data-based estimate of the early This last point connects with Professor Xie and Zheng’s universe. Physicists do not expect it to be the last word but discussion, a short paper in itself. Their Proposition 1 shows it could predominate for decades or even centuries—not so, that conformal inference methods produce well-calibrated say, for the Netfix movie prediction algorithm. My point prediction intervals in surface plus noise models under an iid was not that prediction is inherently ephemeral; rather that assumption for the (x, y) pairs, even if the assumed surface current algorithmic predictions may tend by their construc- model is wrong. In the cholestyramine example of Figure tion toward a short shelf-life. 1, for instance, we might assume a cubic regression even if (3) Sara Reardon’s article in the February 2020 issue of Scien- it should actually be eighth degree. The penalty for using tifc American comments on a deep-learning pneumonia- the wrong model in the conformal calculations is increased predicting algorithm developed at Mount Sinai Hospital: width of the calculated prediction intervals. Proposition 1 is a more sophisticated version of the fact that the nonparametric …Itperformedwithgreaterthan90%accuracyonx- bootstrap gives correct standard errors under iid sampling, rays produced at Mount Sinai but was far less accurate even for “incorrect” estimators, that is, the variability of Figure with scans from other institutions. They eventually fg- 1’s cubic regression curve whether or not the true model is ured out that instead of just analyzing the images, the cubic. algorithm was also factoring in the odds of a positive Professor Davison brings up the question of sample size n. fnding based on how common pneumonia was at each In his practice, and much of mine too, n still tends to be small, institution—not something they expected or wanted the whereas, as Candes/Sabatti point out, in the prediction world n program to do. seems actually to be going to infnity, or at least approaching the The “clever Hans” efect might occur more subtly, and more √ n =allideal,wheretraditionalstalwartssuchas n accuracy, widely than in the mammogram example. no longer apply. My own experience has followed the “too big, (4) There’s no question that a preliminary use of say random too small” principle: theinvestigator brings in a messy mountain forests is a good way to start a more traditional data analysis, of data, but by the time I fgure out the right question to as FHT recommend. (I have gotten fond of doing just that.) ask, the relevant part of the mountain shrinks to a molehill. I This does not mean that there are not diferences worth cannot apologize very much for the small size of the paper’s understanding between traditional and algorithmic meth- examples. Huge sample sizes can conceal a method’s weak- ods. The military and local police may work “hand in glove” nesses, which become worrisome when one gets to the molehill in a disaster but they are still quite diferent organizations. stage. The paper was intended to pinpoint the diferences between ARIMA models were a hot topic when introduced in the early the old and new data analysis methodologies, with the hope 1970s, and, like the prediction algorithms, they carried with of encouraging improvements in both. them a somewhat mysterious allure. It took a decade of critical thinking to reduce the mystery and put ARIMA into its place as Section 8 of the paper repeatedly quotes Professor Cox’s com- awell-understoodstandardstatisticaltool.Myhopeisthatthe mentary on Leo Breimans’ 2001 paper “Statistical Modeling: the same process is now under way for prediction algorithms. Two Cultures,” where he lucidly defends the traditional culture, Nobody has done more than Professors Friedman, Hastie, the one Leo does not have much time for. His discussion here andTibshiranitoconnectthestatisticsandpredictionalgorithm includes a tribute to the international nature of the statistics communities, through their work and, especially, through their community—the happy fact that our feld is not split into non- books. I will try to follow their comment numbering in what communicating polities. The possibility of such a split is a follows: worrisome corollary of the two cultures point of view. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION 677

Parametric modeling, Criteria 4 in Table 5, is a hallmark profound results, a small club that includes Professor Cox as a of the best of traditional statistical theory. Fisher was the key member. fgure in its centrality. At the same time, he argued against over- All of this is to say that in hoping for mathematical structure mathematization. Cox writes in support of the prediction algorithms, the wish is for more structure, not more math per se. An aspect of both Fisher and Bartlett’s work, which I must My thanks to the discussants for what seem to me to be an admit I admire, is a total lack of concern with formal mathe- unusually interesting set of commentaries on a pressing topic. matical regularity conditions which, of course subject to due Under the best of circumstances, discussion papers put a strain care, typically contributes little to understanding. on the Editor. These are not the best of circumstances (April Icouldnotagreemore.Ofcourse,FisherandBartlett 2020), but that has not stopped Regina Liu from putting all this combined their lack of concern with the ability to produce together. Thanks Regina!