Statistical Science 2001, Vol. 16, No. 3, 199–231 Statistical Modeling: The Two Cultures

Leo Breiman

Abstract. There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated bya given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical communityhas been committed to the almost exclusive use of data models. This commit- ment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current prob- lems. Algorithmic modeling, both in theoryand practice, has developed rapidlyin fields outside . It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools.

1. INTRODUCTION The values of the parameters are estimated from Statistics starts with data. Think of the data as the data and the model then used for information being generated bya black box in which a vector of and/or prediction. Thus the black box is filled in like input variables x (independent variables) go in one this: side, and on the other side the response variables y linear regression come out. Inside the black box, nature functions to y logistic regression x associate the predictor variables with the response Cox model variables, so the picture is like this: Model validation. Yes–no using goodness-of-fit tests and residual examination. y nature x Estimated culture population. 98% of all statisti- cians. There are two goals in analyzing the data: The Algorithmic Modeling Culture Prediction. To be able to predict what the responses are going to be to future input variables; The analysis in this culture considers the inside of Information. To extract some information about the box complex and unknown. Their approach is to how nature is associating the response variables find a function fx—an algorithm that operates on to the input variables. x to predict the responses y. Their black box looks like this: There are two different approaches toward these goals: y unknown x The Data Modeling Culture

The analysis in this culture starts with assuming decision trees a stochastic data model for the inside of the black neural nets box. For example, a common data model is that data are generated byindependent draws from Model validation. Measured bypredictive accuracy. Estimated culture population. 2% of statisticians, response variables = f(predictor variables, manyin other fields. random noise, parameters) In this paper I will argue that the focus in the statistical communityon data models has: Leo Breiman is Professor, Department of Statistics, University of California, Berkeley, California 94720- • Led to irrelevant theoryand questionable sci- 4735 (e-mail: [email protected]). entific conclusions;

199 200 L. BREIMAN

• Kept statisticians from using more suitable between inputs and outputs than data models. This algorithmic models; is illustrated using two medical data sets and a • Prevented statisticians from working on excit- genetic data set. A glossaryat the end of the paper ing new problems; explains terms that not all statisticians maybe familiar with. I will also review some of the interesting new developments in algorithmic modeling in and look at applications to three data sets. 3. PROJECTS IN CONSULTING As a consultant I designed and helped supervise 2. ROAD MAP surveys for the Environmental Protection Agency It maybe revealing to understand how I became a (EPA) and the state and federal court systems. Con- member of the small second culture. After a seven- trolled experiments were designed for the EPA, and year stint as an academic probabilist, I resigned and I analyzed traffic data for the U.S. Department of went into full-time free-lance consulting. After thir- Transportation and the California Transportation teen years of consulting I joined the Berkeley Statis- Department. Most of all, I worked on a diverse set tics Department in 1980 and have been there since. of prediction projects. Here are some examples: Myexperiences as a consultant formed myviews Predicting next-dayozone levels. about algorithmic modeling. Section 3 describes two Using mass spectra to identifyhalogen-containing of the projects I worked on. These are given to show compounds. how myviews grew from such problems. Predicting the class of a ship from high altitude When I returned to the universityand began radar returns. reading statistical journals, the research was dis- Using sonar returns to predict the class of a sub- tant from what I had done as a consultant. All marine. articles begin and end with data models. Myobser- Identityof hand-sent Morse Code. vations about published theoretical research in Toxicityof chemicals. statistics are in Section 4. On-line prediction of the cause of a freewaytraffic Data modeling has given the statistics field many breakdown. successes in analyzing data and getting informa- Speech recognition tion about the mechanisms producing the data. But The sources of delayin criminal trials in state court there is also misuse leading to questionable con- systems. clusions about the underlying mechanism. This is reviewed in Section 5. Following that is a discussion To understand the nature of these problems and (Section 6) of how the commitment to data modeling the approaches taken to solve them, I give a fuller has prevented statisticians from entering new sci- description of the first two on the list. entific and commercial fields where the data being 3.1 The Ozone Project gathered is not suitable for analysis by data models. In the past fifteen years, the growth in algorith- In the mid- to late 1960s ozone levels became a mic modeling applications and methodologyhas serious health problem in the Los Angeles Basin. been rapid. It has occurred largelyoutside statis- Three different alert levels were established. At the tics in a new community—often called machine highest, all government workers were directed not learning—that is mostlyyoungcomputer scientists to drive to work, children were kept off playgrounds (Section 7). The advances, particularlyover the last and outdoor exercise was discouraged. five years, have been startling. Three of the most The major source of ozone at that time was auto- important changes in perception to be learned from mobile tailpipe emissions. These rose into the low these advances are described in Sections 8, 9, and atmosphere and were trapped there byan inversion 10, and are associated with the following names: layer. A complex chemical reaction, aided by sun- light, cooked awayand produced ozone two to three Rashomon: the multiplicityof good models; hours after the morning commute hours. The alert Occam: the conflict between simplicityand warnings were issued in the morning, but would be accuracy; more effective if theycould be issued 12 hours in Bellman: dimensionality—curse or blessing? advance. In the mid-1970s, the EPA funded a large Section 11 is titled “Information from a Black effort to see if ozone levels could be accuratelypre- Box” and is important in showing that an algo- dicted 12 hours in advance. rithmic model can produce more and more reliable Commuting patterns in the Los Angeles Basin information about the structure of the relationship are regular, with the total variation in anygiven STATISTICAL MODELING: THE TWO CULTURES 201 daylight hour varying only a few percent from field. The molecules of the compound split and the one weekdayto another. With the total amount of lighter fragments are bent more bythe magnetic emissions about constant, the resulting ozone lev- field than the heavier. Then the fragments hit an els depend on the meteorologyof the preceding absorbing strip, with the position of the fragment on days. A large data base was assembled consist- the strip determined bythe molecular weight of the ing of lower and upper air measurements at U.S. fragment. The intensityof the exposure at that posi- weather stations as far awayas Oregon and Ari- tion measures the frequencyof the fragment. The zona, together with hourlyreadings of surface resultant mass spectra has numbers reflecting fre- temperature, humidity, and wind speed at the quencies of fragments from molecular weight 1 up to dozens of air pollution stations in the Basin and the molecular weight of the original compound. The nearbyareas. peaks correspond to frequent fragments and there Altogether, there were dailyand hourlyreadings are manyzeroes. The available data base consisted of over 450 meteorological variables for a period of of the known chemical structure and mass spectra seven years, with corresponding hourly values of of 30,000 compounds. ozone and other pollutants in the Basin. Let x be The mass spectrum predictor vector x is of vari- the predictor vector of meteorological variables on able dimensionality. Molecular weight in the data the nth day. There are more than 450 variables in base varied from 30 to over 10,000. The variable to x since information several days back is included. be predicted is Let y be the ozone level on the n + 1st day. Then the problem was to construct a function fx such y = 1: contains chlorine, that for anyfuture dayand future predictor vari- y = 2: does not contain chlorine. ables x for that day, fx is an accurate predictor of the next day’s ozone level y. The problem is to construct a function fx that To estimate predictive accuracy, the first five is an accurate predictor of y where x is the mass years of data were used as the training set. The spectrum of the compound. last two years were set aside as a test set. The To measure predictive accuracythe data set was algorithmic modeling methods available in the pre- randomlydivided into a 25,000 member training 1980s decades seem primitive now. In this project set and a 5,000 member test set. Linear discrim- large linear regressions were run, followed byvari- inant analysis was tried, then quadratic discrimi- able selection. Quadratic terms in, and interactions nant analysis. These were difficult to adapt to the among, the retained variables were added and vari- variable dimensionality. By this time I was thinking able selection used again to prune the equations. In about decision trees. The hallmarks of chlorine in the end, the project was a failure—the false alarm mass spectra were researched. This domain knowl- rate of the final predictor was too high. I have edge was incorporated into the decision tree algo- regrets that this project can’t be revisited with the rithm bythe design of the set of 1,500 yes–noques- tools available today. tions that could be applied to a mass spectra of any dimensionality. The result was a decision tree that 3.2 The Chlorine Project gave 95% accuracyon both chlorines and nonchlo- The EPA samples thousands of compounds a year rines (see Breiman, Friedman, Olshen and Stone, and tries to determine their potential toxicity. In 1984). the mid-1970s, the standard procedure was to mea- sure the mass spectra of the compound and to try 3.3 Perceptions on Statistical Analysis to determine its chemical structure from its mass As I left consulting to go back to the university, spectra. these were the perceptions I had about working with Measuring the mass spectra is fast and cheap. But data to find answers to problems: the determination of chemical structure from the mass spectra requires a painstaking examination (a) Focus on finding a good solution—that’s what bya trained chemist. The cost and availabilityof consultants get paid for. enough chemists to analyze all of the mass spectra (b) Live with the data before you plunge into produced daunted the EPA. Manytoxic compounds modeling. contain halogens. So the EPA funded a project to (c) Search for a model that gives a good solution, determine if the presence of chlorine in a compound either algorithmic or data. could be reliablypredicted from its mass spectra. (d) Predictive accuracyon test sets is the crite- Mass spectra are produced bybombarding the rion for how good the model is. compound with ions in the presence of a magnetic (e) Computers are an indispensable partner. 202 L. BREIMAN

4. RETURN TO THE UNIVERSITY These truisms have often been ignored in the enthu- siasm for fitting data models. A few decades ago, I had one tip about what research in the uni- the commitment to data models was such that even versitywas like. A friend of mine, a prominent simple precautions such as residual analysis or statistician from the BerkeleyStatistics Depart- goodness-of-fit tests were not used. The belief in the ment, visited me in Los Angeles in the late 1970s. infallibilityof data models was almost religious. It After I described the decision tree method to him, is a strange phenomenon—once a model is made, his first question was, “What’s the model for the then it becomes truth and the conclusions from it data?” are infallible. 4.1 Statistical Research 5.1 An Example Upon myreturn, I started reading the Annals of I illustrate with a famous (also infamous) exam- Statistics, the flagship journal of theoretical statis- ple: assume the data is generated byindependent tics, and was bemused. Everyarticle started with draws from the model Assume that the data are generated bythe follow- M ing model:  R y = b0 + bmxm + ε 1 followed bymathematics exploring inference, hypo- where the coefficients bm are to be estimated, ε thesis testing and asymptotics. There is a wide is N0 σ2 and σ2 is to be estimated. Given that spectrum of opinion regarding the usefulness of the the data is generated this way, elegant tests of theorypublished in the Annals of Statistics to the hypotheses, confidence intervals, distributions of field of statistics as a science that deals with data. I the residual sum-of-squares and asymptotics can be am at the verylow end of the spectrum. Still, there derived. This made the model attractive in terms have been some gems that have combined nice of the mathematics involved. This theorywas used theoryand significant applications. An example is both byacademic statisticians and others to derive wavelet theory. Even in applications, data models significance levels for coefficients on the basis of are universal. For instance, in the Journal of the model (R), with little consideration as to whether American Statistical Association JASA, virtually the data on hand could have been generated bya everyarticle contains a statement of the form: linear model. Hundreds, perhaps thousands of arti- Assume that the data are generated bythe follow- cles were published claiming proof of something or ing model:  other because the coefficient was significant at the 5% level. I am deeplytroubled bythe current and past use Goodness-of-fit was demonstrated mostlybygiv- of data models in applications, where quantitative ing the value of the multiple correlation coefficient conclusions are drawn and perhaps policydecisions R2 which was often closer to zero than one and made. which could be over inflated bythe use of too many parameters. Besides computing R2, nothing else 5. THE USE OF DATA MODELS was done to see if the observational data could have Statisticians in applied research consider data been generated bymodel (R). For instance, a study modeling as the template for statistical analysis: was done several decades ago bya well-known Faced with an applied problem, think of a data member of a universitystatistics department to model. This enterprise has at its heart the belief assess whether there was gender discrimination in that a statistician, byimagination and bylooking the salaries of the faculty. All personnel files were at the data, can invent a reasonablygood para- examined and a data base set up which consisted of metric class of models for a complex mechanism salaryas the response variable and 25 other vari- devised bynature. Then parameters are estimated ables which characterized academic performance; and conclusions are drawn. But when a model is fit that is, papers published, qualityof journals pub- to data to draw quantitative conclusions: lished in, teaching record, evaluations, etc. Gender appears as a binarypredictor variable. • The conclusions are about the model’s mecha- A linear regression was carried out on the data nism, and not about nature’s mechanism. and the gender coefficient was significant at the 5% level. That this was strong evidence of sex dis- It follows that: crimination was accepted as gospel. The design • If the model is a poor emulation of nature, the of the studyraises issues that enter before the conclusions maybe wrong. consideration of a model—Can the data gathered STATISTICAL MODELING: THE TWO CULTURES 203 answer the question posed? Is inference justified a varietyof models. A residual plot is a goodness-of- when your sample is the entire population? Should fit test, and lacks power in more than a few dimen- a data model be used? The deficiencies in analysis sions. An acceptable residual plot does not imply occurred because the focus was on the model and that the model is a good fit to the data. not on the problem. There are a variety of ways of analyzing residuals. The linear regression model led to manyerro- For instance, Landwher, Preibon and Shoemaker neous conclusions that appeared in journal articles (1984, with discussion) gives a detailed analysis of waving the 5% significance level without knowing fitting a logistic model to a three-variable data set whether the model fit the data. Nowadays, I think using various residual plots. But each of the four most statisticians will agree that this is a suspect discussants present other methods for the analysis. wayto arrive at conclusions. At the time, there were One is left with an unsettled sense about the arbi- few objections from the statistical profession about trariness of residual analysis. the fairy-tale aspect of the procedure, But, hidden in Misleading conclusions mayfollow from data an elementarytextbook, Mosteller and Tukey(1977) models that pass goodness-of-fit tests and residual discuss manyof the fallacies possible in regression checks. But published applications to data often and write “The whole area of guided regression is show little care in checking model fit using these fraught with intellectual, statistical, computational, methods or anyother. For instance, manyof the and subject matter difficulties.” current application articles in JASA that fit data Even currently, there are only rare published cri- models have verylittle discussion of how well their tiques of the uncritical use of data models. One of model fits the data. The question of how well the the few is David Freedman, who examines the use model fits the data is of secondaryimportance com- of regression models (1994); the use of path models pared to the construction of an ingenious stochastic (1987) and data modeling (1991, 1995). The analysis model. in these papers is incisive. 5.3 The Multiplicity of Data Models 5.2 Problems in Current Data Modeling One goal of statistics is to extract information from the data about the underlying mechanism pro- Current applied practice is to check the data ducing the data. The greatest plus of data modeling model fit using goodness-of-fit tests and residual is that it produces a simple and understandable pic- analysis. At one point, some years ago, I set up a ture of the relationship between the input variables simulated regression problem in seven dimensions and responses. For instance, logistic regression in with a controlled amount of nonlinearity. Standard classification is frequentlyused because it produces tests of goodness-of-fit did not reject linearityuntil a linear combination of the variables with weights the nonlinearitywas extreme. Recent theorysup- that give an indication of the variable importance. ports this conclusion. Work byBickel, Ritov and The end result is a simple picture of how the pre- Stoker (2001) shows that goodness-of-fit tests have diction variables affect the response variable plus verylittle power unless the direction of the alter- confidence intervals for the weights. Suppose two native is preciselyspecified. The implication is that statisticians, each one with a different approach omnibus goodness-of-fit tests, which test in many to data modeling, fit a model to the same data directions simultaneously, have little power, and set. Assume also that each one applies standard will not reject until the lack of fit is extreme. goodness-of-fit tests, looks at residuals, etc., and Furthermore, if the model is tinkered with on the is convinced that their model fits the data. Yet basis of the data, that is, if variables are deleted the two models give different pictures of nature’s or nonlinear combinations of the variables added, mechanism and lead to different conclusions. then goodness-of-fit tests are not applicable. Resid- McCullah and Nelder (1989) write “Data will ual analysis is similarly unreliable. In a discussion often point with almost equal emphasis on sev- after a presentation of residual analysis in a sem- eral possible models, and it is important that the inar at Berkeleyin 1993, William Cleveland, one statistician recognize and accept this.” Well said, of the fathers of residual analysis, admitted that it but different models, all of them equallygood, may could not uncover lack of fit in more than four to five give different pictures of the relation between the dimensions. The papers I have read on using resid- predictor and response variables. The question of ual analysis to check lack of fit are confined to data which one most accuratelyreflects the data is dif- sets with two or three variables. ficult to resolve. One reason for this multiplicity With higher dimensions, the interactions between is that goodness-of-fit tests and other methods for the variables can produce passable residual plots for checking fit give a yes–no answer. With the lack of 204 L. BREIMAN power of these tests with data having more than a Mosteller and Tukey(1977) were earlyadvocates small number of dimensions, there will be a large of cross-validation. Theywrite, “Cross-validation is number of models whose fit is acceptable. There is a natural route to the indication of the qualityof any no way, among the yes–no methods for gauging fit, data-derived quantity. We plan to cross-validate of determining which is the better model. A few carefullywherever we can.” statisticians know this. Mountain and Hsiao (1989) Judging bythe infrequencyof estimates of pre- write, “It is difficult to formulate a comprehensive dictive accuracyin JASA, this measure of model model capable of encompassing all rival models. fit that seems natural to me (and to Mosteller and Furthermore, with the use of finite samples, there Tukey) is not natural to others. More publication of are dubious implications with regard to the validity predictive accuracyestimates would establish stan- and power of various encompassing tests that rely dards for comparison of models, a practice that is on asymptotic theory.” common in machine learning. Data models in current use mayhave more dam- aging results than the publications in the social sci- 6. THE LIMITATIONS OF DATA MODELS ences based on a linear regression analysis. Just as the 5% level of significance became a de facto stan- With the insistence on data models, multivariate dard for publication, the Cox model for the analysis analysis tools in statistics are frozen at discriminant of survival times and logistic regression for survive– analysis and logistic regression in classification and nonsurvive data have become the de facto standard multiple linear regression in regression. Nobody for publication in medical journals. That different reallybelieves that multivariate data is multivari- survival models, equallywell fitting, could give dif- ate normal, but that data model occupies a large ferent conclusions is not an issue. number of pages in everygraduate textbook on multivariate statistical analysis. 5.4 Predictive Accuracy With data gathered from uncontrolled observa- The most obvious wayto see how well the model tions on complex systems involving unknown physi- box emulates nature’s box is this: put a case x down cal, chemical, or biological mechanisms, the a priori nature’s box getting an output y. Similarly, put the assumption that nature would generate the data same case x down the model box getting an out- through a parametric model selected bythe statis- put y. The closeness of y and y is a measure of tician can result in questionable conclusions that how good the emulation is. For a data model, this cannot be substantiated byappeal to goodness-of-fit translates as: fit the parameters in your model by tests and residual analysis. Usually, simple para- using the data, then, using the model, predict the metric models imposed on data generated bycom- data and see how good the prediction is. plex systems, for example, medical data, financial Prediction is rarelyperfect. There are usu- data, result in a loss of accuracyand information as allymanyunmeasured variables whose effect is compared to algorithmic models (see Section 11). referred to as “noise.” But the extent to which the There is an old saying “If all a man has is a model box emulates nature’s box is a measure of hammer, then everyproblem looks like a nail.” The how well our model can reproduce the natural trouble for statisticians is that recentlysome of the phenomenon producing the data. problems have stopped looking like nails. I conjec- McCullagh and Nelder (1989) in their book on ture that the result of hitting this wall is that more generalized linear models also think the answer is complicated data models are appearing in current obvious. Theywrite, “At first sight it might seem published applications. Bayesian methods combined as though a good model is one that fits the data with Markov Chain Monte Carlo are cropping up all verywell; that is, one that makes µˆ (the model pre- over. This maysignifythat as data becomes more dicted value) veryclose to y (the response value).” complex, the data models become more cumbersome Then theygo on to note that the extent of the agree- and are losing the advantage of presenting a simple ment is biased bythe number of parameters used and clear picture of nature’s mechanism. in the model and so is not a satisfactorymeasure. Approaching problems bylooking for a data model Theyare, of course, right. If the model has too many imposes an a priori straight jacket that restricts the parameters, then it mayoverfit the data and give a abilityof statisticians to deal with a wide range of biased estimate of accuracy. But there are ways to statistical problems. The best available solution to remove the bias. To get a more unbiased estimate a data problem might be a data model; then again of predictive accuracy, cross-validation can be used, it might be an algorithmic model. The data and the as advocated in an important earlywork byStone problem guide the solution. To solve a wider range (1974). If the data set is larger, put aside a test set. of data problems, a larger set of tools is needed. STATISTICAL MODELING: THE TWO CULTURES 205

Perhaps the damaging consequence of the insis- 7.2 Theory in Algorithmic Modeling tence on data models is that statisticians have ruled Data models are rarelyused in this community. themselves out of some of the most interesting and The approach is that nature produces data in a challenging statistical problems that have arisen black box whose insides are complex, mysterious, out of the rapidlyincreasing abilityof computers and, at least, partlyunknowable. What is observed to store and manipulate data. These problems are is a set of x’s that go in and a subsequent set of y’s increasinglypresent in manyfields, both scientific that come out. The problem is to find an algorithm and commercial, and solutions are being found by fx such that for future x in a test set, fx will nonstatisticians. be a good predictor of y. The theoryin this field shifts focus from data mod- 7. ALGORITHMIC MODELING els to the properties of algorithms. It characterizes Under other names, algorithmic modeling has their “strength” as predictors, convergence if they been used byindustrial statisticians for decades. are iterative, and what gives them good predictive See, for instance, the delightful book “Fitting Equa- accuracy. The one assumption made in the theory tions to Data” (Daniel and Wood, 1971). It has been is that the data is drawn i.i.d. from an unknown used bypsychometricians and social scientists. multivariate distribution. Reading a preprint of Gifi’s book (1990) manyyears There is isolated work in statistics where the ago uncovered a kindred spirit. It has made small focus is on the theoryof the algorithms. Grace inroads into the analysis of medical data starting Wahba’s research on smoothing spline algo- with Richard Olshen’s work in the early1980s. For rithms and their applications to data (using cross- further work, see Zhang and Singer (1999). Jerome validation) is built on theoryinvolving reproducing Friedman and Grace Wahba have done pioneering kernels in Hilbert Space (1990). The final chapter work on the development of algorithmic methods. of the CART book (Breiman et al., 1984) contains But the list of statisticians in the algorithmic mod- a proof of the asymptotic convergence of the CART eling business is short, and applications to data are algorithm to the Bayes risk by letting the trees grow seldom seen in the journals. The development of as the sample size increases. There are others, but algorithmic methods was taken up bya community the relative frequencyis small. outside statistics. Theoryresulted in a major advance in machine learning. Vladimir Vapnik constructed informative 7.1 ANew Research Community bounds on the generalization error (infinite test set error) of classification algorithms which depend on In the mid-1980s two powerful new algorithms the “capacity” of the algorithm. These theoretical for fitting data became available: neural nets and bounds led to support vector machines (see Vapnik, decision trees. A new research communityusing 1995, 1998) which have proved to be more accu- these tools sprang up. Their goal was predictive rate predictors in classification and regression then accuracy. The community consisted of young com- neural nets, and are the subject of heated current puter scientists, physicists and engineers plus a few research (see Section 10). aging statisticians. Theybegan using the new tools Mylast paper “Some infinitytheoryfor tree in working on complex prediction problems where it ensembles” (Breiman, 2000) uses a function space was obvious that data models were not applicable: analysis to try and understand the workings of tree speech recognition, image recognition, nonlinear ensemble methods. One section has the heading, time series prediction, handwriting recognition, “Mykingdom for some good theory.” There is an prediction in financial markets. effective method for forming ensembles known as Their interests range over manyfields that were “boosting,” but there isn’t anyfinite sample size once considered happyhunting grounds for statisti- theorythat tells us whyit works so well. cians and have turned out thousands of interesting research papers related to applications and method- 7.3 Recent Lessons ology. A large majority of the papers analyze real data. The criterion for anymodel is what is the pre- The advances in methodologyand increases in dictive accuracy. An idea of the range of research predictive accuracysince the mid-1980s that have of this group can be got bylooking at the Proceed- occurred in the research of machine learning has ings of the Neural Information Processing Systems been phenomenal. There have been particularly Conference (their main yearly meeting) or at the exciting developments in the last five years. What Machine Learning Journal. has been learned? The three lessons that seem most 206 L. BREIMAN important to one: neural net 100 times on simple three-dimensional data reselecting the initial weights to be small and Rashomon: the multiplicityof good models; random on each run. I found 32 distinct minima, Occam: the conflict between simplicityand accu- each of which gave a different picture, and having racy; about equal test set error. Bellman: dimensionality—curse or blessing. This effect is closelyconnected to what I call instability(Breiman, 1996a) that occurs when there 8. RASHOMON AND THE MULTIPLICITY are manydifferent models crowded together that OF GOOD MODELS have about the same training or test set error. Then Rashomon is a wonderful Japanese movie in a slight perturbation of the data or in the model which four people, from different vantage points, construction will cause a skip from one model to witness an incident in which one person dies and another. The two models are close to each other in another is supposedlyraped. When theycome to terms of error, but can be distant in terms of the testifyin court, theyall report the same facts, but form of the model. their stories of what happened are verydifferent. If, in logistic regression or the Cox model, the What I call the Rashomon Effect is that there common practice of deleting the less important is often a multitude of different descriptions [equa- covariates is carried out, then the model becomes tions fx] in a class of functions giving about the unstable—there are too manycompeting models. same minimum error rate. The most easilyunder- Sayyouare deleting from 15 variables to 4 vari- stood example is subset selection in linear regres- ables. Perturb the data slightlyand youwill very sion. Suppose there are 30 variables and we want to possiblyget a different four-variable model and find the best five variable linear regressions. There a different conclusion about which variables are are about 140,000 five-variable subsets in competi- important. To improve accuracybyweeding out less tion. Usuallywe pick the one with the lowest resid- important covariates you run into the multiplicity ual sum-of-squares (RSS), or, if there is a test set, problem. The picture of which covariates are impor- the lowest test error. But there maybe (and gen- tant can varysignificantlybetween two models erallyare) manyfive-variable equations that have having about the same deviance. RSS within 1.0% of the lowest RSS (see Breiman, Aggregating over a large set of competing mod- 1996a). The same is true if test set error is being els can reduce the nonuniqueness while improving measured. accuracy. Arena et al. (2000) bagged (see Glossary) So here are three possible pictures with RSS or logistic regression models on a data base of toxic and test set error within 1.0% of each other: nontoxic chemicals where the number of covariates Picture 1 in each model was reduced from 15 to 4 bystan- dard best subset selection. On a test set, the bagged y = 21 + 38x3 − 06x8 + 832x12 model was significantlymore accurate than the sin- gle model with four covariates. It is also more stable. − 21x17 + 32x27 This is one possible fix. The multiplicityproblem Picture 2 and its effect on conclusions drawn from models needs serious attention. y =−89 + 46x5 + 001x6 + 120x15 + 175x + 02x 21 22 9. OCCAM AND SIMPLICITY VS. ACCURACY Picture 3 Occam’s Razor, long admired, is usuallyinter-

y =−767 + 93x2 + 220x7 − 132x8 preted to mean that simpler is better. Unfortunately, in prediction, accuracyand simplicity(interpretabil- + 34x + 72x  11 28 ity) are in conflict. For instance, linear regression Which one is better? The problem is that each one gives a fairlyinterpretable picture of the y x rela- tells a different storyabout which variables are tion. But its accuracyis usuallyless than that important. of the less interpretable neural nets. An example The Rashomon Effect also occurs with decision closer to mywork involves trees. trees and neural nets. In myexperiments with trees, On interpretability, trees rate an A+. A project if the training set is perturbed onlyslightly,sayby I worked on in the late 1970s was the analysis of removing a random 2–3% of the data, I can get delayin criminal cases in state court systems. The a tree quite different from the original but with Constitution gives the accused the right to a speedy almost the same test set error. I once ran a small trial. The Center for the State Courts was concerned STATISTICAL MODELING: THE TWO CULTURES 207

Table 1 Data set descriptions

Training Test Data set Sample size Sample size Variables Classes

Cancer 699 — 9 2 Ionosphere 351 — 34 2 Diabetes 768 — 8 2 Glass 214 — 9 6 Soybean 683 — 35 19 Letters 15,000 5000 16 26 Satellite 4,435 2000 36 6 Shuttle 43,500 14,500 9 7 DNA 2,000 1,186 60 3 Digit 7,291 2,007 256 10 that in manystates, the trials were anythingbut variables. At each node choose several of the 20 at speedy. It funded a study of the causes of the delay. random to use to split the node. Or use a random I visited manystates and decided to do the anal- combination of a random selection of a few vari- ysis in Colorado, which had an excellent computer- ables. This idea appears in Ho (1998), in Amit and ized court data system. A wealth of information was Geman (1997) and is developed in Breiman (1999). extracted and processed. The dependent variable for each criminal case 9.2 Forests Compared to Trees was the time from arraignment to the time of sen- We compare the performance of single trees tencing. All of the other information in the trial his- (CART) to random forests on a number of small torywere the predictor variables. A large decision and large data sets, mostlyfrom the UCI repository tree was grown, and I showed it on an overhead and (ftp.ics.uci.edu/pub/MachineLearningDatabases). A explained it to the assembled Colorado judges. One summaryof the data sets is given in Table 1. of the splits was on District N which had a larger Table 2 compares the test set error of a single tree delaytime than the other districts. I refrained from to that of the forest. For the five smaller data sets commenting on this. But as I walked out I heard one above the line, the test set error was estimated by judge sayto another, “I knew those guysin District leaving out a random 10% of the data, then run- N were dragging their feet.” ning CART and the forest on the other 90%. The While trees rate an A+ on interpretability, they left-out 10% was run down the tree and the forest are good, but not great, predictors. Give them, say, and the error on this 10% computed for both. This a B on prediction. was repeated 100 times and the errors averaged. 9.1 Growing Forests for Prediction The larger data sets below the line came with a separate test set. People who have been in the clas- Instead of a single tree predictor, grow a forest of sification field for a while find these increases in trees on the same data—say50 or 100. If we are accuracystartling. Some errors are halved. Others classifying, put the new x down each tree in the for- are reduced byone-third. In regression, where the est and get a vote for the predicted class. Let the for- est prediction be the class that gets the most votes. There has been a lot of work in the last five years on Table 2 ways to grow the forest. All of the well-known meth- Test set misclassification error (%) ods grow the forest byperturbing the training set, Data set Forest Single tree growing a tree on the perturbed training set, per- turbing the training set again, growing another tree, Breast cancer 2.9 5.9 Ionosphere 5.5 11.2 etc. Some familiar methods are bagging (Breiman, Diabetes 24.2 25.3 1996b), boosting (Freund and Schapire, 1996), arc- Glass 22.0 30.4 ing (Breiman, 1998), and additive logistic regression Soybean 5.7 8.6 (Friedman, Hastie and Tibshirani, 1998). Letters 3.4 12.4 Mypreferred method to date is random forests. In Satellite 8.6 14.8 this approach successive decision trees are grown by Shuttle ×103 7.0 62.0 introducing a random element into their construc- DNA 3.9 6.2 Digit 6.2 17.1 tion. For example, suppose there are 20 predictor 208 L. BREIMAN forest prediction is the average over the individual the dimensionality. The published advice was that tree predictions, the decreases in mean-squared test high dimensionalityis dangerous. For instance, a set error are similar. well-regarded book on pattern recognition (Meisel, 1972) states “the features must be relatively 9.3 Random Forests are A + Predictors few in number.” But recent work has shown that The Statlog Project (Mitchie, Spiegelhalter and dimensionalitycan be a blessing. Taylor, 1994) compared 18 different classifiers. 10.1 Digging It Out in Small Pieces Included were neural nets, CART, linear and quadratic discriminant analysis, nearest neighbor, Reducing dimensionalityreduces the amount of etc. The first four data sets below the line in Table 1 information available for prediction. The more pre- were the onlyones used in the Statlog Project that dictor variables, the more information. There is also came with separate test sets. In terms of rank of information in various combinations of the predictor accuracyon these four data sets, the forest comes variables. Let’s trygoing in the opposite direction: in 1, 1, 1, 1 for an average rank of 1.0. The next • Instead of reducing dimensionality, increase it best classifier had an average rank of 7.3. byadding manyfunctions of the predictor variables. The fifth data set below the line consists of 16×16 There maynow be thousands of features. Each pixel grayscale depictions of handwritten ZIP Code potentiallycontains a small amount of information. numerals. It has been extensivelyused byAT&T The problem is how to extract and put together Bell Labs to test a varietyof prediction methods. these little pieces of information. There are two A neural net handcrafted to the data got a test set outstanding examples of work in this direction, The error of 5.1% vs. 6.2% for a standard run of random Shape Recognition Forest (Y. Amit and D. Geman, forest. 1997) and Support Vector Machines (V. Vapnik, 9.4 The Occam Dilemma 1995, 1998). 10.2 The Shape Recognition Forest So forests are A+ predictors. But their mechanism for producing a prediction is difficult to understand. In 1992, the National Institute of Standards and Trying to delve into the tangled web that generated Technology(NIST) set up a competition for machine a pluralityvote from 100 trees is a Herculean task. algorithms to read handwritten numerals. Theyput So on interpretability, they rate an F. Which brings together a large set of pixel pictures of handwritten us to the Occam dilemma: numbers (223,000) written byover 2,000 individ- uals. The competition attracted wide interest, and • Accuracygenerallyrequires more complex pre- diverse approaches were tried. diction methods. Simple and interpretable functions The Amit–Geman approach defined manythou- do not make the most accurate predictors. sands of small geometric features in a hierarchi- Using complex predictors maybe unpleasant, but cal assembly. Shallow trees are grown, such that at the soundest path is to go for predictive accuracy each node, 100 features are chosen at random from first, then tryto understand why.In fact, Section the appropriate level of the hierarchy; and the opti- 10 points out that from a goal-oriented statistical mal split of the node based on the selected features viewpoint, there is no Occam’s dilemma. (For more is found. on Occam’s Razor see Domingos, 1998, 1999.) When a pixel picture of a number is dropped down a single tree, the terminal node it lands in gives 10. BELLMAN AND THE CURSE OF probabilityestimates p0  p9 that it represents DIMENSIONALITY numbers 0 1  9. Over 1,000 trees are grown, the averaged over this forest, and the pre- The title of this section refers to Richard Bell- dicted number is assigned to the largest averaged man’s famous phrase, “the curse of dimensionality.” . For decades, the first step in prediction methodol- Using a 100,000 example training set and a ogywas to avoid the curse. If there were too many 50,000 test set, the Amit–Geman method gives a prediction variables, the recipe was to find a few test set error of 0.7%–close to the limits of human features (functions of the predictor variables) that error. “contain most of the information” and then use 10.3 Support Vector Machines these features to replace the original variables. In procedures common in statistics such as regres- Suppose there is two-class data having prediction sion, logistic regression and survival models the vectors in M-dimensional Euclidean space. The pre- advised practice is to use variable deletion to reduce diction vectors for class #1 are x1 and those for STATISTICAL MODELING: THE TWO CULTURES 209 class #2 are x2. If these two sets of vectors can becomes too large, the separating hyperplane will be separated bya hyperplanethen there is an opti- not give low generalization error. If separation can- mal separating hyperplane. “Optimal” is defined as not be realized with a relativelysmall number of meaning that the distance of the hyperplane to any support vectors, there is another version of support prediction vector is maximal (see below). vector machines that defines optimalitybyadding The set of vectors in x1 and in x2 that a penaltyterm for the vectors on the wrong side of achieve the minimum distance to the optimal the hyperplane. separating hyperplane are called the support vec- Some ingenious algorithms make finding the opti- tors. Their coordinates determine the equation of mal separating hyperplane computationally feasi- the hyperplane. Vapnik (1995) showed that if a ble. These devices reduce the search to a solution separating hyperplane exists, then the optimal sep- of a quadratic programming problem with linear arating hyperplane has low generalization error inequalityconstraints that are of the order of the (see Glossary). number N of cases, independent of the dimension of the feature space. Methods tailored to this partic- optimal hyperplane ular problem produce speed-ups of an order of mag- nitude over standard methods for solving quadratic programming problems. support vector Support vector machines can also be used to provide accurate predictions in other areas (e.g., regression). It is an exciting idea that gives excel- lent performance and is beginning to supplant the In two-class data, separabilitybya hyperplane use of neural nets. A readable introduction is in does not often occur. However, let us increase the Cristianini and Shawe-Taylor (2000). dimensionalitybyadding as additional predictor variables all quadratic monomials in the original 11. INFORMATION FROM A BLACK BOX predictor variables; that is, all terms of the form The dilemma posed in the last section is that xm1xm2. A hyperplane in the original variables plus quadratic monomials in the original variables is a the models that best emulate nature in terms of more complex creature. The possibilityof separa- predictive accuracyare also the most complex and tion is greater. If no separation occurs, add cubic inscrutable. But this dilemma can be resolved by monomials as input features. If there are originally realizing the wrong question is being asked. Nature 30 predictor variables, then there are about 40,000 forms the outputs y from the inputs x bymeans of features if monomials up to the fourth degree are a black box with complex and unknown interior. added. The higher the dimensionalityof the set of fea- y nature x tures, the more likelyit is that separation occurs. In the ZIP Code data set, separation occurs with fourth Current accurate prediction methods are also degree monomials added. The test set error is 4.1%. complex black boxes. Using a large subset of the NIST data base as a training set, separation also occurred after adding neural nets y forests x up to fourth degree monomials and gave a test set support vectors error rate of 1.1%. Separation can always be had by raising the So we are facing two black boxes, where ours dimensionalityhigh enough. But if the separating seems onlyslightlyless inscrutable than nature’s. hyperplane becomes too complex, the generalization In data generated bymedical experiments, ensem- error becomes large. An elegant theorem (Vapnik, bles of predictors can give cross-validated error 1995) gives this bound for the expected generaliza- rates significantlylower than logistic regression. tion error: Mybiostatistician friends tell me, “Doctors can interpret logistic regression.” There is no waythey ExGE≤ Exnumber of support vectors/N − 1 can interpret a black box containing fiftytrees where N is the sample size and the expectation is hooked together. In a choice between accuracyand over all training sets of size N drawn from the same interpretability, they’ll go for interpretability. underlying distribution as the original training set. Framing the question as the choice between accu- The number of support vectors increases with the racyand interpretabilityis an incorrect interpre- dimensionalityof the feature space. If this number tation of what the goal of a statistical analysis is. 210 L. BREIMAN

The point of a model is to get useful information (unspecified) statistical procedure which I assume about the relation between the response and pre- was logistic regression. dictor variables. Interpretabilityis a wayof getting Efron and Diaconis drew 500 bootstrap samples information. But a model does not have to be simple from the original data set and used a similar pro- to provide reliable information about the relation cedure to isolate the important variables in each between predictor and response variables; neither bootstrapped data set. The authors comment, “Of does it have to be a data model. the four variables originallyselected not one was selected in more than 60 percent of the samples. • The goal is not interpretability, but accurate Hence the variables identified in the original analy- information. sis cannot be taken too seriously.” We will come back The following three examples illustrate this point. to this conclusion later. The first shows that random forests applied to a medical data set can give more reliable informa- Logistic Regression tion about covariate strengths than logistic regres- The predictive error rate for logistic regression on sion. The second shows that it can give interesting the hepatitis data set is 17.4%. This was evaluated information that could not be revealed bya logistic bydoing 100 runs, each time leaving out a randomly regression. The third is an application to a microar- selected 10% of the data as a test set, and then raydata where it is difficult to conceive of a data averaging over the test set errors. model that would uncover similar information. Usually, the initial evaluation of which variables are important is based on examining the absolute 11.1 Example I: Variable Importance in a values of the coefficients of the variables in the logis- Survival Data Set tic regression divided bytheir standard deviations. The data set contains survival or nonsurvival Figure 1 is a plot of these values. of 155 hepatitis patients with 19 covariates. It is The conclusion from looking at the standard- available at ftp.ics.uci.edu/pub/MachineLearning- ized coefficients is that variables 7 and 11 are the Databases and was contributed byGail Gong. The most important covariates. When logistic regres- description is in a file called hepatitis.names. The sion is run using onlythese two variables, the data set has been previouslyanalyzedbyDiaconis cross-validated error rate rises to 22.9%. Another and Efron (1983), and Cestnik, Konenenko and wayto find important variables is to run a best Bratko (1987). The lowest reported error rate to subsets search which, for anyvalue k, finds the date, 17%, is in the latter paper. subset of k variables having lowest deviance. Diaconis and Efron refer to work byPeter Gre- This procedure raises the problems of instability goryof the Stanford Medical School who analyzed and multiplicityof models (see Section 7.1). There this data and concluded that the important vari- are about 4,000 subsets containing four variables. ables were numbers 6, 12, 14, 19 and reports an esti- Of these, there are almost certainlya substantial mated 20% predictive accuracy. The variables were number that have deviance close to the minimum reduced in two stages—the first was byinformal and give different pictures of what the underlying data analysis. The second refers to a more formal mechanism is.

3.5

2.5

1.5

.5 standardized coefficients

–.5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 variables

Fig. 1. Standardized coefficients logistic regression. STATISTICAL MODELING: THE TWO CULTURES 211

50

40

30

20

10

percent increse in error 0

– 10 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 variables

Fig. 2. Variable importance-random forest.

Random Forests measure of variable importance that is shown in Figure 1. The random forests predictive error rate, evalu- Random forests singles out two variables, the ated byaveraging errors over 100 runs, each time 12th and the 17th, as being important. As a verifi- leaving out 10% of the data as a test set, is 12.3%— almost a 30% reduction from the logistic regression cation both variables were run in random forests, error. individuallyand together. The test set error rates Random forests consists of a large number of over 100 replications were 14.3% each. Running randomlyconstructed trees, each voting for a class. both together did no better. We conclude that virtu- Similar to bagging (Breiman, 1996), a bootstrap allyall of the predictive capabilityis provided bya sample of the training set is used to construct each single variable, either 12 or 17. tree. A random selection of the input variables is To explore the interaction between 12 and 17 a bit searched to find the best split for each node. further, at the end of a random forest run using all To measure the importance of the mth variable, variables, the output includes the estimated value the values of the mth variable are randomlyper- of the probabilityof each class vs. the case number. muted in all of the cases left out in the current This information is used to get plots of the vari- bootstrap sample. Then these cases are run down able values (normalized to mean zero and standard the current tree and their classification noted. At deviation one) vs. the probabilityof death. The vari- the end of a run consisting of growing manytrees, able values are smoothed using a weighted linear the percent increase in misclassification rate due to regression smoother. The results are in Figure 3 for noising up each variable is computed. This is the variables 12 and 17.

VARIABLE 12 vs PROBABILITY #1 VARIABLE 17 vs PROBABILITY #1 1 1

0 0

– 1 – 1 – 2 variable 12 variable 17 – 2 – 3

– 4 – 3 0 .2 .4 .6 .8 1 0 .2 .4 .6 .8 1 class 1 probability class 1 probability

Fig. 3. Variable 17 vs. probability #1. 212 L. BREIMAN

Fig. 4. Variable importance—Bupa data.

The graphs of the variable values vs. class death is automaticallyless than 50%. The paper lists the probabilityare almost linear and similar. The two variables selected in ten of the samples. Either 12 variables turn out to be highlycorrelated. Thinking or 17 appear in seven of the ten. that this might have affected the logistic regression results, it was run again with one or the other of 11.2 Example II Clustering in Medical Data these two variables deleted. There was little change. The Bupa liver data set is a two-class biomedical Out of curiosity, I evaluated variable impor- data set also available at ftp.ics.uci.edu/pub/Mac- tance in logistic regression in the same waythat I hineLearningDatabases. The covariates are: did in random forests, bypermuting variable val- ues in the 10% test set and computing how much 1. mcv mean corpuscular volume that increased the test set error. Not much help— 2. alkphos alkaline phosphotase variables 12 and 17 were not among the 3 variables 3. sgpt alamine aminotransferase ranked as most important. In partial verification 4. sgot aspartate aminotransferase of the importance of 12 and 17, I tried them sep- 5. gammagt gamma-glutamyl transpeptidase aratelyas single variables in logistic regression. 6. drinks half-pint equivalents of alcoholic Variable 12 gave a 15.7% error rate, variable 17 beverage drunk per day came in at 19.3%. To go back to the original Diaconis–Efron analy- The first five attributes are the results of blood sis, the problem is clear. Variables 12 and 17 are sur- tests thought to be related to liver functioning. The rogates for each other. If one of them appears impor- 345 patients are classified into two classes bythe tant in a model built on a bootstrap sample, the severityof their liver malfunctioning. Class two is other does not. So each one’s frequencyof occurrence severe malfunctioning. In a random forests run,

Fig. 5. Cluster averages—Bupa data. STATISTICAL MODELING: THE TWO CULTURES 213 the misclassification error rate is 28%. The variable (1999) derives similar variable information from a importance given byrandom forests is in Figure 4. different wayof constructing a forest. The similar- Blood tests 3 and 5 are the most important, fol- ityis that theyare both built as waysto give low lowed bytest 4. Random forests also outputs an predictive error. intrinsic similaritymeasure which can be used to There are 32 deaths and 123 survivors in the hep- cluster. When this was applied, two clusters were atitis data set. Calling everyone a survivor gives a discovered in class two. The average of each variable baseline error rate of 20.6%. Logistic regression low- is computed and plotted in each of these clusters in ers this to 17.4%. It is not extracting much useful Figure 5. An interesting facet emerges. The class two sub- information from the data, which mayexplain its jects consist of two distinct groups: those that have inabilityto find the important variables. Its weak- high scores on blood tests 3, 4, and 5 and those that ness might have been unknown and the variable have low scores on those tests. importances accepted at face value if its predictive accuracywas not evaluated. 11.3 Example III: Microarray Data Random forests is also capable of discovering Random forests was run on a microarraylym- important aspects of the data that standard data phoma data set with three classes, sample size of models cannot uncover. The potentiallyinteresting 81 and 4,682 variables (genes) without anyvariable clustering of class two patients in Example II is an selection [for more information about this data set, illustration. The standard procedure when fitting see Dudoit, Fridlyand and Speed, (2000)]. The error data models such as logistic regression is to delete rate was low. What was also interesting from a sci- variables; to quote from Diaconis and Efron (1983) entific viewpoint was an estimate of the importance again, “statistical experience suggests that it is of each of the 4,682 gene expressions. unwise to fit a model that depends on 19 variables The graph in Figure 6 was produced bya run of random forests. This result is consistent with with only155 data points available.” Newer meth- assessments of variable importance made using ods in machine learning thrive on variables—the other algorithmic methods, but appears to have more the better. For instance, random forests does sharper detail. not overfit. It gives excellent accuracyon the lym- phoma data set of Example III which has over 4,600 11.4 Remarks about the Examples variables, with no variable deletion and is capable The examples show that much information is of extracting variable importance information from available from an algorithmic model. Friedman the data.

Fig. 6. Microarray variable importance. 214 L. BREIMAN

These examples illustrate the following points: combination. But the trick to being a scientist is to be open to using a wide varietyof tools. • Higher predictive accuracyis associated with The roots of statistics, as in science, lie in work- more reliable information about the underlying data ing with data and checking theoryagainst data. I mechanism. Weak predictive accuracycan lead to hope in this centuryour field will return to its roots. questionable conclusions. There are signs that this hope is not illusory. Over • Algorithmic models can give better predictive the last ten years, there has been a noticeable move accuracythan data models, and provide better infor- toward statistical work on real world problems and mation about the underlying mechanism. reaching out bystatisticians toward collaborative work with other disciplines. I believe this trend will 12. FINAL REMARKS continue and, in fact, has to continue if we are to The goals in statistics are to use data to predict survive as an energetic and creative field. and to get information about the underlying data mechanism. Nowhere is it written on a stone tablet GLOSSARY what kind of model should be used to solve problems Since some of the terms used in this paper may involving data. To make myposition clear, I am not not be familiar to all statisticians, I append some against data models per se. In some situations they definitions. are the most appropriate wayto solve the problem. Infinite test set error. Assume a loss function But the emphasis needs to be on the problem and Ly yˆ that is a measure of the error when y is on the data. the true response and yˆ the predicted response. Unfortunately, our field has a vested interest in In classification, the usual loss is 1 if y =ˆy and data models, come hell or high water. For instance, zero if y =ˆy. In regression, the usual loss is see Dempster’s (1998) paper on modeling. His posi- y −ˆy2. Given a set of data (training set) consist- tion on the 1990 Census adjustment controversyis ing of yn xnn = 1 2  N, use it to construct particularlyinteresting. He admits that he doesn’t a predictor function φx of y. Assume that the know much about the data or the details, but argues training set is i.i.d drawn from the distribution of that the problem can be solved bya strong dose the random vector Y X. The infinite test set error of modeling. That more modeling can make error- is ELY φX. This is called the generalization ridden data accurate seems highlyunlikelyto me. error in machine learning. Terrabytes of data are pouring into computers The generalization error is estimated either by from manysources, both scientific, and commer- setting aside a part of the data as a test set or by cial, and there is a need to analyze and understand cross-validation. the data. For instance, data is being generated Predictive accuracy. This refers to the size of at an awesome rate bytelescopes and radio tele- the estimated generalization error. Good predictive scopes scanning the skies. Images containing mil- accuracymeans low estimated error. lions of stellar objects are stored on tape or disk. Trees and nodes. This terminologyrefers to deci- Astronomers need automated ways to scan their sion trees as described in the Breiman et al book data to find certain types of stellar objects or novel (1984). objects. This is a fascinating enterprise, and I doubt Dropping an x down a tree. When a vector of pre- if data models are applicable. Yet I would enter this dictor variables is “dropped” down a tree, at each in myledger as a statistical problem. intermediate node it has instructions whether to go The analysis of genetic data is one of the most left or right depending on the coordinates of x.It challenging and interesting statistical problems stops at a terminal node and is assigned the predic- around. Microarraydata, like that analyzedin tion given bythat node. Section 11.3 can lead to significant advances in Bagging. An acronym for “bootstrap aggregat- understanding genetic effects. But the analysis ing.” Start with an algorithm such that given any of variable importance in Section 11.3 would be training set, the algorithm produces a prediction difficult to do accuratelyusing a stochastic data function φx. The algorithm can be a decision tree model. construction, logistic regression with variable dele- Problems such as stellar recognition or analysis tion, etc. Take a bootstrap sample from the training of gene expression data could be high adventure for set and use this bootstrap training set to construct statisticians. But it requires that theyfocus on solv- the predictor φ1x. Take another bootstrap sam- ing the problem instead of asking what data model ple and using this second training set construct the theycan create. The best solution could be an algo- predictor φ2x. Continue this wayfor K steps. In rithmic model, or maybe a data model, or maybe a regression, average all of the φkx to get the STATISTICAL MODELING: THE TWO CULTURES 215

bagged predictor at x. In classification, that class Cristianini, N. and Shawe-Taylor, J. (2000). An Introduction to Support Vector Machines. Cambridge Univ. Press. which has the pluralityvote of the φkx is the bagged predictor. Bagging has been shown effective Daniel, C. and Wood, F. (1971). Fitting equations to data. Wiley, in variance reduction (Breiman, 1996b). New York. Dempster, A. (1998). Logicist statistic 1. Models and Modeling. Boosting. This is a more complex wayof forming Statist. Sci. 13 3 248–276. an ensemble of predictors in classification than bag- Diaconis, P. and Efron, B. (1983). Computer intensive methods ging (Freund and Schapire, 1996). It uses no ran- in statistics. Scientific American 248 116–131. domization but proceeds byaltering the weights on Domingos, P. (1998). Occam’s two razors: the sharp and the the training set. Its performance in terms of low pre- blunt. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (R. Agrawal and diction error is excellent (for details see Breiman, P. Stolorz, eds.) 37–43. AAAI Press, Menlo Park, CA. 1998). Domingos, P. (1999). The role of Occam’s razor in knowledge dis- covery. Data Mining and Knowledge Discovery 3 409–425. ACKNOWLEDGMENTS Dudoit, S., Fridlyand, J. and Speed, T. (2000). Comparison Manyof myideas about data modeling were of discrimination methods for the classification of tumors. (Available at www.stat.berkeley.edu/technical reports). formed in three decades of conversations with my Freedman, D. (1987). As others see us: a case studyin path old friend and collaborator, Jerome Friedman. Con- analysis (with discussion). J. Ed. Statist. 12 101–223. versations with Richard Olshen about the Cox Freedman, D. (1991). Statistical models and shoe leather. Soci- model and its use in biostatistics helped me to ological Methodology 1991 (with discussion) 291–358. understand the background. I am also indebted to Freedman, D. (1991). Some issues in the foundations of statis- tics. Foundations of Science 1 19–83. William Meisel, who headed some of the predic- Freedman, D. (1994). From association to causation via regres- tion projects I consulted on and helped me make sion. Adv. in Appl. Math. 18 59–110. the transition from probabilitytheoryto algorithms, Freund, Y. and Schapire, R. (1996). Experiments with a new and to Charles Stone for illuminating conversations boosting algorithm. In Machine Learning: Proceedings of the about the nature of statistics and science. I’m grate- Thirteenth International Conference 148–156. Morgan Kauf- mann, San Francisco. ful also for the comments of the editor, Leon Gleser, Friedman, J. (1999). Greedypredictive approximation: a gra- which prompted a major rewrite of the first draft dient boosting machine. Technical report, Dept. Statistics of this manuscript and resulted in a different and Stanford Univ. better paper. Friedman, J., Hastie, T. and Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Ann. Statist. 28 337–407. REFERENCES Gifi, A. (1990). Nonlinear Multivariate Analysis. Wiley, New Amit, Y. and Geman, D. (1997). Shape quantization and recog- York. nition with randomized trees. Neural Computation 9 1545– Ho, T. K. (1998). The random subspace method for constructing 1588. decision forests. IEEE Trans. Pattern Analysis and Machine Arena, C., Sussman, N., Chiang, K., Mazumdar, S., Macina, Intelligence 20 832–844. O. and Li, W. (2000). Bagging Structure-ActivityRela- Landswher, J., Preibon, D. and Shoemaker, A. (1984). Graph- tionships: A simulation studyfor assessing misclassifica- ical methods for assessing logistic regression models (with tion rates. Presented at the Second Indo-U.S. Workshop on discussion). J. Amer. Statist. Assoc. 79 61–83. Mathematical Chemistry, Duluth, MI. (Available at NSuss- McCullagh, P. and Nelder, J. (1989). Generalized Linear Mod- [email protected]). els. Chapman and Hall, London. Bickel, P., Ritov, Y. and Stoker, T. (2001). Tailor-made tests Meisel, W. (1972). Computer-Oriented Approaches to Pattern for goodness of fit for semiparametric hypotheses. Unpub- Recognition. Academic Press, New York. lished manuscript. Michie, D., Spiegelhalter, D. and Taylor, C. (1994). Machine Breiman, L. (1996a). The heuristics of instabilityin model selec- Learning, Neural and Statistical Classification. Ellis Hor- tion. Ann. Statist. 24 2350–2381. wood, New York. Breiman, L. (1996b). Bagging predictors. Machine Learning J. Mosteller, F. and Tukey, J. (1977). Data Analysis and Regres- 26 123–140. sion. Addison-Wesley, Redding, MA. Breiman, L. (1998). Arcing classifiers. Discussion paper, Ann. Mountain, D. and Hsiao, C. (1989). A combined structural and Statist. 26 801–824. flexible functional approach for modelenerysubstitution. Breiman. L. (2000). Some infinitytheoryfor tree ensembles. J. Amer. Statist. Assoc. 84 76–87. (Available at www.stat.berkeley.edu/technical reports). Stone, M. (1974). Cross-validatorychoice and assessment of sta- Breiman, L. (2001). Random forests. Machine Learning J. 45 5– tistical predictions. J. Roy. Statist. Soc. B 36 111–147. 32. Vapnik, V. (1995). The Nature of Statistical Learning Theory. Breiman, L. and Friedman, J. (1985). Estimating optimal trans- Springer, New York. formations in multiple regression and correlation. J. Amer. Vapnik, V (1998). Statistical Learning Theory. Wiley, New York. Statist. Assoc. 80 580–619. Wahba, G. (1990). Spline Models for Observational Data. SIAM, Breiman, L., Friedman, J., Olshen, R. and Stone, C. Philadelphia. (1984). Classification and Regression Trees. Wadsworth, Zhang, H. and Singer, B. (1999). Recursive Partitioning in the Belmont, CA. Health Sciences. Springer, New York. 216 L. BREIMAN Comment

D. R. Cox

Professor Breiman’s interesting paper gives both keypoints concern the precise meaning of the data, a clear statement of the broad approach underly- the possible biases arising from the method of ascer- ing some of his influential and widelyadmired con- tainment, the possible presence of major distorting tributions and outlines some striking applications measurement errors and the nature of processes and developments. He has combined this with a cri- underlying missing and incomplete data and data tique of what, for want of a better term, I will call that evolve in time in a wayinvolving complex inter- mainstream statistical thinking, based in part on dependencies. For some of these, at least, it is hard a caricature. Like all good caricatures, it contains to see how to proceed without some notion of prob- enough truth and exposes enough weaknesses to be abilistic modeling. thought-provoking. Next Professor Breiman emphasizes prediction There is not enough space to comment on all the as the objective, success at prediction being the manypoints explicitlyor implicitlyraised in the criterion of success, as contrasted with issues paper. There follow some remarks about a few main of interpretation or understanding. Prediction is issues. indeed important from several perspectives. The One of the attractions of our subject is the aston- success of a theoryis best judged from its abilityto ishinglywide range of applications as judged not predict in new contexts, although one cannot dis- onlyin terms of substantive field but also in terms miss as totallyuseless theories such as the rational of objectives, qualityand quantityof data and action theory(RAT), in political science, which, as so on. Thus anyunqualified statement that “in I understand it, gives excellent explanations of the past but which has failed to predict the real politi- applications” has to be treated sceptically. One cal world. In a clinical trial context it can be argued of our failings has, I believe, been, in a wish to that an objective is to predict the consequences of stress generality, not to set out more clearly the treatment allocation to future patients, and so on. distinctions between different kinds of application If the prediction is localized to situations directly and the consequences for the strategyof statistical similar to those applying to the data there is then analysis. Of course we have distinctions between an interesting and challenging dilemma. Is it prefer- decision-making and inference, between tests and able to proceed with a directlyempirical black-box estimation, and between estimation and predic- approach, as favored byProfessor Breiman, or is tion and these are useful but, I think, are, except it better to tryto take account of some underly- perhaps the first, too phrased in terms of the tech- ing explanatoryprocess? The answer must depend nologyrather than the spirit of statistical analysis. on the context but I certainlyaccept, although it I entirelyagree with Professor Breiman that it goes somewhat against the grain to do so, that would be an impoverished and extremelyunhis- there are situations where a directlyempirical torical view of the subject to exclude the kind of approach is better. Short term economic forecasting work he describes simplybecause it has no explicit and real-time flood forecasting are probablyfurther probabilistic base. exemplars. Keyissues are then the stabilityof the Professor Breiman takes data as his starting predictor as practical prediction proceeds, the need point. I would prefer to start with an issue, a ques- from time to time for recalibration and so on. tion or a scientific hypothesis, although I would However, much prediction is not like this. Often be surprised if this were a real source of disagree- the prediction is under quite different conditions ment. These issues mayevolve, or even change from the data; what is the likelyprogress of the radically, as analysis proceeds. Data looking for incidence of the epidemic of v-CJD in the United a question are not unknown and raise puzzles Kingdom, what would be the effect on annual inci- but are, I believe, atypical in most contexts. Next, dence of cancer in the United States of reducing by even if we ignore design aspects and start with data, 10% the medical use of X-rays, etc.? That is, it may be desired to predict the consequences of something onlyindirectlyaddressed bythe data available for D. R. Cox is an Honorary Fellow, Nuffield College, analysis. As we move toward such more ambitious Oxford OX1 1NF, United Kingdom, and associate tasks, prediction, always hazardous, without some member, Department of Statistics, University of understanding of underlying process and linking Oxford (e-mail: david.cox@nuffield.oxford.ac.uk). with other sources of information, becomes more STATISTICAL MODELING: THE TWO CULTURES 217 and more tentative. Formulation of the goals of process involving somehow or other choosing a analysis solely in terms of direct prediction over the model, often a default model of standard form, data set seems then increasinglyunhelpful. and applying standard methods of analysis and This is quite apart from matters where the direct goodness-of-fit procedures. Thus for survival data objective is understanding of and tests of subject- choose a priori the proportional hazards model. matter hypotheses about underlying process, the (Note, incidentally, that in the paper, often quoted nature of pathways of dependence and so on. but probablyrarelyread, that introduced this What is the central strategyof mainstream sta- approach there was a comparison of several of the tistical analysis? This can most certainly not be dis- manydifferent models that might be suitable for cerned from the pages of Bernoulli, The Annals of this kind of data.) It is true that manyof the anal- Statistics or the Scandanavian Journal of Statistics ysesdone bynonstatisticians or bystatisticians nor from Biometrika and the Journal of Royal Sta- under severe time constraints are more or less like tistical Society, Series B or even from the application those Professor Breiman describes. The issue then pages of Journal of the American Statistical Associa- is not whether theycould ideallybe improved, but tion or Applied Statistics, estimable though all these whether theycapture enough of the essence of the journals are. Of course as we move along the list, information in the data, together with some rea- there is an increase from zero to 100% in the papers sonable indication of precision as a guard against containing analyses of “real” data. But the papers under or overinterpretation. Would more refined do so nearlyalwaysto illustrate technique rather analysis, possibly with better predictive power and than to explain the process of analysis and inter- better fit, produce subject-matter gains? There can pretation as such. This is entirelylegitimate, but be no general answer to this, but one suspects that is completelydifferent from live analysisof current quite often the limitations of conclusions lie more data to obtain subject-matter conclusions or to help in weakness of data qualityand studydesign than solve specific practical issues. Put differently, if an in ineffective analysis. important conclusion is reached involving statisti- There are two broad lines of development active cal analysis it will be reported in a subject-matter at the moment arising out of mainstream statistical journal or in a written or verbal report to colleagues, ideas. The first is the invention of models strongly government or business. When that happens, statis- tied to subject-matter considerations, represent- tical details are typicallyandcorrectlynot stressed. ing underlying dependencies, and their analysis, Thus the real procedures of statistical analysis can perhaps byMarkov chain Monte Carlo methods. be judged onlybylooking in detail at specific cases, In fields where subject-matter considerations are and access to these is not always easy. Failure to largelyqualitative, we see a development based on discuss enough the principles involved is a major Markov graphs and their generalizations. These criticism of the current state of theory. methods in effect assume, subject in principle to I think tentativelythat the following quite com- empirical test, more and more about the phenom- monlyapplies. Formal models are useful and often ena under study. By contrast, there is an emphasis almost, if not quite, essential for incisive thinking. on assuming less and less via, for example, kernel Descriptivelyappealing and transparent methods estimates of regression functions, generalized addi- with a firm model base are the ideal. Notions of tive models and so on. There is a need to be clearer significance tests, confidence intervals, posterior about the circumstances favoring these two broad intervals and all the formal apparatus of inference approaches, synthesizing them where possible. are valuable tools to be used as guides, but not in a Myown interest tends to be in the former style mechanical way;theyindicate the uncertaintythat of work. From this perspective Cox and Wermuth would applyunder somewhat idealized, maybe (1996, page 15) listed a number of requirements of a veryidealized, conditions and as such are often statistical model. These are to establish a link with lower bounds to real uncertainty. Analyses and background knowledge and to set up a connection model development are at least partlyexploratory. with previous work, to give some pointer toward Automatic methods of model selection (and of vari- a generating process, to have primaryparameters able selection in regression-like problems) are to be with individual clear subject-matter interpretations, shunned or, if use is absolutelyunavoidable, are to to specifyhaphazard aspects well enough to lead be examined carefullyfor their effect on the final to meaningful assessment of precision and, finally, conclusions. Unfocused tests of model adequacyare that the fit should be adequate. From this perspec- rarelyhelpful. tive, fit, which is broadlyrelated to predictive suc- Bycontrast, Professor Breiman equates main- cess, is not the primarybasis for model choice and stream applied statistics to a relativelymechanical formal methods of model choice that take no account 218 L. BREIMAN of the broader objectives are suspect in the present the analysis, an important and indeed fascinating context. In a sense these are efforts to establish data question, but a secondarystep. Better a rough descriptions that are potentiallycausal, recognizing answer to the right question than an exact answer that causality, in the sense that a natural scientist to the wrong question, an aphorism, due perhaps to would use the term, can rarelybe established from Lord Kelvin, that I heard as an undergraduate in one type of study and is at best somewhat tentative. applied mathematics. Professor Breiman takes a rather defeatist atti- I have stayed away from the detail of the paper tude toward attempts to formulate underlying but will comment on just one point, the interesting processes; is this not to reject the base of much sci- theorem of Vapnik about complete separation. This entific progress? The interesting illustrations given confirms folklore experience with empirical logistic byBeveridge (1952), where hypothesizedprocesses regression that, with a largish number of explana- in various biological contexts led to important toryvariables, complete separation is quite likelyto progress, even though the hypotheses turned out in occur. It is interesting that in mainstream thinking the end to be quite false, illustrate the subtletyof this is, I think, regarded as insecure in that com- the matter. Especiallyin the social sciences, repre- sentations of underlying process have to be viewed plete separation is thought to be a priori unlikely with particular caution, but this does not make and the estimated separating plane unstable. Pre- them fruitless. sumablybootstrap and cross-validation ideas may The absolutelycrucial issue in serious main- give here a quite misleading illusion of stability. stream statistics is the choice of a model that Of course if the complete separator is subtle and will translate keysubject-matter questions into a stable Professor Breiman’s methods will emerge tri- form for analysis and interpretation. If a simple umphant and ultimatelyit is an empirical question standard model is adequate to answer the subject- in each application as to what happens. matter question, this is fine: there are severe It will be clear that while I disagree with the main hidden penalties for overelaboration. The statisti- thrust of Professor Breiman’s paper I found it stim- cal literature, however, concentrates on how to do ulating and interesting.

Comment

Brad Efron

At first glance Leo Breiman’s stimulating paper dominant interpretational methodologyin dozens of looks like an argument against parsimonyand sci- fields, but, as we sayin California these days, it entific insight, and in favor of black boxes with lots is power purchased at a price: the theoryrequires a of knobs to twiddle. At second glance it still looks modestlyhigh ratio of signal to noise, sample size to that way, but the paper is stimulating, and Leo has number of unknown parameters, to have much hope some important points to hammer home. At the risk of success. “Good experimental design” amounts to of distortion I will tryto restate one of those points, enforcing favorable conditions for unbiased estima- the most interesting one in myopinion, using less tion and testing, so that the statistician won’t find confrontational and more historical language. himself or herself facing 100 data points and 50 From the point of view of statistical development parameters. the twentieth centurymight be labeled “100 years Now it is the twenty-first century when, as the of unbiasedness.” Following Fisher’s lead, most of paper reminds us, we are being asked to face prob- our current statistical theoryand practice revolves lems that never heard of good experimental design. around unbiased or nearlyunbiased estimates (par- Sample sizes have swollen alarminglywhile goals ticularlyMLEs), and tests based on such estimates. grow less distinct (“find interesting data structure”). The power of this theoryhas made statistics the New algorithms have arisen to deal with new prob- lems, a healthysign it seems to me even if the inno- Brad Efron is Professor, Department of Statis- vators aren’t all professional statisticians. There are tics, Sequoia Hall, 390 Serra Mall, Stanford Uni- enough physicists to handle the physics case load, versity, Stanford, California 94305–4065 (e-mail: but there are fewer statisticians and more statistics [email protected]). problems, and we need all the help we can get. An STATISTICAL MODELING: THE TWO CULTURES 219 attractive feature of Leo’s paper is his openness to • The “prediction culture,” at least around Stan- new ideas whatever their source. ford, is a lot bigger than 2%, though its constituency The new algorithms often appear in the form of changes and most of us wouldn’t welcome being black boxes with enormous numbers of adjustable typecast. parameters (“knobs to twiddle”), sometimes more • Estimation and testing are a form of prediction: knobs than data points. These algorithms can be “In our sample of 20 patients drug A outperformed quite successful as Leo points outs, sometimes drug B; would this still be true if we went on to test more so than their classical counterparts. However, all possible patients?” unless the bias-variance trade-off has been sus- • Prediction byitself is onlyoccasionallysuffi- pended to encourage new statistical industries, their cient. The post office is happywith anymethod success must hinge on some form of biased estima- that predicts correct addresses from hand-written tion. The bias maybe introduced directlyas with the scrawls. Peter Gregoryundertook his studyfor pre- “regularization” of overparameterized linear mod- diction purposes, but also to better understand the els, more subtlyas in the pruning of overgrown medical basis of hepatitis. Most statistical surveys regression trees, or surreptitiouslyas with support have the identification of causal factors as their ulti- mate goal. vector machines, but it has to be lurking somewhere inside the theory. The hepatitis data was first analyzed by Gail Of course the trouble with biased estimation is Gong in her 1982 Ph.D. thesis, which concerned pre- that we have so little theoryto fall back upon. diction problems and bootstrap methods for improv- Fisher’s information bound, which tells us how well ing on cross-validation. (Cross-validation itself is an a (nearly) unbiased estimator can possibly perform, uncertain methodologythat deserves further crit- is of no help at all in dealing with heavilybiased ical scrutiny; see, for example, Efron and Tibshi- methodology. Numerical experimentation by itself, rani, 1996). The Scientific American discussion is unguided bytheory,is prone to faddish wandering: quite brief, a more thorough description appearing Rule 1. New methods always look better than old in Efron and Gong (1983). Variables 12 or 17 (13 ones. Neural nets are better than logistic regres- or 18 in Efron and Gong’s numbering) appeared sion, support vector machines are better than neu- as “important” in 60% of the bootstrap simulations, ral nets, etc. In fact it is verydifficult to run an which might be compared with the 59% for variable honest simulation comparison, and easyto inadver- 19, the most for anysingle explanator. tentlycheat bychoosing favorable examples, or by In what sense are variable 12 or 17 or 19 not putting as much effort into optimizing the dull “important” or “not important”? This is the kind of old standard as the exciting new challenger. interesting inferential question raised byprediction methodology. Tibshirani and I made a stab at an Rule 2. Complicated methods are harder to crit- answer in our 1998 annals paper. I believe that the icize than simple ones. Bynow it is easyto check current interest in statistical prediction will eventu- the efficiencyof a logistic regression, but it is no allyinvigorate traditional inference, not eliminate small matter to analyze the limitations of a sup- it. port vector machine. One of the best things statis- A third front seems to have been opened in ticians do, and something that doesn’t happen out- the long-running frequentist-Bayesian wars by the side our profession, is clarifythe inferential basis advocates of algorithmic prediction, who don’t really of a proposed new methodology, a nice recent exam- believe in anyinferential school. Leo’s paper is at its ple being Friedman, Hastie, and Tibshirani’s anal- best when presenting the successes of algorithmic ysis of “boosting,” (2000). The past half-century has modeling, which comes across as a positive devel- seen the clarification process successfullyat work on opment for both statistical practice and theoretical nonparametrics, robustness and survival analysis. innovation. This isn’t an argument against tradi- There has even been some success with biased esti- tional data modeling anymore than splines are an mation in the form of Stein shrinkage and empirical argument against polynomials. The whole point of Bayes, but I believe the hardest part of this work science is to open up black boxes, understand their remains to be done. Papers like Leo’s are a call for insides, and build better boxes for the purposes of more analysis and theory, not less. mankind. Leo himself is a notablysuccessful sci- Prediction is certainlyan interesting subject but entist, so we can hope that the present paper was Leo’s paper overstates both its role and our profes- written more as an advocacydevice than as the con- sion’s lack of interest in it. fessions of a born-again black boxist. 220 L. BREIMAN Comment

Bruce Hoadley

INTRODUCTION next 6 months. The goal is to estimate the function, fx=logPry = 1x/ Pry = 0x. Professor Professor Breiman’s paper is an important one Breiman argues that some kind of simple logistic for statisticians to read. He and Statistical Science regression from the data modeling culture is not should be applauded for making this kind of mate- the wayto solve this problem. I agree. Let’s take rial available to a large audience. His conclusions are consistent with how statistics is often practiced a look at how the engineers at Fair, Isaac solved in business. This discussion will consist of an anec- this problem—wayback in the 1960s and 1970s. dotal recital of myencounters with the algorithmic The general form used for fx was called a modeling culture. Along the way, areas of mild dis- segmented scorecard. The process for developing agreement with Professor Breiman are discussed. I a segmented scorecard was clearlyan algorithmic also include a few proposals for research topics in modeling process. algorithmic modeling. The first step was to transform x into manyinter- pretable variables called prediction characteristics. CASE STUDY OF AN ALGORITHMIC This was done in stages. The first stage was to MODELING CULTURE compute several time series derived from the orig- inal two. An example is the time series of months Although I spent most of mycareer in manage- delinquent—a nonlinear function. The second stage ment at Bell Labs and Bellcore, the last seven years was to define characteristics as operators on the have been with the research group at Fair, Isaac. time series. For example, the number of times in This companyprovides all kinds of decision sup- the last six months that the customer was more port solutions to several industries, and is very than two months delinquent. This process can lead well known for credit scoring. Credit scoring is a to thousands of characteristics. A subset of these great example of the problem discussed byProfessor characteristics passes a screen for further analysis. Breiman. The input variables, x, might come from The next step was to segment the population companydatabases or credit bureaus. The output based on the screened characteristics. The segmen- variable, y, is some indicator of credit worthiness. tation was done somewhat informally. But when Credit scoring has been a profitable business for I looked at the process carefully, the segments Fair, Isaac since the 1960s, so it is instructive to turned out to be the leaves of a shallow-to-medium look at the Fair, Isaac analytic approach to see how tree. And the tree was built sequentiallyusing it fits into the two cultures described byProfessor mostlybinarysplits based on the best splitting Breiman. The Fair, Isaac approach was developed by characteristics—defined in a reasonable way. The engineers and operations research people and was algorithm was manual, but similar in concept to the driven bythe needs of the clients and the quality CART algorithm, with a different purityindex. of the data. The influences of the statistical com- Next, a separate function, fx, was developed for munitywere mostlyfrom the nonparametric side— each segment. The function used was called a score- things like jackknife and bootstrap. card. Each characteristic was chopped up into dis- Consider an example of behavior scoring, which crete intervals or sets called attributes. A score- is used in credit card account management. For card was a linear function of the attribute indicator pedagogical reasons, I consider a simplified version (dummy) variables derived from the characteristics. (in the real world, things get more complicated) of The coefficients of the dummyvariables were called monthlybehavior scoring. The input variables, x, score weights. in this simplified version, are the monthlybills and payments over the last 12 months. So the dimen- This construction amounted to an explosion of sion of x is 24. The output variable is binaryand dimensionality. They started with 24 predictors. is the indicator of no severe delinquencyover the These were transformed into hundreds of charac- teristics and pared down to about 100 characteris- tics. Each characteristic was discretized into about Dr. Bruce Hoadley is with Fair, Isaac and Co., 10 attributes, and there were about 10 segments. Inc., 120 N. Redwood Drive, San Rafael, California This makes 100 × 10 × 10 = 10 000 features. Yes 94903-1996 (e-mail: BruceHoadley@ FairIsaac.com). indeed, dimensionalityis a blessing. STATISTICAL MODELING: THE TWO CULTURES 221

What Fair, Isaac calls a scorecard is now else- meeting, I heard talks on treed regression, which where called a generalized additive model (GAM) looked like segmented scorecards to me. with bin smoothing. However, a simple GAM would After a few years with Fair, Isaac, I developed a not do. Client demand, legal considerations and talk entitled, “Credit Scoring—A Parallel Universe robustness over time led to the concept of score engi- of Prediction and Classification.” The theme was neering. For example, the score had to be monotoni- that Fair, Isaac developed in parallel manyof the callydecreasing in certain delinquencycharacteris- concepts used in modern algorithmic modeling. tics. Prior judgment also played a role in the design Certain aspects of the data modeling culture crept of scorecards. For some characteristics, the score into the Fair, Isaac approach. The use of divergence weights were shrunk toward zero in order to moder- was justified byassuming that the score distribu- ate the influence of these characteristics. For other tions were approximatelynormal. So rather than characteristics, the score weights were expanded in making assumptions about the distribution of the order to increase the influence of these character- inputs, theymade assumptions about the distribu- istics. These adjustments were not done willy-nilly. tion of the output. This assumption of normalitywas Theywere done to overcome known weaknesses in supported bya central limit theorem, which said the data. that sums of manyrandom variables are approxi- matelynormal—even when the component random So how did these Fair, Isaac pioneers fit these variables are dependent and multiples of dummy complicated GAM models back in the 1960s and random variables. 1970s? Logistic regression was not generallyavail- Modern algorithmic classification theoryhas able. And besides, even today’s commercial GAM shown that excellent classifiers have one thing in software will not handle complex constraints. What common, theyall have large margin. Margin, M,is theydid was to maximize (subject to constraints) a random variable that measures the comfort level a measure called divergence, which measures how with which classifications are made. When the cor- well the score, S, separates the two populations with rect classification is made, the margin is positive; different values of y. The formal definition of diver- it is negative otherwise. Since margin is a random 2 gence is 2ESy = 1−ESy = 0 /VSy = variable, the precise definition of large margin is 1+VSy = 0. This constrained fitting was done tricky. It does not mean that EM is large. When with a heuristic nonlinear programming algorithm. I put mydata modeling hat on, I surmised that A linear transformation was used to convert to a log large margin means that EM/ VM is large. odds scale. Lo and behold, with this definition, large margin Characteristic selection was done byanalyzing means large divergence. the change in divergence after adding (removing) Since the good old days at Fair, Isaac, there have each candidate characteristic to (from) the current been manyimprovements in the algorithmic mod- best model. The analysis was done informally to eling approaches. We now use genetic algorithms achieve good performance on the test sample. There to screen verylarge structured sets of prediction were no formal tests of fit and no tests of score characteristics. Our segmentation algorithms have weight statistical significance. What counted was been automated to yield even more predictive sys- performance on the test sample, which was a surro- tems. Our palatable GAM modeling tool now han- gate for the future real world. dles smooth splines, as well as splines mixed with These earlyFair, Isaac engineers were ahead of step functions, with all kinds of constraint capabil- their time and charter members of the algorithmic ity. Maximizing divergence is still a favorite, but modeling culture. The score formula was linear in we also maximize constrained GLM likelihood func- an exploded dimension. A complex algorithm was tions. We also are experimenting with computa- used to fit the model. There was no claim that the tionallyintensive algorithms that will optimize any final score formula was correct, onlythat it worked objective function that makes sense in the busi- well on the test sample. This approach grew natu- ness environment. All of these improvements are rallyout of the demands of the business and the squarelyin the culture of algorithmic modeling. qualityof the data. The overarching goal was to develop tools that would help clients make bet- OVERFITTING THE TEST SAMPLE ter decisions through data. What emerged was a Professor Breiman emphasizes the importance of veryaccurate and palatable algorithmic modeling performance on the test sample. However, this can solution, which belies Breiman’s statement: “The be overdone. The test sample is supposed to repre- algorithmic modeling methods available in the pre- sent the population to be encountered in the future. 1980s decades seem primitive now.” At a recent ASA But in reality, it is usually a random sample of the 222 L. BREIMAN current population. High performance on the test So far, all I had was the scorecard GAM. So clearly sample does not guarantee high performance on I was missing all of those interactions that just had future samples, things do change. There are prac- to be in the model. To model the interactions, I tried tices that can be followed to protect against change. developing small adjustments on various overlap- One can monitor the performance of the mod- ping segments. No matter how hard I tried, noth- els over time and develop new models when there ing improved the test sample performance over the has been sufficient degradation of performance. For global scorecard. I started calling it the Fat Score- some of Fair, Isaac’s core products, the redevelop- card. ment cycle is about 18–24 months. Fair, Isaac also Earlier, on this same data set, another Fair, Isaac does “score engineering” in an attempt to make researcher had developed a neural network with the models more robust over time. This includes 2,000 connection weights. The Fat Scorecard slightly damping the influence of individual characteristics, outperformed the neural network on the test sam- using monotone constraints and minimizing the size ple. I cannot claim that this would work for every of the models subject to performance constraints data set. But for this data set, I had developed an on the current test sample. This score engineer- excellent algorithmic model with a simple data mod- ing amounts to moving from verynonparametric (no eling tool. score engineering) to more semiparametric (lots of Whydid the simple additive model work so well? score engineering). One idea is that some of the characteristics in the model are acting as surrogates for certain inter- SPIN-OFFS FROM THE DATA action terms that are not explicitlyin the model. MODELING CULTURE Another reason is that the scorecard is reallya In Section 6 of Professor Breiman’s paper, he says sophisticated neural net. The inputs are the original that “multivariate analysis tools in statistics are inputs. Associated with each characteristic is a hid- frozen at discriminant analysis and logistic regres- den node. The summation functions coming into the sion in classification ” This is not necessarilyall hidden nodes are the transformations defining the that bad. These tools can carryyouveryfar as long characteristics. The transfer functions of the hid- as you ignore all of the textbook advice on how to den nodes are the step functions (compiled from the use them. To illustrate, I use the saga of the Fat score weights)—all derived from the data. The final Scorecard. output is a linear function of the outputs of the hid- Earlyin myresearch daysat Fair, Isaac, I den nodes. The result is highlynonlinear and inter- was searching for an improvement over segmented active, when looked at as a function of the original scorecards. The idea was to develop first a very inputs. good global scorecard and then to develop small The Fat Scorecard studyhad an ingredient that adjustments for a number of overlapping segments. is rare. We not onlyhad the traditional test sample, To develop the global scorecard, I decided to use but had three other test samples, taken one, two, logistic regression applied to the attribute dummy and three years later. In this case, the Fat Scorecard variables. There were 36 characteristics available outperformed the more traditional thinner score- for fitting. A typical scorecard has about 15 char- card for all four test samples. So the feared over- acteristics. Myvariable selection was structured fitting to the traditional test sample never mate- so that an entire characteristic was either in or rialized. To get a better handle on this you need out of the model. What I discovered surprised an understanding of how the relationships between me. All models fit with anywhere from 27 to 36 variables evolve over time. characteristics had the same performance on the I recentlyencountered another connection test sample. This is what Professor Breiman calls between algorithmic modeling and data modeling. “Rashomon and the multiplicityof good models.” To In classical multivariate discriminant analysis, one keep the model as small as possible, I chose the one assumes that the prediction variables have a mul- with 27 characteristics. This model had 162 score tivariate normal distribution. But for a scorecard, weights (logistic regression coefficients), whose P- the prediction variables are hundreds of attribute values ranged from 0.0001 to 0.984, with onlyone dummyvariables, which are verynonnormal. How- less than 0.05; i.e., statisticallysignificant. The con- ever, if you apply the discriminant analysis algo- fidence intervals for the 162 score weights were use- rithm to the attribute dummyvariables, youcan less. To get this great scorecard, I had to ignore get a great algorithmic model, even though the the conventional wisdom on how to use logistic assumptions of discriminant analysis are severely regression. violated. STATISTICAL MODELING: THE TWO CULTURES 223

ASOLUTION TO THE OCCAMDILEMMA used as a surrogate for the decision process, and misclassification error is used as a surrogate for I think that there is a solution to the Occam dilemma without resorting to goal-oriented argu- profit. However, I see a mismatch between the algo- ments. Clients reallydo insist on interpretable func- rithms used to develop the models and the business tions, fx. Segmented palatable scorecards are measurement of the model’s value. For example, at veryinterpretable bythe customer and are very Fair, Isaac, we frequentlymaximize divergence. But accurate. Professor Breiman himself gave single when we argue the model’s value to the clients, we trees an A+ on interpretability. The shallow-to- don’t necessarilybrag about the great divergence. medium tree in a segmented scorecard rates an A++. We tryto use measures that the client can relate to. The palatable scorecards in the leaves of the trees The ROC curve is one favorite, but it maynot tell are built from interpretable (possiblycomplex) char- the whole story. Sometimes, we develop simulations acteristics. Sometimes we can’t implement them of the client’s business operation to show how the until the lawyers and regulators approve. And that model will improve their situation. For example, in requires super interpretability. Our more sophisti- a transaction fraud control process, some measures cated products have 10 to 20 segments and up to of interest are false positive rate, speed of detec- 100 characteristics (not all in everysegment). These models are veryaccurate and veryinterpretable. tion and dollars saved when 0.5% of the transactions I coined a phrase called the “Ping-Pong theorem.” are flagged as possible frauds. The 0.5% reflects the This theorem says that if we revealed to Profes- number of transactions that can be processed by sor Breiman the performance of our best model and the current fraud management staff. Perhaps what gave him our data, then he could develop an algo- the client reallywants is a score that will maxi- rithmic model using random forests, which would mize the dollars saved in their fraud control sys- outperform our model. But if he revealed to us the tem. The score that maximizes test set divergence performance of his model, then we could develop or minimizes test set misclassifications does not do a segmented scorecard, which would outperform it. The challenge for algorithmic modeling is to find his model. We might need more characteristics, an algorithm that maximizes the generalization dol- attributes and segments, but our experience in this lars saved, not generalization error. kind of contest is on our side. We have made some progress in this area using However, all the competing models in this game of ideas from support vector machines and boosting. Ping-Pong would surelybe algorithmic models. But Bymanipulating the observation weights used in some of them could be interpretable. standard algorithms, we can improve the test set THE ALGORITHM TUNING DILEMMA performance on anyobjective of interest. But the price we payis computational intensity. As far as I can tell, all approaches to algorithmic model building contain tuning parameters, either explicit or implicit. For example, we use penalized objective functions for fitting and marginal contri- MEASURING IMPORTANCE—IS IT bution thresholds for characteristic selection. With REALLY POSSIBLE? experience, analysts learn how to set these tuning I like Professor Breiman’s idea for measuring the parameters in order to get excellent test sample importance of variables in black box models. A Fair, or cross-validation results. However, in industry Isaac spin on this idea would be to build accurate and academia, there is sometimes a little tinker- ing, which involves peeking at the test sample. The models for which no variable is much more impor- result is some bias in the test sample or cross- tant than other variables. There is always a chance validation results. This is the same kind of tinkering that a variable and its relationships will change in that upsets test of fit pureness. This is a challenge the future. After that, you still want the model to for the algorithmic modeling approach. How do you work. So don’t make anyvariable dominant. optimize your results and get an unbiased estimate I think that there is still an issue with measuring of the generalization error? importance. Consider a set of inputs and an algo-

rithm that yields a black box, for which x1 is impor- GENERALIZING THE GENERALIZATION ERROR tant. From the “Ping Pong theorem” there exists a

In most commercial applications of algorithmic set of input variables, excluding x1 and an algorithm modeling, the function, fx, is used to make deci- that will yield an equally accurate black box. For sions. In some academic research, classification is this black box, x1 is unimportant. 224 L. BREIMAN

IN SUMMARY GAMs in the leaves. Theyare veryaccurate and Algorithmic modeling is a veryimportant area of interpretable. And you can do it with data modeling statistics. It has evolved naturallyin environments tools as long as you (i) ignore most textbook advice, with lots of data and lots of decisions. But you (ii) embrace the blessing of dimensionality, (iii) use can do it without suffering the Occam dilemma; constraints in the fitting optimizations (iv) use reg- for example, use medium trees with interpretable ularization, and (v) validate the results.

Comment

Emanuel Parzen

1. BREIMAN DESERVES OUR Breiman presents the potential benefits of algo- APPRECIATION rithmic models (better predictive accuracythandata models, and consequentlybetter information about I stronglysupport the view that statisticians must the underlying mechanism and avoiding question- face the crisis of the difficulties in their practice of able conclusions which results from weak predictive regression. Breiman alerts us to systematic blun- accuracy) and support vector machines (which pro- ders (leading to wrong conclusions) that have been vide almost perfect separation and discrimination committed applying current statistical practice of between two classes byincreasing the dimension of data modeling. In the spirit of “statistician, avoid the feature set). He convinces me that the methods doing harm” I propose that the first goal of statisti- of algorithmic modeling are important contributions cal ethics should be to guarantee to our clients that to the tool kit of statisticians. anymistakes in our analysisare unlike anymis- If the profession of statistics is to remain healthy, takes that statisticians have made before. and not limit its research opportunities, statis- The two goals in analyzing data which Leo calls ticians must learn about the cultures in which prediction and information I prefer to describe as Breiman works, but also about manyother cultures “management” and “science.” Management seeks of statistics. profit, practical answers (predictions) useful for decision making in the short run. Science seeks 2. HYPOTHESES TO TEST TO AVOID truth, fundamental knowledge about nature which BLUNDERS OF STATISTICAL MODELING provides understanding and control in the long run. Breiman deserves our appreciation for pointing As a historical note, Student’s t-test has manysci- out generic deviations from standard assumptions entific applications but was invented byStudent as (which I call bivariate dependence and two-sample a management tool to make Guinness beer better conditional clustering) for which we should rou- (bitter?). tinelycheck. “Test null hypothesis”can be a use- Breiman does an excellent job of presenting the ful algorithmic concept if we use tests that diagnose case that the practice of statistical science, using in a model-free waythe directions of deviation from onlythe conventional data modeling culture, needs the null hypothesis model. reform. He deserves much thanks for alerting us to Bivariate dependence (correlation) mayexist the algorithmic modeling culture. Breiman warns us between features [independent (input) variables] in that “if the model is a poor emulation of nature, the a regression causing them to be proxies for each conclusions maybe wrong.” This situation, which other and our models to be unstable with differ- I call “the right answer to the wrong question,” is ent forms of regression models being equallywell called bystatisticians “the error of the third kind.” fitting. We need tools to routinelytest the hypoth- Engineers at M.I.T. define “suboptimization” as esis of statistical independence of the distributions “elegantlysolving the wrong problem.” of independent (input) variables. Two sample conditional clustering arises in the distributions of independent (input) variables to Emanuel Parzen is Distinguished Professor, Depart- discriminate between two classes, which we call the ment of Statistics, Texas A&M University, 415 C conditional distribution of input variables X given Block Building, College Station, Texas 77843 (e-mail: each class. Class I mayhave onlyone mode (clus- [email protected]). ter) at low values of X while class II has two modes STATISTICAL MODELING: THE TWO CULTURES 225

(clusters) at low and high values of X. We would like approach to estimation of conditional quantile func- to conclude that high values of X are observed only tions which I onlyrecentlyfullyimplemented. I for members of class II but low values of X occur for would like to extend the concept of algorithmic sta- members of both classes. The hypothesis we propose tistical models in two ways: (1) to mean data fitting testing is equalityof the pooled distribution of both byrepresentations which use approximation the- samples and the conditional distribution of sample oryand numerical analysis;(2) to use the notation I, which is equivalent to PclassIX=PclassI. of probabilityto describe empirical distributions of For successful discrimination one seeks to increase samples (data sets) which are not assumed to be the number (dimension) of inputs (features) X to generated bya random mechanism. make PclassIX close to 1 or 0. Myquantile culture has not yetbecome widely applied because “you cannot give away a good idea, 3. STATISTICAL MODELING, MANY you have to sell it” (by integrating it in computer CULTURES, STATISTICAL METHODS MINING programs usable byapplied statisticians and thus Breiman speaks of two cultures of statistics; I promote statistical methods mining). believe statistics has many cultures. At specialized A quantile function Qu,0≤ u ≤ 1, is the −1 workshops (on maximum entropymethods or robust inverse F u of a distribution function Fx, methods or Bayesian methods or ) a main topic −∞ 1 modeling.” We should teach in our introductory are outliers as defined byTukeyEDA. courses that one meaning of statistics is “statisti- For the fundamental problem of comparison of cal data modeling done in a systematic way” by two distributions F and G we define the compar- an iterated series of stages which can be abbrevi- ison distribution Du F G and comparison den- ated SIEVE (specifyproblem and general form of sity du F G=Du F G For F G continu- models, identifytentativelynumbers of parameters ous, define Du F G=GF−1u du F G= and specialized models, estimate parameters, val- gF−1u/fF−1u assuming fx=0 implies idate goodness-of-fit of estimated models, estimate gx=0, written G  F.ForF G discrete with final model nonparametricallyor algorithmically). probabilitymass functions pF and pG define (assum- −1 −1 MacKayand Oldford (2000) brilliantlypresent the ing G  F) du F G=pGF u/pFF u. statistical method as a series of stages PPDAC Our applications of comparison distributions (problem, plan, data, analysis, conclusions). often assume F to be an unconditional distribution and G a conditional distribution. To analyze bivari- 4. QUANTILE CULTURE, ate data X Y a fundamental tool is dependence ALGORITHMIC MODELS density dt u=du F F  When X Y Y YX=QXt is jointlycontinuous, A culture of statistical data modeling based on quantile functions, initiated in Parzen (1979), has dt u been mymain research interest since 1976. In mydiscussion to Stone (1977) I outlined a novel = fX YQXt QYu/fXQXtfYQYu 226 L. BREIMAN

The statistical independence hypothesis F = Q k /10 Instead of deciles k/10 we could use X Y Yj j FXFY is equivalent to dt u=1, all t u. A funda- k/M for another base M. mental formula for estimation of conditional quan- To test the hypothesis that Y1  Ym are statis- tile functions is ticallyindependent we form for all kj = 1  10, Q u=Q D−1u F F  YX=x Y Y YX=x dk1  km=PBin k1  km/ = Q s u= Ds F F  Y Y YX=x PBin k1  kmindependence To compare the distributions of two univariate To test equalityof distribution of a sample from pop- samples, let Y denote the continuous response vari- ulation I and the pooled sample we form able and X be binary0, 1 denoting the population from which Y is observed. The comparison density d1k1  km is defined (note F is the pooled distribution func- Y = PBin k1  km population I/ tion) PBin k1  km pooled sample d1u=du FY FYX=1 for all k1  km such that the denominator is = PX = 1Y = QYu/PX = 1 positive and otherwise defined arbitrarily. One can show (letting X denote the population observed) 5. QUANTILE IDEAS FOR HIGH DIMENSIONAL DATA ANALYSIS d1k1  km=PX = I observation from

Byhigh dimensional data we mean multivariate Bin k1  km/PX = I data Y  Y . We form approximate high 1 m To test the null hypotheses in ways that detect dimensional comparison densities du  u  to 1 m directions of deviations from the null hypothesis our test statistical independence of the variables and, recommended first step is quantile data analysis of when we have two samples, d u  u  to test 1 1 m the values dk  k  and d k  k . equalityof sample I with pooled sample. All our 1 m 1 1 m I appreciate this opportunityto bring to the atten- distributions are empirical distributions but we use tion of researchers on high dimensional data anal- notation for true distributions in our formulas. Note ysis the potential of quantile methods. My con- that clusion is that statistical science has manycul- 1 1 tures and statisticians will be more successful when du1  dumdu1  umd1u1  um=1 0 0 theyemulate Leo Breiman and applyas manycul- A decile quantile bin Bk1  km is defined tures as possible (which I call statistical methods to be the set of observations Y1  Ym satisfy- mining). Additional references are on myweb site ing, for j = 1  m Q k − 1/10

Leo Breiman

I thank the discussants. I’m fortunate to have com- then sharplydiverge. To begin, I quote: “Professor ments from a group of experienced and creative Breiman takes data as his starting point. I would statisticians—even more so in that their comments prefer to start with an issue, a question, or a sci- are diverse. MannyParzen and Bruce Hoadleyare entific hypothesis,” I agree, but would expand more or less in agreement, Brad Efron has seri- the starting list to include the prediction of future ous reservations and D. R. Cox is in downright events. I have never worked on a project that has disagreement. started with “Here is a lot of data; let’s look at it I address Professor Cox’s comments first, since and see if we can get some ideas about how to use our disagreement is crucial. it.” The data has been put together and analyzed starting with an objective. D. R. COX C1 Data Models Can Be Useful Professor Cox is a worthyand thoughtful adver- Professor Cox is committed to the use of data mod- sary. We walk down part of the trail together and els. I readilyacknowledge that there are situations STATISTICAL MODELING: THE TWO CULTURES 227 where a simple data model maybe useful and appro- He advocates construction of stochastic data mod- priate; for instance, if the science of the mechanism els that summarize the understanding of the phe- producing the data is well enough known to deter- nomena under study. The methodology in the Cox mine the model apart from estimating parameters. and Wermuth book (1996) attempts to push under- There are also situations of great complexityposing standing further byfinding casual orderings in the important issues and questions in which there is not covariate effects. The sixth chapter of this book con- enough data to resolve the questions to the accu- tains illustrations of this approach on four data sets. racydesired. Simple models can then be useful in The first is a small data set of 68 patients with giving qualitative understanding, suggesting future seven covariates from a pilot studyat the University research areas and the kind of additional data that of Mainz to identifypyschological and socioeconomic needs to be gathered. factors possiblyimportant for glucose control in dia- At times, there is not enough data on which to betes patients. This is a regression-type problem base predictions; but policydecisions need to be with the response variable measured byGHb (gly- made. In this case, constructing a model using what- cosylated haemoglobin). The model fitting is done ever data exists, combined with scientific common bya number of linear regressions and validated sense and subject-matter knowledge, is a reason- bythe checking of various residual plots. The only able path. Professor Cox points to examples when other reference to model validation is the statement, he writes: “R2 = 034, reasonablylarge bythe standards usual for this field of study.” Predictive accuracy is not Often the prediction is under quite dif- computed, either for this example or for the three ferent conditions from the data; what other examples. is the likelyprogress of the incidence Mycomments on the questionable use of data of the epidemic of v-CJD in the United models applyto this analysis. Incidentally,I tried to Kingdom, what would be the effect on get one of the data sets used in the chapter to con- annual incidence of cancer in the United duct an alternative analysis, but it was not possible States reducing by10% the medical use to get it before myrejoinder was due. It would have of X-rays, etc.? That is, it may be desired been interesting to contrast our two approaches. to predict the consequences of some- thing onlyindirectlyaddressed bythe C3 Approach to Statistical Problems data available for analysis prediction, Basing mycritique on a small illustration in a always hazardous, without some under- book is not fair to Professor Cox. To be fairer, I quote standing of the underlying process and his words about the nature of a statistical analysis: linking with other sources of information, becomes more and more tentative. Formal models are useful and often almost, if not quite, essential for inci- I agree. sive thinking. Descriptivelyappealing and transparent methods with a firm C2 Data Models Only model base are the ideal. Notions of sig- From here on we part company. Professor Cox’s nificance tests, confidence intervals, pos- discussion consists of a justification of the use of terior intervals, and all the formal appa- data models to the exclusion of other approaches. ratus of inference are valuable tools to be For instance, although he admits, “ I certainly used as guides, but not in a mechanical accept, although it goes somewhat against the grain way;theyindicate the uncertaintythat to do so, that there are situations where a directly would applyunder somewhat idealized, empirical approach is better” the two examples maybe very idealized, conditions and as he gives of such situations are short-term economic such are often lower bounds to real forecasts and real-time flood forecasts—among the uncertainty. Analyses and model devel- less interesting of all of the manycurrent suc- opment are at least partlyexploratory. cessful algorithmic applications. In his view, the Automatic methods of model selection onlyuse for algorithmic models is short-term fore- (and of variable selection in regression- casting; there are no comments on the rich infor- like problems) are to be shunned or, if mation about the data and covariates available use is absolutelyunavoidable, are to be from random forests or in the manyfields, such as examined carefullyfor their effect on pattern recognition, where algorithmic modeling is the final conclusions. Unfocused tests of fundamental. model adequacyare rarelyhelpful. 228 L. BREIMAN

Given the right kind of data: relativelysmall sam- gathered in each run consists of a sequence of ple size and a handful of covariates, I have no doubt 150,000 pixel images. Gigabytes of satellite infor- that his experience and ingenuityin the craft of mation are being used in projects to predict and model construction would result in an illuminating understand short- and long-term environmental model. But data characteristics are rapidlychang- and weather changes. ing. In manyof the most interesting current prob- Underlying this rapid change is the rapid evo- lems, the idea of starting with a formal model is not lution of the computer, a device for gathering, stor- tenable. ing and manipulation of incredible amounts of data, together with technological advances incorporating C4 Changes in Problems computing, such as satellites and MRI machines. Myimpression from Professor Cox’s comments is The problems are exhilarating. The methods used that he believes everystatistical problem can be in statistics for small sample sizes and a small num- best solved byconstructing a data model. I believe ber of variables are not applicable. John Rice, in his that statisticians need to be more pragmatic. Given summarytalk at the astronomyand statistics work- a statistical problem, find a good solution, whether shop said, “Statisticians have to become opportunis- it is a data model, an algorithmic model or (although tic.” That is, faced with a problem, theymust find it is somewhat against mygrain), a Bayesiandata a reasonable solution bywhatever method works. model or a completelydifferent approach. One surprising aspect of both workshops was how Mywork on the 1990 Census Adjustment opportunistic statisticians faced with genetic and (Breiman, 1994) involved a painstaking analysis of astronomical data had become. Algorithmic meth- the sources of error in the data. This was done by ods abounded. a long studyof thousands of pages of evaluation documents. This seemed the most appropriate way C5 Mainstream Procedures and Tools of answering the question of the accuracyof the Professor Cox views mycritique of the use of data adjustment estimates. models as based in part on a caricature. Regard- The conclusion that the adjustment estimates ing myreferences to articles in journals such as were largelythe result of bad data has never been JASA, he states that theyare not typicalof main- effectivelycontested and is supported bythe results stream statistical analysis, but are used to illustrate of the Year 2000 Census Adjustment effort. The technique rather than explain the process of anal- accuracyof the adjustment estimates was, arguably, ysis. His concept of mainstream statistical analysis the most important statistical issue of the last is summarized in the quote given in mySection C3. decade, and could not be resolved byanyamount It is the kind of thoughtful and careful analysis that of statistical modeling. he prefers and is capable of. A primaryreason whywe cannot relyon data Following this summaryis the statement: models alone is the rapid change in the nature of statistical problems. The realm of applications of Bycontrast, Professor Breiman equates statistics has expanded more in the last twenty-five mainstream applied statistics to a rel- years than in any comparable period in the history ativelymechanical process involving of statistics. somehow or other choosing a model, often In an astronomyand statistics workshop this a default model of standard form, and year, a speaker remarked that in twenty-five years applying standard methods of analysis we have gone from being a small sample-size science and goodness-of-fit procedures. to a verylarge sample-size science. Astronomical data bases now contain data on two billion objects The disagreement is definitional—what is “main- comprising over 100 terabytes and the rate of new stream”? In terms of numbers mydefinition of main- information is accelerating. stream prevails, I guess, at a ratio of at least 100 A recent biostatistics workshop emphasized the to 1. Simplycount the number of people doing their analysis of genetic data. An exciting breakthrough statistical analysis using canned packages, or count is the use of microarrays to locate regions of gene the number of SAS licenses. activity. Here the sample size is small, but the num- In the academic world, we often overlook the fact ber of variables ranges in the thousands. The ques- that we are a small slice of all statisticians and tions are which specific genes contribute to the an even smaller slice of all those doing analyses of occurrence of various types of diseases. data. There are manystatisticians and nonstatis- Questions about the areas of thinking in the brain ticians in diverse fields using data to reach con- are being studied using functional MRI. The data clusions and depending on tools supplied to them STATISTICAL MODELING: THE TWO CULTURES 229 bySAS, SPSS, etc. Their conclusions are important or nuclear reactions in terms of hard ball models for and are sometimes published in medical or other atoms. The scientific approach is to use these com- subject-matter journals. Theydo not have the sta- plex models as the best possible descriptions of the tistical expertise, computer skills, or time needed to physical world and try to get usable information out construct more appropriate tools. I was faced with of them. this problem as a consultant when confined to using There are manyengineering and scientific appli- the BMDP linear regression, stepwise linear regres- cations where simpler models, such as Newton’s sion, and discriminant analysis programs. My con- laws, are certainlysufficient—say,in structural cept of decision trees arose when I was faced with design. Even here, for larger structures, the model nonstandard data that could not be treated bythese is complex and the analysis difficult. In scientific standard methods. fields outside statistics, answering questions is done When I rejoined the universityafter myconsult- byextracting information from increasinglycom- ing years, one of my hopes was to provide better plex and accurate models. general purpose tools for the analysis of data. The The approach I suggest is similar. In genetics, first step in this direction was the publication of the astronomyand manyother current areas statistics CART book (Breiman et al., 1984). CART and other is needed to answer questions, construct the most similar decision tree methods are used in thou- accurate possible model, however complex, and then sands of applications yearlyinmanyfields. It has extract usable information from it. proved robust and reliable. There are others that Random forests is in use at some major drug are more recent; random forests is the latest. A pre- companies whose statisticians were impressed by liminaryversion of random forests is free source with f77 code, S+ and R interfaces available at its abilityto determine gene expression (variable www.stat.berkeley.edu/users/breiman. importance) in microarraydata. Theywere not A nearlycompleted second version will also be concerned about its complexityor black-box appear- put on the web site and translated into Java bythe ance. Weka group. Mycollaborator, Adele Cutler, and I will continue to upgrade, add new features, graph- E2 Prediction ics, and a good interface. Leo’s paper overstates both its [predic- Myphilosophyabout the field of academic statis- tion’s] role, and our profession’s lack of tics is that we have a responsibilityto provide interest in it Most statistical surveys the manypeople working in applications outside of have the identification of causal factors academia with useful, reliable, and accurate analy- as their ultimate role. sis tools. Two excellent examples are wavelets and decision trees. More are needed. Mypoint was that it is difficult to tell, using BRAD EFRON goodness-of-fit tests and residual analysis, how well a model fits the data. An estimate of its test set accu- Brad seems to be a bit puzzled about how to react racyis a preferable assessment. If, for instance, a to myarticle. I’ll start with what appears to be his model gives predictive accuracyonlyslightlybetter biggest reservation. than the “all survived” or other baseline estimates, E1 From Simple to Complex Models we can’t put much faith in its reliabilityin the iden- tification of causal factors. Brad is concerned about the use of complex I agree that often “ statistical surveys have models without simple interpretabilityin their the identification of casual factors as their ultimate structure, even though these models maybe the role.” I would add that the more predictivelyaccu- most accurate predictors possible. But the evolution rate the model is, the more faith can be put into the of science is from simple to complex. variables that it fingers as important. The equations of general relativityare consider- ablymore complex and difficult to understand than E3 Variable Importance Newton’s equations. The quantum mechanical equa- tions for a system of molecules are extraordinarily A significant and often overlooked point raised difficult to interpret. Physicists accept these com- byBrad is what meaning can one give to state- plex models as the facts of life, and do their best to ments that “variable X is important or not impor- extract usable information from them. tant.” This has puzzled me on and off for quite a There is no consideration given to trying to under- while. In fact, variable importance has always been stand cosmologyon the basis of Newton’s equations defined operationally. In regression the “important” 230 L. BREIMAN variables are defined bydoing “best subsets” or vari- In 1992 I went to myfirst NIPS conference. At able deletion. that time, the exciting algorithmic methodologywas Another approach used in linear methods such as neural nets. Myattitude was grim skepticism. Neu- logistic regression and survival models is to com- ral nets had been given too much hype, just as AI pare the size of the slope estimate for a variable to had been given and failed expectations. I came away its estimated standard error. The larger the ratio, a believer. Neural nets delivered on the bottom line! the more “important” the variable. Both of these def- In talk after talk, in problem after problem, neu- initions can lead to erroneous conclusions. ral nets were being used to solve difficult prediction Mydefinition of variable importance is based problems with test set accuracies better than any- on prediction. A variable might be considered thing I had seen up to that time. important if deleting it seriouslyaffects prediction Myattitude toward new and/or complicated meth- accuracy. This brings up the problem that if two ods is pragmatic. Prove that you’ve got a better variables are highlycorrelated, deleting one or the mousetrap and I’ll buyit. But the proof had bet- other of them will not affect prediction accuracy. ter be concrete and convincing. Deleting both of them maydegrade accuracyconsid- Brad questions where the bias and variance have erably. The definition used in random forests spots gone. It is surprising when, trained in classical bias- both variables. variance terms and convinced of the curse of dimen- “Importance” does not yet have a satisfactory the- sionality, one encounters methods that can handle oretical definition (I haven’t been able to locate the thousands of variables with little loss of accuracy. article Brad references but I’ll keep looking). It It is not voodoo statistics; there is some simple the- depends on the dependencies between the output orythat illuminates the behavior of random forests variable and the input variables, and on the depen- (Breiman, 1999). I agree that more theoretical work dencies between the input variables. The problem is needed to increase our understanding. Brad is an innovative and flexible thinker who begs for research. has contributed much to our field. He is opportunis- E4 Other Reservations tic in problem solving and may, perhaps not overtly, alreadyhave algorithmic modeling in his bag of Sample sizes have swollen alarmingly tools. while goals grow less distinct (“find inter- esting data structure”). BRUCE HOADLEY

I have not noticed anyincreasing fuzziness in I thank Bruce Hoadleyfor his description of goals, onlythat theyhave gotten more diverse. In the algorithmic procedures developed at Fair, Isaac since the 1960s. Theysound like people I would the last two workshops I attended (genetics and enjoyworking with. Bruce makes two points of mild astronomy) the goals in using the data were clearly contention. One is the following: laid out. “Searching for structure” is rarelyseen even though data maybe in the terabyterange. High performance (predictive accuracy) on the test sample does not guaran- The new algorithms often appear in the tee high performance on future samples; form of black boxes with enormous num- things do change. bers of adjustable parameters (“knobs to twiddle”). I agree—algorithmic models accurate in one con- text must be modified to stayaccurate in others. This is a perplexing statement and perhaps I don’t This does not necessarilyimplythat the waythe understand what Brad means. Random forests has model is constructed needs to be altered, but that onlyone adjustable parameter that needs to be set data gathered in the new context should be used in for a run, is insensitive to the value of this param- the construction. eter over a wide range, and has a quick and simple His other point of contention is that the Fair, Isaac wayfor determining a good value. Support vector algorithm retains interpretability, so that it is pos- machines depend on the settings of 1–2 parameters. sible to have both accuracyand interpretability.For Other algorithmic models are similarlysparse in the clients who like to know what’s going on, that’s a number of knobs that have to be twiddled. sellable item. But developments in algorithmic mod- eling indicate that the Fair, Isaac algorithm is an New methods always look better than old exception. ones.  Complicated models are harder A computer scientist working in the machine to criticize than simple ones. learning area joined a large moneymanagement STATISTICAL MODELING: THE TWO CULTURES 231 companysome yearsago and set up a group to do viabilityof statistics as a field. Oddly,we are in a portfolio management using stock predictions given period where there has never been such a wealth of bylarge neural nets. When we visited, I asked how new statistical problems and sources of data. The he explained the neural nets to clients. “Simple,” he danger is that if we define the boundaries of our said; “We fit binarytrees to the inputs and outputs field in terms of familar tools and familar problems, of the neural nets and show the trees to the clients. we will fail to grasp the new opportunities. Keeps them happy!” In both stock prediction and credit rating, the priorityis accuracy.Interpretabil- ityis a secondarygoal that can be finessed. ADDITIONAL REFERENCES Beverdige, W. V. I (1952) The Art of Scientific Investigation. MANNY PARZEN Heinemann, London. Breiman, L. (1994) The 1990 Census adjustment: undercount or MannyParzen opines that there are not two but bad data (with discussion)? Statist. Sci. 9 458–475. manymodeling cultures. This is not an issue I want Cox, D. R. and Wermuth, N. (1996) Multivariate Dependencies. to fiercelycontest. I like mydivision because it is Chapman and Hall, London. prettyclear cut—are youmodeling the inside of the Efron, B. and Gong, G. (1983) A leisurelylook at the bootstrap, box or not? For instance, I would include Bayesians the jackknife, and cross-validation. Amer. Statist. 37 36–48. Efron, B. and Tibshirani, R. (1996) Improvements on cross- in the data modeling culture. I will keep myeyeon validation: the 632+ rule. J. Amer. Statist. Assoc. 91 548–560. the quantile culture to see what develops. Efron, B. and Tibshirani, R. (1998) The problem of regions. Most of all, I appreciate Manny’s openness to the Ann. Statist. 26 1287–1318. issues raised in mypaper. With the rapid changes Gong, G. (1982) Cross-validation, the jackknife, and the boot- in the scope of statistical problems, more open and strap: excess error estimation in forward logistic regression. Ph.D. dissertation, Stanford Univ. concrete discussion of what works and what doesn’t MacKay, R. J. and Oldford, R. W. (2000) Scientific method, should be welcomed. statistical method, and the speed of light. Statist. Sci. 15 224–253. WHERE ARE WE HEADING? Parzen, E. (1979) Nonparametric statistical data modeling (with discussion). J. Amer. Statist. Assoc. 74 105–131. Manyof the best statisticians I have talked to Stone, C. (1977) Consistent nonparametric regression. Ann. over the past years have serious concerns about the Statist. 5 595–645.