Statistical Modeling: the Two Cultures

Statistical Science 2001, Vol. 16, No. 3, 199–231 Statistical Modeling: The Two Cultures Leo Breiman Abstract. There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated bya given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. The statistical communityhas been committed to the almost exclusive use of data models. This commitment has led to irrelevant theory, questionable conclusions, and has kept statisticians from working on a large range of interesting current problems. Algorithmic modeling, both in theoryand practice, has developed rapidlyin fields outside statistics. It can be used both on large complex data sets and as a more accurate and informative alternative to data modeling on smaller data sets. If our goal as a field is to use data to solve problems, then we need to move awayfrom exclusive dependence on data models and adopt a more diverse set of tools. 1. INTRODUCTION The values of the parameters are estimated from Statistics starts with data. Think of the data as the data and the model then used for information being generated bya black box in which a vector of and/or prediction. Thus the black box is filled in like input variables x (independent variables) go in one this: side, and on the other side the response variables y linear regression come out. Inside the black box, nature functions to y logistic regression x associate the predictor variables with the response Cox model variables, so the picture is like this: Model validation. Yes–no using goodness-of-fit tests and residual examination. y nature x Estimated culture population. 98% of all statisticians. There are two goals in analyzing the data: The Algorithmic Modeling Culture Prediction. To be able to predict what the responses are going to be to future input variables; The analysis in this culture considers the inside of Information. To extract some information about the box complex and unknown. Their approach is to how nature is associating the response variables find a function fx—an algorithm that operates on to the input variables. x to predict the responses y. Their black box looks like this: There are two different approaches toward these goals: y unknown x The Data Modeling Culture The analysis in this culture starts with assuming decision trees a stochastic data model for the inside of the black neural nets box. For example, a common data model is that data are generated byindependent draws from Model validation. Measured bypredictive accuracy. Estimated culture population. 2% of statisticians, response variables = f(predictor variables, manyin other fields. random noise, parameters) In this paper I will argue that the focus in the statistical communityon data models has: Leo Breiman is Professor, Department of Statistics, University of California, Berkeley, California 94720- • Led to irrelevant theoryand questionable sci- 4735 (e-mail: [email protected]). entific conclusions; 199 200 L. BREIMAN • Kept statisticians from using more suitable between inputs and outputs than data models. This algorithmic models; is illustrated using two medical data sets and a • Prevented statisticians from working on excit- genetic data set. A glossaryat the end of the paper ing new problems; explains terms that not all statisticians maybe familiar with. I will also review some of the interesting new developments in algorithmic modeling in machine learning and look at applications to three data sets. 3. PROJECTS IN CONSULTING As a consultant I designed and helped supervise 2. ROAD MAP surveys for the Environmental Protection Agency It maybe revealing to understand how I became a (EPA) and the state and federal court systems. Con- member of the small second culture. After a seven- trolled experiments were designed for the EPA, and year stint as an academic probabilist, I resigned and I analyzed traffic data for the U.S. Department of went into full-time free-lance consulting. After thir- Transportation and the California Transportation teen years of consulting I joined the Berkeley Statis- Department. Most of all, I worked on a diverse set tics Department in 1980 and have been there since. of prediction projects. Here are some examples: Myexperiences as a consultant formed myviews Predicting next-dayozone levels. about algorithmic modeling. Section 3 describes two Using mass spectra to identifyhalogen-containing of the projects I worked on. These are given to show compounds. how myviews grew from such problems. Predicting the class of a ship from high altitude When I returned to the universityand began radar returns. reading statistical journals, the research was dis- Using sonar returns to predict the class of a sub- tant from what I had done as a consultant. All marine. articles begin and end with data models. Myobser- Identityof hand-sent Morse Code. vations about published theoretical research in Toxicityof chemicals. statistics are in Section 4. On-line prediction of the cause of a freewaytraffic Data modeling has given the statistics field many breakdown. successes in analyzing data and getting informa- Speech recognition tion about the mechanisms producing the data. But The sources of delayin criminal trials in state court there is also misuse leading to questionable con- systems. clusions about the underlying mechanism. This is reviewed in Section 5. Following that is a discussion To understand the nature of these problems and (Section 6) of how the commitment to data modeling the approaches taken to solve them, I give a fuller has prevented statisticians from entering new sci- description of the first two on the list. entific and commercial fields where the data being 3.1 The Ozone Project gathered is not suitable for analysis by data models. In the past fifteen years, the growth in algorith- In the mid- to late 1960s ozone levels became a mic modeling applications and methodologyhas serious health problem in the Los Angeles Basin. been rapid. It has occurred largelyoutside statis- Three different alert levels were established. At the tics in a new community—often called machine highest, all government workers were directed not learning—that is mostlyyoungcomputer scientists to drive to work, children were kept off playgrounds (Section 7). The advances, particularlyover the last and outdoor exercise was discouraged. five years, have been startling. Three of the most The major source of ozone at that time was auto- important changes in perception to be learned from mobile tailpipe emissions. These rose into the low these advances are described in Sections 8, 9, and atmosphere and were trapped there byan inversion 10, and are associated with the following names: layer. A complex chemical reaction, aided by sun- light, cooked awayand produced ozone two to three Rashomon: the multiplicityof good models; hours after the morning commute hours. The alert Occam: the conflict between simplicityand warnings were issued in the morning, but would be accuracy; more effective if theycould be issued 12 hours in Bellman: dimensionality—curse or blessing? advance. In the mid-1970s, the EPA funded a large Section 11 is titled “Information from a Black effort to see if ozone levels could be accuratelypre- Box” and is important in showing that an algo- dicted 12 hours in advance. rithmic model can produce more and more reliable Commuting patterns in the Los Angeles Basin information about the structure of the relationship are regular, with the total variation in anygiven STATISTICAL MODELING: THE TWO CULTURES 201 daylight hour varying only a few percent from field. The molecules of the compound split and the one weekdayto another. With the total amount of lighter fragments are bent more bythe magnetic emissions about constant, the resulting ozone lev- field than the heavier. Then the fragments hit an els depend on the meteorologyof the preceding absorbing strip, with the position of the fragment on days. A large data base was assembled consist- the strip determined bythe molecular weight of the ing of lower and upper air measurements at U.S. fragment. The intensityof the exposure at that posi- weather stations as far awayas Oregon and Ari- tion measures the frequencyof the fragment. The zona, together with hourlyreadings of surface resultant mass spectra has numbers reflecting fre- temperature, humidity, and wind speed at the quencies of fragments from molecular weight 1 up to dozens of air pollution stations in the Basin and the molecular weight of the original compound. The nearbyareas. peaks correspond to frequent fragments and there Altogether, there were dailyand hourlyreadings are manyzeroes. The available data base consisted of over 450 meteorological variables for a period of of the known chemical structure and mass spectra seven years, with corresponding hourly values of of 30,000 compounds. ozone and other pollutants in the Basin. Let x be The mass spectrum predictor vector x is of vari- the predictor vector of meteorological variables on able dimensionality. Molecular weight in the data the nth day. There are more than 450 variables in base varied from 30 to over 10,000. The variable to x since information several days back is included. be predicted is Let y be the ozone level on the n + 1st day. Then the problem was to construct a function fx such y = 1: contains chlorine, that for anyfuture dayand future predictor vari- y = 2: does not contain chlorine. ables x for that day, fx is an accurate predictor of the next day’s ozone level y. The problem is to construct a function fx that To estimate predictive accuracy, the first five is an accurate predictor of y where x is the mass years of data were used as the training set. The spectrum of the compound. last two years were set aside as a test set.

Load more