To Explain Or to Predict? Galit Shmueli

Home , Predictive analytics

Statistical Science 2010, Vol. 25, No. 3, 289–310 DOI: 10.1214/10-STS330 © Institute of Mathematical Statistics, 2010 To Explain or to Predict? Galit Shmueli

Abstract. Statistical modeling is a powerful tool for developing and testing theories by way of causal explanation, prediction, and description. In many disciplines there is near-exclusive use of statistical modeling for causal explanation and the assumption that models with high explanatory power are inherently of high predictive power. Conflation between explanation and prediction is common, yet the distinction must be understood for progressing scientific knowledge. While this distinction has been recognized in the philosophy of science, the statistical literature lacks a thorough discussion of the many differences that arise in the process of modeling for an explanatory versus a predictive goal. The purpose of this article is to clarify the distinction between explanatory and predictive modeling, to discuss its sources, and to reveal the practical implications of the distinction to each step in the modeling process. Key words and phrases: Explanatory modeling, causality, predictive modeling, predictive power, statistical strategy, data mining, scientific research.

1. INTRODUCTION focus on the use of statistical modeling for causal explanation and for prediction. My main premise is that Looking at how statistical models are used in dif- the two are often conflated, yet the causal versus pre- ferent scientific disciplines for the purpose of theory dictive distinction has a large impact on each step of the building and testing, one finds a range of perceptions statistical modeling process and on its consequences. regarding the relationship between causal explanation Although not explicitly stated in the statistics method- and empirical prediction. In many scientific fields such ology literature, applied statisticians instinctively sense as economics, psychology, education, and environmental science, statistical models are used almost exclu- that predicting and explaining are different. This article sively for causal explanation, and models that possess aims to fill a critical void: to tackle the distinction be- high explanatory power are often assumed to inher- tween explanatory modeling and predictive modeling. ently possess predictive power. In fields such as natural Clearing the current ambiguity between the two is language processing and bioinformatics, the focus is on critical not only for proper statistical modeling, but empirical prediction with only a slight and indirect re- more importantly, for proper scientific usage. Both ex- lation to causal explanation. And yet in other research planation and prediction are necessary for generating fields, such as epidemiology, the emphasis on causal and testing theories, yet each plays a different role in explanation versus empirical prediction is more mixed. doing so. The lack of a clear distinction within statistics Statistical modeling for description, where the purpose has created a lack of understanding in many disciplines is to capture the data structure parsimoniously, and of the difference between building sound explanatory which is the most commonly developed within the field models versus creating powerful predictive models, as of statistics, is not commonly used for theory building well as confusing explanatory power with predictive and testing in other disciplines. Hence, in this article I power. The implications of this omission and the lack of clear guidelines on how to model for explanatory Galit Shmueli is Associate Professor of Statistics, versus predictive goals are considerable for both scien- Department of Decision, Operations and Information tific research and practice and have also contributed to Technologies, Robert H. Smith School of Business, the gap between academia and practice. University of Maryland, College Park, Maryland 20742, I start by defining what I term explaining and pre- USA (e-mail: [email protected]). dicting. These definitions are chosen to reflect the dis-

289 290 G. SHMUELI tinct scientific goals that they are aimed at: causal ex- To illustrate how explanatory modeling is typically planation and empirical prediction, respectively. Ex- done, I describe the structure of a typical article in a planatory modeling and predictive modeling reflect the highly regarded journal in the field of Information Sys- process of using data and statistical (or data mining) tems (IS). Researchers in the field of IS usually have methods for explaining or predicting, respectively. The training in economics and/or the behavioral sciences. term modeling is intentionally chosen over models to The structure of articles reflects the way empirical re- highlight the entire process involved, from goal defini- search is conducted in IS and related fields. tion, study design, and data collection to scientific use. The example used is an article by Gefen, Karahanna and Straub (2003), which studies technology accep- 1.1 Explanatory Modeling tance. The article starts with a presentation of the pre- In many scientific fields, and especially the social vailing relevant theory(ies): sciences, statistical methods are used nearly exclu- Online purchase intensions should be ex- sively for testing causal theory. Given a causal theo- plained in part by the technology accep- retical model, statistical models are applied to data in tance model (TAM). This theoretical model order to test causal hypotheses. In such models, a set is at present a preeminent theory of technol- of underlying factors that are measured by variables X ogy acceptance in IS. are assumed to cause an underlying effect, measured The authors then proceed to state multiple causal hy- by variable Y . Based on collaborative work with social potheses (denoted H ,H ,...in Figure 1, right panel), scientists and economists, on an examination of some 1 2 justifying the merits for each hypothesis and ground- of their literature, and on conversations with a diverse ing it in theory. The research hypotheses are given in group of researchers, I conjecture that, whether statis- terms of theoretical constructs rather than measurable ticians like it or not, the type of statistical models used variables. Unlike measurable variables, constructs are for testing causal hypotheses in the social sciences are abstractions that “describe a phenomenon of theoreti- almost always association-based models applied to ob- cal interest” (Edwards and Bagozzi, 2000) and can be servational data. Regression models are the most com- observable or unobservable. Examples of constructs in mon example. The justification for this practice is that this article are trust, perceived usefulness (PU), and the theory itself provides the causality. In other words, perceived ease of use (PEOU). Examples of constructs the role of the theory is very strong and the reliance used in other fields include anger, poverty, well-being, on data and statistical modeling are strictly through the and odor. The hypotheses section will often include a lens of the theoretical model. The theory–data relation- causal diagram illustrating the hypothesized causal re- ship varies in different fields. While the social sciences lationship between the constructs (see Figure 1,left are very theory-heavy, in areas such as bioinformat- panel). The next step is construct operationalization, ics and natural language processing the emphasis on where a bridge is built between theoretical constructs a causal theory is much weaker. Hence, given this re- and observable measurements, using previous litera- ality, I define explaining as causal explanation and ex- ture and theoretical justification. Only after the theo- planatory modeling as the use of statistical models for retical component is completed, and measurements are testing causal explanations. justified and defined, do researchers proceed to the next

FIG.1. Causal diagram (left) and partial list of stated hypotheses (right) from Gefen, Karahanna and Straub (2003). TOEXPLAINORTOPREDICT? 291 step where data and statistical modeling are introduced aimed at prediction. Fitting a regression model can be alongside the statistical hypotheses, which are opera- descriptive if it is used for capturing the association be- tionalized from the research hypotheses. Statistical in- tween the dependent and independent variables rather ference will lead to “statistical conclusions” in terms of than for causal inference or for prediction. We mention effect sizes and statistical significance in relation to the this type of modeling to avoid confusion with causal- causal hypotheses. Finally, the statistical conclusions explanatory and predictive modeling, and also to high- are converted into research conclusions, often accom- light the different approaches of statisticians and non- panied by policy recommendations. statisticians. In summary, explanatory modeling refers here to 1.4 The Scientific Value of Predictive Modeling the application of statistical models to data for testing causal hypotheses about theoretical constructs. Although explanatory modeling is commonly used Whereas “proper” statistical methodology for testing for theory building and testing, predictive modeling is causality exists, such as designed experiments or spe- nearly absent in many scientific fields as a tool for de- cialized causal inference methods for observational veloping theory. One possible reason is the statistical data [e.g., causal diagrams (Pearl, 1995), discovery training of nonstatistician researchers. A look at many algorithms (Spirtes, Glymour and Scheines, 2000), introductory statistics textbooks reveals very little in probability trees (Shafer, 1996), and propensity scores the way of prediction. Another reason is that prediction (Rosenbaum and Rubin, 1983; Rubin, 1997)], in prac- is often considered unscientific. Berk (2008) wrote, “In tice association-based statistical models, applied to ob- the social sciences, for example, one either did causal servational data, are most commonly used for that pur- modeling econometric style or largely gave up quan- pose. titative work.” From conversations with colleagues in various disciplines it appears that predictive modeling 1.2 Predictive Modeling is often valued for its applied utility, yet is discarded for Idefinepredictive modeling as the process of apply- scientific purposes such as theory building or testing. ing a statistical model or data mining algorithm to data Shmueli and Koppius (2010) illustrated the lack of pre- for the purpose of predicting new or future observa- dictive modeling in the field of IS. Searching the 1072 tions. In particular, I focus on nonstochastic prediction papers published in the two top-rated journals Infor- (Geisser, 1993, page 31), where the goal is to predict mation Systems Research and MIS Quarterly between the output value (Y ) for new observations given their 1990 and 2006, they found only 52 empirical papers input values (X). This definition also includes temporal with predictive claims, of which only seven carried out forecasting, where observations until time t (the input) proper predictive modeling or testing. are used to forecast future values at time t + k,k > 0 Even among academic statisticians, there appears to (the output). Predictions include point or interval pre- be a divide between those who value prediction as the dictions, prediction regions, predictive distributions, or main purpose of statistical modeling and those who see rankings of new observations. Predictive model is any it as unacademic. Examples of statisticians who em- method that produces predictions, regardless of its un- phasize predictive methodology include Akaike (“The derlying approach: Bayesian or frequentist, parametric predictive point of view is a prototypical point of view or nonparametric, data mining algorithm or statistical to explain the basic activity of statistical analysis” in model, etc. Findley and Parzen, 1998), Deming (“The only useful function of a statistician is to make predictions” 1.3 Descriptive Modeling in Wallis, 1980), Geisser (“The prediction of observ- Although not the focus of this article, a third type of ables or potential observables is of much greater rel- modeling, which is the most commonly used and de- evance than the estimate of what are often artificial veloped by statisticians, is descriptive modeling. This constructs-parameters,” Geisser, 1975), Aitchison and type of modeling is aimed at summarizing or repre- Dunsmore (“prediction analysis. . . is surely at the heart senting the data structure in a compact manner. Un- of many statistical applications,” Aitchison and Dun- like explanatory modeling, in descriptive modeling the smore, 1975) and Friedman (“One of the most com- reliance on an underlying causal theory is absent or in- mon and important uses for data is prediction,” Fried- corporated in a less formal way. Also, the focus is at the man, 1997). Examples of those who see it as unacad- measurable level rather than at the construct level. Un- emic are Kendall and Stuart (“The Science of Statistics like predictive modeling, descriptive modeling is not deals with the properties of populations. In considering 292 G. SHMUELI a population of men we are not interested, statistically 5. Predictive power assessment offers a straightfor- speaking, in whether some particular individual has ward way to compare competing theories by ex- brown eyes or is a forger, but rather in how many of the amining the predictive power of their respective ex- individuals have brown eyes or are forgers,” Kendall planatory models. and Stuart, 1977) and more recently Parzen (“The two 6. Predictive modeling plays an important role in goals in analyzing data. . . I prefer to describe as “man- quantifying the level of predictability of measurable agement” and “science.” Management seeks profit. . . phenomena by creating benchmarks of predictive Science seeks truth,” Parzen, 2001). In economics there accuracy (Ehrenberg and Bound, 1993). Knowledge is a similar disagreement regarding “whether predic- of un-predictability is a fundamental component of tion per se is a legitimate objective of economic sci- scientific knowledge (see, e.g., Taleb, 2007). Be- ence, and also whether observed data should be used cause predictive models tend to have higher predic- only to shed light on existing theories or also for the tive accuracy than explanatory statistical models, purpose of hypothesis seeking in order to develop new they can give an indication of the potential level theories” (Feelders, 2002). of predictability. A very low predictability level Before proceeding with the discrimination between explanatory and predictive modeling, it is important to can lead to the development of new measures, new establish prediction as a necessary scientific endeavor collected data, and new empirical approaches. An beyond utility, for the purpose of developing and test- explanatory model that is close to the predictive ing theories. Predictive modeling and predictive testing benchmark may suggest that our understanding of serve several necessary scientific functions: that phenomenon can only be increased marginally. On the other hand, an explanatory model that is very 1. Newly available large and rich datasets often con- far from the predictive benchmark would imply that tain complex relationships and patterns that are hard there are substantial practical and theoretical gains to hypothesize, especially given theories that ex- to be had from further scientific development. clude newly measurable concepts. Using predictive modeling in such contexts can help uncover For a related, more detailed discussion of the value potential new causal mechanisms and lead to the of prediction to scientific theory development see the generation of new hypotheses. See, for example, work of Shmueli and Koppius (2010). the discussion between Gurbaxani and Mendelson (1990, 1994) and Collopy, Adya and Armstrong 1.5 Explaining and Predicting Are Different (1994). In the philosophy of science, it has long been de- 2. The development of new theory often goes hand in bated whether explaining and predicting are one or hand with the development of new measures (Van distinct. The conflation of explanation and predic- Maanen, Sorensen and Mitchell, 2007). Predictive tion has its roots in philosophy of science litera- modeling can be used to discover new measures as ture, particularly the influential hypothetico-deductive well as to compare different operationalizations of model (Hempel and Oppenheim, 1948), which explic- constructs and different measurement instruments. 3. By capturing underlying complex patterns and re- itly equated prediction and explanation. However, as lationships, predictive modeling can suggest im- later became clear, the type of uncertainty associated provements to existing explanatory models. with explanation is of a different nature than that as- 4. Scientific development requires empirically rigor- sociated with prediction (Helmer and Rescher, 1959). ous and relevant research. Predictive modeling en- This difference highlighted the need for developing ables assessing the distance between theory and models geared specifically toward dealing with pre- practice, thereby serving as a “reality check” to dicting future events and trends such as the Delphi the relevance of theories.1 While explanatory power method (Dalkey and Helmer, 1963). The distinction provides information about the strength of an under- between the two concepts has been further elaborated lying causal relationship, it does not imply its pre- (Forster and Sober, 1994;Forster,2002; Sober, 2002; dictive power. Hitchcock and Sober, 2004; Dowe, Gardner and Oppy, 2007). In his book Theory Building, Dubin (1969, 1Predictive models are advantageous in terms of negative em- page 9) wrote: piricism: a model either predicts accurately or it does not, and this can be observed. In contrast, explanatory models can never be con- Theories of social and human behavior ad- firmed and are harder to contradict. dress themselves to two distinct goals of TOEXPLAINORTOPREDICT? 293

science: (1) prediction and (2) understand- The disparity manifests itself in different ways. Four ing. It will be argued that these are separate major aspects are: goals [. . . ] I will not, however, conclude that they are either inconsistent or incompatible. Causation–Association: In explanatory modeling f represents an underlying causal function, and X is Herbert Simon distinguished between “basic science” assumed to cause Y . In predictive modeling f cap- and “applied science” (Simon, 2001), a distinction sim- tures the association between X and Y. ilar to explaining versus predicting. According to Si- Theory–Data: In explanatory modeling, f is care- mon, basic science is aimed at knowing (“to describe fully constructed based on F in a fashion that sup- the world”) and understanding (“to provide explana- ports interpreting the estimated relationship between tions of these phenomena”). In contrast, in applied sci- X and Y and testing the causal hypotheses. In predic- ence, “Laws connecting sets of variables allow infer- tive modeling, f is often constructed from the data. ences or predictions to be made from known values of Direct interpretability in terms of the relationship be- some of the variables to unknown values of other vari- tween X and Y is not required, although sometimes ables.” Why should there be a difference between explaining transparency of f is desirable. and predicting? The answer lies in the fact that measur- Retrospective–Prospective: Predictive modeling is able data are not accurate representations of their un- forward-looking, in that f is constructed for pre- derlying constructs. The operationalization of theories dicting new observations. In contrast, explanatory and constructs into statistical models and measurable modeling is retrospective, in that f is used to test an data creates a disparity between the ability to explain already existing set of hypotheses. phenomena at the conceptual level and the ability to Bias–Variance: The expected prediction error for a generate predictions at the measurable level. new observation with value x, using a quadratic loss To convey this disparity more formally, consider a function,2 is given by Hastie, Tibshirani and Fried- theory postulating that construct X causes construct man (2009, page 223) Y, via the function F, such that Y = F(X ). F is of- = { − ˆ }2 ten represented by a path model, a set of qualitative EPE E Y f(x) statements, a plot (e.g., a supply and demand plot), or = E{Y − f(x)}2 +{E(f(x))ˆ − f(x)}2 mathematical formulas. Measurable variables X and Y (1) are operationalizations of X and Y, respectively. The + E{f(x)ˆ − E(f(x))ˆ }2 operationalization of F into a statistical model f ,such = Var(Y ) + Bias2 + Var(fˆ (x)). as E(Y) = f(X), is done by considering F in light of the study design (e.g., numerical or categorical Y ;hi- Bias is the result of misspecifying the statistical erarchical or flat design; time series or cross-sectional; model f . Estimation variance (the third term) is the complete or censored data) and practical considera- result of using a sample to estimate f .Thefirstterm F tions such as standards in the discipline. Because is the error that results even if the model is correctly is usually not sufficiently detailed to lead to a single f , specified and accurately estimated. The above de- often a set of f models is considered. Feelders (2002) composition reveals a source of the difference be- described this process in the field of economics. In the tween explanatory and predictive modeling: In ex- predictive context, we consider only X, Y and f . planatory modeling the focus is on minimizing bias The disparity arises because the goal in explanatory to obtain the most accurate representation of the modeling is to match f and F as closely as possible for the statistical inference to apply to the theoretical underlying theory. In contrast, predictive modeling hypotheses. The data X,Y are tools for estimating f , seeks to minimize the combination of bias and es- which in turn is used for testing the causal hypotheses. timation variance, occasionally sacrificing theoreti- In contrast, in predictive modeling the entities of inter- cal accuracy for improved empirical precision. This est are X and Y , and the function f is used as a tool for point is illustrated in the Appendix, showing that the generating good predictions of new Y values. In fact, “wrong” model can sometimes predict better than we will see that even if the underlying causal relation- the correct one. ship is indeed Y = F(X ), a function other than f(ˆ X) and data other than X might be preferable for predic- 2For a binary Y , various 0–1 loss functions have been suggested tion. in place of the quadratic loss function (Domingos, 2000). 294 G. SHMUELI

The four aspects impact every step of the modeling bias–variance aspect has been pivotal in data mining process, such that the resulting f is markedly different for understanding the predictive performance of differ- in the explanatory and predictive contexts, as will be ent algorithms and for designing new ones. showninSection2. Another area in statistics and econometrics that fo- 1.6 A Void in the Statistics Literature cuses on prediction is time series. Methods have been developed specifically for testing the predictability of The philosophical explaining/predicting debate has a series [e.g., random walk tests or the concept of not been directly translated into statistical language in Granger causality (Granger, 1969)], and evaluating terms of the practical aspects of the entire statistical predictability by examining performance on holdout modeling process. data. The time series literature in statistics is dominated A search of the statistics literature for discussion of by extrapolation models such as ARIMA-type models explaining versus predicting reveals a lively discussion and exponential smoothing methods, which are suit- in the context of model selection, and in particular, the able for prediction and description, but not for causal derivation and evaluation of model selection criteria. In explanation. Causal models for time series are common this context, Konishi and Kitagawa (2007)wrote: in econometrics (e.g., Song and Witt, 2000), where an There may be no significant difference be- underlying causal theory links constructs, which lead tween the point of view of inferring the true to operationalized variables, as in the cross-sectional structure and that of making a prediction if case. Yet, to the best of my knowledge, there is no an infinitely large quantity of data is avail- discussion in the statistics time series literature regard- able or if the data are noiseless. However, ing the distinction between predictive and explanatory in modeling based on a finite quantity of modeling, aside from the debate in economics regard- real data, there is a significant gap between ing the scientific value of prediction. these two points of view, because an optimal To conclude, the explanatory/predictive modeling model for prediction purposes may be dif- distinction has been discussed directly in the model se- ferent from one obtained by estimating the lection context, but not in the larger context. Areas that ‘true model.’ focus on developing predictive modeling such as ma- The literature on this topic is vast, and we do not intend chine learning and statistical time series, and “predic- to cover it here, although we discuss the major points tivists” such as Geisser, have considered prediction as a in Section 2.6. separate issue, and have not discussed its principal and The focus on prediction in the field of machine learn- practical distinction from causal explanation in terms ing and by statisticians such as Geisser, Aitchison and of developing and testing theory. The goal of this arti- Dunsmore, Breiman and Friedman, has highlighted as- cle is therefore to examine the explanatory versus pre- pects of predictive modeling that are relevant to the ex- dictive debate from a statistical perspective, consider- planatory/prediction distinction, although they do not ing how modeling is used by nonstatistician scientists directly contrast explanatory and predictive modeling.3 for theory development. The prediction literature raises the importance of eval- The remainder of the article is organized as fol- uating predictive power using holdout data, and the lows. In Section 2, I consider each step in the mod- usefulness of algorithmic methods (Breiman, 2001b). eling process in terms of the four aspects of the pre- The predictive focus has also led to the development dictive/explanatory modeling distinction: causation– of inference tools that generate predictive distributions. association, theory–data, retrospective–prospective Geisser (1993) introduced “predictive inference” and and bias–variance. Section 3 illustrates some of these developed it mainly in a Bayesian context. “Predic- differences via two examples. A discussion of the im- tive likelihood” (see Bjornstad, 1990) is a likelihood- plications of the predict/explain conflation, conclu- based approach to predictive inference, and Dawid’s sions, and recommendations are given in Section 4. prequential theory (Dawid, 1984) investigates inference concepts in terms of predictability. Finally, the 2. TWO MODELING PATHS In the following I examine the process of statisti- 3Geisser distinguished between “[statistical] parameters” and “observables” in terms of the objects of interest. His distinction cal modeling through the explain/predict lens, from is closely related, but somewhat different from our distinction be- goal definition to model use and reporting. For clar- tween theoretical constructs and measurements. ity, I broke down the process into a generic set of steps, TOEXPLAINORTOPREDICT? 295

FIG.2. Steps in the statistical modeling process. as depicted in Figure 2. In each step I point out differ- Examining issues of sample size, sample allocation, ences in the choice of methods, criteria, data, and infor- and multilevel modeling for the purpose of “predicting mation to consider when the goal is predictive versus a future observable y∗j in the J th group of a hierar- explanatory. I also briefly describe the related statis- chial dataset,” they found that allocation for estimation tics literature. The conceptual and practical differences versus prediction should be different: “an increase in invariably lead to a difference between a final explana- group size n is often more beneficial with respect to tory model and a predictive one, even though they may prediction than an increase in the number of groups use the same initial data. Thus, a priori determination J ...[whereas]estimationismoreimprovedbyincreas- of the main study goal as either explanatory or pre- ing the number of groups J instead of the group size dictive4 is essential to conducting adequate modeling. n.” This relates directly to the bias–variance aspect. The discussion in this section assumes that the main re- A related issue is the choice of f in relation to sam- search goal has been determined as either explanatory pling scheme. Afshartous and de Leeuw (2005) found or predictive. that for their hierarchical data, a hierarchical f ,which is more appropriate theoretically, had poorer predictive 2.1 Study Design and Data Collection performance than a nonhierarchical f . Even at the early stages of study design and data A third design consideration is the choice between collection, issues of what and how much data to col- experimental and observational settings. Whereas for lect, according to what design, and which collection causal explanation experimental data are greatly pre- instrument to use are considered differently for predic- ferred, subject to availability and resource constraints, tion versus explanation. Consider sample size. In ex- in prediction sometimes observational data are prefer- planatory modeling, where the goal is to estimate the able to “overly clean” experimental data, if they bet- theory-based f with adequate precision and to use it ter represent the realistic context of prediction in terms for inference, statistical power is the main consider- of the uncontrolled factors, the noise, the measured re- ation. Reducing bias also requires sufficient data for sponse, etc. This difference arises from the theory–data model specification testing. Beyond a certain amount and prospective–retrospective aspects. Similarly, when of data, however, extra precision is negligible for pur- choosing between primary data (data collected for the poses of inference. In contrast, in predictive modeling, purpose of the study) and secondary data (data col- f itself is often determined from the data, thereby re- lected for other purposes), the classic criteria of data requiring a larger sample for achieving lower bias and cency, relevance, and accuracy (Patzer, 1995) are con- variance. In addition, more data are needed for creating sidered from a different angle. For example, a predic- holdout datasets (see Section 2.2). Finally, predicting tive model requires the secondary data to include the new individual observations accurately, in a prospec- exact X,Y variables to be used at the time of predictive manner, requires more data than retrospective in- tion, whereas for causal explanation different opera- ference regarding population-level parameters, due to tionalizations of the constructs X , Y may be accept- the extra uncertainty. able. A second design issue is sampling scheme. For In terms of the data collection instrument, whereas instance, in the context of hierarchical data (e.g., in explanatory modeling the goal is to obtain a reliable sampling students within schools) Afshartous and de and valid instrument such that the data obtained rep- Leeuw (2005) noted, “Although there exists an exten- resent the underlying construct adequately (e.g., item sive literature on estimation issues in multilevel mod- response theory in psychometrics), for predictive pur- els, the same cannot be said with respect to prediction.” poses it is more important to focus on the measurement quality and its meaning in terms of the variable to be 4The main study goal can also be descriptive. predicted. 296 G. SHMUELI

Finally, consider the field of design of experiments: In the context of classification trees, the re- two major experimental designs are factorial designs lationship between the missingness and the and response surface methodology (RSM) designs. dependent variable, rather than the standard The former is focused on causal explanation in terms missingness classification approach of Lit- of finding the factors that affect the response. The lat- tle and Rubin (2002)...is the most help- ter is aimed at prediction—finding the combination of ful criterion to distinguish different missing predictors that optimizes Y . Factorial designs employ data methods. a linear f for interpretability, whereas RSM designs Moreover, missingness can be a blessing in a predic- use optimization techniques and estimate a nonlinear tive context, if it is sufficiently informative of Y (e.g., f from the data, which is less interpretable but more missingness in financial statements when the goal is to predictively accurate.5 predict fraudulent reporting). 2.2 Data Preparation Finally, a completely different approach for handling missing data for prediction, mentioned by Sarle (1998) We consider two common data preparation opera- and further developed by Saar-Tsechansky and Provost tions: handling missing values and data partitioning. (2007), considers the case where to-be-predicted obser- 2.2.1 Handling missing values. Most real datasets vations are missing some predictor information, such consist of missing values, thereby requiring one to that the missing information can vary across different identify the missing values, to determine the extent and observations. The proposed solution is to estimate mul- type of missingness, and to choose a course of action tiple “reduced” models, each excluding some predic- accordingly. Although a rich literature exists on data tors. When predicting an observation with missingness imputation, it is monopolized by an explanatory con- on a certain set of predictors, the model that excludes text. In predictive modeling, the solution strongly de- those predictors is used. This approach means that dif- pends on whether the missing values are in the training ferent reduced models are created for different obser- data and/or the data to be predicted. For example, Sarle vations. Although useful for prediction, it is clearly in- (1998) noted: appropriate for causal explanation. If you have only a small proportion of cases 2.2.2 Data partitioning. A popular solution for with missing data, you can simply throw out avoiding overoptimistic predictive accuracy is to evalu- those cases for purposes of estimation; if ate performance not on the training set, that is, the data you want to make predictions for cases with used to build the model, but rather on a holdout sample missing inputs, you don’t have the option of which the model “did not see.” The creation of a hold- throwing those cases out. out sample can be achieved in various ways, the most commonly used being a random partition of the sample Sarle further listed imputation methods that are use- into training and holdout sets. A popular alternative, ful for explanatory purposes but not for predictive pur- especially with scarce data, is cross-validation. Other poses and vice versa. One example is using regression alternatives are resampling methods, such as bootstrap, models with dummy variables that indicate missing- which can be computationally intensive but avoid “bad ness, which is considered unsatisfactory in explana- partitions” and enable predictive modeling with small tory modeling, but can produce excellent predictions. datasets. The usefulness of creating missingness dummy vari- Data partitioning is aimed at minimizing the com- ables was also shown by Ding and Simonoff (2010). bined bias and variance by sacrificing some bias in re- In particular, whereas the classic explanatory ap- turn for a reduction in sampling variance. A smaller proach is based on the Missing-At-Random, Missing- sample is associated with higher bias when f is es- Completely-At-Random or Not-Missing-At-Random timated from the data, which is common in predic- classification (Little and Rubin, 2002), Ding and Si- tive modeling but not in explanatory modeling. Hence, monoff (2010) showed that for predictive purposes the data partitioning is useful for predictive modeling but important distinction is whether the missingness de- less so for explanatory modeling. With today’s abun- pends on Y or not. They concluded: dance of large datasets, where the bias sacrifice is practically small, data partitioning has become a standard 5I thank Douglas Montgomery for this insight. preprocessing step in predictive modeling. TOEXPLAINORTOPREDICT? 297

In explanatory modeling, data partitioning is less as a mediator, its correlation with the response variable common because of the reduction in statistical power. and with other covariates is examined by generating When used, it is usually done for the retrospective pur- specific correlation tables. pose of assessing the robustness of fˆ. A rarer yet im- A third example is the use of EDA for assessing as- portant use of data partitioning in explanatory model- sumptions of potential models (e.g., normality or mul- ing is for strengthening model validity, by demonstrat- ticollinearity) and exploring possible variable transfor- ing some predictive power. Although one would not mations. Here, too, an explanatory context would be expect an explanatory model to be optimal in terms of more restrictive in terms of the space explored. predictive power, it should show some degree of accu- Finally, dimension reduction is viewed and used dif- racy (see discussion in Section 4.2). ferently. In predictive modeling, a reduction in the number of predictors can help reduce sampling vari- 2.3 Exploratory Data Analysis ance. Hence, methods such as principal components Exploratory data analysis (EDA) is a key initial step analysis (PCA) or other data compression methods that in both explanatory and predictive modeling. It consists are even less interpretable (e.g., singular value decom- of summarizing the data numerically and graphically, position) are often carried out initially. They may later reducing their dimension, and “preparing” for the more lead to the use of compressed variables (such as the formal modeling step. Although the same set of tools first few components) as predictors, even if those are can be used in both cases, they are used in a different not easily interpretable. PCA is also used in explana- fashion. In explanatory modeling, exploration is chan- tory modeling, but for a different purpose. For ques- neled toward the theoretically specified causal relation- tionnaire data, PCA and exploratory factor analysis are ships, whereas in predictive modeling EDA is used in used to determine the validity of the survey instrument. a more free-form fashion, supporting the purpose of The resulting factors are expected to correspond to the capturing relationships that are perhaps unknown or at underlying constructs. In fact, the rotation step in fac- least less formally formulated. tor analysis is specifically aimed at making the factors One example is how data visualization is carried out. more interpretable. Similarly, correlations are used for Fayyad, Grinstein and Wierse (2002, page 22) con- assessing the reliability of the survey instrument. trasted “exploratory visualization” with “confirmatory visualization”: 2.4 Choice of Variables Visualizations can be used to explore data, The criteria for choosing variables differ markedly to confirm a hypothesis, or to manipulate in explanatory versus predictive contexts. a viewer...Inexploratoryvisualizationthe In explanatory modeling, where variables are seen user does not necessarily know what he is as operationalized constructs, variable choice is based looking for. This creates a dynamic scenario on the role of the construct in the theoretical causal inwhichinteractioniscritical...Inaconfir- structure and on the operationalization itself. A broad matory visualization, the user has a hypoth- terminology related to different variable roles exists in various fields: in the social sciences—antecedent, esis that needs to be tested. This scenario is 6 more stable and predictable. System para- consequent, mediator and moderator variables; in meters are often predetermined. pharmacology and medical sciences—treatment and control variables; and in epidemiology—exposure and Hence, interactivity, which supports exploration across confounding variables. Carte and Craig (2003)men- a wide and sometimes unknown terrain, is very useful tioned that explaining moderating effects has become for learning about measurement quality and associa- an important scientific endeavor in the field of Man- tions that are at the core of predictive modeling, but agement Information Systems. Another important term much less so in explanatory modeling, where the data common in economics is endogeneity or “reverse cau- are visualized through the theoretical lens. sation,” which results in biased parameter estimates. A second example is numerical summaries. In a pre- Endogeneity can occur due to different reasons. One dictive context, one might explore a wide range of numerical summaries for all variables of interest, whereas 6“A moderator variable is one that influences the strength of a in an explanatory model, the numerical summaries relationship between two other variables, and a mediator variable is would focus on the theoretical relationships. For ex- one that explains the relationship between the two other variables” ample, in order to assess the role of a certain variable (from http://psych.wisc.edu/henriques/mediator.html). 298 G. SHMUELI reason is incorrectly omitting an input variable, say 2.5 Choice of Methods Z, from f when the causal construct Z is assumed to Considering the four aspects of causation–associa- cause X and Y. In a regression model of Y on X,the omission of Z results in X being correlated with the er- tion, theory–data, retrospective–prospective and bias– ror term. Winkelmann (2008) gave the example of a hy- variance leads to different choices of plausible meth- pothesis that health insurance (X ) affects the demand ods, with a much larger array of methods useful for pre- for health services Y. The operationalized variables are diction. Explanatory modeling requires interpretable “health insurance status” (X) and “number of doctor statistical models f that are easily linked to the un- F consultations” (Y ). Omitting an input measurement Z derlying theoretical model . Hence the popularity of for “true health status” (Z) from the regression model statistical models, and especially regression-type meth- f causes endogeneity because X can be determined by ods, in many disciplines. Algorithmic methods such Y (i.e., reverse causation), which manifests as X being as neural networks or k-nearest-neighbors, and unin- correlated with the error term in f . Endogeneity can terpretable nonparametric models, are considered ill- arise due to other reasons such as measurement error suited for explanatory modeling. in X. Because of the focus in explanatory modeling on In predictive modeling, where the top priority is causality and on bias, there is a vast literature on detect- generating accurate predictions of new observations ing endogeneity and on solutions such as constructing and f is often unknown, the range of plausible meth- instrumental variables and using models such as two- ods includes not only statistical models (interpretable stage-least-squares (2SLS). Another related term is si- and uninterpretable) but also data mining algorithms. multaneous causality, which gives rise to special mod- A neural network algorithm might not shed light on an els such as Seemingly Unrelated Regression (SUR) underlying causal mechanism F or even on f ,butit (Zellner, 1962). In terms of chronology, a causal ex- can capture complicated associations, thereby leading planatory model can include only “control” variables to accurate predictions. Although model transparency that take place before the causal variable (Gelman et might be important in some cases, it is of secondary al., 2003). And finally, for reasons of model identifia- importance: “Using complex predictors may be un- bility (i.e., given the statistical model, each causal ef- pleasant, but the soundest path is to go for predictive fect can be identified), one is required to include main accuracy first, then try to understand why” (Breiman, effects in a model that contains an interaction term be- 2001b). tween those effects. We note this practice because it is Breiman (2001b) accused the statistical community not necessary or useful in the predictive context, due to of ignoring algorithmic modeling: the acceptability of uninterpretable models and the potential reduction in sampling variance when dropping There are two cultures in the use of sta- predictors (see, e.g., the Appendix). tistical modeling to reach conclusions from In predictive modeling, the focus on association data. One assumes that the data are gener- rather than causation, the lack of F, and the prospec- ated by a given stochastic data model. The tive context, mean that there is no need to delve into other uses algorithmic models and treats the the exact role of each variable in terms of an underly- data mechanism as unknown. The statisti- ing causal structure. Instead, criteria for choosing pre- cal community has been committed to the dictors are quality of the association between the pre- almost exclusive use of data models. dictors and the response, data quality, and availability of the predictors at the time of prediction, known From the explanatory/predictive view, algorithmic as ex-ante availability. In terms of ex-ante availability, modeling is indeed very suitable for predictive (and whereas chronological precedence of X to Y is nec- descriptive) modeling, but not for explanatory model- essary in causal models, in predictive models not only ing. must X precede Y ,butX must be available at the time Some methods are not suitable for prediction from of prediction. For instance, explaining wine quality ret- the retrospective–prospective aspect, especially in time rospectively would dictate including barrel characteris- series forecasting. Models with coincident indicators, tics as a causal factor. The inclusion of barrel charac- which are measured simultaneously, are such a class. = teristics in a predictive model of future wine quality An example is the model Airfaret f(OilPricet ), would be impossible if at the time of prediction the which might be useful for explaining the effect of oil grapes are still on the vine. See the eBay example in price on airfare based on a causal theory, but not for Section 3.2 for another example. predicting future airfare because the oil price at the TOEXPLAINORTOPREDICT? 299 time of prediction is unknown. For prediction, alterna- questions and factor analysis. Inference for individual tive models must be considered, such as using a lagged coefficients is also used for detecting over- or under- OilPrice variable, or creating a separate model for fore- specification. Validating model fit involves goodness- casting oil prices and plugging its forecast into the air- of-fit tests (e.g., normality tests) and model diagnostics fare model. Another example is the centered moving such as residual analysis. Although indications of lack average, which requires the availability of data during of fit might lead researchers to modify f , modifications a time window before and after a period of interest, and are made carefully in light of the relationship with F is therefore not useful for prediction. and the constructs X ,Y. Lastly, the bias–variance aspect raises two classes In predictive modeling, the biggest danger to gener- of methods that are very useful for prediction, but not alization is overfitting the training data. Hence valida- for explanation. The first is shrinkage methods such as tion consists of evaluating the degree of overfitting, by ridge regression, principal components regression, and comparing the performance of fˆ on the training and partial least squares regression, which “shrink” predic- holdout sets. If performance is significantly better on tor coefficients or even eliminate them, thereby intro- the training set, overfitting is implied. ducing bias into f , for the purpose of reducing esti- Not only is the large context of validation markedly mation variance. The second class of methods, which different in explanatory and predictive modeling, but “have been called the most influential development in so are the details. For example, checking for multi- Data Mining and Machine Learning in the past decade” collinearity is a standard operation in assessing model (Seni and Elder, 2010, page vi), are ensemble meth- fit. This practice is relevant in explanatory modeling, ods such as bagging (Breiman, 1996), random forests where multicollinearity can lead to inflated standard er- (Breiman, 2001a), boosting7 (Schapire, 1999), varia- rors, which interferes with inference. Therefore, a vast tions of those methods, and Bayesian alternatives (e.g., literature exists on strategies for identifying and re- Brown, Vannucci and Fearn, 2002). Ensembles com- ducing multicollinearity, variable selection being one bine multiple models to produce more precise predic- strategy. In contrast, for predictive purposes “multi- tions by averaging predictions from different models, collinearity is not quite as damning” (Vaughan and and have proven useful in numerous applications (see Berry, 2005). Makridakis, Wheelwright and Hyndman the Netflix Prize example in Section 3.1). (1998, page 288) distinguished between the role of multicollinearity in explaining versus its role in pre- 2.6 Validation, Model Evaluation and Model dicting: Selection Multicollinearity is not a problem unless ei- Choosing the final model among a set of models, ther (i) the individual regression coefficients validating it, and evaluating its performance, differ are of interest, or (ii) attempts are made to markedly in explanatory and predictive modeling. Al- isolate the contribution of one explanatory though the process is iterative, I separate it into three variable to Y, without the influence of the components for ease of exposition. other explanatory variables. Multicollinear- 2.6.1 Validation. In explanatory modeling, valida- ity will not affect the ability of the model to tion consists of two parts: model validation validates predict. that f adequately represents F,andmodel fit validates Another example is the detection of influential ob- that fˆ fits the data {X, Y }. In contrast, validation in pre- servations. While classic methods are aimed at detect- dictive modeling is focused on generalization,whichis ing observations that are influential in terms of model the ability of fˆ to predict new data {X ,Y }. new new estimation, Johnson and Geisser (1983) proposed a Methods used in explanatory modeling for model method for detecting influential observations in terms validation include model specification tests such as of their effect on the predictive distribution. the popular Hausman specification test in econometrics (Hausman, 1978), and construct validation techniques 2.6.2 Model evaluation. Consider two performance such as reliability and validity measures of survey aspects of a model: explanatory power and predictive power. The top priority in terms of model perfor- 7Although boosting algorithms were developed as ensemble mance in explanatory modeling is assessing explana- methods, “[they can] be seen as an interesting regularization tory power, which measures the strength of relation- ˆ scheme for estimating a model” (Bohlmann and Hothorn, 2007). ship indicated by f . Researchers report R2-type values 300 G. SHMUELI and statistical significance of overall F -type statistics statistics to include and/or exclude variables, might ap- to indicate the level of explanatory power. pear suitable for achieving high explanatory power. In contrast, in predictive modeling, the focus is on However, optimizing explanatory power in this fash- predictive accuracy or predictive power, which refer to ion conceptually contradicts the validation step, where the performance of fˆ on new data. Measures of pre- variable inclusion/exclusion and the structure of the dictive power are typically out-of-sample metrics or statistical model are carefully designed to represent the their in-sample approximations, which depend on the theoretical model. Hence, proper explanatory model type of required prediction. For example, predictions selection is performed in a constrained manner. In the of a binary Y could be binary classifications (Yˆ = 0, 1), words of Jaccard (2001): predicted probabilities of a certain class [P(Yˆ = 1)], or rankings of those probabilities. The latter are common Trimming potentially theoretically mean- in marketing and personnel psychology. These three ingful variables is not advisable unless one different types of predictions would warrant different is quite certain that the coefficient for the performance metrics. For example, a model can per- variable is near zero, that the variable is inform poorly in producing binary classifications but ad- consequential, and that trimming will not equately in producing rankings. Moreover, in the con- introduce misspecification error. text of asymmetric costs, where costs are heftier for A researcher might choose to retain a causal covari- some types of prediction errors than others, alterna- ate which has a strong theoretical justification even tive performance metrics are used, such as the “average if is statistically insignificant. For example, in med- cost per predicted observation.” ical research, a covariate that denotes whether a person A common misconception in various scientific fields smokes or not is often present in models for health con- is that predictive power can be inferred from explana- ditions, whether it is statistically significant or not.8 In tory power. However, the two are different and should contrast to explanatory power, statistical significance be assessed separately. While predictive power can be plays a minor or no role in assessing predictive perfor- assessed for both explanatory and predictive models, mance. In fact, it is sometimes the case that removing explanatory power is not typically possible to assess inputs with small coefficients, even if they are statis- for predictive models because of the lack of F and an tically significant, results in improved prediction ac- 2 underlying causal structure. Measures such as R and curacy (Greenberg and Parks, 1997; Wu, Harris and F would indicate the level of association, but not cau- McAuley, 2007, and see the Appendix). Stepwise-type sation. algorithms are very useful in predictive modeling as Predictive power is assessed using metrics computed long as the selection criteria rely on predictive power from a holdout set or using cross-validation (Stone, rather than explanatory power. 1974; Geisser, 1975). Thus, a major difference between As mentioned in Section 1.6, the statistics literature explanatory and predictive performance metrics is the on model selection includes a rich discussion on the data from which they are computed. In general, mea- difference between finding the “true” model and find- sures computed from the data to which the model was ing the best predictive model, and on criteria for ex- fitted tend to be overoptimistic in terms of predictive planatory model selection versus predictive model se- accuracy: “Testing the procedure on the data that gave lection. A popular predictive metric is the in-sample it birth is almost certain to overestimate performance” Akaike Information Criterion (AIC). Akaike derived (Mosteller and Tukey, 1977). Thus, the holdout set the AIC from a predictive viewpoint, where the model serves as a more realistic context for evaluating pre- is not intended to accurately infer the “true distribu- dictive power. tion,” but rather to predict future data as accurately 2.6.3 Model selection. Once a set of models f1,f2, as possible (see, e.g., Berk, 2008; Konishi and Kita- ... has been estimated and validated, model selection gawa, 2007). Some researchers distinguish between pertains to choosing among them. Two main differen- AIC and the Bayesian information criterion (BIC) on tiating aspects are the data–theory and bias–variance this ground. Sober (2002) concluded that AIC mea- considerations. In explanatory modeling, the models sures predictive accuracy while BIC measures good- are compared in terms of explanatory power, and hence ness of fit: the popularity of nested models, which are easily compared. Stepwise-type methods, which use overall F 8I thank Ayala Cohen for this example. TOEXPLAINORTOPREDICT? 301

In a sense, the AIC and the BIC provide es- turn are translated into scientific conclusions regard- timates of different things; yet, they almost ing F,X,Y and the causal hypotheses. With a focus always are thought to be in competition. If on theory, causality, bias and retrospective analysis, ex- the question of which estimator is better is planatory studies are aimed at testing or comparing ex- to make sense, we must decide whether the isting causal theories. Accordingly the statistical sec- average likelihood of a family [=BIC] or tion of explanatory scientific papers is dominated by its predictive accuracy [=AIC] is what we statistical inference. ˆ want to estimate. In predictive modeling f is used to generate predictions for new data. We note that generating predictions Similarly, Dowe, Gardner and Oppy (2007) con- from fˆ can range in the level of difficulty, depending trasted the two Bayesian model selection criteria Min- on the complexity of fˆ and on the type of prediction imum Message Length (MML) and Minimum Ex- generated. For example, generating a complete predic- pected Kullback–Leibler Distance (MEKLD). They tive distribution is easier using a Bayesian approach concluded, than the predictive likelihood approach. If you want to maximise predictive accu- In practical applications, the predictions might be racy, you should minimise the expected KL the final goal. However, the focus here is on predictive distance (MEKLD); if you want the best in- modeling for supporting scientific research, as was dis- ference, you should use MML. cussed in Section 1.2. Scientific predictive studies and articles therefore emphasize data, association, bias– Kadane and Lazar (2004) examined a variety of model variance considerations, and prospective aspects of the selection criteria from a Bayesian decision–theoretic study. Conclusions pertain to theory-building aspects point of view, comparing prediction with explanation such as new hypothesis generation, practical relevance, goals. and predictability level. Whereas explanatory articles Even when using predictive metrics, the fashion in focus on theoretical constructs and unobservable para- which they are used within a model selection process meters and their statistical section is dominated by in- can deteriorate their adequacy, yielding overoptimistic ference, predictive articles concentrate on the observ- predictive performance. Berk (2008) described the case able level, with predictive power and its comparison where across models being the core. statistical learning procedures are often applied several times to the data with one or 3. TWO EXAMPLES more tuning parameters varied. The AIC Two examples are used to broadly illustrate the dif- may be computed for each. But each AIC ferences that arise in predictive and explanatory stud- is ignorant about the information obtained ies. In the first I consider a predictive goal and dis- from prior fitting attempts and how many cuss what would be involved in “converting” it to an degrees of freedom were expended in the explanatory study. In the second example I consider process. Matters are even more complicated an explanatory study and what would be different in if some of the variables are transformed a predictive context. See the work of Shmueli and orrecoded...Someunjustifiedoptimismre- Koppius (2010) for a detailed example “converting” mains. the explanatory study of Gefen, Karahanna and Straub (2003) from Section 1 into a predictive one. 2.7 Model Use and Reporting 3.1 Netflix Prize Given all the differences that arise in the modeling process, the resulting predictive model would ob- Netflix is the largest online DVD rental service in viously be very different from a resulting explanatory the United States. In an effort to improve their movie model in terms of the data used ({X, Y }), the estimated recommendation system, in 2006 Netflix announced a model fˆ, and explanatory power and predictive power. contest (http://netflixprize.com), making public a huge The use of fˆ would also greatly differ. dataset of user movie ratings. Each observation con- As illustrated in Section 1.1, explanatory models sisted of a user ID, a movie title, and the rating that the in the context of scientific research are used to de- user gave this movie. The task was to accurately pre- rive “statistical conclusions” using inference, which in dict the ratings of movie-user pairs for a test set such 302 G. SHMUELI that the predictive accuracy improved upon Netflix’s usefulness of the rating scale employed by Netflix. The recommendation engine by at least 10%. The grand research also highlights the importance of knowing prize was set at $ 1,000,000. The 2009 winner was a which movies a user does not rate. And importantly, composite of three teams, one of them from the AT&T it sets the stage for explanatory research. research lab (see Bell, Koren and Volinsky, 2010). In Let us consider a hypothetical goal of explaining their 2008 report, the AT&T team, who also won the movie preferences. After stating causal hypotheses, we 2007 and 2008 progress prizes, described their model- would define constructs that link user behavior and ing approach (Bell, Koren and Volinsky, 2008). movie features X to user preference Y, with a care- Let me point out several operations and choices de- ful choice of F. An operationalization step would link scribed by Bell, Koren and Volinsky (2008) that high- the constructs to measurable data, and the role of each light the distinctive predictive context. Starting with variable in the causality structure would be defined. sample size, the very large sample released by Net- Even if using the Netflix dataset, supplemental covari- flix was aimed at allowing the estimation of f from ates that capture movie features and user characteris- the data, reflecting the absence of a strong theory. In tics would be absolutely necessary. In other words, the the data preparation step, with relation to missingness data collected and the variables included in the model that is predictively informative, the team found that would be different from the predictive context. As to “the information on which movies each user chose methods and models, data compression methods such to rate, regardless of specific rating value” turned out as SVD, heuristic-based predictive algorithms which to be useful. At the data exploration and reduction learn f from the data, and the combination of multi- step, many teams including the winners found that the ple models would be considered inappropriate, as they noninterpretable Singular Value Decomposition (SVD) lack interpretability with respect to F and the hypothe- data reduction method was key in producing accurate ses. The choice of f would be restricted to statistical predictions: “It seems that models based on matrix- models that can be used for inference, and would di- factorization were found to be most accurate.” As for rectly model issues such as the dependence between choice of variables, supplementing the Netflix data records for the same customer and for the same movie. with information about the movie (such as actors, di- Finally, the model would be validated and evaluated in rector) actually decreased accuracy: “We should men- terms of its explanatory power, and used to conclude tion that not all data features were found to be use- about the strength of the causal relationship between ful. For example, we tried to benefit from an exten- various user and movie characteristics and movie pref- sive set of attributes describing each of the movies in erences. Hence, the explanatory context leads to a com- the dataset. Those attributes certainly carry a signifi- pletely different modeling path and final result than the cant signal and can explain some of the user behav- predictive context. ior. However, we concluded that they could not help It is interesting to note that most competing teams at all for improving the accuracy of well tuned collab- had a background in computer science rather than sta- orative filtering models.” In terms of choice of meth- tistics. Yet, the winning team combines the two dis- ods, their solution was an ensemble of methods that ciplines. Statisticians who see the uniqueness and im- included nearest-neighbor algorithms, regression mod- portance of predictive modeling alongside explanatory els, and shrinkage methods. In particular, they found modeling have the capability of contributing to sci- that “using increasingly complex models is only one entific advancement as well as achieving meaningful way of improving accuracy. An apparently easier way practical results (and large monetary awards). to achieve better accuracy is by blending multiple sim- 3.2 Online Auction Research pler models.” And indeed, more accurate predictions were achieved by collaborations between competing The following example highlights the differences be- teams who combined predictions from their individ- tween explanatory and predictive research in online ual models, such as the winners’ combined team. All auctions. The predictive approach also illustrates the these choices and discoveries are very relevant for pre- utility in creating new theory in an area dominated by diction, but not for causal explanation. Although the explanatory modeling. Netflix contest is not aimed at scientific advancement, Online auctions have become a major player in there is clearly scientific value in the predictive models providing electronic commerce services. eBay (www. developed. They tell us about the level of predictabil- eBay.com), the largest consumer-to-consumer auction ity of online user ratings of movies, and the implicated website, enables a global community of buyers and TOEXPLAINORTOPREDICT? 303 sellers to easily interact and trade. Empirical research The other three models, or “model specifications,” in- of online auctions has grown dramatically in recent cluded a modified set of predictors, with some interac- years. Studies using publicly available bid data from tion terms and an alternate auction duration measure- websites such as eBay have found many divergences ment. The authors used a censored-Normal regression of bidding behavior and auction outcomes compared for model estimation, because some auctions did not to ordinary offline auctions and classical auction the- receive any bids and therefore the price was truncated ory. For instance, according to classical auction the- at the minimum bid. Typical explanatory aspects of the ory (e.g., Krishna, 2002), the final price of an auc- modeling are: tion is determined by a priori information about the number of bidders, their valuation, and the auction for- Choice of variables: Several issues arise from the mat. However, final price determination in online auc- causal-theoretical context. First is the exclusion of tions is quite different. Online auctions differ from of- the number of bidders (or bids) as a determinant due fline auctions in various ways such as longer duration, to endogeneity considerations, where although it is anonymity of bidders and sellers, and low barriers of likely to affect the final price, “it is endogenously entry. These and other factors lead to new bidding be- determined by the bidders’ choices.” To verify endo- haviors that are not explained by auction theory. An- geneity the authors report fitting a separate regres- = other important difference is that the total number of sion of Y Number of bids on all the determinants. bidders in most online auctions is unknown until the Second, the authors discuss operationalization chal- auction closes. lenges that might result in bias due to omitted vari- Empirical research in online auctions has concen- ables. In particular, the authors discuss the construct trated in the fields of economics, information systems of “auction attractiveness” (X ) and their inability to and marketing. Explanatory modeling has been em- judge measures such as photos and verbal descrip- ployed to learn about different aspects of bidder be- tions to operationalize attractiveness. havior in auctions. A survey of empirical explanatory Model validation: The four model specifications are research on auctions was given by Bajari and Hortacsu used for testing the robustness of the hypothesized (2004). A typical explanatory study relies on game effect of the construct “auction length” across differ- theory to construct F, which can be done in different operationalized variables such as the continuous ent ways. One approach is to construct a “structural number of days and a categorical alternative. model,” which is a mathematical model linking the var- Model evaluation: For each model, its in-sample R2 ious constructs. The major construct is “bidder val- is used for determining explanatory power. uation,” which is the amount a bidder is willing to Model selection: The authors report the four fitted re- pay, and is typically operationalized using his observed gression models, including both significant and in- placed bids. The structural model and operationalized significant coefficients. Retaining the insignificant constructs are then translated into a regression-type covariates in the model is for matching f with F. model [see, e.g., Sections 5 and 6 in Bajari and Hor- Model use and reporting: The main focus is on in- tacsu (2003)]. To illustrate the use of a statistical model ference for the β’s, and the final conclusions are in explanatory auction research, consider the study by given in causal terms. (“A seller’s feedback rat- Lucking-Reiley et al. (2007) who used a dataset of 461 ings...have a measurable effect on her auction eBay coin auctions to determine the factors affecting prices...when a sellerchooses to have her auction the final auction price. They estimated a set of linear re- last for a longer period of days [sic], this signifi- gression models where Y = log(Price) and X included cantly increases the auction price on average.”) auction characteristics (the opening bid, the auction duration, and whether a secret reserve price was used), Although online auction research is dominated by seller characteristics (the number of positive and nega- explanatory studies, there have been a few predictive tive ratings), and a control variable (book value of the studies developing forecasting models for an auction’s coin). One of their four reported models was of the final price (e.g., Jank, Shmueli and Wang, 2008;Jap form and Naik, 2008; Ghani and Simmons, 2004; Wang, Jank and Shmueli, 2008; Zhang, Jank and Shmueli, = + log(Price) β0 β1 log(BookValue) 2010). For a brief survey of online auction forecast- + β2 log(MinBid) + β3Reserve ing research see the work of Jank and Shmueli (2010, Chapter 5). From my involvement in several of these + β NumDays + β PosRating 4 5 predictive studies, let me highlight the purely predic- + β6NegRating + ε. tive aspects that appear in this literature: 304 G. SHMUELI

Choice of variables: If prediction takes place before should be put aside to give more scope to learning from or at the start of the auction, then obviously the to- the data. This closes the empirical cycle from observa- tal number of bids or bidders cannot be included as tion to theory to the testing of theories on new data.” a predictor. While this variable was also omitted in The current accelerated rate of social, environmental, the explanatory study, the omission was due to a dif- and technological changes creates a burning need for ferent reason, that is, endogeneity. However, if pre- new theories and for the examination of old theories in diction takes place at time t during an ongoing auc- light of the new realities. tion, then the number of bidders/bids present at time A common practice due to the indiscrimination of t is available and useful for predicting the final price. explanation and prediction is to erroneously infer pre- Even more useful is the time series of the number of dictive power from explanatory power, which can lead bidders from the start of the auction until time t as to incorrect scientific and practical conclusions. Col- well as the price curve until time t (Bapna, Jank and leagues from various fields confirmed this fact, and Shmueli, 2008). a cursory search of their scientific literature brings Choice of methods: Predictive studies in online auc- up many examples. For instance, in ecology an arti- tions tend to learn f from the data, using flexible cle intending to predict forest beetle assemblages in- models and algorithmic methods (e.g., CART, k- fers predictive power from explanatory power [“To nearest neighbors, neural networks, functional meth- study...predictive power,... we calculated the R2”; ods and related nonparametric smoothing-based “We expect predictabilities with R2 of up to 0.6” methods, Kalman filters and boosting (see, e.g., (Muller and Brandl, 2009)]. In economics, an article Chapter 5 in Jank and Shmueli, 2010). Many of these entitled “The predictive power of zero intelligence in are not interpretable, yet have proven to provide high financial markets” (Farmer, Patelli and Zovko, 2005) 2 predictive accuracy. infers predictive power from a high R value of a lin- Model evaluation: Auction forecasting studies eval- ear regression model. In epidemiology, many studies uate predictive power on holdout data. They report rely on in-sample hazard ratios estimated from Cox re- performance in terms of out-of-sample metrics such gression models to infer predictive power, reflecting an as MAPE and RMSE, and are compared against other indiscrimination between description and prediction. predictive models and benchmarks. For instance, Nabi et al. (2010) used hazard ratio estimates and statistical significance “to compare the pre- Predictive models for auction price cannot provide dictive power of depression for coronary heart disease direct causal explanations. However, by producing with that of cerebrovascular disease.” In information high-accuracy price predictions they shed light on new systems, an article on “Understanding and predicting potential variables that are related to price and on the electronic commerce adoption” (Pavlou and Fygenson, types of relationships that can be further investigated in 2006) incorrectly compared the predictive power of terms of causality. For instance, a construct that is not different models using in-sample measures (“To exam- directly measurable but that some predictive models ine the predictive power of the proposed model, we are apparently capturing is competition between bid- compare it to four models in terms of R2 adjusted”). ders. These examples are not singular, but rather they reflect the common misunderstanding of predictive power in 4. IMPLICATIONS, CONCLUSIONS AND these and other fields. SUGGESTIONS Finally, a consequence of omitting predictive mod- 4.1 The Cost of Indiscrimination to eling from scientific research is also a gap between research and practice. In an age where empirical re- Scientific Research search has become feasible in many fields, the oppor- Currently, in many fields, statistical modeling is used tunity to bridge the gap between methodological de- nearly exclusively for causal explanation. The conse- velopment and practical application can be easier to quence of neglecting to include predictive modeling achieve through the combination of explanatory and and testing alongside explanatory modeling is losing predictive modeling. the ability to test the relevance of existing theories and Finance is an example where practice is concerned to discover new causal mechanisms. Feelders (2002) with prediction whereas academic research is focused commented on the field of economics: “The pure hy- on explaining. In particular, there has been a reliance pothesis testing framework of economic data analysis on a limited number of models that are considered TOEXPLAINORTOPREDICT? 305 pillars of research, yet have proven to perform very tremes on some continuum, I consider them as two di- poorly in practice. For instance, the CAPM model and mensions.10,11 Explanatory power and predictive accu- more recently the Fama–French model are regression racy are different qualities; a model will possess some models that have been used for explaining market be- level of each. havior for the purpose of portfolio management, and A related controversial question arises: must an ex- have been evaluated in terms of explanatory power planatory model have some level of predictive power to (in-sample R2 and residual analysis) and not predic- be considered scientifically useful? And equally, must tive accuracy.9 More recently, researchers have be- a predictive model have sufficient explanatory power to gun recognizing the distinction between in-sample ex- be scientifically useful? For instance, some explanatory planatory power and out-of-sample predictive power models that cannot be tested for predictive accuracy yet (Goyal and Welch, 2007), which has led to a discus- constitute scientific advances are Darwinian evolution sion of predictability magnitude and a search for pre- theory and string theory in physics. The latter produces dictively accurate explanatory variables (Campbell and currently untestable predictions (Woit, 2006, pages x– Thompson, 2005). In terms of predictive modeling, the xii). Conversely, there exist predictive models that do Chief Actuary of the Financial Supervisory Authority not properly “explain” yet are scientifically valuable. of Sweden commented in 1999: “there is a need for Galileo, in his book Two New Sciences, proposed a models with predictive power for at least a very near fu- demonstration to determine whether light was instan- ture. . . Given sufficient and relevant data this is an area taneous. According to Mackay and Oldford (2000), for statistical analysis, including cluster analysis and Descartes gave the book a scathing review: various kind of structure-finding methods” (Palmgren, The substantive criticisms are generally di- 1999). While there has been some predictive model- rected at Galileo’s not having identified the ing using genetic algorithms (Chen, 2002) and neural causes of the phenomena he investigated. networks (Chakraborty and Sharma, 2007), it has been For most scientists at this time, and partic- performed by practitioners and nonfinance academic ularly for Descartes, that is the whole point researchers and outside of the top academic journals. of science. In summary, the omission of predictive modeling for theory development results not only in academic work Similarly, consider predictive models that are based on becoming irrelevant to practice, but also in creating a wrong explanation yet scientifically and practically a barrier to achieving significant scientific progress, they are considered valuable. One well-known example which is especially unfortunate as data become easier is Ptolemaic astronomy, which until recently was used to collect, store and access. for nautical navigation but is based on a theory proven In the opposite direction, in fields that focus on pre- to be wrong long ago. While such examples are ex- dictive modeling, the reason for omitting explanatory treme, in most cases models are likely to possess some modeling must be sought. A scientific field is usu- level of both explanatory and predictive power. ally defined by a cohesive body of theoretical knowl- Considering predictive accuracy and explanatory edge, which can be tested. Hence, some form of test- power as two axes on a two-dimensional plot would ing, whether empirical or not, must be a component of place different models (f ), aimed either at explana- the field. In areas such as bioinformatics, where there is tion or at prediction, on different areas of the plot. The little theory and an abundance of data, predictive mod- bi-dimensional approach implies that: (1) In terms of els are pivotal in generating avenues for causal theory. modeling, the goal of a scientific study must be specified a priori in order to optimize the criterion of inter- 4.2 Explanatory and Predictive Power: est; and (2) In terms of model evaluation and scientific Two Dimensions reporting, researchers should report both the explana- I have polarized explaining and predicting in this ar- tory and predictive qualities of their models. Even if ticle in an effort to highlight their fundamental differ- prediction is not the goal, the predictive qualities of ences. However, rather than considering them as ex- a model should be reported alongside its explanatory

9Although in their paper Fama and French (1993) did split the 10Similarly, descriptive models can be considered as a third di- sample into two parts, they did so for purposes of testing the sensi- mension, where yet different criteria are used for assessing the tivity of model estimates rather than for assessing predictive accu- strength of the descriptive model. racy. 11I thank Bill Langford for the two-dimensional insight. 306 G. SHMUELI power so that it can be fairly evaluated in terms of its simpler than explanatory models: “Simplicity is rele- capabilities and compared to other models. Similarly, a vant because complex families often do a bad job of predictive model might not require causal explanation predicting new data, though they can be made to fit the in order to be scientifically useful; however, reporting old data quite well” (Sober, 2002). The same argument its relation to causal theory is important for purposes was given by Hastie, Tibshirani and Friedman (2009): of theory building. The availability of information on “Typically the more complex we make the model, the a variety of predictive and explanatory models along lower the bias but the higher the variance.” In con- these two axes can shed light on both predictive and trast, some predictive models in practice are very com- causal aspects of scientific phenomena. The statistical plex,12 and indeed Breiman (2001b) commented: “in modeling process, as depicted in Figure 2, should in- some cases predictive models are more complex in or- clude “overall model performance” in terms of both der to capture small nuances that improve predictive predictive and explanatory qualities. accuracy.” Zellner (2001) used the term “sophisticatedly simple” to define the quality of a “good” model. 4.3 The Cost of Indiscrimination to the I would suggest that the definitions of parsimony and Field of Statistics complexity are task-dependent: predictive or explana- Dissolving the ambiguity surrounding explanatory tory. For example, an “overly complicated” model in versus predictive modeling is important for advancing explanatory terms might prove “sophisticatedly sim- our field itself. Recognizing that statistical methodol- ple” for predictive purposes. ogy has focused mainly on inference indicates an important gap to be filled. While our literature contains 4.4 Closing Remarks and Suggestions predictive methodology for model selection and pre- The consequences from the explanatory/predictive dictive inference, there is scarce statistical predictive distinction lead to two proposed actions: methodology for other modeling steps, such as study design, data collection, data preparation and EDA, 1. It is our responsibility to be aware of how statisti- which present opportunities for new research. Cur- cal models are used in research outside of statistics, rently, the predictive void has been taken up the field why they are used in that fashion, and in response of machine learning and data mining. In fact, the differ- to develop methods that support sound scientific re- ences, and some would say rivalry, between the fields search. Such knowledge can be gained within our of statistics and data mining can be attributed to their field by inviting scientists from different disciplines different goals of explaining versus predicting even to give talks at statistics conferences and seminars, more than to factors such as data size. While statisti- and to require graduate students in statistics to read cal theory has focused on model estimation, inference, and present research papers from other disciplines. and fit, machine learning and data mining have concen- 2. As a discipline, we must acknowledge the differ- trated on developing computationally efficient predic- ence between explanatory, predictive and descriptive algorithms and tackling the bias–variance trade-off tive modeling, and integrate it into statistics edu- in order to achieve high predictive accuracy. cation of statisticians and nonstatisticians, as early Sharpening the distinction between explanatory and as possible but most importantly in “research meth- predictive modeling can raise a new awareness of the ods” courses. This requires creating written materi- strengths and limitations of existing methods and prac- als that are easily accessible and understandable by tices, and might shed light on current controversies nonstatisticians. We should advocate both explana- within our field. One example is the disagreement in tory and predictive modeling, clarify their differ- survey methodology regarding the use of sampling ences and distinctive scientific and practical uses, weights in the analysis of survey data (Little, 2007). and disseminate tools and knowledge for imple- Whereas some researchers advocate using weights to menting both. One particular aspect to consider is reduce bias at the expense of increased variance, and advocating a more careful use of terms such as “pre- others disagree, might not the answer be related to the dictors,” “predictions” and “predictive power,” to final goal? reduce the effects of terminology on incorrect sci- Another ambiguity that can benefit from an explana- entific conclusions. tory/predictive distinction is the definition of parsimony. Some claim that predictive models should be 12I thank Foster Provost from NYU for this observation. TOEXPLAINORTOPREDICT? 307

Awareness of the distinction between explanatory and be lower for the underspecified model. Wu, Harris and predictive modeling, and of the different scientific McAuley (2007) showed the general result for an un- functions that each serve, is essential for the progress derspecified linear regression model with multiple pre- of scientific knowledge. dictors. In particular, they showed that the underspecified model that leaves out q predictors has a lower EPE APPENDIX: IS THE “TRUE” MODEL THE BEST when the following inequality holds: PREDICTIVE MODEL? A LINEAR REGRESSION 2 − EXAMPLE (6) qσ >β2X2(I H1)X2β2. Consider F to be the true function relating constructs This means that the underspecified model produces X and Y and let us assume that f is a valid oper- more accurate predictions, in terms of lower EPE, in ationalization of F. Choosing an intentionally biased the following situations: ∗ function f in place of f is clearly undesirable from • when the data are very noisy (large σ ); a theoretical–explanatory point of view. However, we • when the true absolute values of the left-out parame- will show that f ∗ can be preferable to f from a pre- ters (in our example β2) are small; dictive standpoint. • when the predictors are highly correlated; and To illustrate this, consider the statistical model • when the sample size is small or the range of left-out = + + f(x) β1x1 β2x2 ε whichisassumedtobecor- variables is small. rectly specified with respect to F. Using data, we obtain the estimated model fˆ, which has the properties The bottom line is nicely summarized by Hagerty and Srinivasan (1991): “We note that the practice in = (2) Bias 0, applied research of concluding that a model with a ˆ ˆ ˆ Var(f(x))= Var(x1β1 + x2β2) higher predictive validity is “truer,” is not a valid in- (3) ference. This paper shows that a parsimonious but less 2 −1 = σ x (X X) x, true model can have a higher predictive validity than a truer but less parsimonious model.” where x is the vector x =[x1,x2] ,andX is the design matrix based on both predictors. Combining the squared bias with the variance gives ACKNOWLEDGMENTS EPE = E Y − f(x)ˆ 2 I thank two anonymous reviewers, the associate editor, and editor for their suggestions and comments − (4) = σ 2 + 0 + σ 2x (X X) 1x which improved this manuscript. I express my grati- − = σ 2 1 + x (X X) 1x . tude to many colleagues for invaluable feedback and fruitful discussion that have helped me develop the In comparison, consider the estimated underspeci- explanatory/predictive argument presented in this ar- ˆ∗ fied form f (x) =ˆγ1x1. The bias and variance here ticle. I am grateful to Otto Koppius (Erasmus) and are given by Montgomery, Peck and Vining (2001, Ravi Bapna (U Minnesota) for familiarizing me with pages 292–296): explanatory modeling in Information Systems, for col- laboratively pursuing prediction in this field, and for Bias = x1γ1 − (x1β1 + x2β2) tireless discussion of this work. I thank Ayala Cohen = −1 + x1(x1x1) x1(x1β1 x2β2) (Technion), Ralph Snyder (Monash), Rob Hyndman (Monash) and Bill Langford (RMIT) for detailed feed- − (x1β1 + x2β2), back on earlier drafts of this article. Special thanks to ˆ∗ = ˆ = 2 −1 Var(f (x)) x1 Var(γ1)x1 σ x1(x1x1) x1. Boaz Shmueli and Raquelle Azran for their meticulous Combining the squared bias with the variance gives reading and discussions of the manuscript. And special thanks for invaluable comments and suggestions go to = −1 − 2 EPE x1(x1x1) x1x2β2 x2β2 Murray Aitkin (U Melbourne), Yoav Benjamini (Tel (5) − Aviv U), Smarajit Bose (ISI), Saibal Chattopadhyay + σ 2 1 + x (x x ) 1x . 1 1 1 1 (IIMC), Ram Chellapah (Emory), Etti Doveh (Tech- Although the bias of the underspecified model f ∗(x) nion), Paul Feigin (Technion), Paulo Goes (U Arizona), is larger than that of f(x), its variance can be smaller, Avi Goldfarb (Toronto U), Norma Hubele (ASU), and in some cases so small that the overall EPE will Ron Kenett (KPA Inc.), Paul Lajbcygier (Monash), 308 G. SHMUELI

Thomas Lumley (U Washington), David Madigan DAWID, A. P. (1984). Present position and potential developments: (Columbia U), Isaac Meilejson (Tel Aviv U), Dou- Some personal views: Statistical theory: The prequential ap- glas Montgomery (ASU), Amita Pal (ISI), Don Poskitt proach. J. Roy. Statist. Soc. Ser. A 147 278–292. MR0763811 DING,Y.andSIMONOFF, J. (2010). An investigation of missing (Monash), Foster Provost (NYU), Saharon Rosset (Tel data methods for classification trees applied to binary response Aviv U), Jeffrey Simonoff (NYU) and David Steinberg data. J.Mach.Learn.Res.11 131–170. (Tel Aviv U). DOMINGOS, P. (2000). A unified bias–variance decomposition for zero–one and squared loss. In Proceedings of the Seventeenth National Conference on Artificial Intelligence 564–569. AAAI REFERENCES Press, Austin, TX. DOWE,D.L.,GARDNER,S.andOPPY, G. R. (2007). Bayes not AFSHARTOUS,D.andDE LEEUW, J. (2005). Prediction in multi- bust! Why simplicity is no problem for Bayesians. Br. J. Philos. level models. J. Educ. Behav. Statist. 30 109–139. Sci. 58 709–754. MR2375767 AITCHISON,J.andDUNSMORE, I. R. (1975). Statistical Predic- DUBIN, R. (1969). Theory Building. The Free Press, New York. tion Analysis. Cambridge Univ. Press. MR0408097 EDWARDS,J.R.andBAGOZZI, R. P. (2000). On the nature BAJARI,P.andHORTACSU, A. (2003). The winner’s curse, reserve and direction of relationships between constructs. Psychologi- prices and endogenous entry: Empirical insights from ebay auc- cal Methods 52155–174. tions. Rand J. Econ. 3 329–355. EHRENBERG,A.andBOUND, J. (1993). Predictability and pre- BAJARI,P.andHORTACSU, A. (2004). Economic insights from diction. J. Roy. Statist. Soc. Ser. A 156 167–206. internet auctions. J. Econ. Liter. 42 457–486. FAMA,E.F.andFRENCH, K. R. (1993). Common risk factors in BAPNA,R.,JANK,W.andSHMUELI, G. (2008). Price formation stock and bond returns. J. Fin. Econ. 33 3–56. and its dynamics in online auctions. Decision Support Systems FARMER,J.D.,PATELLI,P.andZOVKO, I. I. A. A. (2005). The 44 641–656. predictive power of zero intelligence in financial markets. Proc. BELL,R.M.,KOREN,Y.andVOLINSKY, C. (2008). The BellKor Natl. Acad. Sci. USA 102 2254–2259. 2008 solution to the Netflix Prize. FAYYAD,U.M.,GRINSTEIN,G.G.andWIERSE, A. (2002). In- BELL,R.M.,KOREN,Y.andVOLINSKY, C. (2010). All together formation Visualization in Data Mining and Knowledge Discov- now: A perspective on the netflix prize. Chance 23 24. ery. Morgan Kaufmann, San Francisco, CA. BERK, R. A. (2008). Statistical Learning from a Regression Per- FEELDERS, A. (2002). Data mining in economic science. In Deal- spective. Springer, New York. ing with the Data Flood 166–175. STT/Beweton, Den Haag, BJORNSTAD, J. F. (1990). Predictive likelihood: A review. Statist. The Netherlands. Sci. 5 242–265. MR1062578 FINDLEY,D.Y.andPARZEN, E. (1998). A conversation with Hi- BOHLMANN,P.andHOTHORN, T. (2007). Boosting algorithms: rotsugo Akaike. In Selected Papers of Hirotugu Akaike 3–16. Regularization, prediction and model fitting. Statist. Sci. 22 Springer, New York. MR1486823 477–505. MR2420454 FORSTER, M. (2002). Predictive accuracy as an achievable goal of science. Philos. Sci. 69 S124–S134. BREIMAN, L. (1996). Bagging predictors. Mach. Learn. 24 123– 140. MR1425957 FORSTER,M.andSOBER, E. (1994). How to tell when simpler, more unified, or less ad-hoc theories will provide more accurate BREIMAN, L. (2001a). Random forests. Mach. Learn. 45 5–32. predictions. Br. J. Philos. Sci. 45 1–35. MR1277464 BREIMAN, L. (2001b). Statistical modeling: The two cultures. Sta- FRIEDMAN, J. H. (1997). On bias, variance, 0/1-loss, and the tist. Sci. 16 199–215. MR1874152 curse-of-dimensionality. Data Mining and Knowledge Discov- BROWN,P.J.,VANNUCCI,M.andFEARN, T. (2002). Bayes ery 1 55–77. model averaging with selection of regressors. J. R. Stat. Soc. GEFEN,D.,KARAHANNA,E.andSTRAUB, D. (2003). Trust and Ser. B Stat. Methodol. 64 519–536. MR1924304 TAM in online shopping: An integrated model. MIS Quart. 27 CAMPBELL,J.Y.andTHOMPSON, S. B. (2005). Predicting ex- 51–90. cess stock returns out of sample: Can anything beat the histori- GEISSER, S. (1975). The predictive sample reuse method with ap- cal average? Harvard Institute of Economic Research Working plications. J. Amer. Statist. Assoc. 70 320–328. Paper 2084. GEISSER, S. (1993). Predictive Inference: An Introduction. Chap- CARTE,T.A.andCRAIG, J. R. (2003). In pursuit of moderation: man and Hall, London. MR1252174 Nine common errors and their solutions. MIS Quart. 27 479– GELMAN,A.,CARLIN,J.B.,STERN,H.S.andRUBIN,D.B. 501. (2003). Bayesian Data Analysis, 2nd ed. Chapman & Hall/CRC CHAKRABORTY,S.andSHARMA, S. K. (2007). Prediction of cor- New York/Boca Raton, FL. MR1385925 porate financial health by artificial neural network. Int. J. Elec- GHANI,R.andSIMMONS, H. (2004). Predicting the end-price of tron. Fin. 1 442–459. online auctions. In International Workshop on Data Mining and CHEN,S.-H.,ED. (2002). Genetic Algorithms and Genetic Pro- Adaptive Modelling Methods for Economics and Management, gramming in Computational Finance. Kluwer, Dordrecht. Pisa, Italy. COLLOPY,F.,ADYA,M.andARMSTRONG, J. (1994). Principles GOYAL,A.andWELCH, I. (2007). A comprehensive look at the for examining predictive–validity—the case of information- empirical performance of equity premium prediction. Rev. Fin. systems spending forecasts. Inform. Syst. Res. 5 170–179. Stud. 21 1455–1508. DALKEY,N.andHELMER, O. (1963). An experimental applica- GRANGER, C. (1969). Investigating causal relations by economet- tion of the delphi method to the use of experts. Manag. Sci. 9 ric models and cross-spectral methods. Econometrica 37 424– 458–467. 438. TOEXPLAINORTOPREDICT? 309

GREENBERG,E.andPARKS, R. P. (1997). A predictive approach MONTGOMERY,D.,PECK,E.A.andVINING, G. G. (2001). to model selection and multicollinearity. J. Appl. Econom. 12 Introduction to Linear Regression Analysis. Wiley, New York. 67–75. MR1820113 GURBAXANI,V.andMENDELSON, H. (1990). An integrative MOSTELLER,F.andTUKEY, J. W. (1977). Data Analysis and Re- model of information systems spending growth. Inform. Syst. gression. Addison-Wesley, Reading, MA. Res. 1 23–46. MULLER,J.andBRANDL, R. (2009). Assessing biodiversity by GURBAXANI,V.andMENDELSON, H. (1994). Modeling vs. remote sensing in mountainous terrain: The potential of lidar to forecasting—the case of information-systems spending. Inform. predict forest beetle assemblages. J. Appl. Ecol. 46 897–905. Syst. Res. 5 180–190. NABI,J.,KIVIMÄKI,M.,SUOMINEN,S.,KOSKENVUO,M.and HAGERTY,M.R.andSRINIVASAN, S. (1991). Comparing the pre- VAHTERA, J. (2010). Does depression predict coronary heart dictive powers of alternative multiple regression models. Psy- diseaseand cerebrovascular disease equally well? The health chometrika 56 77–85. MR1115296 and social support prospective cohort study. Int. J. Epidemiol. HASTIE,T.,TIBSHIRANI,R.andFRIEDMAN, J. H. (2009). The 39 1016–1024. Elements of Statistical Learning: Data Mining, Inference, and PALMGREN, B. (1999). The need for financial models. ERCIM Prediction, 2nd ed. Springer, New York. MR1851606 News 38 8–9. HAUSMAN, J. A. (1978). Specification tests in econometrics. PARZEN, E. (2001). Comment on statistical modeling: The two Econometrica 46 1251–1271. MR0513692 cultures. Statist. Sci. 16 224–226. MR1874152 HELMER,O.andRESCHER, N. (1959). On the epistemology of PATZER, G. L. (1995). Using Secondary Data in Marketing Re- the inexact sciences. Manag. Sci. 5 25–52. search: United States and Worldwide. Greenwood Publishing, HEMPEL,C.andOPPENHEIM, P. (1948). Studies in the logic of explanation. Philos. Sci. 15 135–175. Westport, CT. HITCHCOCK,C.andSOBER, E. (2004). Prediction versus accom- PAV L O U ,P.andFYGENSON, M. (2006). Understanding and pre- modation and the risk of overfitting. Br. J. Philos. Sci. 55 1–34. dicting electronic commerce adoption: An extension of the the- JACCARD, J. (2001). Interaction Effects in Logistic Regression. ory of planned behavior. Mis Quart. 30 115–143. SAGE Publications, Thousand Oaks, CA. PEARL, J. (1995). Causal diagrams for empirical research. Bio- JANK,W.andSHMUELI, G. (2010). Modeling Online Auctions. metrika 82 669–709. MR1380809 Wiley, New York. ROSENBAUM,P.andRUBIN, D. B. (1983). The central role of JANK,W.,SHMUELI,G.andWANG, S. (2008). Modeling price the propensity score in observational studies for causal effects. dynamics in online auctions via regression trees. In Statis- Biometrika 70 41–55. MR0742974 tical Methods in eCommerce Research. Wiley, New York. RUBIN, D. B. (1997). Estimating causal effects from large data sets MR2414052 using propensity scores. Ann. Intern. Med. 127 757–763. JAP,S.andNAIK, P. (2008). Bidanalyzer: A method for estima- SAAR-TSECHANSKY,M.andPROVOST, F. (2007). Handling tion and selection of dynamic bidding models. Marketing Sci. missing features when applying classification models. J. Mach. 27 949–960. Learn. Res. 8 1625–1657. JOHNSON,W.andGEISSER, S. (1983). A predictive view of SARLE, W. S. (1998). Prediction with missing inputs. In JCIS 98 the detection and characterization of influential observations Proceedings (P. Wang, ed.) II 399–402. Research Triangle Park, in regression analysis. J. Amer. Statist. Assoc. 78 137–144. Durham, NC. MR0696858 SENI,G.andELDER, J. F. (2010). Ensemble Methods in Data KADANE,J.B.andLAZAR, N. A. (2004). Methods and cri- Mining: Improving Accuracy Through Combining Predictions teria for model selection. J. Amer. Statist. Soc. 99 279–290. (Synthesis Lectures on Data Mining and Knowledge Discovery). MR2061890 Morgan and Claypool, San Rafael, CA. KENDALL,M.andSTUART, A. (1977). The Advanced Theory of SHAFER, G. (1996). The Art of Causal Conjecture. MIT Press, Statistics 1, 4th ed. Griffin, London. Cambridge, MA. KONISHI,S.andKITAGAWA, G. (2007). Information Criteria and SCHAPIRE, R. E. (1999). A brief introduction to boosting. In Pro- Statistical Modeling. Springer, New York. MR2367855 ceedings of the Sixth International Joint Conference on Artifi- KRISHNA, V. (2002). Auction Theory. Academic Press, San Diego, cial Intelligence 1401–1406. Stockholm, Sweden. CA. SHMUELI,G.andKOPPIUS, O. R. (2010). Predictive analytics in LITTLE, R. J. A. (2007). Should we use the survey weights to weight? JPSM Distinguished Lecture, Univ. Maryland. information systems research. MIS Quart. To appear. IMON LITTLE,R.J.A.andRUBIN, D. B. (2002). Statistical Analysis S , H. A. (2001). Science seeks parsimony, not simplicity: with Missing Data. Wiley, New York. MR1925014 Searching for pattern in phenomena. In Simplicity, Inference LUCKING-REILEY,D.,BRYAN,D.,PRASAD,N.and and Modelling: Keeping it Sophisticatedly Simple 32–72. Cam- REEVES, D. (2007). Pennies from ebay: The determinants of bridge Univ. Press. MR1932928 price in online auctions. J. Indust. Econ. 55 223–233. SOBER, E. (2002). Instrumentalism, parsimony, and the Akaike MACKAY,R.J.andOLDFORD, R. W. (2000). Scientific method, framework. Philos. Sci. 69 S112–S123. statistical method, and the speed of light. Working Paper 2000- SONG,H.andWITT, S. F. (2000). Tourism Demand Modelling 02, Dept. Statistics and Actuarial Science, Univ. Waterloo. and Forecasting: Modern Econometric Approaches.Pergamon MR1847825 Press, Oxford. MAKRIDAKIS,S.G.,WHEELWRIGHT,S.C.andHYNDMAN, SPIRTES,P.,GLYMOUR,C.andSCHEINES, R. (2000). Causation, R. J. (1998). Forecasting: Methods and Applications,3rded. Prediction, and Search, 2nd ed. MIT Press, Cambridge, MA. Wiley, New York. MR1815675 310 G. SHMUELI

STONE, M. (1974). Cross-validatory choice and assesment of sta- WINKELMANN, R. (2008). Econometric Analysis of Count Data, tistical predictions (with discussion). J. Roy. Statist. Soc. Ser. B 5th ed. Springer, New York. MR2148271 39 111–147. MR0356377 WOIT, P. (2006). Not Even Wrong: The Failure of String Theory TALEB, N. (2007). The Black Swan. Penguin Books, London. and the Search for Unity in Physical Law. Jonathan Cope, Lon- VAN MAANEN,J.,SORENSEN,J.andMITCHELL, T. (2007). The don. MR2245858 interplay between theory and method. Acad. Manag. Rev. 32 WU,S.,HARRIS,T.andMCAULEY, K. (2007). The use of simpli- 1145–1154. fied or misspecified models: Linear case. Canad. J. Chem. Eng. 85 386–398. VAUGHAN,T.S.andBERRY, K. E. (2005). Using Monte Carlo ZELLNER, A. (1962). An efficient method of estimating seemingly techniques to demonstrate the meaning and implications of mul- unrelated regression equations and tests for aggregation bias. ticollinearity. J. Statist. Educ. 13 online. J. Amer. Statist. Assoc. 57 348–368. MR0139235 WALLIS, W. A. (1980). The statistical research group, 1942–1945. ZELLNER, A. (2001). Keep it sophisticatedly simple. In Simplic- J. Amer. Statist. Assoc. 75 320–330. MR0577363 ity, Inference and Modelling: Keeping It Sophisticatedly Simple WANG,S.,JANK,W.andSHMUELI, G. (2008). Explaining and 242–261. Cambridge Univ. Press. MR1932939 forecasting online auction prices and their dynamics using func- ZHANG,S.,JANK,W.andSHMUELI, G. (2010). Real-time fore- tional data analysis. J. Business Econ. Statist. 26 144–160. casting of online auctions via functional k-nearest neighbors. MR2420144 Int. J. Forecast. 26 666–683.