Risk Analysis, Vol. 24, No. 5, 2004

Invited Commentary

Bounding Analysis as an Inadequately Specified Methodology

∗ Sander Greenland

The bounding analysis methodology described by Ha-Duong et al. (this issue) is logically incomplete and invites serious misuse and misinterpretation, as their own example and in- terpretation illustrate. A key issue is the extent to which these problems are inherent in their methodology, and resolvable by a logically complete assessment (such as Monte Carlo or Bayesian risk assessment), as opposed to being general problems in any risk-assessment methodology. I here attempt to apportion the problems between those inherent in the pro- posed bounding analysis and those that are more general, such as reliance on questionable expert elicitations. I conclude that the specific methodology of Ha-Duong et al. suffers from logical gaps in the definition and construction of inputs, and hence should not be used in the form proposed. Furthermore, the labor required to do a sound bounding analysis is great enough so that one may as well skip that analysis and carry out a more logically complete probabilistic analysis, one that will better inform the consumer of the appropriate level un- certainty. If analysts insist on carrying out a bounding analysis in place of more thorough assessments, extensive analyses of sensitivity to inputs and assumptions will be essential to display uncertainties, arguably more essential than it would be in full probabilistic analyses.

KEY WORDS: Bayesian analysis; bias analysis; bounding analysis; causal inference; epidemiologic meth- ods; expert elicitation; Monte Carlo analysis; probabilistic assessment; risk analysis; risk assessment; sensitivity analysis; stochastic simulation

1. INTRODUCTION to Bayesians as a “prior distribution,” and known to frequentists by many other names such as “mixing dis- From a narrow but common viewpoint there are tribution,” “second-stage distribution,” or “penalty currently only two received modes of probabilistic or function”), and when they adopt the same models for statistical inference: frequentist and Bayesian. These all levels of variation they produce parallel mathe- two modes have many subtypes and variants, and the matical deductions. boundary between them becomes blurred when one The consequence is that a savvy modeler can enters the realm of hierarchical (multilevel) modeling produce identical numerical results (e.g., interval es- (Good, 1983, 1987; Greenland, 2000a; Carlin & Louis, timates) from each model. Of course, the nominal 2000). All the variants depend critically on a sampling interpretation of these results will differ. For exam- model for the variability (both systematic and ran- ple, a frequentist interval estimate (such as a con- dom) in observations. In the hierarchical framework fidence interval) is defined and evaluated in terms both modes deploy a model for information about or of a coverage rate in a very hypothetical thought relations among sampling-model parameters (known experiment in which all the input models are per- ∗ Departments of and , University of fectly correct and used for resampling. In contrast, California, Los Angeles, CA 90095-1772, USA; lesdomes@ a subjective-Bayesian interval estimate (a posterior ucla.edu. interval) is defined in terms of what a hypothetical

1085 0272-4332/04/0100-1085$22.00/1 C 2004 Society for Risk Analysis 1086 Greenland observer should bet, given the input models and the analysis is to generate an upper bound on the mor- data. tality attributed to the group of poorly characterized As is often the case with competing politicians vy- factors ...” (Ha-Duong et al., Section 1.1). The error ing for our votes, these competing interpretations are the authors commit is that of treating this upper bound less far apart than polemics often suggest. At the very and input bounds used to derive this bound as sharply least, both start with “assume the following models defined quantities, without ever offering a definition and distributions” and deduce from there, and both of what they are. They also do not indicate how to depend on full probability models (even though those derive input bounds from subject matter, except by models might be infinite-dimensional, as in non- and expert elicitation, which I will criticize later. This def- semiparametric methods). In observational epidemi- initional gap is fatal: for a mathematical analysis (like ology, where all the models are subjective speculation, that in Section 3 of Ha-Duong et al.)toproduce sci- divergence between frequentist and Bayesian infer- entifically meaningful outputs, we must have opera- ences can just as well be attributed to differences in tional definitions of the inputs and reliable means for the input models (mainly in use of uniform versus measuring them. more concentrated parameter distributions), rather What is one of their bounds supposed to repre- than to differences in fundamental philosophy (Good, sent? To appreciate the question, consider a central 1987). input to their analysis, the smoking attributable frac- On the principle that diversity of viewpoints is tion for lung cancer (Ha-Duong et al., Section 2); call beneficial, and given that standard statistics does it SAF for now. In what seems a nonsequitur, after dis- not encompass as much diversity as is often claimed cussing various estimates of this quantity, Ha-Duong (except by the virtue of rampant misinterpretations et al. offer 0.70 as a lower bound. Later, in Section 4, of basic statistics such as significance tests), we should Ha-Duong et al. condition their results on the premise welcome any attempt to produce a truly different an- “if one is confident in the bounds assigned to the well- alytic viewpoint. It seems widely accepted, however, understood risk factors”; presumably this refers to the that any such attempt must have logical clarity and 0.70, among others. Why not 0.67 or 0.75? Is 0.70 a consistency, and any application must be consistent point that they are 100% certain SAF exceeds? If so, with background (subject matter) information. I will how did they come upon that certainty? Or is it some- argue that the bounding analysis methodology pro- thing less than certain? If less, by how much? If they posed by Ha-Duong et al. (2004) fails on both counts, are 100% certain SAF exceeds 0.70, are they as cer- and hence should be rejected in the form presented. tain it exceeds 0.75? If so, why not use 0.75? If they are My criticisms draw heavily on my work and that of not as certain about 0.75, why not? The other direction colleagues in epidemiologic risk estimation; due to deserves questioning as well: If Ha-Duong et al. are space constraints, many details are omitted but can not 100% certain SAF exceeds 0.70, are they 100% be found in the references. certain it exceeds 0.67? If not, why not? If so, why not use 0.67? Finally, if the bound is not supposed 2. FLAWS IN THE BOUNDING ANALYSIS to be linked to a subjective probability, then what OF HA-DUONG ET AL. does it mean, and what does “confident in the bound” mean? The following sections describe three logically in- These are questions about the implications of dependent criticisms of the bounding methodology the background data and subject-matter theory. Ha proposed by Ha-Duong et al. The first and most ex- Duong et al. write as if 0.70 is somehow implied by this tensive criticism is important in any application, and scientific background, yet no estimate they present thus must be repaired before any use of the proposed from the literature comes close to this bound. They method, while the impact of the other two would de- offer no confidence interval, so apparently the bound pend on the context. Section 2.4 raises an ambiguity in is not based on considering random error. Nor do the causal model underlying the methodology, which they offer any attempt to quantify the impact of the could have an impact in special situations. sources of bias, such as the uncontrolled confounding, the nonrepresentativeness of the source cohorts, the 2.1. Failure to Conceptually Define the Bounds large measurement errors that must be present, or the and Deal with Bound Uncertainty coarsening errors from use of such crude age and ex- The bounding proposal begins with a conceptual posure categories. Each of these problems is a source gap, on which it founders thereafter: “The goal of the of uncertainty. Invited Commentary 1087

Instead of quantifying potential problems, Ha- et al., 1982; Gilovich et al., 2002). Worse, elicitations Duong et al. either make qualitative dismissals or else can incorporate personal biases of the expert, either fail to mention them. As an example, they say “other academic (previous commitment to a value for the potential confounders ...may not be a large problem bounded quantity) or economic (conflict of interest). because ...”. What kind of confidence should we as- Saying that they are “expert judgments” begs these sign this claim, and how much uncertainty is omit- questions. Because the same problems arise in prob- ted by ignoring this and other potential problems? ability elicitations, I will return to these issues after Did Ha-Duong et al. quantify and factor in that un- discussing probabilistic assessments. certainty when setting their bounds? Apparently not. In summary, the core proposal of Ha-Duong Other problems such as measurement error and ex- et al. is nothing more than an algorithm for operating cessive coarsening are not discussed, yet can have an on inputs, and gives no meaning to those inputs or the order-of-magnitude larger impact on results than the output. By offering neither a precise definition of the problems Ha-Duong et al. do discuss (Greenland, in input bounds nor their scientific derivation, the pro- press; Poole & Greenland, 1997). posed bounding analysis first injects and then ignores It appears instead that Ha-Duong et al. rely on potentially large sources of uncertainty and bias into expert consensus as an indication of both precision the analysis, and leaves unclear the meaning of the and validity of published risk estimates. This reliance outputs. No subsequent mathematical operation can might not be too misleading for smoking-lung can- compensate for these deficiencies. These criticisms re- cer relation, given that the latter is one of the most quire more than clarification of the proposed method- well-studied relations in the history of epidemiology. ology; they require an explicit conceptual foundation But few relations are in that category; radon and lung for the inputs, and a means of deriving inputs from cancer is certainly not. Hence, to be credible, a gen- the literature. eral risk-assessment methodology must have a means to explicitly incorporate methodologic uncertainties 2.2. Unjustified Assumptions into the analysis. I could find no such means in the proposal or example of Ha-Duong et al. The mathematical development in Section 3 of So, again, what is the rationale for using 0.70? Ha-Duong et al. can be criticized for the assumptions Again, why not 0.67 or 0.75? And again, what confi- it injects. In any risk analysis, a sharp mathematical dence are we supposed to place in this bound? I be- assumption represents a source of uncertainty that lieve that the 0.70 (as well as the 0.95 upper bound they the analyst has chosen to ignore. A credible analy- offer) represents in large measure digit preference, as sis presents scientific reasons to justify those choices, arises from demanding a sharp answer to an ill-defined or evidence that the results are insensitive to credi- question. Had modern society evolved into using a ble assumption violations. Equation (2) of Ha-Duong base-12 number system, Ha-Duong et al. would have et al. assumes independence of the exposures, an as- offered 0.67 or 0.75 (8/12 or 9/12, expressed as 0.80 sumption with neither empirical nor theoretical sup- or 0.90 in base 12). Given that subsequent analy- port in the application; in fact, the assumption is not ses in Ha-Duong et al. depend on the remainder of credible in the application, as the authors concede 1 − 0.70 = 0.30, the range for digit preference is not in a footnote. More generally, unhealthy behaviors, a negligible proportion of the input. Furthermore, as environments, and occupations tend to cluster (re- will be argued later, bounding analysis is more sensi- flecting social-class and education gradients). It thus tive to these input bounds than probabilistic analyses. is a mistake to build a method around this indepen- It thus appears that the input bound of 0.70 is as much dence assumption, especially when one can see from a function of digit preference as subject matter; sensi- the formula that results could be highly sensitive to tivity of outputs to this preference is a clue that some- dependencies. thing very arbitrary and poorly thought out underlies The development also assumes absence of three- the proposed analysis. way or higher interactions; the only justification given Expert elicitation does not address the above is- is absence of information. One should recall that ab- sues. If the bound is from an expert, we should ask sence of evidence is not evidence of absence, how- the same pointed questions: What kind of confidence ever; in fact, the interactions being assumed away do you have in this bound? Why that bound, and not could be large. Unfortunately, Ha-Duong et al. pro- another nearby? Elicitations can be severely affected vide no clue as to whether their results are sensitive by digit preferences and cognitive biases (Kahneman to their assumption of interaction absence. Finally, 1088 Greenland the development in Section 3 invokes a variant of puts and superior handling of problems such as bias. Laplace’s Principle of Insufficient Reason, saying it Given these possibilities, the competitor should not is used in Section 4 to estimate the output bounds. I be judged deficient just because it conflicts with the could find nothing in the article that pinpoints where bounding analysis. and how the Principle was used to derive the bounds Second, as with the input bounds, Ha-Duong et al. in Section 4. Regardless, to justify the use of the Prin- give no sense of precision or reliability for their output ciple in a real application one needs to demonstrate bound of 3.2%. It is not even clear what this bound that indeed background information was accounted means: Is it supposed to represent a point that we for and exhausted before the Principle is applied. Ha- are 100% certain or confident is above the X-risk is Duong et al. do not do this. Their use of the Prin- below 3.2%? That is how they treat it in Section 4. ciple is not just another source of uncertainty, but a How is such certainty justified given the inevitable potential source of bias insofar as the Principle in- uncertainty in the inputs to and assumptions in the vokes use of a uniform distribution, which is an ex- bounding analysis? If the output bound corresponds treme distribution in the space of credible distribu- to anything less than 100% certainty, how much less? tions, and is almost never epidemiologically plausi- Why use that particular less-than-100% level of cer- ble in light of background information (Greenland, tainty as a cutoff point? If we are not to invoke prob- 1998a). abilistic concepts of certainty or confidence, how are we supposed to interpret the bound? In a complementary fashion, the output from any 2.3. Inferential Deficiencies competing assessment will (or should) have some un- Let us for the moment forget the above criticisms certainty attached. At what point do we say that the and pretend (as do Ha-Duong et al.) that the inputs competitor conflicts with the bounding analysis? Do to the proposed method are logically precise, scientif- we construct a test comparing the two outputs? If ically measurable, and methodologically sound, and so, at what critical level do we reject the hypothe- that the assumptions made in the mathematical anal- sis that the two assessments are compatible, given ysis are contextually justifiable. Even with this pre- the uncertainties involved? And how do we account tense, the inferences drawn by Ha-Duong et al. in Sec- for the fact that the bounding analysis and the com- tion 4 do not follow logically from their outputs. This peting assessment will be based on much the same is in part because of a new logical gap, and in part be- data, and so the two will be correlated? Ha-Duong cause the meaning of the output bounds is ambiguous, et al.’s comparison of their upper bound with a pub- as with the inputs. lished point projection is na¨ıve at best. While not Of other assessments, Ha-Duong et al. state that likely to mislead if the point falls within the bounds, “if the assessments project the lung cancer risk...from it can be very misleading otherwise: a valid compari- these pollutants (‘other’ environmental factors, X) son of two analyses cannot be based on just looking to be in excess of 3.2% of the annual lung-cancer at whether bounds overlap with point estimates or mortality, the assumptions of the models should be other bounds; for example, two 95% confidence inter- reexamined and the upper bound on the resulting vals may overlap even though the difference between estimate constrained.” This inference is a complete the two estimates they surround is significant at the nonsequitur. To see this, let us call the risk being 0.05 level. bounded the X-risk. First, Ha-Duong et al. are as- suming that their upper bound of 3.2% on this risk 2.4. Clarifying the Underlying Causal Model is a gold standard. Nothing could be further from the truth. Any disagreement between Ha-Duong et al.’s Ha-Duong et al. say that “there is a consider- upper bound and another assessment could be due to able debate over the meaning of causation in epi- faulty inputs or execution in Ha-Duong et al.’s anal- demiology.” I think that this is a severe overstate- ysis. Unaccounted-for biases (as well as random er- ment. Among statisticians there has been debate rors) in the data used to form the inputs would ripple between a growing number that favor a potential- through to distort Ha-Duong et al.’s output (Maclure outcome (counterfactual) view of causation and a & Schneeweiss, 2001). “Garbage in, garbage out” is few opponents (Greenland, 2000, 2004; Maldonado & a watch-phrase that applies to every assessment, in- Greenland, 2002; Dawid, 2000). Among researchers cluding a bounding analysis. A competing assessment actually developing, teaching, and applying practical is just that: a competitor, possibly with better in- causal-analysis methods, potential-outcome models Invited Commentary 1089 have been in use for over 80 years (Neyman, 1923), 1999), so the mapping is from X to the space of prob- and have become textbook material (e.g., Berk, 2003; ability measures on Y.This extension is embodied in Freedman et al., 1997; Newman, 2001; Gelman et al., the “set” or “do” calculus for causal actions developed 2003; Rosenbaum, 2002; Van Der Laan & Robins, for artificial-intelligence research (Pearl, 2000). The 2003). Furthermore, the general counterfactual con- theory also extends to continuous X by allowing the cept on which they are based has long been estab- potential-outcome vector to be infinite-dimensional lished in epidemiology; e.g., see MacMahon and Pugh with coordinates indexed by X, and with components (1967), as well as Rothman and Greenland (1998, y(x)orpx(y). Chs. 2 and 4), Parascandola and Weed (2001), and On the matter of causation, Ha-Duong et al. give Greenland (2004) for further historical as well as cur- only one epidemiologic citation: Parascandola and rent citations. Models that were initially perceived as Weed (2001). The latter authors note that nearly all successful alternatives (e.g., causal diagrams and epidemiologic proposals incorporate counterfactual structural equations) have since been shown to be elements, and hence require potential outcomes; the isomorphic to potential-outcome models, and thus main point of debate is whether and how one in- not alternatives after all (Pearl, 2000; Greenland corporates probabilistic elements into each proposal. & Brumback, 2002); the remaining proposals have Traditionalists favor restricting those elements to the foundered on hidden circularities and have failed to sampling model, separated from the underlying causal generate acceptable data-analytic procedures (Pearl, model, and argue that thinking of deterministic poten- 2000; Maldonado & Greenland, 2002). tial outcomes facilitates development of mechanistic To describe the potential-outcome models, sup- hypotheses. The opposing view (sometimes labeled pose we wish to study the effect of an intervention “black-box”) regards the actual systems under study variable X with potential values (range) x1, ..., xJ on as too complex to be modeled adequately in determin- a subsequent outcome variable Y defined on an ob- istic terms, and favors introducing stochastic potential servational unit or a population. One then assumes outcomes (Greenland, 1987; Robins & Greenland, that there is a baseline vector of potential outcomes 1989; Greenland et al., 1999; Parascandola & Weed, ′ y = (y(x1), ..., y(xJ)) , such that if X = xj then Y = 2001). This shift to stochastic potential outcomes can y(xj); this vector is simply a mapping from the X range affect the choice of statistical procedure, and in par- to the Y range for the unit. To say that intervention ticular can increase final uncertainty (Robins, 1988). xi causally affects Y relative to intervention xj then The black-box viewpoint is supported by the success means that y(xi) = y(xj); and the effect of interven- of highly stochastic modeling in prediction problems tion xi relative to xj on Y is measured by y(xi) − y(xj) (Breiman, 2001), although a pragmatic view would re- or (if Y is strictly positive) by y(xi)/y(xj). Under this gard both types of models as useful tools with differing theory, assignment of a unit to a treatment level xi is domains of utility. simply a choice of which coordinate of y to attempt Like traditional and modern epidemiologic views to observe; regardless of assignment, the remaining (e.g., MacMahon & Pugh, 1967; Rothman & Green- coordinates are treated as existing pretreatment co- land, 1998), Ha-Duong et al.’s proposal appears to variates on which data are missing (Rubin, 1978). For- implicitly use potential outcomes, especially in its mally, if we define the vector of potential treatments discussion of partitioning and interactions. Adoption ′ x = (x1, ..., xJ) , with treatment indicators ri = 1if of potential outcomes has statistical implications for the unit is given treatment xi,0otherwise, and r = confounding identification and control, especially in ′ (r1, ..., rJ) , then the actual treatment given is xa = an application that uses odds ratios or rate ratios as ′ ′ r x and the actual outcome is ya ≡ y(xa) = r y; the measures of effect for a common disease (Greenland, remaining y(xi) are the counterfactuals. Viewing r as 1987; Rothman & Greenland, 1998, Ch. 4; Greenland the item-response vector for the items in y, causal in- et al., 1999). Furthermore, Ha-Duong’s sharp parti- ference under potential outcomes can be seen as a tioning of case numbers among risk-factor combina- special case of inference under item nonresponse in tions implies the potential outcomes are determinis- which iri = 0or1,i.e., at most one item in y is ob- tic, and so is subject to criticisms of that view (Paras- served per unit (Rubin, 1991). candola & Weed, 2001). This is not a fatal objection The theory extends to stochastic outcomes by re- (in practice to date, most potential-outcome models placing the y(xi)byprobability mass functions pi(y) have been deterministic), but it is an aspect of their (Greenland, 1987; Robins, 1988; Greenland et al., approach that needs acknowledgment. 1090 Greenland

3. PROBABILISTIC RISK ASSESSMENT From a Bayesian viewpoint, the bounding anal- VERSUS BOUNDING ANALYSIS ysis of Ha-Duong et al. entails mishandling of cru- cial uncertainties. Returning to my criticism of input It would seem that Ha-Duong et al. propose their bounds, an attempt to honestly answer the question bounding analysis as an alternative to probabilistic of “why 0.70 rather than 0.67 or 0.75?” will lead one risk assessments, or even as a standard to judge the to face the fuzziness of the bound concept as well as latter. I do not see the need for such an alternative. the location of the bound. Suppose this criticism is On the contrary, I think the discipline imposed by hav- answered by declaring the goal is to specify a greatest ing to specify a full probability model is an important lower bound L that one is 100% certain falls below brake on the vague hand waving that often substitutes the parameter at issue. One will then have to face for critical thinking in model specification and output the uncertainty or dispute among experts about the interpretation. Furthermore, I would judge bounding value of L to specify. If one could poll the experts analysis against probabilistic assessment methods, for one could plot a cumulative distribution of L, and use the latter are far more well established and under- that to shape the lower tail of a prior distribution. One stood (e.g., Eddy et al., 1992; Saltelli et al., 2000), and could take the analogous approach to the least upper are closely related to even more well-studied statis- bound U. Although this approach would not imply a tical methods for nonrandom exposure and missing- uniform density between the first 100th percentile of data mechanisms (Gelman et al., 2003; Little & Rubin, the Ls and the first 0th percentile of the Us, uniformly 2002; Robins et al., 1999). filling in the gap between the highest L and lowest U Two types of probabilistic assessments, Bayesian would approximate averaging over the expert densi- assessment and Monte Carlo sensitivity analysis ties if the individual densities are relatively flat be- (MCSA), depart from traditional statistical inference tween their bounds. In the extreme case of just one by starting with a model that includes nonidenti- expert very certain of his or her bounds, this approach fied sensitivity or bias parameters. Hence both begin would reduce to using a uniform density between the with a bias modeling, a process conspicuously missing bounds. The final MCSA or Bayesian outputs would from the bounding proposal of Ha-Duong et al. Full not be very sensitive to small changes in the bounds, Bayesian assessment requires specification of a joint given that the mass shifts involved would be small. As prior for all unknown parameters, and in turn supplies used by Ha-Duong et al. bounding analysis seems to posterior distributions of target parameters (Leamer, correspond to placing 50% subjective prior mass at 1974, 1978; Eddy et al., 1992; Little & Rubin, 2002; each of the uncertain bounds, which turns the bounds Gelman et al., 2003). MCSA—a variant of familiar into leverage points, and thus grossly inflates sensitiv- stochastic simulation methods (Morgan & Henrion, ity to bound location—a sensitivity that would at the 1990; Vose, 2000)—requires priors for the nonidenti- very least need to be explored as part of a bounding fied parameters only, and supplies simulated distribu- analysis. tions of estimates “corrected” for biases and random errors. The MCSA can be viewed as an approximation 4. EXPERT ELICITATION VERSUS to a semi-Bayesian analysis and also as an exten- LITERATURE ANALYSIS sion of traditional sensitivity analysis (Greenland, 2001, 2003, in press). In traditional sensitivity analy- In problems with which I am familiar, I would sis, bias parameters are varied in a systematic fashion, be loathe to rely on expert elicitations, whether for which can produce misleading range effects (Green- bounds or full priors. In my experience, epidemio- land, 1998b); in MCSA those parameters are in- logic experts often possess highly distorted views of stead drawn from a prior distribution. Because it is the literature on which they are supposedly expert, easily performed with popular software and can be freely mixing unsupported impressions with statis- based on familiar bias-sensitivity formulas, MCSA tical misinterpretations. They often believe a trend has been promoted for routine epidemiologic appli- is monotone because a trend test (which assumes cations (Lash & Silliman, 2000; Powell et al., 2001; monotonicity) is “statistically significant” (Maclure Greenland, 2001, in press; Lash & Fink, 2003; Phillips, & Greenland, 1992). When summarizing a literature, 2003; Steenland & Greenland, 2004). Nonetheless, they often believe no association is present because Bayesian bias analyses have begun to appear as well most studies are not “statistically significant,” even (Graham, 2000; Gustafson, 2003; Greenland, 2003; when this is just an artifact of low power of individ- Steenland & Greenland, 2004). ual studies, and they often believe the literature is Invited Commentary 1091 conflicting because some studies are “significant” and rare case that the evidence regarding an interaction others are not, even when this is just an artifact of is reliable and valid, one would again be better off power differences; see Greenland et al. (2000a) for a deriving inputs from direct analysis of available data, meta-analysis that revealed both phenomena. When bearing in mind the high likelihood of the aforemen- done well (with proper accounting for heterogeneity tioned biases in the published literature. and publication bias, as needed), meta-analysis can expose such literature misinterpretations and provide a far more valid and reliable source of input to risk 5. CONCLUSION assessment than would an expert elicitation. It is an open question whether one can break from Direct inspection of relevant literature can also a full probabilistic framework and still produce a log- uncover much useful data (the best prior informa- ically sound and scientifically sensible mode of rea- tion) for estimating biases in published study results soning under uncertainty. Ha-Duong et al.’s failure to (Greenland, 2003; Steenland & Greenland, 2004), do so might be evidence in favor of a negative answer. whereas expert judgments about methodologic biases Still, it would be interesting to see attempts to clarify tend to be highly prejudiced (depending on whether or repair their methodology. At the very least, such the bias in question would apply to the expert’s own methodologic proposals test the hypothesis that full studies, or would move results toward or away from (not just interval) probability models are essential to his or her favored answer). Even when without prej- rational decision and prediction. This full-probability udice, bias judgments are often based on incorrect hypothesis is received wisdom (if often implicit) in or misleading heuristics, as Jurek et al. (2004) found the statistics literature, which is largely devoted to regarding the oft-cited and often false rule that non- proposing new probability models and deducing con- differential misclassification produces bias toward the sequences and procedures from them. null. A received hypothesis merits refutation attempts, To address the problems with elicitations, it ap- even if it seems obviously correct; note how physi- pears that Ha-Duong et al. would gather and synthe- cists never cease testing received theories such as gen- size relevant literature and present it to the experts eral relativity. Thus, despite my strong criticisms of for review before doing elicitation. While I have lit- what Ha-Duong et al. have proposed and my warnings tle doubt that this would improve the elicitations, I against using it, I would also strongly encourage these strongly doubt that it would address most of the cog- authors to repair or replace (not just defend) their nitive problems described above, such as misinterpre- proposal, and so provide a stronger challenge to full- tation of lack of statistical significance. Although re- probability approaches. This repair will require con- source intensive, I see no way around a meta-analytic siderable conceptual clarification of both input and approach if one wants reliable and valid inputs to a output bounds, along with far more extensive sensi- risk analysis; the role of experts is then to conduct or tivity analyses than Ha-Duong et al. supply. In the review such an analysis, and interpret the findings in meantime, I again warn that bounding analysis is no context. substitute or criterion for probabilistic assessment. On a more specific issue, the authors propose elic- iting expert opinions regarding two-way interactions. With rare exceptions (smoking and radon perhaps be- REFERENCES ing one), the epidemiologic evidence on such inter- actions is so imprecise that not even a direction can Berk, R. A. (2003). Regression Analysis: A Constructive Critique. be reliably inferred. Because of their large number Thousand Oaks: Sage. Breiman, L. (2001). Statistical modeling: the two cultures. Statistical (n(n − 1)/2 interactions among n factors) and great Science, 16, 199–231. instability (typically no less than twice the standard Carlin, B., & Louis, T. A. (2000). Bayes and Empirical-Bayes Meth- error of main effects), published interaction estimates ods for Data Analysis, 2nd ed. New York: Chapman and Hall. Dawid, P. (2000). Causal inference without counterfactuals (with are subject to especially severe multiple-comparison discussion). Journal of the American Statistical Association, 95, artifacts and publication bias; they are also more vul- 407–448. nerable to biases from data sparsity (Greenland et al., Eddy, D. M., Hasselblad, V.,& Schachter, R. (1992). Meta-Analysis by the Confidence Profile Method. New York: Academic Press. 2000b). I would not expect any expert elicitation to Freedman, D., Pisani, R., & Purves, R. (1997). Statistics, 3rd ed. account for these problems, even when based on a New York: Norton. literature synthesis, which is to say I would not trust Gelman, A., Carlin, J. B., Stern, H. S., & Rubin, D. B. (2003). Bayesian Data Analysis, 2nd ed. New York: Chapman and such an elicitation to be better than noise. Even in the Hall/CRC. 1092 Greenland

Gilovich, T., Griffin, D., & Kahneman, D. (2002). Heuristcs and Leamer, E. E. (1974). False models and post-data model construc- Biases: The Psychology of Intuitive Judgment. New York: tion. Journal of the American Statistical Association, 69, 122– Cambridge University Press. 131. Good, I. J. (1983). Good Thinking.Minneapolis: University of Leamer, E. E. (1978). Specification Searches. New York: Wiley. Minnesota Press. Little, R. J. A., & Rubin, D. A. (2002). Statistical Analysis with Good, I. J. (1987). Hierarchical Bayesian and empirical Bayesian Missing Data, 2nd ed. New York: Wiley. methods (letter). American Statistician, 41, 92. Maclure, M., & Greenland, S. (1992). Tests for trend and dose- Graham, P.(2000). Bayesian inference for a generalized population response: Misinterpretations and alternatives. American Jour- attributable fraction. Statistics in Medicine, 19, 937–956. nal of Epidemiology, 135, 96–104. Greenland, S. (1987). Interpretation and choice of effect measures Maclure, M., & Schneeweiss, S. (2001). Causation of bias: The epis- in epidemiologic analysis. American Journal of Epidemiology, cope. Epidemiology, 12, 114–122. 125, 761–768. MacMahon, B., & Pugh, T. F. (1967). Causes and entities of disease. Greenland, S. (1998a). Probability logic and probabilistic induction. In D. W. Clark & B. MacMahon (Eds.), Preventive Medicine Epidemiology, 9, 322–332. (pp. 11–18). Boston: Little, Brown. Greenland, S. (1998b). The sensitivity of a sensitivity analysis. In Maldonado, G., & Greenland, S. (2002). Estimating causal effects 1997 Proceedings of the Biometrics Section. Alexandria, VA: (with discussion). International Journal of Epidemiology, 31, American Statistical Association, pp. 19–21. 421–438. Greenland, S. (2000). Causal analysis in the health sciences. Journal Morgan, M. G., & Henrion, M. (1990). Uncertainty: A Guide to of the American Statistical Association, 95, 286–289. Dealing with Uncertainty in Quantitiative Risk and Policy Anal- Greenland, S. (2001). Sensitivity analysis, Monte-Carlo risk anal- ysis. New York: Cambridge University Press. ysis, and Bayesian uncertainty assessment. Risk Analysis, 21, Newman, S. C. (2001). Biostatistical Methods in Epidemiology. New 579–583. York: Wiley. Greenland, S. (2003). The impact of prior distributions for uncon- Neyman, J. (1923). Sur les applications de la thar des probabili- trolled confounding and response bias: A case study of the re- ties aux experiences Agaricales: Essay des principle (English lation of wire codes and magnetic fields to childhood leukemia. translation of excerpts Dabrowska, D., and Speed, T. (1990). Journal of the American Statistical Association, 98, 47– Statistical Science, 5, 463–472.) 54. Parascandola, M., & Weed, D. (2001). Causation in epidemiology. Greenland, S. (2004). An overview of methods for causal infer- Journal of Epidemiology and Community Health, 55, 905–912. ence from observational studies. In A. Gelman & X. L. Meng Pearl, J. (2000). . New York: Cambridge. (Eds.), Applied Bayesian Modeling and Causal Inference from Phillips, C. V. (2003). Quantifying and reporting uncertainty from an Incomplete-Data Perspective. New York: Wiley. systematic errors. Epidemiology, 14, 459–466. Greenland, S. (in press). Multiple-bias modeling for observational Poole, C., & Greenland, S. (1997). How a court accepted a possible studies (with discussion). Journal of the Royal Statistical Soci- explanation. American Statistician, 51, 112–114. ety, Series A. Powell, M., Ebel, E., & Schlossel, W. (2001). Considering uncer- Greenland, S., & Brumback, B. A. (2002). An overview of rela- tainty in comparing the burden of illness due to foodborne tions among causal modelling methods. International Journal microbial pathogens. International Journal of Food Microbiol- of Epidemiology, 31, 1030–1037. ogy, 69, 209–215. Greenland, S., Robins, J. M., & Pearl, J. (1999). Confounding and Robins, J. M. (1988). Confidence intervals for causal parameters. collapsibility in causal inference. Statistical Science, 14, 29– Statistics in Medicine, 7, 773–785. 46. Robins, J. M., & Greenland, S. (1989). The probability of causation Greenland, S., Sheppard, A. R., Kaune, W. T., Poole, C., & Kelsh, under a stochastic model for individual risks. Biometrics, 46, M. A. (2000a). A pooled analysis of magnetic fields, wire codes, 1125–1138. and childhood leukemia. Epidemiology, 11, 624–663. Robins, J. M., Rotnitzky, A., & Scharfstein, D. O. (1999). Sensitiv- Greenland, S., Schwartzbaum, J. A., & Finkle, W. D. (2000b). Prob- ity analysis for selection bias and unmeasured confounding in lems from small samples and sparse data in conditional logistic missing data and causal inference models. In M. E. Halloran regression analysis. American Journal of Epidemiology, 151, &D.A.Berry (Eds.), Statistical Models in Epidemiology (pp. 531–539. 1–92). New York: Springer-Verlag. Gustafson, P. (2003). Measurement Error and Misclassification in Rosenbaum, P. (2002). Observational Studies, 2nd ed. New York: Statistics and Epidemiology. New York: Chapman and Hall. Springer-Verlag. Ha-Duong, M., Casman, E. A., & Morgan, M. G. (2004). Bound- Rothman, K. J., & Greenland, S. (1998). Modern Epidemiology, ing poorly characterized risks: A lung cancer example. Risk 2nd ed. Philadelphia: Lippincott. Analysis, 24, 1071–1083. Rubin, D. B. (1978). Bayesian inference for causal effects: The role Jurek, A., Maldonado, G., Church, T., & Greenland, S. (2004). of randomization. Annals of Statistics, 6, 34–58. Exposure-measurement error is frequently ignored when in- Rubin, D. B. (1991). Practical implications of modes of statistical in- terpreting epidemiologic study results. American Journal of ference for causal effects and the critical role of the assignment Epidemiology, 159, 572. mechanism. Biometrics, 47, 1213–1234. Kahneman, D., Slovic, P., & Tversky, A. (1982). Judgment Un- Saltelli, A., Chan, K., & Scott, M. (Eds.) (2000). Mathematical and der Uncertainty: Heuristics and Biases. New York: Cambridge Statistical Methods for Sensitivity Analysis (pp. 275–292). New University Press. York: Wiley. Lash, T. L., & Fink, A. K. (2003). Semi-automated sensitivity anal- Steenland, K., & Greenland, S. (2004). Monte-Carlo sensitivity ysis to assess systematic errors in observational epidemiologic analysis and Bayesian analysis of smoking as an unmeasured data. Epidemiology, 14, 451–458. confounder in a study of silica and lung cancer. American Jour- Lash, T. L., & Silliman, R. A. (2000). A sensitivity analysis to sepa- nal of Epidemiology, 160, 384–392. rate bias due to confounding from bias due to predicting mis- Van Der Laan, M., & Robins, J. M. (2003). Unified Methods for Cen- classification by a variable that does both. Epidemiology, 11, sored Longitudinal Data and Causality. New York: Springer. 544–549. Vose, D. (2000). Risk Analysis. New York: John Wiley and Sons.