<<

Statistical Science 2004, Vol. 19, No. 1, 58–80 DOI 10.1214/088342304000000116 © Institute of Mathematical , 2004 The Interplay of Bayesian and Frequentist Analysis M.J.BayarriandJ.O.Berger

Abstract. Statistics has struggled for nearly a century over the issue of whether the Bayesian or frequentist paradigm is superior. This debate is far from over and, indeed, should continue, since there are fundamental philosophical and pedagogical issues at stake. At the methodological level, however, the debate has become considerably muted, with the recognition that each approach has a great deal to contribute to statistical practice and each is actually essential for full development of the other approach. In this article, we embark upon a rather idiosyncratic walk through some of these issues. Key words and phrases: Admissibility, Bayesian model checking, condi- tional frequentist, confidence intervals, consistency, coverage, design, hierar- chical models, nonparametric Bayes, objective Bayesian methods, p-values, reference priors, testing.

CONTENTS 5. Areas of current disagreement 6. Conclusions 1. Introduction Acknowledgments 2. Inherently joint Bayesian–frequentist situations References 2.1. Design or preposterior analysis 2.2. The meaning of frequentism 2.3. Empirical Bayes, gamma minimax, restricted risk 1. INTRODUCTION Bayes Statisticians should readily use both Bayesian and 3. Estimation and confidence intervals frequentist ideas. In Section 2 we discuss situations 3.1. Computation with hierarchical, multilevel or mixed in which simultaneous frequentist and Bayesian think- model analysis ing is essentially required. For the most part, how- 3.2. Assessment of accuracy of estimation ever, the situations we discuss are situations in which 3.3. Foundations, minimaxity and exchangeability 3.4. Use of frequentist methodology in prior develop- it is simply extremely useful for Bayesians to use fre- ment quentist methodology or frequentists to use Bayesian 3.5. Frequentist simplifications and asymptotic approxi- methodology. mations The most common scenarios of useful connections 4. Testing, and model checking between frequentists and Bayesians are when no exter- 4.1. Conditional frequentist testing nal information (other than the data and model itself) 4.2. Model selection is to be introduced into the analysis—on the Bayesian 4.3. p-Values for model checking side, when “objective prior distributions” are used. Frequentists are usually not interested in subjective, M. J. Bayarri is Professor, Department of Statistics, informative priors, and Bayesians are less likely to be University of Valencia, Av. Dr. Moliner 50, 46100 interested in frequentist evaluations when using sub- Burjassot, Valencia, Spain (e-mail: susie.bayarri@ jective, highly informative priors. uv.es). J. O. Berger is Director, Statistical and Applied We will, for the most part, avoid the question Mathematical Sciences Institute, and Professor, Duke of whether the Bayesian or frequentist approach to University, P.O. Box 90251, Durham, North Carolina statistics is “philosophically correct.” While this is a 27708-0251, USA (e-mail: [email protected]). valid question, and research in this direction can be of

58 BAYESIAN AND FREQUENTIST ANALYSIS 59

fundamental importance, the focus here is simply on EXAMPLE 2.1. Suppose X1,...,Xn are i.i.d. methodology. In a related vein, we avoid the question Poisson random variables with θ, and that it is of what is “pedagogically correct.” If pressed, we desired to estimate√θ under the weighted squared er- would probably argue that (with ror loss (θˆ − θ)2/ θ and using the classical estima- ˆ ¯ emphasis on objective Bayesian methodology) should tor θ = X. This√ estimator√ has frequentist expected loss ¯ 2 be the type of statistics that is taught to the masses, Eθ [(X − θ) / θ]= θ/n. with frequentist statistics being taught primarily to A typical design problem would be to choose the advanced statisticians, but that is not an issue for sample size n so that the expected loss is less than this paper. some prespecified limit C. (An alternative formulation Several caveats are in order. First, we primarily focus might be to minimize C + nc,wherec is the cost of an on the Bayesian and frequentist approaches here; these observation, but this would not significantly alter the are the most generally applicable and accepted statisti- discussion here.) This is clearly not possible, for all θ; cal philosophies, and both have features that are com- hence we must bring prior knowledge about θ into play. pelling to most statisticians. Other statistical schools, A primitive recommendation that one often sees, in such as the likelihood school (see, e.g., Reid, 2000), such situations, is to make√ a “best guess” θ0 for θ, have many attractive features and vocal proponents, but ≤ and then√ choose n so that θ0/n C; that is, choose have not been as extensively developed or utilized as n ≈ θ0/C. This is needlessly dogmatic, in that one the frequentist and Bayesian approaches. rarely believes particularly strongly in a particular A second caveat is that the selection of topics value θ0. here is rather idiosyncratic, being primarily based on A common primitive recommendation in the oppo- situations and examples in which we are currently site direction is to choose√ an upper bound θU for θ, interested. Other Bayesian–frequentist synthesis works ≤ and then√ choose n so that θU /n C; that is, choose (e.g., Pratt, 1965; Barnett, 1982; Rubin, 1984; and n ≈ θU /C. This is needlessly conservative, in that even Berger, 1985a) focus on a quite different set of the resulting n will typically be much larger than situations. Furthermore, we almost completely ignore needed. many of the most time-honored Bayesian–frequentist The Bayesian approach to the design question is synthesis topics, such as empirical Bayes analysis. to elicit a subjective prior distribution π(θ) for θ, Hence, rather than being viewed as a comprehensive √ and then to choose n so that θ π(θ)dθ ≤ C;that review, this paper should be thought of more as a √ n ≈ personal view of current interesting issues in the is, choose n θπ(θ)dθ/C. This is a reasonable Bayesian–frequentist synthesis. compromise between the above two extremes and will typically result in the most reasonable values of n. 2. INHERENTLY JOINT BAYESIAN– Classical design texts often focus on the very spe- FREQUENTIST SITUATIONS cial situations in which the design criterion is constant There are certain statistical scenarios in which a in the unknown model parameter θ, and hence fail to joint frequentist–Bayesian approach is arguably re- clarify the philosophical centrality of Bayesian issues quired. As illustrations of this, we first discuss the in design. The basic fact is that, before experimenta- issue of design—in which the notion should not be tion, one knows neither the data nor θ, and so expec- controversial—and then discuss the basic meaning of tations over both (i.e., both frequentist and Bayesian frequentism, which arguably should be (but is not expectations) are needed for design. See Chaloner and typically perceived as) a joint frequentist–Bayesian Verdinelli (1995) and Dawid and Sebastiani (1999). endeavor. A very common situation in which design evaluation is not constant is classical testing, in which the sample 2.1 Design or Preposterior Analysis size is often chosen to achieve a given power at a  Frequentist design focuses on planning of experi- specified value θ of the parameter under the alternative  ments—for instance, the issue of choosing an appro- hypothesis. Again, specifying a specific θ is very priate sample size. In Bayesian analysis this is often crude when viewed from a Bayesian perspective. Far called preposterior analysis, because it is done before more reasonable for a classical tester would be to the data is collected (and, hence, before the posterior specify a prior distribution for θ under the alternative, distribution is available). and consider the average power with respect to this 60 M. J. BAYARRI AND J. O. BERGER distribution. (More controversial would be to consider The impact of this “real” frequentist principle thus an average Type I error.) arises when the frequentist evaluation of a procedure is not constant over the parameter space. Here is 2.2 The Meaning of Frequentism an example. There is a sense in which essentially everyone should EXAMPLE 2.2. Binomial confidence interval. ascribe to frequentism: Brown, Cai and DasGupta (2001, 2002) considered the ∼ FREQUENTIST PRINCIPLE. In repeated practical problem of observing X Binomial(n, θ) and deter- use of a statistical procedure, the long-run average mining a 95% confidence interval for the unknown actual accuracy should not be less than (and ideally success θ. We consider here the special = should equal) the long-run average reported accuracy. case of n 50, and two confidence procedures. The first is CJ (x), defined as the “Jeffreys equal-tailed This version of the frequentist principle is actually 95% confidence interval,” given by a joint frequentist–Bayesian principle. Suppose, for in- J = stance, that we decide it is relevant to statistical prac- (2.1) C (x) q0.025(x), q0.975(x) , tice to repeatedly use a particular and where qα(x) is the αth-quantile of the Beta(x + 0.5, procedure—for instance, a 95% classical confidence 50.5 − x) distribution. The second confidence proce- interval for a normal mean. This procedure will, in dure we consider is the “modified Jeffreys equal-tailed practice, be used on a series of different problems 95% confidence interval,” given by  involving a series of different normal with a cor-  =  q0.025(x), q0.975(x) , if x 0 responding series of data. Hence, in evaluating the pro-  and x = n, cedure, we should simultaneously be averaging over J ∗ = (2.2) C (x)  the differing means and data.  0,q0.975(x) , if x = 0,  This is in contrast to textbook statements of the q0.025(x), 1 , if x = n. frequentist principle which tend to focus on fixing For the , simply consider these as formulae for the value of, say, the normal mean, and imagining confidence intervals; we later discuss their motivation. repeatedly drawing data from the given model and Brown, Cai and Dasgupta (2001) provide the graph utilizing the confidence procedure repeatedly on this ∗ of the coverage probability of CJ given in Figure 1. data. The word imagining is emphasized, because this Note that, while roughly close to the target 95%, the is solely a thought . What is done in practice coverage probability varies considerably as a function is to use the confidence procedure on a series of of θ, going from a high of 1 at θ = 0andθ = 1 different problems—not use the confidence procedure to a low of 0.884 at θ = 0.049 and θ = 0.951. for a series of repetitions of the same problem with A “textbook frequentist” might then assert that this different data (which would typically make no sense isonlyan88.4% confidence procedure, since the in practice). coverage cannot be guaranteed to be higher than Neyman himself repeatedly pointed out (see, e.g., this limit. But would the “practical frequentist” agree Neyman, 1977) that the motivation for the frequentist with this? principle is in its use on differing real problems, and The practical frequentist evaluates how CJ ∗ would not imaginary repetitions for one problem with a fixed work for a sequence {θ1,θ2,...,θm} of parameters “true parameter.” Of course, the reason textbooks typi- (and corresponding data) encountered in a series of real cally give the latter (philosophically misleading) ver- problems. If m is large, the law of large numbers guar- sion is because of the convenient mathematical fact antees that the coverage that is actually experienced that if, say, a confidence procedure has 95% frequen- will be the average of the coverages obtained over the tist coverage for each fixed parameter value, then it will sequence of problems. Thus we should be considering necessarily also have 95% coverage when used repeat- averages of the coverage in Figure 1 over sequences edly on a series of differing problems. Thus (as with of θj . design), whenever the frequentist evaluation is constant One could, of course, choose the sequence of θj to over the parameter space, one does not need to also all be 0.049 and/or 0.951, but this is not very realistic. do a Bayesian average over the parameter space; but, One might consider global averages with respect to conceptually, it is the combined frequentist–Bayesian sequences generated from prior distributions π(θ),but average that is practically relevant. a “practical frequentist” presumably does not want to BAYESIAN AND FREQUENTIST ANALYSIS 61

∗ ∗ FIG.1. Frequentist coverage of the CJ intervals, as a function FIG.2. Local average coverage of the CJ intervals, as a of θ when n = 50. function of θ when n = 50 and ε = 0.05.

spend much time thinking about prior distributions. with Beta(θ|a(θ),a(1 − θ)) denoting the beta density One plausible solution is to look at local average with given parameters]. coverage, defined via a local smoothing of the binomial For the binomial example we are considering, this = coverage function. A convenient computational kernel is graphed in Figure 2, for ε 0.05. (Such a value for this problem, when smoothing at a point θ is of ε could be interpreted as implying that one is desired and when a smoothing kernel having standard sure that the sequence of practical problems, for deviation ε is desired (so that θ ± 2ε can roughly be which the binomial confidence interval will be used, has θ varying by at least ±0.05.) Note that this local thought of as the over which the smoothing is j average coverage is always close to 0.95, so that a performed), is the Beta(a(θ), a(1 − θ)) distribution practical frequentist would be quite pleased with the k (·) where ε,θ confidence interval.  − ≤  1 2ε, if θ ε, One could imagine a textbook frequentist arguing  − =  [θ(1 − θ)ε 2 − 1]θ, that, sometimes, a particular value, such as θ 0.049, = could be of special interest in repeated investigations, (2.3) a(θ)  if ε<θ<1 − ε,  the value perhaps corresponding to some important  1  − 3 + 2ε, if θ ≥ 1 − ε. physical theory concerning θ that science will repeat- ε edly investigate. In such a situation, however, it is ar- Writing the standard frequentist coverage as 1 − α(θ), guably not appropriate to utilize confidence intervals; this leads to the ε-local average coverage that there is a special value of θ of interest should be acknowledged via some type of testing procedure. 1 θ 1 − αε(θ) = [1 − α(θ)]kε,θ (θ) dθ Even if there were a distinguished value of and it 0 was erroneously handled by finding a confidence inter- n n val, the practical frequentist has one more arrow in his =  a(θ) + a(1 − θ) x or her quiver: it is not likely that a series of x=0 ·  a(θ) + x investigating this particular physical theory would all choose the same sample size, so one should consider ·  a(1 − θ)+ n − x “practical averaging” over sample size. For instance, · (a(θ)) a(1 − θ) suppose sample sizes would vary between 40 and 60 for the binomial problem we have been considering. − ·  a(θ) + a(1 − θ)+ n 1 Then one could reasonably consider average coverage over these sample sizes, the result of which is given in · Beta θ|a(θ),a(1 − θ) dθ, Figure 3. While not always as close to 0.95 as was the ∗ CJ local average coverage, it would still strike most peo- the last equation following from the standard expres- ple as reasonable to call CJ ∗ a 95% confidence interval sion for the beta–binomial predictive distribution [and when averaged over reasonable sample sizes. 62 M. J. BAYARRI AND J. O. BERGER

lower α/2-quantiles of the resulting posterior distribu- tion for θ.) This is the prior that is customary for an objective Bayesian to use for the binomial problem. (See Section 3.4.3 for further discussion of the Jeffreys prior.) Note that, because of the derivation of the cred- ible set from the objective Bayesian perspective, there is strong reason to believe that “conditionally on the given situation and data, the accuracy assignment of 95% is reasonable.” See Section 3.2.2 for discussion of conditional performance. The frequentist coverage of the intervals CJ (x) is given in Figure 4. A pure frequentist might well be con- cerned with the raw coverage of this because it goes to zero at θ = 0andθ = 1. A moment’s ∗ FIG.3. Average coverage over n between 40 and 60 of the CJ reflection reveals why this is the case: the equal-tailed intervals, as a function of θ. Bayesian credible intervals purposely exclude values in the left and right tails of the posterior distribution A similar idea concerning “local averages” of and, hence, will always exclude θ = 0andθ = 1. The frequentist properties was employed by Woodroofe modification of this interval employed in Brown, Cai J ∗ (1986), who called the concept “very weak expan- and DasGupta (2001) is C (x) in (2.2): for the ob- = = sions.” Brown, Cai and DasGupta (2002), for the servations x 0orx n, one simply extends the binomial problem, considered the average coverage Jeffreys equal-tailed credible intervals to include 0 or 1. defined as the smooth part of their asymptotic expan- Of course, from a conditional Bayesian perspective, sion of coverage, yielding a result similar to that in these intervals then have 0.975, so a Bayesian would no longer call them 95% credible Figure 2. Rousseau (2000) took a different approach, intervals. considering slight adjustment of the Bayesian intervals While the raw frequentist coverage of the Jeffreys through randomization to achieve the correct frequen- equal-tailed credible intervals might seem unappeal- tist coverage. ing, their 0.05-local average coverage is excellent, So far the discussion has been in terms of the virtually the same as that in Figure 2 for the modified practical frequentist acknowledging the importance of interval; indeed, the difference is not visually apparent, considering averages over θ. We would also claim, so that we do not separately include a graph of this lo- however, that Bayesians should ascribe to the above cal average coverage. Hence the practical frequentist version of the frequentist principle. If (say) a Bayesian would be quite happy with use of CJ (x),evenifithas were to repeatedly construct purported 90% credible low coverage right at the endpoints. intervals in his or her practical work, yet they only con- tained the unknowns about 70% of the time, something would be seriously wrong. A Bayesian might feel that the practical frequentist principle will automatically be satisfied if he or she does a good Bayesian job of sep- arately analyzing each individual problem, and hence that it is not necessary to specifically worry about the principle, but that does not mean that the principle is invalid.

EXAMPLE 2.3. In this regard, let us return to the binomial example to discuss the origin of the confidence intervals CJ (x) and CJ ∗(x).Theinter- vals CJ (x) arise as the Bayesian equal-tailed credi- ble sets obtained from use of the Jeffreys prior (see ∝ −1/2 − −1/2 Jeffreys, 1961) π(θ) θ (1 θ) for θ. (In FIG.4. Coverage of the CJ intervals, as a function of θ particular, the intervals are formed by the upper and when n = 50. BAYESIAN AND FREQUENTIST ANALYSIS 63

The issue of Bayesians achieving good pure fre- For examples and references, see Berger (1985a) and quentist coverage near a finite boundary of a para- Vidakovic (2000). meter space is an interesting issue; our guess is that In the restricted risk Bayes approach, one has a sin- this is often not possible. In the above example, for gle prior distribution, but can only consider statisti- instance, whether a Bayesian includes, or excludes, cal procedures whose frequentist risk (expected loss) θ = 0orθ = 1 in a credible interval is rather arbi- is constrained in some fashion. The idea is that one trary and will depend on, for example, a choice such as can utilize the prior information, but in a way that that between an equal-tailed or highest posterior den- will be guaranteed to be acceptable to the frequentist sity (HPD) interval. (The HPD intervals for x = 0and who wants to limit frequentist risk. (See Berger, 1985a, x = n would include θ = 0andθ = 1, respectively.) for discussion and earlier references.) This approach is Furthermore, this choice will typically lead to either actually not “inherently Bayesian–frequentist,” but is 0 frequentist coverage or coverage of 1 at the end- more what could be termed a “hybrid” approach, in the points, unless something unnatural to a Bayesian, such sense that it seeks some type of formal compromise be- as randomization, were incorporated. Hence the recog- tween Bayesian and frequentist positions. There have nition of the centrality to frequentist practice of some been many other attempts at such compromises, but type of average coverage, rather than pointwise cover- none has seemed to significantly affect statistical prac- age, can be important in such problems to achieve si- tice. There are many other important areas in which multaneously acceptable Bayesian and frequentist per- joint frequentist–Bayesian evaluation is used. Some formance. were even developed primarily from the Bayesian 2.3 Empirical Bayes, Gamma Minimax, perspective, such as the prequential approach of Dawid Restricted Risk Bayes (cf. Dawid and Vovk, 1999).

Several approaches to statistical analysis have been 3. ESTIMATION AND CONFIDENCE INTERVALS proposed which are inherently a mixture of Bayesian and frequentist analysis. These approaches have leng- In statistical estimation (including development of thy histories and extensive literature and we so we confidence intervals), objective Bayesian and frequen- can do little more here than simply give pointers to tist methods often give similar (or even identical) the areas. answers in standard parametric problems with contin- Robbins (1955) introduced the empirical Bayes uous parameters. The standard normal linear model is the prototypical example: frequentist estimates and approach, in which one specifies a class of prior confidence intervals coincide exactly with the stan- distributions , but assumes that the prior is other- dard objective Bayesian estimates and credible inter- wise unknown. The data is then used to help de- vals. Indeed, this occurs more generally in situations termine the prior and/or to directly find the optimal that exhibit an “invariance structure,” provided objec- Bayesian answer. Frequentist reasoning was intimately tive Bayesians use the “right-Haar prior density”; see involved in Robbins’ original formulation of empiri- Berger (1985a), Eaton (1989) and Robert (2001) for cal Bayes, and in significant implementations of the discussion and earlier references. paradigm, such as Morris (1983) for hierarchical mod- This dual frequentist–Bayesian interpretation of ma- els. More recently, the name empirical Bayes is often ny textbook estimation procedures has a number of used in association with approximate Bayesian analy- important implications, not the least of which is that ses which do not specifically involve frequentist mea- much of standard textbook statistical methodology sures. (Simply using a maximum likelihood estimate of (and standard software) can alternatively be presented a hyperparameter does not make a technique frequen- and described from the objective Bayesian perspective. tist.) For modern reviews of empirical Bayes analysis In particular, one can teach much of elementary statis- and previous references, see Carlin and Louis (2000) tics from this alternative perspective, without changing and Robert (2001). the procedures that are taught. In the gamma minimax approach, one again has a In more complicated situations, it is still usually class  of possible prior distributions and considers possible to achieve near-agreement between frequen- the frequentist Bayes risk (the expected loss over tist and Bayesian estimation procedures, although this both the data and unknown parameters) of the Bayes may require careful utilization of the tools of both. procedure for priors in the class. One then chooses A number of situations requiring such cross-utilization that prior which minimizes this frequentist Bayes risk. of tools are discussed in this section. 64 M. J. BAYARRI AND J. O. BERGER

3.1 Computation with Hierarchical, Multilevel or priors, and note that a “reasonable” objective prior Analysis may well depend on which parameter is the parame- terofinterest.) With the advent of Gibbs and other Markov • By simulation, obtain a (large) sample from the chain Monte Carlo (MCMC) methods of analysis (cf. posterior distribution of the parameter of interest: Robert and Casella, 1999), it has become relatively Option 1. If a predetermined confidence interval standard to deal with models that go under any of the C(X) is of interest, simply approximate the pos- names listed in the above title as Bayesian methods. terior probability of the interval by the fraction This popularity of the Bayesian methods is not nec- of the samples from the posterior distribution that essarily because of their intrinsic virtues, but rather fall in the interval. because the Bayesian computation is now much easier Option 2. If the confidence interval is not predeter- than computation via more classical routes. See Hobert mined, find the α/2 upper and lower fractiles of (2000) for an overview and other references. the posterior sample; the interval between these On the other hand, any MCMC method relies fun- fractiles approximates the 100(1 − α)% equal- damentally on frequentist reasoning to do the com- tailed posterior credible interval for the parameter putation. An MCMC method generates a sequence of of interest. (Alternative forms for the confidence simulated values θ 1, θ 2,...,θ m of an unknown quan- set can be considered, but the equal-tailed interval tity θ, and then relies upon a law of large num- is fine for most applications.) bers or ergodic theorem (both frequentist) to assert • Assert that the obtained interval is the frequen- ¯ = 1 m → that θ m m i=1 θ i θ. Furthermore, diagnostics tist confidence interval, having frequentist coverage for MCMC convergence are almost universally based given by the posterior probability of the interval. on frequentist tools. There is a purely Bayesian way of looking at such computation problems, which goes There is a large body of theory, discussed in Sec- under the heading “Bayesian numerical analysis” (cf. tion 3.4, as well as considerable practical experience, Diaconis, 1988a; O’Hagan, 1992), but in practice supporting the validity of constructing frequentist con- it is typically much simpler to utilize the frequen- fidence intervals in this way. Here is one example from the “practical experience” side. tist reasoning. In conclusion for much of modern statistical analysis EXAMPLE 3.1. Medical diagnosis (Mossman and in hierarchical models, we already see an inseparable Berger, 2001). Within a population for which p0 = joining of Bayesian and frequentist methodology. Pr( Disease D), a diagnostic test results in either a Positive (+)orNegative(−) reading. Let p = Pr(+| 3.2 Assessment of Accuracy of Estimation 1 patient hasD) and p2 = Pr(+|patient does not haveD). Frequentist methodology for of un- By Bayes’ theorem, known model parameters is relatively straightforward p p θ = Pr(D|+) = 0 1 . and successful. However, assessing the accuracy of the p0p1 + (1 − p0)p2 estimates is considerably more challenging and is a In practice, the pi are typically unknown, but for problem for which frequentists should draw heavily on i = 0, 1, 2 there are available (independent) data x Bayesian methodology. i having Binomial(ni,pi) densities. It is desired to find 3.2.1 Finding good confidence intervals in the pres- a 100(1 − α)% confidence set for θ that has good ence of nuisance parameters. Confidence intervals for conditional and frequentist properties. a model parameter are a common way of indicating A simple objective Bayesian approach to this prob- ∝ −1/2 · the accuracy of an estimate of the parameter. Find- lem is to utilize the Jeffreys priors π(pi) pi −1/2 ing good confidence intervals when there are nuisance (1 − pi) for each of the pi , and compute the parameters is very challenging within the frequen- 100(1 − α)% equal-tailed posterior credible interval tist paradigm, unless one utilizes objective Bayesian for θ. A suitable implementation of the algorithm pre- methodology, in which case the frequentist problem sented above is as follows: becomes relatively straightforward. Indeed, here is a • Draw random p from the Beta(x + 1 ,n − x + 1 ) rather general prescription for finding confidence in- i i 2 i i 2 posterior distributions, i = 0, 1, 2. tervals using objective Bayesian methods: • Compute the associated • Begin with a “reasonable” objective prior distrib- p p θ = 0 1 ution. (See Section 3.4 for discussion of objective p0 p1 + (1 − p0)p2 BAYESIAN AND FREQUENTIST ANALYSIS 65

for each random triplet. confidence intervals were the simplest to derive and • Repeat this process 10,000 times. will automatically be conditionally appropriate (see • The α/2and1− α/2 fractiles of these 10,000 Section 3.2.2), because of their Bayesian derivation. generated θ form the desired confidence interval. The finding in the above example, that objective [In other words, simply order the 10,000 values Bayesian analysis very easily provides small confi- of θ, and let the confidence interval be the interval − dence sets with excellent frequentist coverage, has between the (10,000 × α )th and (10,000 × 1 α )th 2 2 been repeatedly shown to happen. See Section 3.4 for values.] additional discussion. The proposed objective Bayesian procedure is clear- 3.2.2 Obtaining good conditional measures of accu- ly simple to use, but is the resulting confidence in- racy. Developing frequentist confidence intervals us- terval a satisfactory frequentist interval? To provide ing the Bayesian approach automatically provides an perspective on this question, note that the above prob- additional significant benefit: the confidence statement lem has also been studied in the frequentist literature, will be conditionally appropriate. Here is a simple arti- using standard log-odds and delta-method procedures ficial example. to develop confidence intervals, as well as more sophis- ticated approaches such as the Gart–Nam (Gart and EXAMPLE 3.2. Two observations, X1 and X2,are Nam, 1988) procedure. For a description of these clas- to be taken, where sical methods, as applied to this problem of medical θ + 1, with probability 1/2, X = diagnosis, see Mossman and Berger (2001). i θ − 1, with probability 1/2. Table 1 gives an indication of the frequentist perfor- Consider the confidence set for the unknown θ ∈ , mance of the confidence intervals developed by these four methods. It is based on a simulation that repeat- C(X1,X2) edly generates data from binomial distributions with 1 the point (X1 + X2) , if X1 = X2, sample sizes ni = 20 and the indicated values of the = 2 the point {X − 1}, if X = X . parameters (p0,p1,p2). For each generated triplet of 1 1 2 data in the simulation, the 95% confidence interval is The frequentist coverage of this confidence set can eas- computed using the objective Bayesian algorithm or ily be shown to be Pθ (C(X1,X2) contains θ)= 0.75. one of the three classical methods. It is then noted This is not at all a sensible report, once the data is at whether the computed interval contains the true θ,or hand. To see this, observe that, if x1 = x2,thenwe misses to the left or right. The entries in the table are know for sure that their average is equal to θ,sothat the long run proportion of misses to the left or right. the confidence set is then actually 100% accurate. On Ideally, these proportions should be 0.025 and, at the the other hand, if x1 = x2, we do not know if θ is the least, their sum should be 0.05. data’s common value plus 1 or their common value mi- Clearly the objective Bayes interval has quite good nus 1, and each of these possibilities is equally likely frequentist performance, better than any of the classi- to have occurred. cally derived confidence intervals. Furthermore, it can To obtain sensible frequentist answers here, one be seen that the objective Bayes intervals are, on av- must define the conditioning S =|X1 − X2|, erage, smaller than the classically derived intervals. which can be thought of as measuring the “strength of (See Mossman and Berger, 2001, for these and more evidence” in the data (S = 2 reflecting data with maxi- extensive computations.) Finally, the objective Bayes mal evidential content and S = 0 being data of minimal

TABLE 1 The probability that the nominal 95% interval misses the true θ on the left and on the right, for the indicated parameter values and when n0 = n1 = n2 = 20

(p0,p1,p2) O-Bayes Log odds Gart–Nam Delta 1 , 3 , 1 0.0286, 0.0271 0.0153, 0.0155 0.0277, 0.0257 0.0268, 0.0245 4 4 4 1 , 9 , 1 0.0223, 0.0247 0.0017, 0.0003 0.0158, 0.0214 0.0083, 0.0041 10 10 10 1 9 1 2 , 10 , 10 0.0281, 0.0240 0.0004, 0.0440 0.0240, 0.0212 0.0125, 0.0191 66 M. J. BAYARRI AND J. O. BERGER evidential content). Then one defines frequentist cover- the inferences that arise from the Bayesian approach age conditional on the strength of evidence S.Forthe to be considerably more realistic than those from example, an easy computation shows that this condi- competitors, such as various versions of maximum tional confidence equals likelihood estimation (or empirical Bayes estimation) or (often worse) unbiased estimation. Pθ C(X1,X2) contains θ|S = 2 = 1, One of the potentially severe problems with the | = = 1 maximum likelihood or empirical Bayes approach is Pθ C(X1,X2) contains θ S 0 2 , that maximum likelihood estimates of in for the two distinct cases, which are the intuitively hierarchical models (or component models) correct answers. can easily be zero, especially when there are numer- It is important to realize that conditional frequentist ous variances in the model that are being estimated. measures are fully frequentist and (to most people) (Unbiased estimation will be even worse in such sit- clearly better than unconditional frequentist measures. uations; if the mle is zero, the unbiased estimate will They have the same unconditional property (e.g., in the be negative.) above example one will report 100% confidence half EXAMPLE 3.3. Suppose, for i = 1,...,p,that the time, and 50% confidence half the time, resulting 2 Xi ∼ Normal(µi, 1) and µi ∼ Normal(0,τ ),allran- in an “average” of 75% confidence, as must be the dom variables being independent. Then, marginally, case for a frequentist measure), yet give much better 2 Xi ∼ Normal(0, 1 + τ ), so that the likelihood func- indications of the accuracy for the type of data that one tion of τ 2 can be written has actually encountered. 1 S2 In the above example, finding the appropriate con- (3.1) L(τ 2) ∝ exp − , ditioning statistic was easy but, in more involved sit- (1 + τ 2)p/2 2(1 + τ 2) uations, it can be a challenging undertaking. Luckily, 2 = 2 2 where S Xi . The mle for τ is easily calculated intervals developed via the Bayesian approach will au- 2 ˆ2 = { S − } 2 tomatically condition appropriately. For instance, in to be τ max 0, p 1 .Thus,ifS

quickly, clearly indicating that there is considerable analysis is a nice mathematical coincidence that can be uncertainty as to the true value of τ 2 even though the used to eliminate inferior frequentist procedures, but mle is 0. that Bayesian ideas should not form the basis for choice among acceptable frequentist procedures. Still, the Utilizing an mle of 0 as a variance estimate can be complete class theorems provide a powerful underlying quite dangerous, because it will typically affect the link between frequentist and Bayesian statistics. ensuing analysis in an incorrectly aggressive fashion. One of the most prominent frequentist principles for In the above example, for instance, setting τ 2 to0is choosing a statistical procedure is that of minimaxity; µ equivalent to stating that all the i are exactly equal to see Brown (1994, 2000) and Strawderman (2000) for each other. This is clearly unreasonable in light of the 2 reviews on the important impact of this concept on fact that there is actually great uncertainty about τ , statistics. Bayesian analysis again provides the most as reflected in Figure 5. Since the likelihood maximum useful tool for deriving minimax procedures: one is occurring at the boundary of the parameter space, it finds the “least favorable prior distribution,” and the is also very difficult to utilize likelihood or frequentist 2 minimax procedure is the resulting Bayes rule. methods to attempt to incorporate uncertainty about τ To many Bayesians, the most compelling foundation into the analysis. of statistics is that based on exchangeability, as de- None of these difficulties arises in the Bayesian veloped in de Finetti (1970). From the assumption of approach, and the vague nature of the information in exchangeability of an infinite sequence, X1,X2,..., the data about such variances will be clearly reflected of observations (essentially the assumption that the in the posterior distribution. For instance, if one were to distribution of the sequence remains the same under 2 = use the constant prior density π(τ ) 1intheabove permutation of the coordinates), one can sometimes de- example, the posterior density would be proportional duce the existence of a particular statistical model, with to the likelihood in Figure 5, and the significant unknown parameters, and a prior distribution on the 2 uncertainty about τ would permeate the analysis. parameters. By considering an infinite series of obser- 3.3 Foundations, Minimaxity and Exchangeability vations, frequentist reasoning—or at least frequentist mathematics—is clearly involved. Reviews of more re- There are numerous ties between frequentist and cent developments and other references can be found in Bayesian analysis at the foundational level. The foun- Diaconis (1988b) and Lad (1996). dation of frequentist statistics typically focuses on the There are many other foundational arguments that class of “optimal” procedures in a given situation, begin with axioms of rational behavior and lead to the called a complete class of procedures. Through the conclusion that some type of Bayesian behavior is im- work of Wald (1950) and others, it has long been plied. (See Bernardo and Smith, 1994, for review and known that a complete class of procedures is identi- references.) Many of these effectively involve simul- cal to the class of Bayes procedures or certain limits taneous frequentist–Bayesian evaluations of outcomes, thereof. Furthermore, in proving frequentist optimal- such as Rubin (1987), which is perhaps the weakest set ity of a procedure, it is typically necessary to employ of axioms that implies Bayesian behavior. Bayesian tools. (See Berger, 1985a; Robert, 2001, for many examples and references.) Hence, at a fundamen- 3.4 Use of Frequentist Methodology in tal level, the frequentist paradigm is intertwined with Prior Development the Bayesian paradigm. In principle, a subjective Bayesian need not worry Interestingly, this fundamental duality has not had a about frequentist ideas—if a prior distribution is elici- pronounced effect on the Bayesian versus frequentist ted and accurately reflects prior beliefs, then Bayes’ debate. In part, this is because many frequentists theorem guarantees that any resulting inference will find the search for optimal frequentist procedures to be optimal. The hitch is that it is not very common to be of limited practical utility (since such searches have a prior distribution that accurately reflects all prior usually take place in rather limited settings, from the beliefs. Suppose, for instance, that the only unknown perspective of practice), and hence do not themselves model parameter is a normal mean θ. Complete assess- pursue optimality and thereby come into contact with ment of the prior distribution for θ involves an infinite the Bayesian equivalence. Even among frequentists number of judgments [e.g., specification of the prob- who are significantly concerned with optimality, it is ability of the interval (−∞,r) for any rational num- typically perceived that the relationship with Bayesian ber r]. In practice, of course, only a few assessments 68 M. J. BAYARRI AND J. O. BERGER are ever made, with the others being made conven- EXAMPLE 3.4. In numerous models in use today, tionally (e.g., one might specify the first quartile and the number of parameters increases with the amount the , but then choose a Cauchy density for the of data. The classic example of this is the Neyman– prior). Clearly one should worry about the effect of fea- Scott problem (Neyman and Scott, 1948), in which tures of the prior that were not elicited. one observes Even more common in practice is to utilize a default 2 or objective prior distribution, and Bayes’s theorem Xij ∼ N (µi,σ ), i = 1,...,n,j = 1, 2, does not then provide any guarantee as to performance. 2 ¯ = and is interested in estimating σ . Defining xi It has proved to be very useful to evaluate partially + ¯ = ¯ ¯ 2 = n − 2 (xi1 xi2)/2, x (x1,...,xn), S i=1(xi1 xi2) elicited and objective priors by utilizing frequentist = techniques to evaluate their properties in repeated use. and µ (µ1,...,µn), the likelihood function can be written 3.4.1 Information-based developments. A number 1 1 S2 of developments of prior distributions utilize informa- L(µ,σ)∝ exp − |x¯ − µ|2 + . tion-based arguments that rely on frequentist measures. σ 2n σ 2 4 Consider the reference prior theory, for instance, ini- Until relatively recently, the most commonly used tiated in Bernardo (1979) and refined in Berger and objective prior was the Jeffreys-rule prior (Jeffreys, Bernardo (1992). The reference prior is defined to 1961), here given by π J (µ,σ) = 1/σ n+1.There- be that distribution which minimizes the asymptotic sulting posterior distribution for σ is proportional to Kullback–Leibler between the posterior the likelihood times the prior, which, after integrating distribution and the prior distribution, thus hopefully out µ,is obtaining a prior that “minimizes information” in an appropriate sense. This divergence is calculated with 1 S2 π(σ|x) ∝ exp − . respect to a joint frequentist–Bayesian computation σ 2n+1 4σ 2 since, as in design, it is being computed before any data has been obtained. One common Bayesian estimate of σ 2 is the poste- The reference prior approach has arguably been rior mean, which here is S2/[4(n − 1)]. This estimate the most generally successful method of obtaining is inconsistent, as can be seen by applying simple fre- Bayes rules that have excellent frequentist performance quentist reasoning to the situation. Indeed, note that 2 2 (see Berger, Philippe and Robert, 1998, as but one (Xi1 − Xi2) /(2σ ) is a chi-squared random variable example). There are, furthermore, many other features with one degree of freedom, and hence that S2/(2σ 2) of reference priors that are influenced by frequentist is chi-squared with n degrees of freedom. It follows by matters. One such feature is that the reference prior the law of large numbers that S2/(2n) → σ 2,sothat typically depends not only on the model, but also on the Bayes estimate converges to σ 2/2, the wrong value. which parameter is the inferential focus. Without such (Any other natural Bayesian estimate, such as the pos- dependence on the “parameter of interest,” optimal terior median or posterior , can also be seen to frequentist performance is typically not attainable by be inconsistent.) Bayesian methods. A number of other information-based priors have The problem in the above example is that the also been derived. See Soofi (2000) for an overview Jeffreys-rule prior is often inappropriate in multidi- and references. mensional settings, yet it can be difficult or impossible 3.4.2 Consistency. Perhaps the simplest frequentist to assess this problem within the Bayesian paradigm estimation tool that a Bayesian can usefully employ is itself. Indeed, the inadequacy of the multidimensional consistency: as the sample size grows to ∞, does the Jeffreys-rule prior has led to a search for improved ob- estimate being studied converge to the true value (in jective priors in multivariable settings. The reference a suitable sense of convergence). Bayes estimates are prior approach, mentioned earlier, has been one suc- virtually always consistent if the parameter space is cessful solution. [For the Neyman–Scott problem, the finite-dimensional (see Schervish, 1995, for a typical reference prior is π R(µ,σ) = 1/σ , which results in result and earlier references), but this need not be a consistent posterior mean and, indeed, yields infer- true if the parameter space is not finite-dimensional ences that are numerically equal to the classical in- or in irregular cases (see Ghosh, Ghosal and Samanta, ferences for σ 2.] Another approach to developing im- 1994). Here is an example of the former. proved priors is discussed in Section 3.4.3. BAYESIAN AND FREQUENTIST ANALYSIS 69

3.4.3 Frequentist performance: coverage and ad- this topic was initiated in Stein (1956), which effec- missibility. Consistency is a rather crude frequentist tively showed that the usual constant prior for a multi- criterion, and more sophisticated frequentist evalua- variate normal mean would result in an inadmissible tions of performance of Bayesian procedures are of- estimator under quadratic loss (in three or more di- ten considered. For instance, one of the most common mensions). One of the first Bayesian works to ad- approaches to evaluation of an objective prior distribu- dress this issue was Hill (1974). To access the huge tion is to see if it yields posterior credible sets that have resulting literature on the role of admissibility in good frequentist coverage properties. We have already choice of hierarchical priors, see Brown (1971), Berger seen examples of this method of evaluation in Exam- and Robert (1990), Berger and Strawderman (1996), ples 2.2 and 3.1. Robert (2001) and Tang (2001). Evaluation by frequentist coverage has actually been Here is an example where initial admissibility con- given a formal theoretical definition and is called the siderations led to significant Bayesian developments. frequentist-matching approach to developing objective EXAMPLE 3.5. Consider estimation of a covari- priors. The idea is to look at one-sided Bayesian cred- ance matrix , based on i.i.d. multivariate normal ible sets for the unknown quantity of interest, and data (x1,...,xn), where each column vector xi arises then seek that prior distribution for which the credible from the Nk(0, ) density. The sufficient statistic for sets have optimal frequentist coverage asymptotically. = n   is S i=1 xixi . Since Stein (1975), it has been Welch and Peers (1963) developed the first extensive understood that the commonly used estimates of , results in this direction, essentially showing that, for which are various multiples of S (depending on the loss one-dimensional continuous parameters, the Jeffreys function considered) are seriously inadmissible. Hence prior is frequentist-matching. There is an extensive lit- there has been a great effort in the frequentist literature erature devoted to finding frequentist-matching priors (see Yang and Berger, 1994, for references) to develop in multivariate contexts; see Efron (1993), Rousseau better estimators of . (2000), Ghosh and Kim (2001), Datta, Mukerjee, The interest in this from the Bayesian perspective is Ghosh and Sweeting (2000) and Fraser, Reid, Wong that by far the most commonly used subjective prior and Yi (2003) for some recent results and earlier for a covariance matrix is the inverse Wishart prior (for references. subjectively specified a and b) Other frequentist properties have also been used to − − (3.2) π() ∝|| a/2 exp − 1 tr[b 1] . help in the choice of an objective prior. For instance, if 2 estimation is the goal, it has long been common to uti- A frequently used objective version of this prior is the lize the frequentist concept of admissibility to help in Jeffreys-rule prior given by choosing a = k + 1and the selection of the prior. The idea behind admissibility b = 0. When one notes that the Bayesian estimates is to define a in estimation (e.g., squared arising from these priors are linear functions of S, error loss), and then see if a proposed estimator can be which were deemed to be seriously inadequate by the beaten in terms of frequentist expected loss (e.g., mean frequentists, there is clear cause for concern in the squared error). If so, the estimator is said to be inad- routine use of these priors. missible; if it cannot be beaten, it is admissible.For In this case, it is possible to also indicate the problem instance, in situations having what is known as a group with these priors utilizing Bayesian reasoning. Indeed, = t invariance structure, it has long been known that the write  H DH,whereH is an orthogonal matrix prior distribution defined by the right-Haar measure and D is a diagonal matrix with diagonal entries being ··· will typically yield Bayes estimates that are admis- the eigenvalues of the matrix, d1 >d2 > >dk. A change of variables yields sible from a frequentist perspective, while the seem- ingly more natural (to a Bayesian) left-Haar measure =| |−a/2 − 1 [ −1] π()d D exp 2 tr bD will typically fail to yield admissible estimators. Thus · − · [ ··· ] use of the right-Haar priors has become standard. See (di dj ) I d1> >dk dD dH, Berger (1985a) and Robert (2001) for general discus- i···>dk] denotes the indicator function on − Another situation in which admissibility has played the given set. Since i

estimate their sum); in classical language, σ 2 and τ 2 indeed, Bayesian and frequentist asymptotic answers are not identifiable. Were a Bayesian to attempt to are often (but not always) the same; see Schervish utilize an improper objective prior here, such as π(σ2, (1995) for an introduction to Bayesian asymptotics and τ 2) = 1, the posterior distribution would be improper. Le Cam (1986) for a high-level discussion. One might The point here is that frequentist insight and litera- conclude that this is thus another significant poten- ture about identifiability can be useful to a Bayesian in tial use of frequentist methodology by Bayesians. It is determining whether there is a problem with posterior rather rare for Bayesians to directly use asymptotic an- propriety. Thus, in the above example, upon recogniz- swers, however, since Bayesians can typically directly ing the identifiability problem, the Bayesian will know compute exact small sample size answers, often with not to use the improper objective prior and will attempt less effort than derivation of the asymptotic approxi- to elicit a true subjective proper prior for at least one of mation would require. σ 2 or τ 2. (Of course, more data, such as replications at Still, asymptotic techniques are useful to Bayesians, the first stage of the model, could also be sought.) in a variety of approximations and theoretical develop- ments. For instance, the popular Laplace approxima- 3.5 Frequentist Simplifications and tion (cf. Schervish, 1995) and BIC (cf. Schwarz, 1978) Asymptotic Approximations are based on an asymptotic arguments. Important Situations can occur in which straightforward use of Bayesian methodological developments, such as the frequentist intuition directly yields sensible answers. In definition of reference priors, also make considerable the Neyman–Scott problem, for instance, consideration use of asymptotic theory, as was mentioned earlier. of the paired differences, xi1 − xi2, directly yielded a sensible answer. In contrast, a fairly sophisticated 4. TESTING, MODEL SELECTION AND objective Bayesian analysis (use of the reference prior) MODEL CHECKING was required for a satisfactory answer. Unlike estimation, frequentist reports and conclu- This is not to say that classical methodology is sions in testing (and model selection) are often in con- universally better in such situations. Indeed, Neyman flict with their Bayesian counterparts. For a long time and Scott created this example primarily to show that it was believed that this was unavoidable—that the two use of maximum likelihood methodology can be very paradigms are essentially irreconcilable for testing. inadequate; it essentially leads to the same “bad” answer in the example as the Bayesian analysis based Berger, Brown and Wolpert (1994) showed, however, on the Jeffreys-rule prior. This points out the dilemma that this is not necessarily the case; that the main diffi- facing Bayesians in use of frequentist simplifications: culty with frequentist testing was an inappropriate lack a frequentist answer might be “simple,” but a Bayesian of conditioning which could, in a variety of situations, might well feel uneasy in its utilization unless it were be fixed. This is the focus of the next section, after felt to approximate a Bayesian answer. (For instance, which we turn to more general issues involving the is the answer conditionally sound, as discussed in of frequentist and Bayesian methodology in Section 3.2.2.) Of course, if only the frequentist answer testing and model selection. is available, the issue is moot. 4.1 Conditional Frequentist Testing It would be highly useful to catalogue situations in which direct frequentist reasoning is arguably simpler Unconditional Neyman–Pearson testing, in which than Bayesian methodology, but we do not attempt one reports the same error probability regardless of the to do so. Discussion of this and examples can be size of the test statistic (as long as it is in the rejection found in Robins and Ritov (1997) and Robins and region), has long been viewed as problematical by most Wasserman (2000). statisticians. To Fisher, this was the main inadequacy Outside of standard models (such as the normal lin- of Neyman–Pearson testing, and one of the chief mo- ear model), it is unfortunately rather rare to be able to tivations for his championing p-values in testing and obtain exact frequentist answers for small or moderate model checking. Unfortunately (as Neyman would ob- sample sizes. Hence much of frequentist methodology serve), p-values do not have a frequentist justification relies on asymptotic approximations, based on assum- in the sense, say, of the frequentist principle in Sec- ing that the sample size is large. tion 2.2. For more extensive discussion of the perceived Asymptotics can also be used to provide an approx- inadequacies of these two approaches to testing, see imation to Bayesian answers for large sample sizes; Berger (2003). 72 M. J. BAYARRI AND J. O. BERGER

The “solution” proposed in Berger, Brown and allowed (and encouraged) within the frequentist para- Wolpert (1994) for testing, following earlier devel- digm. The Bayesian connection arises because Berger, opments in Kiefer (1977), was to use the Neyman– Brown and Wolpert (1994) show that Pearson approach of formally defining frequentist er- B(x) 1 (4.3) α(s) = and β(s) = , ror probabilities of Type I and Type II, but to do so 1 + B(x) 1 + B(x) conditional on the observed value of a statistic measur- ing the “strength of evidence in the data,” as was done where B(x) is the likelihood ratio (or ), in Example 3.2. (Other proposed solutions to this prob- and these expressions are precisely the Bayesian poste- H H lem have been considered in, e.g., Hwang et al., 1992.) rior probabilities of 0 and 1, respectively, assuming the hypotheses have equal prior probabilities of 1/2. For illustration, suppose that we wish to test that Therefore, a conditional frequentist can simply com- the data X arises from the simple (i.e., completely pute the objective Bayesian posterior probabilities of specified) hypotheses H : f = f or H : f = f .The 0 0 1 1 the hypotheses, and declare that they are the condi- idea is to select a statistic S = S(X) which measures tional frequentist error probabilities; there is no need the “strength of the evidence” in X, for or against to formally derive the conditioning statistic or perform the hypotheses. Then, conditional error probabilities the conditional frequentist computations. (There are (CEPs) are computed as some technical details concerning the definition of the rejection region, but these have almost no practical im- α(s) = P(Type I error |S = s) pact; see Berger, 2003, for further discussion.) ≡ P0 reject H0|S(X) = s , The value of having a merging of the frequentist and (4.1) objective Bayesian answers in testing goes well beyond β(s) = P(Type II error|S = s) the technical convenience of computation; statistics as ≡ P1 accept H0|S(X) = s , a whole is the big winner because of the unification that results. But we are trying to avoid philosophical issues where P0 and P1 refer to probability under H0 and H1, here and so will simply focus on the methodological respectively. advantages that will accrue to frequentism. The proposed conditioning statistic S and associated Dass and Berger (2003) and Paulo (2002) extend this test utilize p-values to measure the strength of the result to many classical testing scenarios; here is an evidence in the data. Specifically (see Wolpert, 1996; example from the former. Sellke, Bayarri and Berger, 2001), we consider EXAMPLE 4.1. McDonald, Vance and Gibbons (1995) studied car emission data X = (X1,...,Xn), S = max{p0,p1}, testing whether the i.i.d. Xi follow the Weibull or where p0 is the p-value when testing H0 versus H1, Lognormal distribution, given, respectively, by and p1 is the p-value when testing H1 versus H0. γ x γ −1 x γ [Note that the use of p-values in determining eviden- H0 : fW(x; β,γ) = exp − , β β β tiary equivalence is much weaker than their use as − − 2 an absolute measure of significance; in particular, use 2 1 (ln x µ) H1 : fL(x; µ,σ ) = √ exp . of ψ(pi ),whereψ is any strictly increasing function, x 2πσ2 2σ 2 would determine the same conditioning.] The corre- There are several difficulties with classical analysis of sponding conditional frequentist test is then as follows: this situation. First, there are no low-dimensional suf- if p ≤ p reject H and ficient statistics, and no obvious test statistics; indeed, 0 1 0 McDonald, Vance and Gibbons (1995) simply consider report Type I CEP α(s); a variety of generic tests, such as the likelihood ra- (4.2) tio test (MLR), which they eventually recommended as if p0 >p1 accept H0 and being the most powerful. Second, it is not clear which report Type II CEP β(s); hypothesis to make the null hypothesis, and the clas- sical conclusion can depend on this choice (although where the CEPs are given in (4.1). not significantly, in the sense of the choice allowing To this point, there has been no connection with differing conclusions with low error probabilities). Fi- Bayesianism. Conditioning, as above, is completely nally, computation of unconditional error probabilities BAYESIAN AND FREQUENTIST ANALYSIS 73

requires a rather expensive simulation and, once com- TABLE 2 puted, one is stuck with error probabilities that do not For CO data, the MLR test at level α = 0.05, and the conditional vary with the data. test of H0 : Lognormal versus H1 :Weibull For comparison, the conditional frequentist test when n = 16 (one of the cases considered by McDonald, Mileage 0 4,000 24,000 (b) 24,000 (a) Vance and Gibbons, 1995) results in the following test: MLR decision A A A A  B(x) 2.436 9.009 6.211 2.439  if B(x) ≤ 0.94, C  T decision A A A A  reject H and report Type I CEP  0 CEP 0.288 0.099 0.139 0.291  α(x) = B(x)/(1 + B(x)), T C =   if B(x)>0.94,  the considered mileages. Further analyses and compar-  accept H0 and report Type II CEP β(x) = 1/(1 + B(x)), isons can be found in Dass and Berger (2003). Derivation of the conditional frequentist test. The where conditional frequentist analysis here depends on recog- 2(n)n−n/2 nizing an important fact: both the Weibull and the B(x) = ((n − 1)/2)π (n−1)/2 lognormal distributions are location–scale distributions (4.4) − (the Weibull after suitable transformation). In this case, ∞ n n y zi −¯z the objective Bayesian (and, hence, conditional fre- · exp dy, 0 n = ysz quentist) solution to the problem is to utilize the right- i 1 −1 Haar prior density for the distributions [π(µ,σ)= σ = ¯ = 1 n 2 = 1 n −¯ 2 with zi ln xi , z n i=1 zi and sz n i=1(zi z) . for the lognormal problem] and compute the result- In comparison with the situation for the unconditional ing Bayes factor. (Unconditional frequentists could, of frequentist test: course, have recognized the invariance of the situation and used this as a test statistic, but they would have • There is a well-defined test statistic B(x). faced a much more difficult computational challenge.) • If one switches the null hypothesis, the new Bayes − By invariance, the distribution of B(X) under ei- factor is simply B(x) 1, which will clearly lead to ther hypothesis does not depend on model parameters, the same CEPs (i.e., the CEPs do not depend on so that the original testing problem can be reduced which hypothesis is called the null hypothesis). to testing two simple hypotheses, namely H :“B(X) • Computation of the CEPs is almost trivial, requiring 0 has distribution F ∗ ”versusH :“B(X) has distrib- only a one-dimensional integration. W 1 ution F ∗,” where F ∗ and F ∗ are the distribution • Above all, the CEPs vary continuously with the data. L W L functions of B(X) under the Weibull and Lognormal In elaboration of the last point, consider one of the distributions, respectively, with an arbitrary choice of testing situations considered by McDonald, Vance and the parameters (e.g., β = γ = 1 for the Weibull, and Gibbons (1995), namely, testing for the distribution of µ = 0, σ = 1 for the Lognormal). Recall that the CEPs carbon monoxide emission data, based on a sample of happen to equal the objective Bayesian posterior prob- size n = 16. Data was collected at the four different abilities of the hypotheses. mileage levels indicated in Table 2, with (b) and (a) 4.2 Model Selection indicating “before” or “after” scheduled vehicle main- tenance. Note that the decisions for both the MLR and Clyde and George (2004) give an excellent review of the conditional test would be to accept the lognormal Bayesian model selection. Frequentists have not typ- model for the data. McDonald, Vance and Gibbons ically used Bayesian arguments in model selection, (1995) did not give the Type II error probability associ- although that may be changing, in part due to the pro- ated with acceptance (perhaps because it would depend nounced practical success that Bayesian “model aver- on the unknown parameters for many of the test statis- aging” has achieved. Bayesians often use frequentist tics they considered) but, even if Type II error had been arguments to develop approximate model selection provided, note that it would be constant. In contrast, the methods (such as BIC), to evaluate performance of conditional test has CEPs (here, the conditional Type II model selection methods and to develop default pri- errors) that vary fully with the data, usefully indicating ors for model selection. There is a huge list of arti- the differing certainties in the acceptance decision for cles of this type, including many listed in Clyde and 74 M. J. BAYARRI AND J. O. BERGER

George (2004). Robert (2001), Berger and Pericchi are many non-Bayesian ways of doing so, some of (2001, 2004) and Berger, Ghosh and Mukhopadhyay which are reviewed in Bayarri and Berger (2000) and (2003) also have general discussions and numerous Robins, van der Vaart and Ventura (2000). Here we other recent references. The story here is far from set- consider only the most common method, which is to tled, in that there is no agreement on the immediate replace θ in (4.5) by its mle, θˆ. The resulting p-value horizon as to even a reasonable method of model se- will be called the plug-in p-value (pplug). Henceforth lection. It seems highly likely, however, that any such using a superscript to denote the density with respect agreement will be based on a mixture of frequentist and to which the p-value in (4.5) is computed, the plug-in Bayesian arguments. p-value is thus defined as 4.3 p-Values for Model Checking f(·|θ)ˆ (4.6) pplug = Pr t(X) ≥ t(xobs) . Both classical statisticians and Bayesians routinely Although very simple to use, there is a worrisome use p-values for model checking. We first consider “double use” of the data in pplug, first to estimate θ and their use by classical statisticians and show the value of then to compute the tail area corresponding to t(xobs) Bayesian methodology in the computation of “proper” in that distribution. p-values; then we turn to Bayesian p-values and the Bayarri and Berger (2000) proposed the following importance of frequentist ideas in their evaluation. alternative way of eliminating θ, based on Bayesian 4.3.1 Use of Bayesian methodology in computing methodology. Begin with an objective prior densi- classical p-values. Suppose that a statistical model ty π(θ); we recommend, as before, that this be a reference prior (when available), but even a constant H0 : X ∼ f(x|θ) is being entertained, data xobs is observed and it is desired to check whether the model prior density will usually work fine. Next, define the is adequate, in light of the data. Classical statisticians partial posterior density (Bayesian motivation will be have long used p-values for this purpose. A strict given in the next section) frequentist would not do so (since p-values do not π(θ|xobs \ tobs) ∝ f(xobs|tobs, θ)π(θ) satisfy the frequentist principle), but most frequentists (4.7) f(x |θ)π(θ) relax their morals a bit in this situation (i.e., when ∝ obs , there is no alternative hypothesis which would allow f(tobs|θ) construction of a Neyman–Pearson test). resulting in the partial posterior predictive density The common approach to model checking is to of T , choose a statistic T = t(X), where (without loss of generality) large values of T indicate less compatibility (4.8) m(t|xobs \tobs) = f(t|θ)π(θ|xobs \tobs)dθ. with the model. The p-value is then defined as Since this density is free of θ, it can be used in (4.5) to (4.5) p = Pr t(X) ≥ t(x )|θ . obs compute the partial posterior predictive p-value, When θ is known, this probability computation is with ·| \ (4.9) p = Pr m( xobs tobs)(T ≥ t ). respect to f(x|θ). The crucial question is what to do ppp obs when θ is unknown. Note that pppp uses only the information in xobs that For future reference, we note a key property of is not in tobs = t(xobs) to “train” the prior and to then the p-value when θ is known: considered as a ran- eliminate θ. Intuitively this avoids double use of the dom function of X in the continuous case, p(X) has data because the contribution of tobs to the posterior is a Uniform(0, 1) distribution under H0. The implica- “removed” before θ is eliminated by integration. (The tion of this property is that p-values then have a com- notation xobs \ tobs was chosen to indicate this.) mon interpretation across statistical problems, making When T is not ancillary (or approximately so), the their general use feasible. Indeed, this property, or its double use of the data can cause the plug-in p-value to asymptotic version for composite H0, has been used fail dramatically, in the sense of being moderately large to characterize “proper,” well-behaved p-values (see even when the null model is clearly wrong. Examples Robins, van der Vaart and Ventura, 2000). and earlier references are given in Bayarri and Berger When θ is unknown, computation of a p-value (2000). Here is another interesting illustration, taken requires some way of “eliminating” θ from (4.5). There from Bayarri and Castellanos (2004). BAYESIAN AND FREQUENTIST ANALYSIS 75

EXAMPLE 4.2. Consider the hierarchical (or ran- dramatically failing to clearly indicate that the as- dom effects) model sumption of i.i.d. normality of the µi is wrong, while p = 0.010. Many other similar examples can be X |µ ∼ N(µ ,σ2) for i = 1,...,I, ppp ij i i i found in Castellanos (2002). (4.10) j = 1,...,ni, A strong indication that a proposed p-value is | ∼ 2 = µi ν,τ N(ν,τ ) for i 1,...,I, inappropriate is when it fails to be asymptotically where all variables are independent and, for simplicity, Uniform(0, 1) under the null hypothesis for all values 2 of θ, as a proper p-value should. Robins, van der Vaart we assume that the variances σi at the first level are known. Suppose that we are primarily interested and Ventura (2000) prove that the plug-in p-value often in investigating whether the normality assumption fails to be asymptotically proper in this sense, while the for the means µi is compatible with the data. We partial posterior predictive p-value is asymptotically ¯ ¯ choose the test statistic T = max{X1,...,XI },where proper. Furthermore, they show that the latter p-value ¯ Xi denotes the usual group sample means, which here is uniformly most powerful with respect to Pitman al- are sufficient statistics for the µi. When no specific ternatives, lending additional powerful frequentist sup- alternative hypothesis is postulated, “optimal” test port to the methodology. Note that this is completely statistics do not exist, and casual choices (such as this) a frequentist evaluation; no Bayesian averaging is in- are not uncommon. The issue under study is whether volved. Numerical comparisons in the above example such (easy) choices of the test statistic can usefully be of the behavior of pplug and pppp under the assumed used for computing p-values. (null) model is deferred to Section 4.3.2. Since the µi are random effects, it is well known that tests should be based on the marginal densities 4.3.2 Evaluating Bayesianp-values. Most Bayesian ¯ p-values are defined analogously to (4.8) and (4.9), of the sufficient statistics Xi , with the µi integrated but with alternatives to π(θ|xobs \ tobs). Subjective out. (Replacing the µi by their mle’s would result in a vacuous inference here.) The resulting null distribution Bayesians simply use the prior distribution π(θ) di- can be represented rectly to integrate out θ. The resulting p-value, called the predictive p-value and popularized by Box (1980), ¯ | ∼ 2 + 2 = (4.11) Xi ν,τ N(ν,σi τ ) for i 1,...,I. has the property, when considered as a random variable of X, of being uniformly distributed under the null pre- Thus pplug is computed with respect to this distri- bution, with the mle’s ν,ˆ τˆ2 [numerically computed dictive distribution. (See Meng, 1994, for discussion of from (4.11)] inserted back into (4.11) and (4.10). the importance of this property.) This p-value is thus [ ] To compute the partial posterior predictive p-value, Uniform 0, 1 in an average sense over θ, which is pre- we begin with a common objective prior for (ν, τ 2), sumably satisfactory (for consistency of interpretation namely π(ν,τ2) = 1/τ (not 1/τ 2 as is sometimes across problems) to Bayesians who believe in their sub- done, which would result in an improper posterior). jective prior. Much of model checking, however, takes place in The computation of pplug is based on an MCMC, discussed in Bayarri and Castellanos (2004). scenarios in which the model is quite tentative and, Both p-values were computed for a simulated data hence, for which serious subjective prior elicitation set, in which one of the groups comes from a distribu- (which is invariably highly costly) is not feasible. tion with a much larger mean than the other groups. In Hence model checking is more typically done by particular, the data was generated from Bayesians using objective prior distributions; guaran- tees concerning “average uniformity” of p-values then | ∼ = Xij µi N(µi, 4) for i 1,...,5, no longer apply and, indeed, the resulting p-values are j = 1,...,8, not even defined if the objective prior distribution is (4.12) T µi ∼ N(1, 1) for i = 1,...,4, improper (since the predictive distribution of will ∼ then be improper). The “solution” that has become µ5 N(5, 1). quite popular is to utilize objective priors, but to use the The resulting sample means were 1.560, 0.641, 1.982, posterior distribution π(θ|xobs), instead of the prior, in 0.014 and 6.964. Note that the sample mean of the fifth defining the distribution used to compute a p-value. group is 6.65 standard deviations away from the mean Formally, this leads to the posterior predictive p-value, of the other four groups. With this data, pplug = 0.130, defined in Guttman (1967) and popularized in Rubin 76 M. J. BAYARRI AND J. O. BERGER

(1984) and Gelman, Carlin, Stern and Rubin (1995), In this regard, Robins, van der Vaart and Ventura given by (2000) show that ppost is often severely conservative ·| (and, surprisingly, is worse in this regard than is pplug), = m( xobs) ≥ ppost Pr (T tobs), while p is asymptotically Uniform[0, 1]. (4.13) ppp It is also of interest to numerically study the nonas- | = | | m(t xobs) f(t θ)π(θ xobs)dθ. ymptotic null distribution of the three p-values con- sidered in this section. We thus return to the random Note that there is also “double use” of the data effects example. in ppost, first to convert the (possibly improper) prior into a proper distribution for determining the predic- EXAMPLE 4.3. Consider again the situation de- tive distribution, and then for computing the tail area scribed in Example 4.2. We consider pplug(X), pppp(X) corresponding to t(xobs) in that distribution. The detri- and ppost(X) as random variables and simulate their mental effect of this double use of the data was dis- distribution under an instance of the null model. cussed in Bayarri and Berger (2000) and arises again in Specifically, we chose the random effects mean and Example 4.2. Indeed, computation yields that the pos- variance to be 0 and 1, respectively, thus simulat- terior predictive p-value for the given data is 0.409, ing Xij as in (4.12), but now generating all five µi from which does not at all suggest a problem with the ran- the N(0, 1) distribution (so that the null normal hierar- dom effects normality assumption; yet, recall that one chical model is correct). Figure 6 shows the resulting of the means was more than six standard deviations sampling distributions of the three p-values. away from the others. Note that the distribution of pppp(X) is quite close The point of this section is to observe that the fre- to uniform, even though only five means were in- quentist property of “uniformity of a p-value under volved. In contrast, the distributions of pplug(X) and the null hypothesis” provides a useful discriminatory ppost(X) are quite far from uniform, with the latter tool for judging the adequacy of Bayesian (and other) being the worst. Even for larger numbers of means p-values. Note that if a p-value is uniform under the (e.g., 25), pplug(X) and ppost(X) remained signifi- null hypothesis in the frequentist sense for any θ, cantly nonuniform, indicating a serious and inappro- then it has the strong Bayesian property of being mar- priate conservatism. ginally Uniform[0, 1] under any proper prior distribu- Of course, Bayesians frequently criticize the direct tion. More important, if a proposed p-value is always use of p-values in testing a precise hypothesis, in that either conservative or anticonservative in a frequentist p-values then fail to correspond to natural Bayesian sense (see Robins, van der Vaart and Ventura, 2000, measures. There are, however, various calibrations for definitions), then it is likewise guaranteed to be of p-values that have been suggested (see Good, conservative or anticonservative in a Bayesian sense, 1983; Sellke, Bayarri and Berger, 2001). But these are no matter what the prior. If the conservatism (or anti- “higher level” considerations in that, if a purported conservatism) is severe, then the p-value cannot cor- “Bayesian p-value” fails to adequately criticize a null respond to any approximate true Bayesian p-value. hypothesis, the higher level concerns are irrelevant.

FIG.6. Null distribution of pplug(X) (left), ppost(X) (center) and pppp(X) (right) when the null normal hierarchical model is correct. BAYESIAN AND FREQUENTIST ANALYSIS 77

5. AREAS OF CURRENT DISAGREEMENT An interesting recent development is the finding, in Berger, Boukai and Wang (1999), that “optimal” It is worth mentioning some aspects of inference in conditional frequentist testing also essentially obeys which it seems very difficult to reconcile the frequentist the stopping rule principle. This suggests the admit- and Bayesian approaches. Of course, as discussed in tedly controversial speculation that optimal (condi- Section 4, it was similarly believed to be difficult tional) frequentist procedures will eventually be found to reconcile frequentist and Bayesian testing until to essentially satisfy the stopping rule principle; and recently, so it may simply be a matter of time until that we currently have a controversy only because sub- reconciliation also occurs in these other areas. optimal (unconditional) frequentist procedures are be- Multiple comparisons. When doing multiple tests, ing used. such as occurs in the variable selection problem in It is, however, also worth noting that in prior de- regression, classical statistics performs some type of velopment Bayesians sometimes utilize the stopping adjustment (e.g., the Bonferonni adjustment to the rule to help define an objective prior, for example, significance level) to account for the multiplicity of the Jeffreys or reference prior (see Ye, 1993; Sun and tests. In contrast, Bayesian analysis does not explicitly Berger, 2003, for examples and motivation). Since the adjust for multiplicity of tests, the argument being will then depend on the stopping that a correct adjustment is automatic within the rule, objective Bayesian analysis can involve a (proba- Bayesian paradigm. bly slight) violation of the stopping rule principle. (See Unfortunately, the duality between frequentist and Sweeting, 2001, for related discussion concerning the Bayesian methodology in testing, discussed in Sec- extent of violation of the likelihood principle by objec- tion 4, has not been extended to the multiple hypothesis tive Bayesian methods.) testing framework, so we are left with competing and Finite population sampling. In this central area of quite distinct methodologies. Worse, there are multi- statistics, classical inference is primarily based on fre- ple competing frequentist methodologies and multiple quentist averaging with respect to the sampling proba- competing Bayesian methodologies, a situation that is bilities by which units of the population are selected for professionally disconcerting, yet common when there inclusion in the sample. In contrast, Bayesian analy- is no frequentist–Bayesian consensus. (To access some sis asserts that these sampling probabilities are irrele- of these methodologies, see http://www.ba.ttu.edu/isqs/ vant for inference, once the data is at hand. (See Ghosh, westfall/mcp2002.htm.) 1988, which contains excellent essays by Basu on this Sequential analysis. The stopping rule principle says subject.) Hence we, again, have a fundamental philo- that once the data have been obtained, the reasons sophical and practical conflict. for stopping experimentation should have no bearing There have been several arguments (e.g., Rubin, on the evidence reported about unknown model pa- 1984; Robins and Ritov, 1997) to the effect that there rameters. This principle is automatically satisfied by are situations in which Bayesians do need to take into Bayesian analysis, but is viewed as crazy by many fre- account the sampling probabilities, to save themselves quentists. Indeed frequentist practice in clinical trials from a too-difficult (and potentially nonrobust) prior is to “spend α” for looks at the data; that is, if there development. Conversely, there has been a very signif- are to be interim analyses during the , with icant growth in use of Bayesian methodology in finite the option of stopping the trial early should the data population contexts, such as in the use of “small area look convincing, frequentists feel that it is then manda- estimation methods” (see, e.g., Rao, 2003). Small area tory to adjust the allowed error probability (down) to estimation methods actually occur in the broader con- account for the multiple analyses. text of the model-based approach to finite population This issue is extensively discussed in Berger and sampling (which mostly ignores the sampling probabil- Berry (1988), which has many earlier references. That ities), and this model-based approach also has frequen- it is a controversial and difficult issue is admirably tist and Bayesian versions (the differences, however, expressed by Savage (1962): “I learned the stopping being much smaller than the differences arising from rule principle from Professor Barnard, in conversation use, or not, of the sampling probabilities). in the summer of 1952. Frankly, I then thought it a 6. CONCLUSIONS scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely It seems quite clear that both Bayesian and frequen- believe that people resist an idea so patently right.” tist methodology are here to stay, and that we should 78 M. J. BAYARRI AND J. O. BERGER not expect either to disappear in the future. This is not BERGER,J.andBERNARDO, J. (1992). On the development of to say that all Bayesian or all frequentist methodol- reference priors. In Bayesian Statistics 4 (J. M. Bernardo, ogy is fine and will survive. To the contrary, there are J. O. Berger, A. P. Dawid and A. F. M. Smith, eds.) 35–60. many areas of frequentist methodology that should be Oxford Univ. Press. BERGER,J.andBERRY, D. (1988). The relevance of stopping replaced by (existing) Bayesian methodology that pro- rules in statistical inference (with discussion). In Statisti- vides superior answers, and the verdict is still out on cal Decision Theory and Related Topics IV (S. Gupta and those Bayesian methodologies that have been exposed J. Berger, eds.) 1 29–72. Springer, New York. as having potentially serious frequentist problems. BERGER,J.,BOUKAI,B.andWANG, Y. (1999). Simultaneous Philosophical unification of the Bayesian and fre- Bayesian–frequentist sequential testing of nested hypotheses. Biometrika 86 79–92. quentist positions is not likely, nor desirable, since BERGER,J.,BROWN,L.D.andWOLPERT, R. (1994). A unified each illuminates a different aspect of statistical infer- conditional frequentist and Bayesian test for fixed and sequen- ence. We can hope, however, that we will eventually tial simple hypothesis testing. Ann. Statist. 22 1787–1807. have a general methodological unification, with both BERGER,J.,GHOSH,J.K.andMUKHOPADHYAY, N. (2003). Bayesians and frequentists agreeing on a body of stan- Approximations and consistency of Bayes factors as model dard statistical procedures for general use. dimension grows. J. Statist. Plann. Inference 112 241–258. BERGER,J.andPERICCHI, L. (2001). Objective Bayesian meth- ods for model selection: Introduction and comparison (with ACKNOWLEDGMENTS discussion). In Model Selection (P. Lahiri, ed.) 135–207. IMS, Beachwood, OH. Research supported by Spanish Ministry of Science BERGER,J.andPERICCHI, L. (2004). Training samples in and Technology Grant SAF2001-2931 and by NSF objective Bayesian model selection. Ann. Statist. 32 841–869. Grants DMS-01-03265 and DMS-01-12069. Part of BERGER,J.,PHILIPPE,A.andROBERT, C. (1998). Estimation of the work was done while the first author was visiting quadratic functions: Noninformative priors for non-centrality the Statistical and Applied Mathematical Sciences parameters. Statist. Sinica 8 359–376. BERGER,J.andROBERT, C. (1990). Subjective hierarchical Bayes Institute and ISDS, Duke University. estimation of a multivariate normal mean: On the frequentist interface. Ann. Statist. 18 617–651. REFERENCES BERGER,J.andSTRAWDERMAN, W. (1996). Choice of hierarchi- cal priors: Admissibility in estimation of normal means. Ann. BARNETT, V. (1982). Comparative Statistical Inference, 2nd ed. Statist. 24 931–951. Wiley, New York. BERGER,J.andWOLPERT, R. L. (1988). The Likelihood Prin- BARRON, A. (1999). Information-theoretic characterization of ciple: AReview, Generalizations, and Statistical Implications, Bayes performance and the choice of priors in paramet- 2nd ed. IMS, Hayward, CA. (With discussion.) ric and nonparametric problems. In Bayesian Statistics 6 BERNARDO, J. M. (1979). Reference posterior distributions for (J.M.Bernardo,J.O.Berger,A.P.DawidandA.F.M.Smith, (with discussion). J. Roy. Statist. Soc. eds.) 27–52. Oxford Univ. Press. Ser. B 41 113–147. BARRON,A.,SCHERVISH,M.J.andWASSERMAN, L. (1999). BERNARDO,J.M.andSMITH, A. F. M. (1994). Bayesian Theory. The consistency of posterior distributions in nonparametric Wiley, New York. problems. Ann. Statist. 27 536–561. BOX, G. E. P. (1980). Sampling and Bayes’ inference in scientific BAYARRI,M.J.andBERGER, J. (2000). P-values for composite modeling and robustness (with discussion). J. Roy. Statist. Soc. null models (with discussion). J. Amer. Statist. Assoc. 95 Ser. A 143 383–430. 1127–1170. BROWN, L. D. (1971). Admissible estimators, recurrent diffusions, BAYARRI,M.J.andCASTELLANOS, M. E. (2004). Bayesian and insoluble boundary value problems. Ann. Math. Statist. 42 checking of hierarchical models. Technical report, Univ. Va- 855–903. lencia. BROWN, L. D. (1994). Minimaxity, more or less. In Statistical BELITSER,E.andGHOSAL, S. (2003). Adaptive Bayesian infer- Decision Theory and Related Topics V (S. Gupta and J. Berger, ence on the mean of an infinite-dimensional normal distribu- eds.) 1–18. Springer, New York. tion. Ann. Statist. 31 536–559. BROWN, L. D. (2000). An essay on statistical decision theory. BERGER, J. (1985a), Statistical Decision Theory and Bayesian J. Amer. Statist. Assoc. 95 1277–1281. Analysis, 2nd ed. Springer, New York. BROWN,L.D.,CAI,T.T.andDASGUPTA, A. (2001). Interval BERGER, J. (1985b). The frequentist viewpoint and conditioning. estimation for a binomial proportion (with discussion). Sta- In Proceedings of the Berkeley Conference in Honor of Jerzy tist. Sci. 16 101–133. Neyman and Jack Kiefer (L. Le Cam and R. Olshen, eds.) BROWN,L.D.,CAI,T.T.andDASGUPTA, A. (2002). Confi- 15–44. Wadsworth, Monterey, CA. dence intervals for a binomial proportion and asymptotic ex- BERGER, J. (1994). An overview of robust Bayesian analysis (with pansions. Ann. Statist. 30 160–201. discussion). Test 3 5–124. CARLIN,B.P.andLOUIS, T. A. (2000). Bayes and Empirical BERGER, J. (2003). Could Fisher, Jeffreys and Neyman have Bayes Methods for Data Analysis, 2nd ed. Chapman and agreed on testing (with discussion)? Statist. Sci. 18 1–32. Hall, London. BAYESIAN AND FREQUENTIST ANALYSIS 79

CASELLA, G. (1988). Conditionally acceptable frequentist solu- GART,J.J.andNAM, J. (1988). Approximate tions (with discussion). In Statistical Decision Theory and of the ratio of binomial parameters: A review and corrections Related Topics IV (S. Gupta and J. Berger, eds.) 1 73–117. for . Biometrics 44 323–338. Springer, New York. GELMAN,A.,CARLIN,J.B.,STERN,H.andRUBIN,D.B. CASTELLANOS, M. E. (2002). Diagnóstico Bayesiano de mode- (1995). Bayesian Data Analysis. Chapman and Hall, London. los. Ph.D. dissertation, Univ. Miguel Hernández, Spain. GHOSAL,S.,GHOSH,J.K.andVA N D E R VAART, A. W. (2000). CHALONER,K.andVERDINELLI, I. (1995). Bayesian experimen- Convergence rates of posterior distributions. Ann. Statist. 28 tal design: A review. Statist. Sci. 10 273–304. 500–531. CLYDE,M.andGEORGE, E. (2004). Model uncertainty. Sta- GHOSH, J. K., ed. (1988). Statistical Information and Likelihood. tist. Sci. 19 81–94. A Collection of Critical Essays. Lecture Notes in Statist. 45. DANIELS,M.andPOURAHMADI, M. (2002). Bayesian analysis of Springer, New York. covariance matrices and dynamic models for longitudinal data. GHOSH,J.K.,GHOSAL,S.andSAMANTA, T. (1994). Stability Biometrika 89 553–566. and convergence of the posterior in non-regular problems. In DASS,S.andBERGER, J. (2003). Unified conditional frequentist Statistical Decision Theory and Related Topics V (S. S. Gupta and Bayesian testing of composite hypotheses. Scand. J. Sta- and J. Berger, eds.) 183–199. Springer, New York. tist. 30 193–210. GHOSH,J.K.andRAMAMOORTHI, R. V. (2003). Bayesian DATTA,G.S.,MUKERJEE,R.,GHOSH,M.andSWEETING,T.J. Nonparametrics. Springer, New York. (2000). Bayesian prediction with approximate frequentist va- GHOSH,M.andKIM, Y.-H. (2001). The Behrens–Fisher problem lidity. Ann. Statist. 28 1414–1426. revisited: A Bayes–frequentist synthesis. Canad. J. Statist. 29 5–17. DAWID,A.P.andSEBASTIANI, P. (1999). Coherent dispersion criteria for optimal experimental design. Ann. Statist. 27 GOOD, I. J. (1983). Good Thinking: The Foundations of Probabil- 65–81. ity and Its Applications. Univ. Minnesota Press, Minneapolis. GUTTMAN, I. (1967). The use of the concept of a future observa- DAWID,A.P.andVOVK, V. G. (1999). Prequential probability: tion in goodness-of-fit problems. J. Roy. Statist. Soc. Ser. B 29 Principles and properties. Bernoulli 5 125–162. 83–100. DE FINETTI, B. (1970). Teoria delle Probabilità 1, 2. Einaudi, HILL, B. (1974). On coherence, inadmissibility and inference Torino. [English translations published (1974, 1975) as Theory about many parameters in the theory of least squares. In of Probability 1, 2. Wiley, New York.] Studies in Bayesian and Statistics (S. Fienberg DELAMPADY,M.,DASGUPTA,A.,CASELLA,G.,RUBIN,H.and and A. Zellner, eds.) 555–584. North-Holland, Amsterdam. STRAWDERMAN, W. E. (2001). A new approach to default HOBERT, J. (2000). Hierarchical models: A current computational priors and robust Bayes methodology. Canad. J. Statist. 29 perspective. J. Amer. Statist. Assoc. 95 1312–1316. 437–450. HWANG,J.T.,CASELLA,G.,ROBERT,C.,WELLS,M.T. DEY,D.,MÜLLER,P.andSINHA, D., eds. (1998). Practical Non- and FARRELL, R. (1992). Estimation of accuracy in testing. parametric and Semiparametric Bayesian Statistics. Lecture Ann. Statist. 20 490–509. Notes in Statist. 133. Springer, New York. JEFFREYS, H. (1961). Theory of Probability, 3rd ed. Oxford DIACONIS, P. (1988a). Bayesian numerical analysis. In Statisti- Univ. Press. cal Decision Theory and Related Topics IV (S. Gupta and KIEFER, J. (1977). Conditional confidence statements and confi- J. Berger, eds.) 1 163–175. Springer, New York. dence estimators (with discussion). J. Amer. Statist. Assoc. 72 DIACONIS, P. (1988b). Recent progress on de Finetti’s notion 789–827. of exchangeability. In Bayesian Statistics 3 (J. M. Bernardo, KIM,Y.andLEE, J. (2001). On posterior consistency of survival M.H.DeGroot,D.V.LindleyandA.F.M.Smith,eds.) models. Ann. Statist. 29 666–686. 111–125. Oxford Univ. Press. LAD, F. (1996). Operational Subjective Statistical Methods: DIACONIS,P.andFREEDMAN, D. (1986). On the consistency of A Mathematical, Philosophical and Historical Introduction. Bayes estimates (with discussion). Ann. Statist. 14 1–67. Wiley, New York. EATON, M. L. (1989). Group Invariance Applications in Statistics. LE CAM, L. (1986). Asymptotic Methods in Statistical Decision IMS, Hayward, CA. Theory. Springer, New York. EFRON, B. (1993). Bayes and likelihood calculations from confi- LEHMANN,E.L.andCASELLA, G. (1998). Theory of Point dence intervals. Biometrika 80 3–26. Estimation, 2nd ed. Springer, New York. FRASER,D.A.S.,REID,N.,WONG,A.andYI, G. Y. (2003). MCDONALD,G.C.,VANCE,L.C.andGIBBONS, D. I. (1995). Direct Bayes for interest parameters. In Bayesian Statistics 7 Some tests for discriminating between lognormal and Weibull (J. M. Bernardo, M. J. Bayarri, J. O. Berger, A. P. Dawid, distributions—an application to emissions data. In Recent Ad- D. Heckerman, A. F. M. Smith and M. West, eds.) 529–534. vances in Life-Testing and Reliability—A Volume in Honor of Oxford Univ. Press. Alonzo Clifford Cohen, Jr. (N. Balakrishnan, ed.) Chapter 25. FREEDMAN, D. A. (1963). On the asymptotic behavior of CRC Press, Boca Raton, FL. Bayes estimates in the discrete case. Ann. Math. Statist. 34 MENG, X.-L. (1994). Posterior predictive p-values. Ann. Statist. 1386–1403. 22 1142–1160. FREEDMAN, D. A. (1999). On the Bernstein–von Mises the- MORRIS, C. (1983). Parametric empirical Bayes inference: Theory orem with infinite-dimensional parameters. Ann. Statist. 27 and applications (with discussion). J. Amer. Statist. Assoc. 78 1119–1140. 47–65. 80 M. J. BAYARRI AND J. O. BERGER

MOSSMAN,D.andBERGER, J. (2001). Intervals for post-test SAVAG E , L. J. (1962). The Foundations of Statistical Inference. probabilities: A comparison of five methods. Medical Decision Methuen, London. Making 21 498–507. SCHERVISH, M. (1995). Theory of Statistics. Springer, New York. NEYMAN, J. (1977). Frequentist probability and frequentist statis- SCHWARZ, G. (1978). Estimating the dimension of a model. tics. Synthèse 36 97–131. Ann. Statist. 6 461–464. NEYMAN,J.andSCOTT, E. L. (1948). Consistent estimates based SELLKE,T.,BAYARRI,M.J.andBERGER, J. (2001). Calibration on partially consistent observations. Econometrica 16 1–32. of p-values for testing precise null hypotheses. Amer. Statist. O’HAGAN, A. (1992). Some Bayesian numerical analysis. In 55 62–71. Bayesian Statistics 4 (J.M.Bernardo,J.O.Berger,A.P.Dawid SOOFI, E. (2000). Principal information theoretic approaches. and A. F. M. Smith, eds.) 345–363. Oxford Univ. Press. J. Amer. Statist. Assoc. 95 1349–1353. PAULO, R. (2002). Problems on the Bayesian/frequentist interface. STEIN, C. (1956). Inadmissibility of the usual estimator for Ph.D. dissertation, Duke Univ. the mean of a multivariate normal distribution. Proc. Third PRATT, J. W. (1965). Bayesian interpretation of standard inference Berkeley Symp. Math. Statist. Probab 1 197–206. Univ. Cal- statements (with discussion). J. Roy. Statist. Soc. Ser. B 27 ifornia Press, Berkeley. 169–203. STEIN, C. (1975). Estimation of a covariance matrix. Reitz Lec- RAO, J. N. K. (2003). Small Area Estimation. Wiley, New York. ture, IMS–ASA Annual Meeting. (Also unpublished lec- REID, N. (2000). Likelihood. J. Amer. Statist. Assoc. 95 1335–1340. ture notes.) RÍOS INSUA,D.andRUGGERI, F., eds. (2000). Robust Bayesian STRAWDERMAN, W. (2000). Minimaxity. J. Amer. Statist. Assoc. Analysis. Lecture Notes in Statist. 152. Springer, New York. 95 1364–1368. ROBBINS, H. (1955). An empirical Bayes approach to statistics. SUN,D.andBERGER, J. (2003). Objective priors under sequential Proc. Third Berkeley Symp. Math. Statist. Probab. 1 157–164. experimentation. Technical report, Univ. Missouri. Univ. California Press, Berkeley. SWEETING, T. J. (2001). Coverage probability bias, objective ROBERT, C. P. (2001). The Bayesian Choice, 2nd ed. Springer, Bayes and the likelihood principle. Biometrika 88 657–675. New York. TANG, D. (2001). Choice of priors for hierarchical models: Admis- ROBERT,C.P.andCASELLA, G. (1999). Monte Carlo Statistical sibility and computation. Ph.D. dissertation, Purdue Univ. Methods. Springer, New York. VIDAKOVIC, B. (2000). Gamma-minimax: A paradigm for conser- ROBINS,J.M.andRITOV, Y. (1997). Toward a curse of di- vative robust Bayesians. In Robust Bayesian Analysis. Lecture mensionality appropriate (CODA) asymptotic theory for semi- Notes in Statist. 152 241–259. Springer, New York. parametric models. Statistics in Medicine 16 285–319. WALD, A. (1950). Statistical Decision Functions. Wiley, New ROBINS,J.M.,VA N D E R VAART,A.andVENTURA, V. (2000). York. p Asymptotic distribution of -values in composite null models. WELCH,B.andPEERS, H. (1963). On formulae for confidence J. Amer. Statist. Assoc. 95 1143–1156. points based on integrals of weighted likelihoods. J. Roy. ROBINS,J.andWASSERMAN, L. (2000). Conditioning, likelihood Statist. Soc. Ser. B 25 318–329. and coherence: A review of some foundational concepts. WOLPERT, R. L. (1996). Testing simple hypotheses. In Data J. Amer. Statist. Assoc. 95 1340–1346. Analysis and Information Systems: Statistical and Conceptual ROBINSON, G. K. (1979). Conditional properties of statistical Approaches (H.-H. Bock and W. Polasek, eds.) 289–297. procedures. Ann. Statist. 7 742–755. Springer, Berlin. ROUSSEAU, J. (2000). Coverage properties of one-sided intervals in the discrete case and applications to matching priors. WOODROOFE, M. (1986). Very weak expansions for sequential Ann. Inst. Statist. Math. 52 28–42. confidence levels. Ann. Statist. 14 1049–1067. RUBIN, D. B. (1984). Bayesianly justifiable and relevant fre- YANG,R.andBERGER, J. (1994). Estimation of a covariance quency calculations for the applied statistician. Ann. Statist. matrix using the reference prior. Ann. Statist. 22 1195–1211. 12 1151–1172. YE, K. (1993). Reference priors when the stopping rule depends on RUBIN, H. (1987). A weak system of axioms for “rational” the parameter of interest. J. Amer. Statist. Assoc. 88 360–363. behavior and the non-separability of utility from prior. Statist. ZHAO, L. (2000). Bayesian aspects of some nonparametric prob- Decisions 5 47–58. lems. Ann. Statist. 28 532–552.