<<

arXiv:0804.3173v7 [math.ST] 18 Jan 2010 ttsisadEooi tde ISE,Paris, (INSEE), e-mail: France Studies Economic for and Institute Statistics National (CREST), Economics Statistics in Research and for Center the Laboratory, of Statistics Member and Dauphine, Department, Universit´e Mathematics Paris Applied Statistics, of e-mail: France Paris, (INSEE), Studies Statistics Economic for and Institute National and (CREST), Economics Statistics in Research for Statistics Center the Laboratory, of Member Economic and and ENSAE Administration), Statistics Statistics, for of School Professor (National is for Chopin Analysis) Nicolas Bayesian 2008. for Society (International ISBA Rousseau Judith and Chopin Nicolas Robert, P. Christian of Theory Revisited Jeffreys’s Harold 09 o.2,N.2 141–172 2, DOI: No. 24, Vol. 2009, Science Statistical 10.1214/09-STS284REJ 10.1214/09-STS284B 10.1214/09-STS284A [email protected] [email protected] ainlIsiuefrSaitc n cnmcStudies e-mail: France Economic Paris, and (INSEE), Statistics for (CREST), Institute Statistics National for and Center Economics Laboratory, in Statistics Research the of Head and Dauphine Universit´e Paris Department, Mathematics hita .Rbr sPoesro ttsis Applied Statistics, of Professor is Robert P. Christian c nttt fMteaia Statistics Mathematical of Institute 1 icse in Discussed 10.1214/09-STS284 [email protected] 10.1214/09-STS284E tor, t σ phrases: today. difficult and relevant this words still in are Key navigating that in passages readers on concentrating modern in help to an Ou divergences. is estimation functional here both on of thorough based construction priors the noninformative the especially fu and work, the problems reference out point testing this we of paper e aspects this time, In tal the rigor. of on mathematical impact characteristics of lasting the terms a reflects had have book factors the p Bayes BayHowever, noninformative of sc in scaling of Bayes the classics derivation on objective the main as the on the of advances of initiator its one the ticular, as be well to as considered Statistics now is and nity Probability Abstract. , , 10.1214/09-STS284C fiiemaue ery’ ro,Klbc iegne test divergence, Kullback prior, Jeffreys’s measure, -finite 10.1214/09-STS284F . uihRusa sProfessor is Rousseau Judith . p vle,gons ffit. of goodness -values, ewstePeietof President the was He . 2009 , ulse xcl eet er g,Jeffreys’s ago, years seventy exactly Published 13)hshdauiu mato h aeincommu- Bayesian the on impact unique a had has (1939) . , 10.1214/09-STS284D , eone at rejoinder ; aeinfudtos oifraieprior, noninformative foundations, Bayesian , 1 ot( root innova- other tions, Among reference statistics. principal Bayesian the modern as in considered rightly is book hr smc ( much as third Probability of ory ( Smith and ( Jeffreys bility ln itiuin sn ihrifrain talso It information. Fisher using sam- the distribution, from pling priors noninformative deriving for ple epc h ra e nwoesolesw stand. we shoulders whose on men great the respect ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint ttsia Science article Statistical original the the of by reprint published electronic an is This 1 e aeinbosohrthan other books Bayesian Few .Jeffreys H. mn h Bysa lsis”ol aae( Savage only classics,” “Bayesian the Among 1970 r ootnctda onainltext. foundational a as cited often so are hoyo Probability of Theory h hoyo rbblt ae tpsil to possible it makes probability of theory The 1939 n egr( Berger and ) 1994 , 1948 ore ogeScholar Google Source: oigfil ls.Tehomonymous The close. fairly coming ) .INTRODUCTION 1. yd iet ( Finetti de by , , nttt fMteaia Statistics Mathematical of Institute , hoyo Probability of Theory 1961 09 o.2,N.2 141–172 2, No. 24, Vol. 2009, 1985 ol npar- In hool. ,temr eetbo yBernardo by book recent more the ), ,Bysfac- Bayes s, ao aim major r ir swell as riors oeaeof coverage in specially hoyof Theory emt e oecttosthan citations more get to seem ) testing d ndamen- h field. the x and ext ttstegnrlprinci- general the states esian 1974 ). , 1975 hoyo Proba- of Theory esqoe a quoted gets ) eto 1.6. Section , 1954 ,DeG- ), This . 1 This The- in 2 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU proposes a clear processing of Bayesian testing, in- they may be surprisingly accurate), and parts that cluding the dimension-free scaling of Bayes factors. are linked with Bayesian perspectives will be covered This comprehensive treatment of fleetingly. Thus, when pointing out notions that may from an objective Bayes perspective is a major inno- seem outdated or even mathematically unsound by vation for the time, and it has certainly contributed modern standards, our only aim is to help the mod- to the advance of a field that was then submitted to ern reader stroll past them, and we apologize in ad- severe criticisms by R. A. Fisher (Aldrich, 2008) and vance if, despite our intent, our tone seems overly others, and was in danger of becoming a feature of presumptuous: it is rather a reflection of our igno- the past. As pointed out by Zellner (1980) in his in- rance of the current conditions at the time since (to troduction to a volume of essays in honor of Harold borrow from the above quote which may sound it- Jeffreys, a fundamental strength of Theory of Prob- self somehow presumptuous) we stand respectfully ability is its affirmation of a unitarian principle in at the feet of this giant of . the statistical processing of all fields of science. The plan of the paper follows Theory of Probabil- For a 21st century reader, Jeffreys’s Theory of ity linearly by allocating a section to each chapter of Probability is nonetheless puzzling for its lack of the book (Appendices are only mentioned through- formalism, including its difficulties in handling im- out the paper). Section 10 contains a brief conclu- proper priors, its reliance on intuition, its long de- sion. Note that, in the following, words, sentences bate about the nature of probability, and its re- or passages quoted from Theory of Probability are peated attempts at philosophical justifications. The written in italics with no precise indication of their title itself is misleading in that there is absolutely location, in order to keep the style as light as pos- no exposition of the mathematical bases of probabil- sible. We also stress that our review is based on ity theory in the sense of Billingsley (1986) or Feller the third edition of Theory of Probability (Jeffreys, (1970): “Theory of Inverse Probability” would have 1961), since this is both the most matured and the been more accurate. In other words, the style of the most available version (through the last reprint by book appears to be both verbose and often vague in Oxford University Press in 1998). Contemporary re- 2 its mathematical foundations for a modern reader. views of Theory of Probability are found in Good (Good, 1980, also acknowledges that many passages (1962) and Lindley (1962). of the book are “obscure.”) It is thus difficult to ex- tract from this dense text the principles that made 2. CHAPTER I: FUNDAMENTAL NOTIONS Theory of Probability the reference it is nowadays. In this paper we endeavor to revisit the book from The posterior of the hypotheses are a Bayesian perspective, in order to separate founda- proportional to the products of the prior tional principles from less relevant parts. probabilities and the likelihoods. This review is neither a historical nor a critical H. Jeffreys, Theory of Probability, Section 1.2. exercise: while conscious that Theory of Probabil- ity reflects the idiosyncrasies both of the scientific The first chapter of Theory of Probability sets gen- achievements of the 1930’s—with, in particular, the eral goals for a coherent theory of induction. More emerging formalization of Probability as a branch of importantly, it proposes an axiomatic (if slightly Mathematics against the ongoing debate on the na- tautological) derivation of prior distributions, while ture of probabilities—and of Jeffreys’s background— justifying this approach as coherent, compatible with as a geophysicist—, we aim rather at providing the the ordinary process of learning and allowing for the modern reader with a reading guide, focusing on the incorporation of imprecise information. It also rec- pioneering advances made by this book. Parts that ognizes the fundamental property of coherence when correspond to the lack (at the time) of analytical updating posterior distributions, since they can be (like matrix algebra) or numerical (like simulation) used as the in taking into account tools and their substitution by approximation de- of a further set of data. Despite a style that is often vices (that are not used any longer, even though difficult to penetrate, this is thus a major chapter of Theory of Probability. It will also become clearer at a later stage that the principles exposed in this chap- 2In order to keep readability as high as possible, we shall use modern notation whenever the original notation is either ter correspond to the (modern) notion of objective unclear or inconsistent, for example, Greek letters for param- Bayes inference: despite mentions of prior probabil- eters and roman letters for observations. ities as reflections of prior belief or existing pieces of THEORY OF PROBABILITY REVISITED 3 information, Theory of Probability remains strictly Note that, maybe due to this very call to Ockham, “objective” in that prior distributions are always de- the later Bayesian literature abounds in references rived analytically from sampling distributions and to Ockham’s razor with little formalization of this that all examples are treated in a noninformative principle, even though Berger and Jefferys (1992), manner. One may find it surprising that a physi- Balasubramanian (1997) and MacKay (2002) develop cist like Jeffreys does not emphasise the appeal of elaborate approaches. In particular, the definition subjective Bayes, that is, the ability to take into ac- of the Bayes factor in Section 1.6 can be seen as a count genuine prior information in a principled way. partial implementation of Ockham’s razor when set- But this is in line with both his predecessors, in- ting the probabilities of both models equal to 1/2. cluding Laplace and Bayes, and their use of uniform In the beginning of his Chapter 28, entitled Model priors and his main field of study that he perceived Choice and Occam’s Razor, MacKay (2002) argues as objective (Lindley, 2008, private communication), that Bayesian inference embodies Ockham’s razor while one of the main appeals of Theory of Probabil- because “simple” models tend to produce more pre- ity is to provide a general and coherent framework cise predictions and, thus, when the data is equally to derive objective priors. compatible with several models, the simplest one 2.1 A Philosophical Exercise will end up as the most probable. This is generally true, even though there are some counterexamples The chapter starts in Section 1.0 with an epis- in Bayesian nonparametrics. temological discussion of the nature of (statistical) Overall, we nonetheless feel that this part of The- inference. Some sections are quite puzzling. For in- ory of Probability could be skipped at first read- stance, the example that the kinematic equation for ing as less relevant for Bayesian studies. In particu- an object in free-fall, lar, the opposition between mathematical deduction s = a + ut + 1 gt2, and statistical induction does not appear to carry a 2 strong argument, even though the distinction needs cannot be deduced from observations is used as an (needed?) to be made for mathematically oriented argument against deduction under the reasoning that readers unfamiliar with statistics. However, from a an infinite number of functions, historical point of view, this opposition must be con- 1 2 sidered against the then-ongoing debate about the s = a + ut + gt + f(t)(t t1) (t tn), 2 − − nature of induction, as illustrated, for instance, by also apply to describe a free fall observed at times Karl Popper’s articles of this period about the logi- t1,...,tn. The limits of the epistemological discus- cal impossibility of induction (Popper, 1934). sion in those early pages are illustrated by the in- troduction of Ockham’s razor (the choice of the sim- 2.2 Foundational Principles plest law that fits the fact), as the meaning of what The text becomes more focused when dealing with a simplest law can be remains unclear, and the sec- the construction of a theory of inference: while some tion lacks a clear (objective) argument in motivating notions are yet to be defined, including the pervasive this choice, besides common sense, while the discus- evidence, sentences like inference involves in its very sion ends up with a somehow paradoxical statement nature the possibility that the alternative chosen as that, since deductive logic provides no explanation the most likely may in fact be wrong are in line with of the choice of the simplest law, this is proof that our current interpretation of modeling and obviously deductive logic is grossly inadequate to cover scien- with the Bayesian paradigm. In Section 1.1 Jeffreys tific and practical requirements. On the other hand, sets up a collection of postulates or rules that act and from a statistician’s narrower perspective, one like axioms for his theory of inference, some of which can re-interpret this gravity example as possibly the require later explanations to be fully understood: earliest discussion of the conceptual difficulties asso- ciated with model choice, which are still not entirely 1. All hypotheses must be explicitly stated and the resolved today. In that respect, it is quite fascinat- conclusions must follow from the hypotheses: what ing to see this discussion appear so early in the book may first sound like an obvious scientific principle (third page), as if Jeffreys had perceived how impor- is in fact a leading characteristic of Bayesian statis- tant this debate would become later. tics. While it seems to open a whole range of new 4 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU questions—“To what extent must we define our be- have been more adequate. But, from reading fur- lief in the statistical models used to build our in- ther (as in Section 1.2), it appears that this rule is ference? How can a unique conclusion stem from a to be understood as the foundational principle (the given model and a given set of observations?”—and chief constructive rule) for defining prior distribu- while it may sound far too generic to be useful, we tions. While this is certainly not clear at this stage, may interpret this statement as setting the working Bayesian inference does indeed provide for the pos- principle of Bayesian decision theory: given a prior, sibility that the model under study is not correct a sampling distribution, an observation and a loss and for the unreliability of the resulting inference function, there exists a single decision procedure. via a . In contrast, the frequentist theories of Neyman or 5. The theory must not deny any empirical propo- of Fisher require the choice of ad hoc procedures, sition a priori. This principle remains unclear when whose (good or bad) properties they later analyze. put into practice. If it is to be understood in the But this may be a far-fetched interpretation of this sense of a physical theory, there is no reason why rule at this stage even though the comment will ap- some empirical proposition could not be excluded pear more clearly later. from the start. If it is the sense of an inferential 2. The theory must be self-consistent. The state- theory, then the statement would require a better ment is somehow a repetition of the previous rule definition of empirical proposition. But Jeffreys us- and it is only later (in Section 3.10) that its mean- ing the epithet a priori seems to imply that the prior ing becomes clearer, in connection with the intro- distribution corresponding to the theory must be as duction of Jeffreys’s noninformative priors as a self- inclusive as possible. This certainly makes sense as contained principle. Consistency is nonetheless a dom- long as prior information does not exclude parts of inant feature of the book, as illustrated in Section 3.1 the parameter space as, for instance, in Physics. with the rejection of Haldane’s prior.3 6. The number of postulates should be reduced to a 3. Any rule must be applicable in practice. This minimum. This rule sounds like an embedded Ock- “rule” does not seem to carry any weight in prac- ham’s razor, but, more positively, it can also be in- tice. In addition, the explicit prohibition of esti- terpreted as a call for noninformative priors. Once mates based on impossible experiments sounds im- again, the vagueness of the wording opens a wide plementable only through deductive arguments. But range of interpretations. this leads to the exclusion of rules based on fre- 7. The theory need not represent thought-processes quency arguments and, as such, is fundamental in in details, but should agree with them in outline. setting a Bayesian framework. Alternatively (and This vague principle could be an attempt at rec- this is another interpretation), this constraint should onciliating statistical theories, but it does not give be worded in more formal terms of the measurability clear directions on how to proceed. In the light of of procedures. Jeffreys’s arguments, it could rather signify that the 4. The theory must provide explicitly for the pos- construction of prior distributions cannot exactly re- sibility that inferences made by it may turn out to flect an actual construction in real life. Since a non- be wrong. This is both a fundamental aspect of sta- informative (or “objective”) perspective is adopted tistical inference and an indication of a surprising for most of the book, this is more likely to be a pre- view of inference. Indeed, even when conditioning liminary argument in favor of this line of thought. In on the model, inference is never right in the sense Section 1.2 this rule is invoked to derive the (prior) that a point estimate rarely gives the true answer. ordering of events. It may be that Jeffreys is solely thinking of sta- 8. An objection carries no weight if [it] would in- tistical testing, in which case the rightfulness of a validate part of pure mathematics. This rule grounds decision is necessarily conditional on the truthful- Theory of Probability within mathematics, which may ness of the corresponding model and thus dubious. be a necessary reminder in the spirit of the time A more relative (or more precise) statement would (where some were attempting to dissociate statis- tics from mathematics). 3Consistency is then to be understood in the weak sense of invariant under reparameterization, which is a usual argument The next paragraph discusses the notion of prob- for Jeffreys’s principle, not in terms of asymptotic convergence ability. Its interest is mostly historical: in the early properties. 1930’s, the axiomatic definition of probability based THEORY OF PROBABILITY REVISITED 5 on Kolmogorov’s axioms was not yet universally ac- formal (modern) definition of a probability distri- cepted, and there were still attempts to base this bution, with the provision that this probability is definition on limiting properties. In particular, always conditional on the same data. This is also Lebesgue integration was not part of the undergrad- reminiscent of the derivation of the existence of a uate curriculum till the late 1950’s at either Cam- prior distribution from an ordering of prior proba- bridge or Oxford (Lindley, 2008, private communi- bilities in DeGroot (1970), but the discussion about cation). This debate is no longer relevant, and the the arbitrary ranking of probabilities between 0 and current theory of probability, as derived from mea- 1 may sound anecdotal today. Note also that, from sure theory, does not bear further discussion. This a mathematical point of view, defining only condi- also removes the ambiguity of constructing objective tional probabilities like P (p q) is somehow superflu- | probabilities as derived from actual or possible ob- ous in that, if the conditioning q is to remain fixed, servations. A probability model is to be understood P ( q) is a regular , while, if | as a mathematical (and thus unobjectionable) con- q is to be updated into qr, P ( qr) can be derived | struct, in agreement with Rule 8 above. from P ( q) by Bayes’ theorem (which is to be in- | Then follows (still in Section 1.1) a rather long troduced later). Therefore, in all cases, P ( q) ap- | debate on causality versus determinism. While the pears like the reference probability. At some stage, principles stated in those pages are quite accept- while stating that the probability of the sure event able, the discussion only uses the most basic concept is equal to one is merely a convention, Jeffreys indi- of determinism, namely, that identical causes give cates that, when expressing ignorance over an infi- identical effects, in the sense of Laplace. We thus nite range of values of a quantity, it may be conve- agree with Jeffreys that, at this level, the principle nient to use instead. Clearly, this paves the way ∞ is useless, but the same paragraph actually leaves for the introduction of improper priors.5 Unfortu- us quite confused as to its real purpose. A likely ex- nately, the convention and the motivation (to keep planation (Lindley, 2008, personal communication) ratios for finite ranges determinate) do not seem is that Jeffreys stresses the inevitability of probabil- correct, if in tune with the perspective of the time ity statements in Science: (measurement) errors are (see, e.g., Lhoste, 1923; Broemeling and Broemel- not mistakes but part of the picture. ing, 2003). Notably, setting all events involving an infinite range with a probability equal to seems 2.3 Prior Distributions to restrict the abilities of the theory to a∞ far ex- 6 In Section 1.2 Jeffreys introduces the notion of tent. Similar to Laplace, Jeffreys is more used to prior in an indirect way, by considering that the handling equal probability finite sets than continu- probability of a proposition is always conditional on ous sets and the extension to continuous settings is some data and that the occurrence of new items of unorthodox, using, for instance, Dedekind’s sections information (new evidence) on this proposition sim- and putting several meanings under the notation dx. ply updates the available data. This is slightly con- Given the convoluted derivation of conditional prob- trary to our current way of defining a prior distri- abilities in this context, the book states the product rule P (qr p)= P (q p)P (r qp) as an axiom, rather bution π on a parameter θ as the information avail- | | | able on θ prior to the observation of the data, but than as a consequence of the basic probability ax- it simply conveys the fact that the prior distribu- ioms. It leads (in Section 1.22) to Bayes’ theorem, tion must be derived from some prior items of in- formation about θ. As pointed out by Jeffreys, this 5Jeffreys’s Theory of Probability strongly differs from the also allows for the coexistence of prior distributions earlier Scientific Inference (1931) in this respect, the latter be- ing rather dismissive of the mathematical difficulty: To make for different experts within the same probabilistic 4 this integral equal to 1 we should therefore have to include framework. In the sequel all statements will, how- a zero factor unless very small and very large values are ex- ever, condition on the same data. cluded. This does appear to be the case (Section 5.43, page The following paragraphs derive standard math- 67). 6 ematical logic axioms that directly follow from a This difficulty with handling σ-finite measures and contin- uous variables will be recurrent throughout the book: Jeffreys does not seem to be adverse to normalizing an improper dis- 4Jeffreys seems to further note that the same conditioning tribution by , even though the corresponding derivations ∞ applies for the model of reference. are not meaningful. 6 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU namely, that, for all events qr, imposed earlier. This part does not bring much nov- elty, once the fundamental properties of a proba- P (q pH) P (q H)P (p q H), r| ∝ r| | r bility distribution are stated. This is basically the where H denotes the information available and p a purpose of this section, where earlier “Axioms” are set of observations. In this (modern) format P (p qrH) checked in terms of the posterior probability P ( pH). | | is identified as Fisher likelihood and P (qr H) as the A reassuring consequence of this derivation is that prior probability. Bayes’ theorem is defined| as the the use of a posterior probability as the basis for principle of inverse probability and only for finite inference cannot lead to inconsistency. The use of sets, rather than for measures.7 Obviously, the gen- the posterior as a new prior for future observations eral version of Bayes’ theorem is used in the sequel and the corresponding learning principle are devel- for continuous parameter spaces. oped at this stage. The debate about the choice of Section 1.3 represents one of the few forays of the prior distribution is postponed till later, while the book into the realm of decision theory,8 in con- the issue of the influence of this prior distribution nection with Laplace’s notions of mathematical and is dismissed as having very little difference [on] the moral expectations, and with Bernoulli’s Saint Pe- results, which needs to be quantified, as in the quote tersburg paradox, but there is no recognition of the below at the beginning of Section 5. central role of the loss function in defining an opti- Given the informal approach to (or rather with- mal Bayes rule as formalized later by Wald (1950) out) measure theory adopted in Theory of Probabil- and Raiffa and Schlaifer (1961). The attribution of ity, the study of the limiting behavior of posterior a decision-theoretic background to T. Bayes himself distributions in Section 1.6 does not provide much is surprising, since there is not anything close to the insight. For instance, the fact that notion of loss or of benefit in Bayes’ (1763) origi- P (q p1 pnH) nal paper. We nonetheless find there the seed of an | P (q H) idea later developed in Rubin (1987), among oth- = | ers, that prior and loss function are indistinguish- P (p1 H)P (p2 p1H) P (pn p1 pn−1H) | | | able. [Section 1.8 briefly re-enters this perspective is shown to induce that P (pn p1 pn−1H) converges to point out that (posterior) expectations are often to 1 is not particularly surprising,| although it relates nowhere near the actual value of the random quan- to Laplace’s principle that repeated verifications of tity.] The next section (Section 1.4) is important in consequences of a hypothesis will make it practically that it tackles for the first time the issue of nonin- certain that the next consequence will be verified. It formative priors. When the number of alternatives would have been equally interesting to focus on cases is finite, Jeffreys picks the uniform prior as his non- in which P (q p1 pnH) goes to 1. informative prior, following Laplace’s Principle of The end of| Section 1.62 introduces some quanti- Insufficient Reason. The difficulties associated with ties of interest, such as the distinction between esti- this choice in continuous settings are not mentioned mation problems and significance tests, but with no at this stage. clear guideline: when comparing models of complex- 2.4 More Axiomatics and Some Asymptotics ity m (this quantity being only defined for differen- tial equations), Jeffreys suggests using prior prob- Section 1.5 attempts an axiomatic derivation that abilities that are penalized by m, such as 2−m or the Bayesian principles just stated follow the rules 6/π2m2, the motivation for those specific values be- ing that the corresponding series converge. Penal- 7As noted by Fienberg (2006), the adjective term ization by the model complexity is quite an inter- “Bayesian” had not yet appeared in the statistical literature esting idea, to be formalized later by, for example, by the time Theory of Probability was published, and Jeffreys Rissanen (1983, 1990), but Jeffreys somehow kills sticks to the 19th century denomination of “inverse probabil- ity.” The adjective can be traced back to either , this idea before it is hatched by pointing out the who used it in a rather derogatory meaning, or to Abraham difficulties with the definition of m. Wald, who gave it a more complimentary meaning in Wald Instead, Jeffreys switches to a completely different (1950). (if paramount) topic by defining in a few lines the 8 The reference point estimator advocated by Jeffreys (if Bayes factor for testing a point null hypothesis, any) seems to be the maximum a posteriori (MAP) estimator, even though he stated in his discussion of Lindley (1953) that P (q θH) P (q H) K = | | , he deprecated the whole idea of picking out a unique estimate. P (q′ θH) P (q′ H) | . | THEORY OF PROBABILITY REVISITED 7 where θ denotes the data. He suggests using P (q H)= model assessment must involve a description of the 1/2 as a default value, except for sequences of| em- alternative(s) to be validated. bedded hypotheses for which he suggests The main bulk of the chapter is about sampling distributions. Section 2.1 introduces binomial and P (q H) | = 2, hypergeometric distributions at length, including the P (q′ H) | interesting problem of deciding between binomial presumably because the series with leading term 2−n versus negative binomial experiments when faced is converging. with the outcome of a survey, used later in the de- Once again, the rather quick coverage of this ma- fence of the Likelihood Principle (Berger and Wolpert, terial is somehow frustrating, as further justifica- 1988). The description of the binomial contains the tions would have been necessary for the choice of equally interesting remark that a given coin repeat- the constant and so on.9 Instead, the chapter con- edly thrown will show a bias toward head or tail due cludes with a discussion of the distinction between to the wear, a remark later exploited in Diaconis and “idealism” and “realism” that can be skipped for Ylvisaker (1985) to justify the use of mixtures of most purposes. conjugate priors. Bernoulli’s version of the Central Limit theorem is also recalled in this section, with no particular appeal if one considers that a mod- 3. CHAPTER II: DIRECT PROBABILITIES ern Statistics course (see, e.g., Casella and Berger, The whole of the information contained in the 2001) would first start with the probabilistic back- observations that is relevant to the posterior ground.10 probabilities of different hypotheses is summed The Poisson distribution is first introduced as a up in the values that they give to the likelihood. limiting distribution for the binomial distribution H. Jeffreys, Theory of Probability, Section 2.0. (n,p) when n is large and np is bounded. (Connec- Btions with radioactive disintegration are mentioned This chapter is certainly the least “Bayesian” chap- afterward.) The normal distribution is proposed as ter of the book, since it covers both the standard a large sample approximation to a sum of Bernoulli sampling distributions and some equally standard random variables. As for the other distributions, probability results. It starts with a reminder that there is some attempt at justifying the use of the the principle of inverse probability can be stated in normal distribution, as well as [what we find to be] the form a confusing paragraph about the “true” and “actual observed” values of the parameters. A long section Posterior Probability Prior Probability ∝ (Section 2.3) expands about the properties of Pear- Likelihood, son’s distributions, then allowing Jeffreys to intro- duce the negative binomial as a mixture of Poisson thus rephrasing Bayes’ theorem in terms of the like- distributions. The introduction of the bivariate nor- lihood and with the proper indication that the rel- mal distribution is similarly convoluted, using first evant information contained in the observations is binomial variates and second a limiting argument, summarized by the likelihood (sufficiency will be and without resorting to matrix formalism. mentioned later in Section 3.7). Then follows (still Section 2.6 attempts to introduce cumulative dis- in Section 2.0) a long paragraph about the tenta- tribution functions in a more formal manner, using tive nature of models, concluding that a statistical the current three-step definition, but again dealing model must be made part of the prior information with limits in an informal way. Rather coherently H before it can be tested against the observations, from a geophysicist’s point of view, characteristic which (presumably) relates to the fact that Bayesian functions are also covered in great detail, including connections with moments and the Cauchy distribu- 9Similarly, the argument against philosophers that main- tion, as well as L´evy’s inversion theorem. The main tain that no method based on the theory of probability can give a (...) non-zero probability to a precise value against a contin- 10In fact, some of the statements in Theory of Probability uous background is not convincing as stated. The distinction that surround the statement of the Central Limit theorem are between zero measure events and mixture priors including a not in agreement with measure theory, as, for instance, the Dirac mass should have been better explained, since this is confusion between pointwise and uniform convergence, and the basis for Bayesian point-null testing. convergence in probability and convergence in distribution. 8 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU goal of using characteristic functions seems nonethe- since the decision-theoretic perspective for building less to be able to establish the Central Limit theo- (point) estimators is mostly missing from the book rem in its full generality (Section 2.664). (see Section 1.8 for a very brief remark on expecta- Rather surprisingly for a Bayesian reference book tions). Both Good (1980) and Lindley (1980) stress and mostly in complete disconnection with the test- this absence. ing chapters, the χ2 test of goodness of fit is given a 4.1 Noninformative Priors of Former Days large and uncritical place within this book, includ- ing an adjustment for the degrees of freedom.11 Ex- Section 3.1 sets the principles for selecting nonin- amples include the obvious independence of a rect- formative priors. Jeffreys recalls Laplace’s rule that, angular contingency table. The only criticism (Sec- if a parameter is real-valued, its prior probability tion 2.76) is fairly obscure in that it blames poor should be taken as uniformly distributed, while, if performances of the χ2 test on the fact that all di- this parameter is positive, the prior probability of its vergences in the χ2 sum are equally weighted. The logarithm should be taken as uniformly distributed. test is nonetheless implemented in the most classi- The motivation advanced for using both priors is cal manner, namely, that the hypothesis is rejected the invariance principle, namely, the invariance of if the χ2 statistic is outside the standard interval. the prior selection under several different sets of pa- It is unclear from the text in Section 2.76 that re- rameters. At this stage, there is no recognition of jection would occur were the χ2 statistic too small, a potential problem with using a σ-finite measure even though Jeffreys rightly addresses the issue at and, in particular, with the fact that these priors the end of Chapter 5 (Section 5.63). He also men- are not probability distributions, but rather a sim- tions the need to coalesce small groups into groups ple warning that these are formal rules expressing of size at least 5 with no further justification. The ignorance. We face the difficulty mentioned earlier chapter concludes with similar uses of Student’s t when considering σ-finite measures since they are and Fisher’s z tests. not properly handled at this stage: when stating that one starts with any distribution of prior proba- 4. CHAPTER III: ESTIMATION PROBLEMS bility, it is not possible to include σ-finite measures this way, except via the [incorrect] argument that a If we have no information relevant to the actual probability is merely a number and, thus, that the value of the parameter, the probability must be total weight can be as well as 1: use instead chosen so as to express the fact that we have none. of 1 to indicate certainty∞ on data H. The∞ wrong in- H. Jeffreys, Theory of Probability, Section 3.1. terpretation of a σ-finite measure as a probability distribution (and of as a “number”) then leads This is a major chapter of Theory of Probabil- to immediate paradoxes,∞ such as the prior proba- ity as it introduces both exponential families and bility of any finite range being null, which sounds the principle of Jeffreys noninformative priors. The inconsistent with the statement that we know noth- main concepts are already present in the early sec- ing about the parameter, but this results from an tions, including some invariance principles. The pur- over-interpretation of the measure as a probability pose of the chapter is stated as a point estimation distribution already pointed out by Lindley (1971, problem, where obtaining the probability distribu- 1980) and Kass and Wasserman (1996). tion of [the] parameters, given the observations is The argument for using a flat (Lebesgue) prior is the goal. Note that estimation is not to be under- based (a) on its use by both Bayes and Laplace in stood in the (modern?) sense of point estimation, finite or compact settings, and (b) on the argument that is, as a way to produce numerical substitutes that it correctly reflects the absence of prior knowl- for the true parameters that are based on the data, edge about the value of the parameter. At this stage, no point is made against it for reasons related with 11Interestingly enough, the parameters are estimated by the invariance principle—there is only one parame- 2 minimum χ rather than either maximum likelihood or terization that coincides with a uniform prior—but Bayesian point estimates. This is, again, a reflection of the Jeffreys already argues that flat priors cannot be practice of the time, coupled with the fact that most ap- proaches are asymptotically indistinguishable. Posterior ex- used for significance tests, because they would al- pectations are not at all advocated as Bayes (point) estima- ways reject the point null hypothesis. Even though tors in Theory of Probability. Bayesian significance tests, including Bayes factors, THEORY OF PROBABILITY REVISITED 9 have not yet been properly introduced, the notion incomplete since missing the fundamental issue of of an infinite mass canceling a point null hypothesis distinguishing proper from improper priors. is sufficiently intuitive to be used at this point. While Haldane’s (1932) prior on probabilities (or While, indeed, using an improper prior is a ma- rather on chances as defined in Section 1.7), jor difficulty when testing point null hypotheses be- 1 cause it gives an infinite mass to the alternative π(p) , ∝ p(1 p) (DeGroot, 1970), Jeffreys fails to identify the prob- − lem as such but rather blames the flat prior applied is dismissed as too extreme (and inconsistent), there to a parameter with a semi-infinite range of pos- is no discussion of the main difficulty with this prior sible values. He then goes on justifying the use of (or with any other improper prior associated with a π(σ) = 1/σ for positive parameters (replicating the finite-support sampling distribution), which is that argument of Lhoste, 1923) on the basis that it is the corresponding posterior distribution is not de- invariant for the change of parameters ̺ = 1/σ, as fined when x (n,p) is either equal to 0 or to n ∼B well as any other power, failing to recognize that (although Jeffreys concludes that x = 0 leads to a other transforms that preserve positivity do not ex- point mass at p = 0, due to the infinite mass nor- 13 hibit such an invariance. One has to admit, however, malization). Instead, the corresponding Jeffreys’s that, from a physicist’s perspective, power trans- prior forms are more important than other mathematical 1 π(p) transforms, such as arctan, because they can be as- ∝ p(1 p) signed meaningful units of measurement, while other − functions cannot. At least this seems to be the spirit is suggested with little justificationp against the (truly) of the examples considered in Theory of Probability: uniform prior: we may as well use the uniform dis- Some methods of measuring the charge of an elec- tribution. 2 tron give e, others e . 4.2 Laplace’s Succession Rule There is a vague indication that Jeffreys may also recognize π(σ) = 1/σ as the scale group invariant Section 3.2 contains a Bayesian processing of measure, but this is unclear. An indefensible argu- Laplace’s succession rule, which is an easy introduc- ment follows, namely, that tion given that the parameter of the sampling dis- tribution, a hypergeometric (N, r), is an integer. a ∞ H vn dv vn dv The choice of a uniform prior on r, π(r) = 1/(N +1), 0 a does not require much of a discussion and the pos- Z . Z is only indeterminate when n = 1, which allows us terior distribution − to avoid contradictions about the lack of prior in- r N r N + 1 π(r l,m,N,H)= − formation. Jeffreys acknowledges that this does not | l m l + m + 1 solve the problem since this choice implies that the    .   is available in closed form, including the normal- prior “probability” of a finite interval (a, b) is then izing constant. The posterior predictive probability always null, but he avoids the difficulty by admit- that the next specimen will be of the same type is ting that the probability that σ falls in a particu- then (l +1)/(l +m+1) and more complex predictive lar range is zero, because zero probability does not probabilities can be computed as well. As in earlier imply impossibility. He also acknowledges that the books involving Laplace’s succession rule, the sec- invariance principle cannot encompass the whole tion argues about its truthfulness from a metaphys- range of transforms without being inconsistent, but ical point of view (using classical arguments about he nonetheless sticks to the π(σ) = 1/σ prior as it is better than the Bayes–Laplace rule.12 Once again, 13 the argument sustaining the whole of Section 3.1 is Jeffreys (1931, 1937) does address the problem in a clearer manner, stating that this is not serious, for so long as the sample is homogeneous (meaning x = 0, n) the ex- 12In both the 19th and early 20th centuries, there is a tra- treme values (meaning p = 0, 1) are still admissible, and we dition within the not-yet-Bayesian literature to go to extreme do attach a high probability to the proposition is of one type; lengths in the justification of a particular prior distribution, as while as soon as any exceptions are known the extreme values if there existed one golden prior. See, for example, Broemeling are completely excluded and no infinity arises (Section 10.1, and Broemeling (2003) in this respect. page 195). 10 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU the probabilities that the sun rising tomorrow and distribution as the posterior distribution and he de- that all swans are white that always seem to be as- rives the predictive probability that the next member sociates themselves with this topic) but, more inter- will be of the first type as estingly, it then moves to introducing a point mass (x1 + 1)/ xi + r. on specific values of the parameter in preparation i for hypothesis testing. Namely, following a renewed X There could be some connections there with the ir- criticism of the uniform assessment via the fact that relevance of alternative hypotheses later (in time) P (r = N l,m = 0,N,H) l + 1 discussed in polytomous regression models (Gouri´eroux | = P (r = N l = n,N,H) N + 1 and Monfort, 1996), but they are well hidden. In any | case, the Dirichlet distribution is not invariant to the is too small, Jeffreys suggests setting aside a portion introduction of new types. 2k of the prior mass for both extreme values r = 0 4.3 Poisson Distribution and r = N. This is indeed equivalent to using a point mass on the null hypothesis of homogeneity of the The processing of the estimation of the parameter population. While mixed samples are independent α of the Poisson distribution (α) is based on the [improper] prior π(α) 1/α, deemedP to be the cor- of the choice of k (since they exclude those extreme ∝ values), a sample of the first type with l = n leads rect prior probability distribution for scale invariance reasons. Given n observations from (α) with sum to a posterior probability ratio of P Sn, Jeffreys reproduces Haldane’s (1932) derivation P (r = N l = n,N,H) n + 1 k N 1 of the Gamma posterior a(Sn,n) and he notes that | = − , G P (r = N l = n,N,H) N n 1 2k 1 Sn is a sufficient statistic, but does not make a gen- | − − eral property of it at this stage. (This is done in which leads to the crucial question of the choice14 of Section 3.7.) k. The ensuing discussion is not entirely convincing: The alternative choice π(α) 1/√α will be later 1 1 justified in Section 3.10 not as∝ Jeffreys’s (invariant) 2 is too large, 4 is not unreasonable [but] too low in this case. The alternative prior but as leading to a posterior defined for all observations, which is not the case of π(α) 1/α 1 1 ∝ k = + when x = 0, a fact overlooked by Jeffreys. Note that 4 N + 1 π(α) 1/α can nonetheless be advocated by Jeffreys ∝ argues that the classification of possibilities [is] as on the ground that the Poisson process derives from follows: (1) Population homogeneous on account of the exponential distribution, for which α is a scale −αt some general rule. (2) No general rule but extreme parameter: e represents the fraction of the atoms values to be treated on a level with others. This pro- originally present that survive after time t. posal is mostly interesting for its bearing on the 4.4 Normal Distribution continuous case, for, in the finite case, it does not When the sampling variance σ2 of a normal model sound logical to put weight on the null hypothesis (,σ2) is known, the posterior distribution associ- (r =0 and r = N) within the alternative, since this N ated with a flat prior is correctly derived as x1,..., confuses the issue. (See Berger, Bernardo and Sun, 2 | xn (¯x,σ /n) (with the repeated difficulty about 2009, for a recent reappraisal of this approach from the∼N use of a σ-finite measure as a probability). Under the point of view of reference priors.) the joint improper prior Section 3.3 seems to extend Laplace’s succession π(,σ) 1/σ, rule to the case in which the class sampled consists ∝ of several types, but it actually deals with the (much the (marginal) posterior on is obtained as a Stu- more interesting) case of Bayesian inference for the dent’s t 2 multinomial (n; p1,...,pr) distribution, when us- (n 1, x,s¯ /n(n 1)) M T − − ing the Dirichlet (1,..., 1) distribution as a prior. 2 D distribution, while the marginal posterior on σ is Jeffreys recovers the Dirichlet (x1 + 1,...,xr + 1) an inverse gamma ((n 1)/2,s2/2).15 D IG −

14A prior weight of 2k = 1/2 is reasonable since it gives 15Section 3.41 also contains the interesting remark that, equal probability to both hypotheses. conditional on two observations, x1 and x2, the posterior THEORY OF PROBABILITY REVISITED 11

Jeffreys notices that, when n = 1, the above prior does not lead to a proper posterior since π( x ) | 1 ∝ 1/ x1 is not integrable, but he concludes that the| solution− | degenerates in the right way, which, we suppose, is meant to say that there is not enough information in the data. But, without further for- malization, it is a delicate conclusion to make. Under the same noninformative prior, the predic- tive density of a second sample with sufficient statis- 16 tic (¯x2,s2) is found to be proportional to

n n −(n1+n2−1)/2 n s2 + n s2 + 1 2 (¯x x¯ )2 . 1 1 2 2 n + n 2 − 1  1 2  A direct conclusion is that this implies thatx ¯2 and Fig. 1. Seven posterior distributions on the values of accel- s2 are dependent for the predictive, if independent eration due to gravity (in cm/sec2) at locations in East Africa given and σ, while the marginal predictives onx ¯2 when using a noninformative prior. 2 and s2 are Student’s t and Fisher’s z, respectively. Extensions to the prediction of multiple future sam- ples with the same (Section 3.43) or with different graph in Section 3.44 contains hints about hierar- (Section 3.44) means follow without surprise. In the chical Bayes modeling as a way of strengthening es- latter case, given m samples of nr (1 r m) nor- timation, which is a perspective later advanced in mal ( ,σ2) measurements, the posterior≤ ≤ on σ2 favor of this approach (Lindley and Smith, 1972; N i under the noninformative prior Berger and Robert, 1990). The extension in Section 3.5 to the setting of the π(1,...,r,σ) 1/σ normal linear regression model should be simple (see, ∝ e.g., Marin and Robert, 2007, Chapter 3), except is again an inverse gamma (ν/2,s2/2) distribu- that the use of tensorial conventions—like when a tion,17 with s2 = (x IGx¯ )2 and ν = n , r i ri − r r r suffix i is repeated it is to be given all values from 1 while the posterior on t = √n ( x¯ )/s is a Stu- i i i to m—and the absence of matrix notation makes the dent’s t with ν degreesP P of freedom− for all iP’s (no reading quite arduous for today’s readers.18 Because matter what the number of observations within this of this lack of matrix tools, Jeffreys uses an implicit group is). Figure 1 represents the posteriors on the diagonalization of the regressor matrix XTX (with means for the data set analyzed in this section on i modern notation) and thus expresses the posterior seven sets of measurements of the gravity. A para- in terms of the transforms ξi of the regression co- efficients βi. This section is worth reading if only probability that µ is between both observations is exactly 1/2. to realize the immense advantage of using matrix Jeffreys attributes this property to the fact that the scale σ is directly estimated from those two observations under a nonin- notation. The case of regression equations formative prior. Section 3.8 generalizes the observation to all y = X β + ε , ε (0,σ2), location-scale families with median equal to the location. Oth- i i i i ∼N i erwise, the posterior probability is less than 1/2. Similarly, the with different unknown variances leads to a poly- probability that a third observation x will be between x and 3 1 t output (Bauwens, 1984) under a noninformative x2 is equal to 1/3 under the predictive. While Jeffreys gives a proof by complete integration, this is a direct consequence prior, which is deemed to be a complication, and 2 2 of the exchangeability of x1, x2 and x3. Note also that this is Jeffreys prefers to revert to the case when σi = ωiσ one of the rare occurrences of a credible interval in the book. with known ω ’s.19 The final part of this section 16 2 i In the current 1961 edition, n2s2 is mistakenly typed as 2 2 n2s2 in equation (6) of Section 3.42. 17 18 Jeffreys does not use the term “inverse gamma distribu- Using the notation ci for yi, xi for βi, yi for βˆi and air tion” but simply notes that this is a distribution with a scale for xir certainly makes reading this part more arduous. parameter that is given by a single set of tables (for a given ν). 19Sections 3.53 and 3.54 detail the numerical resolution of He also notices that the distribution of the transform log(σ/s) the normal equations by iterative methods and have no real is closer to a normal distribution than the original. bearing on modern Bayesian analysis. 12 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU mentions the interesting subcase of estimating a nor- which is also the posterior obtained directly from the mal mean α when truncated at α = 0: negative ob- distribution of̺ ˆ. Indeed, the sampling distribution servations do not need to be rejected since only the is given by posterior distribution has to be truncated in 0. [In n 2 a similar spirit, Section 3.6 shows how to process a f(ˆ̺ ̺)= − (1 ̺ˆ2)(n−4)/2 uniform (α σ, α + σ) distribution under the non- | √2π − informativeU π−(α, σ) = 1/σ prior.] Γ(n 1) (1 ̺2)(n−1)/2 − Section 3.9 examines the estimation of a two-dimen- − Γ(n 1/2) sional covariance matrix − (1 ̺̺ˆ)−(n−3/2) σ2 ̺στ Θ= − ̺στ τ 2 F 1/2, 1/2; n 1/2; (1 + ̺̺ˆ)/2 .   2 1{ − } under centred normal observations. The prior advo- There is thus no marginalization paradox (Dawid, cated by Jeffreys is π(τ, σ, ̺) 1/τσ, leading to the ∝ Stone and Zidek, 1973) for this prior selection, while (marginal) posterior one occurs for the alternative choice π(τ, σ, ̺) 1/τ 2σ2. ∝ π(̺ ̺,nˆ ) | 4.5 Sufficiency and Exponential Families ∞ (1 ̺2)n/2 − dβ Section 3.7 generalizes20 observations made pre- ∝ (cosh β ̺̺ˆ)n Z0 − viously about sufficient statistics for particular dis- (1 ̺2)n/2 tributions (Poisson, multinomial, normal, uniform). = − (1 ̺̺ˆ)n−1/2 If there exists a sufficient statistic T (x) when x − f(x α), the posterior distribution on α only depends∼ 1 (1 u)n−1 on T| (x) and on the number n of observations.21 The − 1 (1 + ̺̺ˆ)u/2 −1/2 du √2u { − } generic form of densities from exponential families Z0 that only depends on̺ ˆ. (Jeffreys notes that, when σ log f(x α) = (x α)′(α)+ (α)+ ψ(x) and τ are known, the posterior of ̺ also depends on | − the empirical variances for both components. This is obtained by a convoluted argument of imposingx ¯ paradoxical increase in the dimension of the suffi- as the MLE of α, which is not equivalent to requiring cient statistics when the number of parameters is de- x¯ to be sufficient. The more general formula creasing is another illustration of the limited mean- f(x α ,...,α ) ing of marginal sufficient statistics pointed out by | 1 m Basu, 1988.) While this integral can be computed m via confluent hypergeometric functions (Gradshteyn = φ(α1,...,αm)ψ(x)exp us(α)vs(x) s=1 and Ryzhik, 1980), X 1 (1 x)n−1 is provided as a consequence of the (then very re- − du cent) Pitman–Koopman[–Darmois] theorem22 on the 0 u(1 au) Z − necessary and sufficient connection between the ex- =pB(1/2,n)2F1 1/2, 1/2; n + 1/2; (1 + ̺̺ˆ)/2 , istence of fixed dimensional sufficient statistics and { } the corresponding posterior is certainly less manage- exponential families. The theorem as stated does not impose a fixed support on the densities f(x α) able than the inverse Wishart that would result from | a power prior Θ γ on the matrix Θ itself. The ex- and this invalidates the necessary part, as shown tension to noncentred| | observations with flat priors in Section 3.6 with the uniform distribution. It is on the means induces a small change in the outcome in that 20Jeffreys’s derivation remains restricted to the unidimen- 2 (n−1)/2 sional case. (1 ̺ ) 21 π(̺ ̺,nˆ ) − Stating that n is an ancillary statistic is both formally | ∝ (1 ̺̺ˆ)n−3/2 correct in Fisher’s sense (n does not depend on α) and am- − biguous from a Bayesian perspective since the posterior on α 1 (1 u)n−2 − depends on n. 22Darmois (1935) published a version (in French) of this 0 √2u Z theorem in 1935, about a year before both Pitman (1936) 1 (1 + ̺̺ˆ)u/2 −1/2 du, and Koopman (1936). { − } THEORY OF PROBABILITY REVISITED 13 only later in Section 3.6 that parameter-dependent is Fisher information.24 A first comment of impor- supports are mentioned, with an unclear conclusion. tance is that I(α) is equivariant under reparameter- Surprisingly, this section does not contain any in- ization, because both distances are functional dis- dication that the specific structure of exponential tances and thus invariant for all nonsingular trans- families could be used to construct conjugate23 pri- formations of the parameters. Therefore, if α′ is a ors (Raiffa, 1968). This lack of connection with reg- (differentiable) transform of α, ular priors highlights the fully noninformative per- dα dαT spective advocated in Theory of Probability, despite I(α′)= I(α) , comments (within the book) that priors should re- dα′ dα′ flect prior beliefs and/or information. and this is the spot where Jeffreys states his general principle for deriving noninformative priors (Jeffreys’s 4.6 Predictive Densities priors):25 Section 3.8 contains the rather amusing and not π(α) I(α) 1/2 well-known result that, for any location-scale para- ∝ | | metric family such that the location parameter is is thus an ideal prior in that it is invariant under the median, the posterior probability that the third any (differentiable) transformation. observation lies between the first two observations Quite curiously, there is no motivation for this is 1/2. This may be the first use of Bayesian predic- choice of priors other than invariance (at least at this tive distributions, that is, p(x x ,x ) in this case, 3| 1 2 stage) and consistency (at the end of the chapter). where parameters are integrated out. Such predic- Fisher information is only perceived as a second or- tive distributions cannot be properly defined in fre- der approximation to two functional distances, with quentist terms; at best, one may take p(x θ = θˆ) 3| no connection with either the curvature of the like- where θˆ is a plug-in estimator. Building more sen- lihood or the variance of the score function, and no sible predictives seems to be one major appeal of mention of the information content at the current the Bayesian approach for modern practitioners, in value of the parameter or of the local discriminating particular, econometricians. power of the data. Finally, no connection is made 4.7 Jeffreys’s Priors at this stage with Laplace’s approximation (see Sec- tion 4.0). The motivation for centering the choice Section 3.10 introduces Fisher information as a of the prior at I(α) is thus uncertain. No mention quadratic approximation to distributional distances. is made either of the potential use of those func- Given the Hellinger distance and the Kullback– tional distances as intrinsic loss functions for the Leibler divergence, [point] estimation of the parameters (Le Cam, 1986; Robert, 1996). However, the use of these intrinsic d (P, P ′)= (dP )1/2 (dP ′)1/2 2 1 | − | divergences (measures of discrepancy) to introduce Z and I(α) as a key quantity seems to indicate that Jeffreys dP understood I(α) as a local discriminating power of d (P, P ′)= log d(P P ′), the model and to some extent as the intrinsic fac- 2 dP ′ − Z tor used to compensate for the lack of invariance we have the second-order approximations of α α′ 2. It corroborates the fact that Jeffreys’s | − | 1 ′ T ′ priors are known to behave particularly well in one- d (P , P ′ ) (α α ) I(α)(α α ) 1 α α ≈ 4 − − dimensional cases. and Immediately, a problem associated with this generic ′ T ′ principle is spotted by Jeffreys for the normal distri- d2(Pα, Pα′ ) (α α ) I(α)(α α ), bution (,σ2). While, when considering and σ ≈ − − N where ∂f(x α) ∂f(x α)T 24Jeffreys uses an infinitesimal approximation to derive I(α)= E | | α ∂α ∂α I(α) in Theory of Probability, which is thus not defined this   way, nor connected with Fisher. 25Obviously, those priors are not called Jeffreys’s priors in 23As pointed to us by Dennis Lindley, Section 1.7 comes the book but, as a counter-example to Steve Stigler’s law of close to the concept of exchangeability when introducing eponimy (Stigler, 1999), the name is now correctly associated chances. with the author of this new concept. 14 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU separately, one recovers the invariance priors π() is found to be the corresponding reference distribu- 1 and π(σ) 1/σ, Jeffreys’s prior on the pair (,σ∝) tion, with a more severe criticism of the other dis- is π(,σ) ∝1/σ2. If, instead, m normal observa- tributions (see Section 4.1): both the usual rule and tions with∝ the same variance σ2 were proposed, they Haldane’s rule are rather unsatisfactory. The corre- m+1 would lead to π(1,...,m,σ) 1/σ , which is sponding Dirichlet (1/2,..., 1/2) prior is obtained unacceptable (because it induces∝ a growing depar- on the probabilitiesD of a multinomial distribution. ture from the true value as m increases). Indeed, if Interestingly too, Jeffreys derives most of his priors one considers the likelihood by recomputing the L2 or Kullback distance and by using a second-order approximation, rather than by L( ,..., ,σ) 1 m following the genuine definition of the Fisher infor- m −mn n 2 2 mation matrix. Because Jeffreys’s prior on the Pois- σ exp 2 (¯xi i) + si , son (λ) parameter is π(λ) 1/√λ, there is some ∝ −2σ { − } P ∝ Xi=1 attempt at justification, with the mention that gen- the marginal posterior on σ is eral rules for the prior probability give a starting m point, that is, act like reference priors (Berger and n σ−mn−1 exp s2, Bernardo, 1992). −2σ2 i i=1 In the case of the (normal) correlation coefficient, X the posterior corresponding to Jeffreys’s prior that is, π(̺, τ, σ) 1/τσ(1 ̺2)3/2 is not properly defined for a single∝ observation,− but Jeffreys does not ex- σ−2 a (mn 1)/2,n s2/2 ∼ G − i pand on the generic improper nature of those prior  i  X distributions. In an attempt close to defining a refer- and ence prior, he notices that, with both τ and σ fixed, n m s2 the (conditional) prior is E[σ2]= i=1 i mn 1 P − 1+ ̺2 whose own expectation is π(̺) 2 , ∝ p1 ̺ mn m − − σ2, which, while improper, can also be compared to the mn 1 0 − arc-sine prior if σ0 denotes the “true” standard deviation. If n is 1 1 small against m, the bias resulting from this choice π(̺)= , π 1 ̺2 will be important.26 Therefore, in this special case, − Jeffreys proposes a departure from the general rule which is integrable as is. Notep that Jeffreys does not by using π(,σ) 1/σ. (There is a further men- conclude in favor of one of those priors: We cannot tion of difficulties∝ with a large number of param- really say that any of these rules is better than the eters when using one single scale parameter, with uniform distribution. the same solution proposed. There may even be an In the case of exponential families with natural indication about reference priors at this stage, when parameter β, stating that some transforms do not need to be con- f(x β)= ψ(x)φ(β)exp βv(x), sidered.) | The arc-sine law on probabilities, Jeffreys does not take advantage of the fact that 1 1 Fisher information is available as a transform of φ, π(p)= , indeed, π p(1 p) − I(β)= ∂2 log φ(β)/∂β2, p 26As pointed out to us by Lindley (2008, private commu- but rather insists on the invariance of the distri- nication), Jeffreys expresses more clearly the difficulty that bution under location-scale transforms, β = kβ′ + the corresponding t distribution would always be [of index] l, which does not correctly account for potential (n + 1)/2, no matter how many true values were estimated, that is, that the natural reduction of the degrees of freedom boundaries on β. with the number of nuisance parameters does not occur with Somehow, surprisingly, rather than resorting to this prior. the natural “Jeffreys’s prior,” π(β) ∂2 log φ(β)/ ∝ | THEORY OF PROBABILITY REVISITED 15

∂β2 1/2, Jeffreys prefers to use the “standard” flat, of one extra observation. log-flat| and symmetric priors depending on the range H. Jeffreys, Theory of Probability, Section 4.0. of β. He then goes on to study the alternative of defining the noninformative prior via the mean pa- As in Chapter II, many points of this chapter are rameterization suggested by Huzurbazar (see Huzur- outdated by modern Bayesian practice. The main bazar, 1976), bulk of the discussion is about various approxima- tions to (then) intractable quantities or posteriors, (β)= v(x)f(x β) dx. approximations that have limited appeal nowadays | Z when compared with state-of-the-art computational Given the overall invariance of Jeffreys’s priors, this tools. For instance, Sections 4.43 and 4.44 focus on should not make any difference, but Jeffreys chooses the issue of grouping observations for a linear regres- to pick priors depending on the range of (β). For sion problem: if data is gathered modulo a round- instance, this leads him once again to promote the ing process [or if a polyprobit model is to be esti- Dirichlet (1/2, 1/2) prior on the probability p of D mated (Marin and Robert, 2007)], data augmenta- a binomial model if considering that log p/(1 p) tion (Tanner and Wong, 1987; Robert and Casella, 27 − is unbounded, and the uniform prior if consider- 2004) can recover the original values by simulation, ing that (p)= np varies on (0, ). It is interesting ∞ rather than resorting to approximations. Mentions to see that, rather than sticking to a generic prin- are made of point estimators, but there is unfortu- ciple inspired by the Fisher information that Jef- nately no connection with decision theory and loss freys himself recognizes as consistent and that offers functions in the classical sense (DeGroot, 1970; Berger, an almost universal range of applications, he resorts 1985). A long section (Section 4.7) deals with rank to group invariant (Haar) measures when the rule, statistics, containing apparently no connection with though consistent, leads to results that appear to dif- Bayesian Statistics, while the final section (Section 4.9) fer too much from current practice. on randomized designs also does not cover the spe- We conclude with a delicate example that is found cial issue of randomization within Bayesian Statis- within Section 3.10. Our interpretation of a set of tics (Berger and Wolpert, 1988). quantitative laws φ with chances α [such that] if φ r r r The major components of this chapter in terms is true, the chance of a variable x being in a range of Bayesian theory are an introduction to Laplace’s dx is f (x, α ,...,α ) dx is that of a mixture of r r1 rn approximation, although not so-called (with an in- distributions, teresting side argument in favor of Jeffreys’s priors), m some comments on orthogonal parameterisation [un- x α f (x, α ,...,α ). ∼ r r r1 rn derstood from an information point of view] and the r=1 X well-known tramcar example. Because of the complex shape (convex combination) 5.1 Laplace’s Approximation of the distribution, the Fisher information is not readily available and Jeffreys suggests assigning a When the number of observations n is large, the reference prior to the weights (α1,...,αm), that is, posterior distribution can be approximated by a Gaus- a Dirichlet (1/2,..., 1/2), along with separate ref- sian centered at the maximum likelihood estimate D erence priors on the αrs. Unfortunately, this leads with a range of order n−1/2. There are numerous to an improper posterior density (which integrates instances of the use of Laplace’s approximation in to infinity). In fact, mixture models do not allow for Bayesian literature (see, e.g., Berger, 1985; MacKay, independent improper priors on their components 2002), but only with specific purposes oriented to- (Marin, Mengersen and Robert, 2005). ward model choice, not as a generic substitute. Jef- freys derives from this approximation an incentive 5. CHAPTER IV: APPROXIMATE METHODS to treat the prior probability as uniform since this AND SIMPLIFICATIONS is of no practical importance if the number of obser- The difference made by any ordinary change of the vations is large. His argument is made more precise prior probability is comparable with the effect through the normal approximation,

L(θ x1,...,xn) 27There is another typo when stating that log p/(1 p) | ranges over (0, ). − L˜(θ x) exp n(θ θˆ)TI(θˆ)(θ θˆ)/2 , ∞ ≈ | ∝ {− − − } 16 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU to the likelihood. [Jeffreys notes that it is of trivial of this proposal would then be to use variational importance whether I(θ) is evaluated for the actual methods (Jaakkola and Jordan, 2000) or ABC tech- values or for the MLE θˆ.] Since the normalization niques (Beaumont, Zhang and Balding, 2002). factor is An interesting insight is given by the notion of orthogonal parameters in Section 4.31, to be under- (n/2π)m/2 I(θ) 1/2, | | stood as the choice of a parameterization such that using Jeffreys’s prior π(θ) I(θ) 1/2 means that I(θ) is diagonal. This orthogonalization is central in ∝ | | the posterior distribution is properly normalized and the construction of reference priors (Kass, 1989; Tib- ˆ that the posterior distribution of θi θi is nearly shirani, 1989; Berger and Bernardo, 1992; Berger, − ˆ the same (. . .) whether it is taken on data θi or Philippe and Robert, 1998) that are identical to Jef- on θi. This sounds more like a pivotal argument in freys’s priors. Jeffreys indicates, in particular, that Fisher’s fiducial sense than genuine Bayesian rea- full orthogonalization is impossible for m =4 and soning, but it nonetheless brings an additional ar- more dimensions. gument for using Jeffreys’s prior, in the sense that In Section 4.42 the errors-in-variables model is the prior provides the proper normalizing factor. Ac- handled rather poorly, presumably because of com- tually, this argument is much stronger than it first putational difficulties: when considering (1 r n) looks in that it is at the very basis of the construc- ≤ ≤ ′ tion of matching priors (Welch and Peers, 1963). In- yr = αξ + β + εr, xr = ξ + εr, deed, when considering the proper normalizing con- the posterior on (α, β) under standard normal errors 1/2 stant (π(θ) I(θ) ), the agreement between the is frequentist distribution∝ | | of the maximum likelihood π(α, β (x ,y ),..., (x ,y )) estimator and the posterior distribution of θ gets | 1 1 n x closer by an order of 1. n (t2 + α2s2)−1/2 5.2 Outside Exponential Families ∝ r r rY=1 When considering distributions that are not from n (y αx β)2 exponential families, sufficient statistics of fixed di- exp r − r − , − 2(t2 + α2s2) mension do not exist, and the MLE is much harder ( r=1 r r ) to compute. Jeffreys suggests in Section 4.1 using a X which induces a normal conditional distribution on minimum χ2 approximation to overcome this diffi- β and a more complex t-like marginal posterior dis- culty, an approach which is rarely used nowadays. tribution on α that can still be processed by present- A particular example is the poly-t (Bauwens, 1984) distribution day standards. Section 4.45 also contains an interesting example s 2 −(νr+1)/2 ( xr) of a normal (,σ2) sample when there is a known π( x1,...,xs) 1+ − | ∝ ν s2 contributionN to the standard error, that is, when r=1 r r  Y σ2 >σ′2 with σ′ known. In that case, using a flat that happens when several series of observations yield prior on log(σ2 σ′2) leads to the posterior independent estimates [xr] of the same true value − []. The difficulty with this posterior can now be π(,σ x,s¯ 2,n) | easily solved via a Gibbs sampler that demarginal- 1 1 n izes each t density. exp ( x¯)2 + s2 , ∝ σ2 σ′2 σn−1 −2σ2 { − } Section 4.3 is not directly related to Bayesian Statis- −   tics in that it is considering (best) unbiased esti- which integrates out over to mators, even though the Rao–Blackwell theorem is 1 1 ns2 somehow alluded to. The closest connection with π(σ s2,n) exp . | ∝ σ2 σ′2 σn−2 −2σ2 Bayesian Statistics could be that, once summary −   statistics have been chosen for their availability, a The marginal obviously has an infinite mode (or corresponding posterior can be constructed condi- pole) at σ = σ′, but there can be a second (and tional on those statistics.28 The present equivalent of the parameters given the statistics seems to precede the 28A side comment on the first-order symmetry between the first-order symmetry of the (posterior and frequentist) confi- probability of a set of statistics given the parameters and that dence intervals established in Welch and Peers (1963). THEORY OF PROBABILITY REVISITED 17

Therefore, the posterior median (the justification of which as a Bayes estimator is not included) is approximately 2m. Although this point is not dis- cussed by Jeffreys, this example is often mentioned in support of the Bayesian approach against the MLE, since the corresponding maximum estimator of n is m, always below the true value of n, while the Bayes estimator takes a more reasonable value.

6. CHAPTER V: SIGNIFICANCE TESTS: ONE NEW PARAMETER The essential feature is that we express ignorance of whether the new parameter is needed by taking half the prior probability for it as concentrated in Fig. 2. Posterior distribution π(σ s2, n) for σ′ = √2, n = 15 | and ns2 = 100, when using the prior π(µ, σ) 1/σ (blue the value indicated by the null hypothesis and ′ ∝ curve) and the prior π(µ, σ) 1/σ2 σ 2 (brown curve). distributing the other half over the range possible. ∝ − H. Jeffreys, Theory of Probability, Section 5.0. 2 meaningful) mode if s is large enough, as illustrated This chapter (as well as the following one) is con- on Figure 2 (brown curve). The outcome is indeed cerned with the central issue of testing hypotheses, different from using the truncated prior π(,σ) ∝ the title expressing a focus on the specific case of 1/σ (blue curve), but to conclude that the infer- point null hypotheses: Is the new parameter sup- ence using this assessment of the prior probability ′ ported by the observations, or is any variation ex- would be that σ = σ is based once again on the false pressible by it better interpreted as random? 30 The premise that infinite mass posteriors act like Dirac 2 construction of Bayes factors as natural tools for priors, which is not correct: since π(σ s ,n) does not answering such questions does require more math- integrate over σ = σ′, the posterior is| simply not de- 29 ematical rigor when dealing with improper priors fined. In that sense, Jeffreys is thus right in reject- than what is found in Theory of Probability. Even ing this prior choice as absurd. though it can be argued that Jeffreys’s solution (us- 5.3 The Tramcar Problem ing only improper priors on nuisance parameters) is acceptable via a limiting argument (see also Berger, This chapter contains (in Section 4.8) the now Pericchi and Varshavsky, 1998, for arguments based classical “tramway problem” of Newman, about a on group invariance), the specific and delicate fea- man traveling in a foreign country [who] has to change ture of using infinite mass measures would deserve trains at a junction, and goes into the town, the exis- more validation than what is found there. The dis- tence of which he has only just heard. He has no idea cussion on the choice of priors to use for the param- of its size. The first thing that he sees is a tramcar eters of interest is, however, more rewarding since numbered 100. What can he infer about the num- Jeffreys realizes that (point estimation) Jeffreys’s ber of tramcars in the town? It may be assumed that priors cannot be used in this setting (because of they are numbered consecutively from 1 upwards. their improperness) and that an alternative class of This is another illustration of the standard non- (testing) Jeffreys’s priors needs to be introduced. It informative prior for a scale, that is, π(n) 1/n, ∝ seems to us that this second type of Jeffreys’s pri- where n is the number of tramcars; the posterior ors has been overlooked in the subsequent literature, satisfies π(n m = 100) 1/n2I(n 100) and | ∝ ≥ even though the specific case of the Cauchy prior ∞ ∞ is often pointed out as a reference prior for testing P −2 −2 m (n>n0 m)= r r . point null hypotheses involving location parameters. | ≈ n r=n +1 r=m 0 X0  X 30The formulation of the question restricts the test to em- 29For an example of a constant MAP estimator, see Robert bedded hypotheses, even though Section 5.7 deals with nor- (2001, Example 4.2). mality tests. 18 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

6.1 Model Choice Formalism This is indeed a stepping stone for Bayesian Statis- tics in that it explicitly recognizes the need to sep- Jeffreys starts by analyzing the question, arate the null hypothesis from the alternative hy- In what circumstances do observations sup- pothesis within the prior, lest the null hypothesis port a change of the form of the law it- is not properly weighted once it is accepted. The self?, overall principle is illustrated for a normal setting, x (θ,σ2) (with known σ2), so that the Bayes from a model-choice perspective, by assigning ∼N factor is prior probabilities to the models Mi that are in competition, π(M ) (i = 1, 2,...). He further con- π(H x) π(H ) i K = 0| 0 strains those probabilities to be terms of a conver- π(H1 x) π(H1) gent series.31 When checking back in Chapter I (Sec- | . exp x2/2σ2 tion 1.62), it appears that this condition is due to the = {− } . f(θ)exp (x θ)2/2σ2 dθ constraint that the probabilities can be normalized {− − } to 1, which sounds like an unnecessary condition if The numericalR calibration of the Bayes factor is not 32 dealing with improper priors at the same time. directly addressed in the main text, except via a The consequence of this constraint is that π(Mi) qualitative divergence from the neutral K = 1. Ap- −i −2 must decrease like 2 or i and it thus (a) pre- pendix B provides a grading of the Bayes factor, as vents the use of equal probabilities advocated before follows: and (b) imposes an ordering of models. Obviously, the use of the Bayes factor eliminates Grade 0. K> 1. Null hypothesis supported. • −1/2 the impact of this choice of prior probabilities, as Grade 1. 1 >K> 10 . Evidence against H0, • it does for the decomposition of an alternative hy- but not worth more than a bare mention. −1/2 −1 pothesis H into a series of mutually irrelevant alter- Grade 2. 10 >K> 10 . Evidence against H0 1 • native hypotheses. The fact that m alternatives are substantial. −1 −3/2 tested at once induces a Bonferroni effect, though, Grade 3. 10 >K> 10 . Evidence against H0 • that is not (correctly) taken into account at the be- strong. ginning of Section 5.04 (even if Jeffreys notes that Grade 4. 10−3/2 >K> 10−2. Evidence against H • 0 the Bayes factor is then multiplied by 0.7m). The very strong. following discussion borders more on “ranking and Grade 5. 10−2 >K>. Evidence against H deci- • 0 selection” than on testing per se, although the use sive. of Bayes factors with correction factor m or m2 is The comparison with the χ2 and t statistics in this the proposed solution. It is only at the end of Sec- appendix shows that a given value of K leads to an tion 5.04 that the Bonferroni effect of repeated test- increasing (in n) value of those statistics, in agree- ing is properly recognized, if not correctly solved from a Bayesian point of view. ment with Lindley’s paradox (see Section 6.3 below). If there are nuisance parameters ξ in the model If the hypothesis to be tested is H0 : θ = 0, against (Section 5.01), Jeffreys suggests using the same prior the alternative H1 that is the aggregate of other pos- sible values [of θ], Jeffreys initiates one of the major on ξ under both alternatives, π0(ξ), resulting in the advances of Theory of Probability by rewriting the general Bayes factor prior distribution as a mixture of a point mass in K = π (ξ)f(x ξ, 0) dξ θ = 0 and of a generic density π on the range of θ, 0 | Z π(θ)= 1 I (θ)+ 1 π(θ). 2 0 2 π (ξ)π (θ ξ)f(x ξ,θ) dξ dθ, 0 1 | | . Z 31The perspective of an infinite sequence of models under where π1(θ ξ) is a conditional density. Note that Jef- comparison is not pursued further in this chapter. | 32In Jeffreys (1931), Jeffreys puts forward a similar argu- freys uses a normal model with Laplace’s approxi- ment that it is impossible to construct a theory of quantitative mation to end up with the approximation inference on the hypothesis that all general laws have the same prior probability (Section 4.3, page 43). See Earman (1992) for 1 ngθθ 1 2 K exp ngθθθˆ , a deeper discussion of this point. ≈ π (θˆ ξˆ) 2π −2 1 | r   THEORY OF PROBABILITY REVISITED 19 where θˆ and ξˆ are the MLEs of θ and ξ, and where is mentioned at this stage for reaching a decision gθθ is the component of the information matrix cor- about H0 when looking at K. The only comment responding to θ (under the assumption of strong worth noting there is that K is not very decisive orthogonality between θ and ξ, which means that for small values of n: we cannot get decisive results the MLE of ξ is identical in both situations). The one way or the other from a small sample (without low impact of the choice of π0 on the Bayes fac- adopting a decision framework). The next example tor may be interpreted as a licence to use improper still sticks to a compact parameter space, since it priors on the nuisance parameters despite difficul- deals with the 2 2 contingency table. The null hy- × ties with this approach (DeGroot, 1973). An inter- pothesis H0 is that of independence between both esting feature of this proposal is that the nuisance factors, H0 : p11p22 = p12p21. The reparameterization parameters are processed independently under both in terms of the margins is alternatives/models but with the same prior, with the consequence that it makes little difference to K 1 2 whether we have much or little information about 1 αβ + γ α(1 β) γ − − θ.33 When the nuisance parameters and the param- 2 (1 α)β γ (1 α)(1 β)+ γ − − − − eter of interest are not orthogonal, the MLEs ξˆ0 and but, in order to simplify the constraint ξˆ1 differ and the approximation of the Bayes factor is now min αβ, (1 α)(1 β) π (ξˆ ) 1 ng 1 − { − − } K 0 0 θθ exp ng θˆ2 , γ min α(1 β), (1 α)β , ≈ ˆ ˆ ˆ 2π −2 θθ ≤ ≤ { − − } π0(ξ1) π1(θ ξ1)r   | Jeffreys then assumes that α β 1/2 via a mere which shows that the choice of π0 may have an in- ≤ ≤ fluence too. rearrangement of the table. In this case, π(γ α, β)= 1/α over ( αβ, α(1 β)). Unfortunately,| this as- − − 6.2 Prior Modeling sumption (of being able to rearrange) is not realistic In Section 5.02 Jeffreys perceives the difficulty in when α and β are unknown and, while the author using an improper prior on the parameter of interest notes that in ranges where α is not the smallest, it must be replaced in the denominator [of π(γ α, β)] θ as a normalization problem. If one picks π(θ) or | π1(θ ξ) as a σ-finite measure, the Bayes factor K is by the smallest, the subsequent derivation keeps us- | ing the constraint α β 1/2 and the denominator undefined (rather than always infinite, as put for- ≤ ≤ ward by Jeffreys when normalizing by ). He thus α in the conditional distribution of γ, acknowledg- imposes π(θ) to be of any form whose∞ integral con- ing later that an approximation has been made in verges (to 1, presumably), ending up in the location allowing α to range from 0 to 1 since α<β< 1/2. case34 suggesting a Cauchy (0,σ2) prior as π(θ). Obviously, the motivation behind this crude approx- The first example fully processedC in this chapter is imation is to facilitate the computation of the Bayes 35 the innocuous (n,p) model with H0 : p = p0, which factor, as leads to the BayesB factor (n1 + 1)!n2!n1!n2! (n + 1)! K (n + 1) (1) K = px(1 p )n−x ≈ n11!n22!n12!n21!(n + 1)! x!(n x)! 0 − 0 − if the data is under the uniform prior. While K = 1 is recognized as a neutral value, no scaling or calibration of K 1 2 1 n11 n12 n1 33The requirement that ξ′ = ξ when θ = 0 (where ξ′ denotes 2 n21 n22 n2 n n n the nuisance parameter under H1) seems at first meaningless, 1 2 since each model is processed independently, but it could sig- nify that the parameterization of both models must be the The computation of the (true) marginal associ- same when θ = 0. Otherwise, assuming that some parame- ated with this prior (under H1) is indeed involved ters are the same under both models is a source of contention within the Bayesian literature. 34 35 Note that the section seems to consider only location pa- Notice the asymmetry in n1 resulting from the approxi- rameters. mation. 20 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

which is larger than Jeffreys’s approximation. A ver- sion much closer to Jeffreys’s modeling is based on the parameterization

1 2 1 αβ γ(1 β) 2 (1 α)β (1 γ)(1− β) − − − in which case α, β and γ are not constrained by one another and a uniform prior on the three parameters can be proposed. After straightforward calculations, the Bayes factor is given by n !n !(n + 1)!(n + 1)! K = (n + 1) 1 2 1 2 , (n + 1)!n11!n12!n21!n22! Fig. 3. Comparison of a Monte Carlo approximation to the Bayes factor for the 2 2 contingency table with Jeffreys’s which is very similar to Jeffreys’s approximation × 3 approximation, based on 10 randomly generated 2 2 tables since the ratio is (n2 + 1)/(n + 1). Note that the 4 × and 10 generations from the prior. alternative parameterization based on using

1 2 and requires either formal or numerical machine- 1 αβ αγ based integration. For instance, massively simulat- 2 (1 α)(1 β) (1 α)(1 γ) ing from the prior is sufficient to provide this ap- − − − − proximation. As shown by Figure 3, the difference with a uniform prior provides a different answer between the Monte Carlo approximation and Jef- (with ni’s and ni’s being inverted in K). Section 5.12 freys’s approximation is not spectacular, even though reprocesses the contingency table with one fixed mar- Jeffreys’s approximation appears to be always bi- gin, obtaining very similar outcomes.37 ased toward larger values, that is, toward the null In the case of the comparison of two Poisson sam- hypothesis, especially for the values of K larger than ples (Section 5.15), (λ) and (λ′), the null hy- 1. In some occurrences, the bias is such that it means ′ P P pothesis is H0 : λ/λ = a/(1 a), with a fixed. This acceptance versus rejection, depending on which ver- suggests the reparameterization− sion of K is used. However, if one uses instead a Dirichlet (1, 1, 1, 1) λ = αβ, λ′ = (1 α)β′, D − prior on the original parameterization (p11,...,p22), the marginal is (up to the multinomial coefficient) with H0 : α = a. This reparameterization appears to the Dirichlet normalizing constant36 be strongly orthogonal in that x x′ x+x′ −β D(n11 + 1,...,n22 + 1) π(β)a (1 a) β e dβ m1(n) K = − ∝ D(1, 1, 1, 1) π(β)αx(1 α)x′ βx+x′ e−β dβ dα R − ′ ′ (n + 3)! R ax(1 a)x π(β)βx+x e−β dβ = 3! , = − n11!n22!n12!n21! αx(1 α)x′ dα π(β)βx+x′ e−β dβ − R so the (true) Bayes factor in this case is x x′ ′ R a (1 a) R (x + x + 1)! x x′ = − ′ = ′ a (1 a) , n1!n2!n1!n2! 3!(n + 3)! αx(1 α)x dα x!x ! − K = − ((n + 1)!)2 n !n !n !n ! 11 22 12 21 R 37 n1!n2!n1!n2! 3!(n + 3)(n + 2) An interesting example of statistical linguistics is pro- = , cessed in Section 5.14, with the comparison of genders in n11!n22!n12!n21! (n + 1)! Welsh, Latin and German, with Freund’s psychoanalytic sym- bols, whatever that means!, but the fact that both Latin and 36Note that using a Haldane (improper) prior is impossible German have neuters complicated the analysis so much for in this case, since the normalizing constant cannot be elimi- Jeffreys that he did without the neuters, apparently unable nated. to deal with 3 2 tables. × THEORY OF PROBABILITY REVISITED 21 for every prior π(β), a rather unusual invariance by a limiting argument that a null empirical vari- property! Note that, as shown by (1), it also cor- ance implies that σ = 0 and thus that =x ¯ = 0. responds to the Bayes factor for the distribution This constraint is equivalent to the denominator of of x conditional on x + x′ since this is a binomial K diverging, that is, (x + x′, α) distribution. The generalization to the B f(v)vn−1 dv = . Poisson case is therefore marginal since it still fo- ∞ Z cuses on a compact parameter space. A solution39 that works for all n 2 is the Cauchy ≥ 6.3 Improper Priors Enter density, f(v) = 1/π(1 + v2), advocated as such40 a reference prior by Jeffreys (while he criticizes the The bulk of this chapter is dedicated to testing potential use of this distribution for actual data). problems connected with the normal distribution. It While the numerator of K is available in closed form, offers an interesting insight into Jeffreys’s processing ∞ n of improper priors, in that both the infinite mass σ−n−1 exp (¯x2 + s2) dσ −2σ2 and the lack of normalizing constant are not clearly Z0   signaled as potential problems in the book. n −n/2 = (¯x2 + s2) Γ(n/2), In the original problem of testing the nullity of a 2 normal mean, when x ,...,x (,σ2), Jeffreys   1 n this is not the case for the denominator and Jeffreys uses a reference prior π (σ) ∼Nσ−1 under the null 0 studies in Section 5.2 some approximations to the hypothesis and the same reference∝ prior augmented Bayes factor, the simplest41 being by a proper prior on under the alternative, K πν/2(1 + t2/ν)−(ν+1)/2, 1 1 ≈ π (,σ) π (/σ) , where ν = n 1 and t = √νx/s¯ (which is the stan- 1 ∝ σ 11 σ p dard t statistic− with a constant distribution over ν where σ is used as a scale for . The Bayes factor is under the null hypothesis). Although Jeffreys does then defined as not explicitly delve into this direction, this approx- ∞ n imation of the Bayes factor is sufficient to expose K = σ−n−1 exp (¯x2 + s2) dσ Lindley’s paradox (Lindley, 1957), namely, that the −2σ2 Z0   Bayes factor K, being equivalent to πν/2exp t2/2 , ∞ +∞ {− } −n−2 goes to with ν for a fixed value of t, thus high- π11(/σ)σ ∞ p 0 −∞ lighting the increasing discrepancy between the fre- . Z Z quentist and the Bayesian analyses of this testing n exp problem (Berger and Sellke, 1987). As pointed out −2σ2  to us by Lindley (private communication), the para- dox is sometimes called the Lindley–Jeffreys para- [(¯x )2 + s2] dσ d dox, because this section clearly indicates that t in- −  creases like (log ν)1/2 to keep K constant. without any remark on the use of an improper prior The correct Bayes factor can of course be approx- in both the numerator and the denominator.38 There imated by a Monte Carlo experiment, using, for in- is therefore no discussion about the point of using an stance, samples generated as improper prior on the nuisance parameters present n + 1 ns2 σ−2 a , and σ (¯x,σ2/n). in both models, that has been defended later in, ∼ G 2 2 | ∼N for example, Berger, Pericchi and Varshavsky (1998)   with deeper arguments. The focus is rather on a ref- 39There are obviously many other distributions that also erence choice for the proper prior π11. Jeffreys notes satisfy this constraint. The main drawback of the Cauchy pro- that, if π11 is even, K =1 when n =1, and he forces posal is nonetheless that the scale of 1 is arbitrary, while it the Bayes factor to be zero when s2 = 0 andx ¯ = 0, clearly has an impact on posterior results. 40Cauchy random variables occur in practice as ratios of normal random variables, so they are not completely implau- 38If we extrapolate from earlier remarks by Jeffreys, his sible. justification may be that the same normalizing constant 41The closest to an explicit formula is obtained just before (whether or not it is finite) is used in both the numerator Section 5.21 as a representation of K through a single integral and the denominator. involving a confluent hypergeometric function. 22 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

Fig. 4. Comparison of a Monte Carlo approximation to the Fig. 5. Monte Carlo approximation to the Bayes factor for Bayes factor for the normal mean problem with Jeffreys’s ap- the normal mean problem with known variance, compared with proximation, based on 5 103 randomly generated normal suf- Jeffreys’s approximation, based on 106 Monte Carlo simula- × 4 ficient statistics with n = 10 and 10 Monte Carlo simulations tions of µ, when n = 5. of (µ, σ). above, he deduces that the Cauchy prior he proposed The difference between the t approximation and the on is equivalent to a flat prior on arctan J 1/2: true value of the Bayes factor can be fairly impor- d 1 dJ 1/2 1 tant, as shown on Figure 4 for n = 10. As in Fig- −1 1/2 2 2 = = d tan J () , ure 3, the bias is always in the same direction, the πσ(1 + /σ ) π 1+ J π { } approximation penalizing H0 this time. Obviously, and turns this coincidence into a general rule.42 In as n increases, the discrepancy decreases. (The up- particular, the change of variable from to J is per truncation on the cloud is a consequence of Jef- not one-to-one, so there is some technical difficulty freys’s approximation being bounded by πν/2.) linked with this proposal: Jeffreys argues that J 1/2 The Cauchy prior on the mean is also a computa- should be taken to have the same sign as but this p tional hindrance when σ is known: the Bayes factor is not satisfactory nor applicable in general settings. is then Obviously, the symmetrization will not always be possible and correcting when the inverse tangents K = exp nx¯2/2σ2 {− } do not range from π/2 to π/2 can be done in many ∞ − 1 n 2 d ways, thus making the idea not fully compatible exp 2 (¯x ) 2 2 . πσ −∞ −2σ − 1+ /σ with the general invariance principle at the core of . Z    Theory of Probability. Note, however, that Jeffreys’s In this case, Jeffreys proposes the approximation idea of using a functional of the Kullback–Leibler 1 divergence (or of other divergences) as a reference K 2/πn , ≈ 1 +x ¯2/σ2 parameterisation for the new parameter has many p interesting applications. For instance, it is central to which is then much more accurate, as shown by Fig- the locally conic parameterization used by Dacunha- ure 5: the maximum ratio between the approximated Castelle and Gassiat (1999) for testing the number K and the value obtained by simulation is 1.15 for of components in mixture models. n = 5 and the difference furthermore decreases as n In the first case he examines, namely, the case of increases. the contingency table, Jeffreys finds that the cor- 6.4 A Second Type of Jeffreys Priors responding Kullback divergence depends on which In Section 5.3 Jeffreys makes another general pro- 42We were not aware of this rule prior to reading the book posal for the selection of proper priors under the and this second type of Jeffreys’s priors, judging from the alternative hypothesis: Noticing that the Kullback Bayesian literature, does not seem to have inspired many fol- divergence is J( σ) = 2/σ2 in the normal case lowers. | THEORY OF PROBABILITY REVISITED 23

leads to (modulo the improper change of variables) J(ζ)=2sinh2(ζ) and 1 d tan−1 J 1/2(ζ) √2 cosh(ζ) = π dζ π cosh(2ζ) as a potential (and overlooked) prior on ζ = log(σ/ 43 σ0). The corresponding Bayes factor is not avail- able in closed form since ∞ cosh(ζ) e−nζ exp ns2/2σ2e2ζ dζ cosh(2ζ) {− 0 } Z−∞ ∞ 1+ u2 ns2 = un exp u2 du 1+ u4 − 2 Z0   cannot be analytically integrated, even though a Fig. 6. Jeffreys’s reference density on log(σ/σ ) for the test 0 Monte Carlo approximation is readily computed. Fig- of H0 : σ = σ0. ure 7 shows that Jeffreys’s approximation,

cosh(2 log s/σ0) n K πn/2 (s/σ0) ≈ cosh(log s/σ0) p exp n(1 (s/σ )2)/2 , { − 0 } is again fairly accurate since the ratio is at worst 0.9 for n = 10 and the difference decreases as n in- creases. The special case of testing a normal correlation co- efficient H0 : ρ = ρ0 is not processed (in Section 5.5) via this general approach but, based on arguments connected with (a) the earlier difficulties in the con- struction of an appropriate noninformative prior (Sec- tion 4.7) and (b) the fact that J diverges for the null hypothesis44 ρ = 1, Jeffreys falls back on the uni- form ( 1, 1) solution,± which is even more convinc- Fig. 7. Ratio of a Monte Carlo approximation to the Bayes U − factor for the normal variance problem and of Jeffreys’s ap- ing in that it leads to an almost closed-form solution proximation, when n = 10 (based on 104 simulations). 2(1 ρ2)n/2/(1 ρρˆ)n−1/2 K = − 0 − . 1 2 n/2 n−1/2 −1(1 ρ ) /(1 ρρˆ) dρ margins are fixed (as is well known, the Fisher in- − − formation matrix is not fully compatible with the Note that Jeffreys’sR approximation, Likelihood Principle, see Berger and Wolpert, 1988). 2n 1 1/2 (1 ρ2)n/2(1 ρˆ2)(n−3)/2 K − − 0 − , Nonetheless, this is an interesting insight that pre- ≈ π (1 ρρˆ)n−1/2 cedes the reference priors of Bernardo (1979): given   − nuisance parameters, it derives the (conditional) prior is quite reasonable in this setting, as shown by Fig- on the parameter of interest as the Jeffreys prior for ure 8, and also that the value of ρ0 has no influence the conditional information. See Bayarri and Garcia- on the ratios of the approximations. The extension Donato (2007) for a modern extension of this per- 43 spective to general testing problems. Note that this is indeed a probability density, whose shape is given in Figure 6, despite the loose change of vari- In the case (Section 5.43) of testing whether a ables, because a missing 2 cancels with a missing 1/2! [normal] standard error has a suggested value σ0 44This choice of the null hypothesis is somehow unusual, when observing ns2 a(n/2,σ2/2), the parame- since, on the one hand, it is more standard to test for no terization ∼ G correlation, that is, ρ = 0, and, on the other hand, having ρ = 1 is akin to a unit-root test that, as we know today, ζ ± σ = σ0e requires firmer theoretical background. 24 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

for small τ 2, without further specification. The (nu- merical) complexity of the problem may explain why Jeffreys differs from his usual processing, although current computational tools obviously allow for a complete processing (modulo the proper choice of a prior on τ) (see, e.g., Ghosh and Meeden, 1984). Jeffreys also advocates using this principle for test- ing a normal distribution against alternatives from the Pearson family of distributions in Section 5.7, but no detail is given as to how J is computed and how the Bayes factor is derived. Similarly, for the comparison of the Poisson distribution with the neg- ative binomial distribution in Section 5.8, the form of J is provided for the distance between the two Fig. 8. Ratio of a Monte Carlo approximation to the Bayes distributions, but the corresponding Bayes factor is factor for the normal variance problem and of Jeffreys’s ap- 4 only given via a very crude approximation with no proximation, when n = 10 and ρ0 = 0 (based on 10 simula- tions). mention of the corresponding priors. In Section 5.9 the extension of the (regular) model to the case of (linear) regression and of variable se- to two samples in Section 5.51 (for testing whether lection is briefly considered, noticing that (a) for a or not the correlation is the same) is not processed single regressor (Section 5.91), the problem is ex- in a symmetric way, with some uncertainty about actly equivalent to testing whether or not a normal the validity of the expression for the Bayes factor: mean is equal to 0 and (b) for more than one re- a pseudo-common correlation is defined under the gressor (Section 5.92), the test of nullity of one coef- alternative in accordance with the rule that the pa- ficient can be done conditionally on the others, that rameter ρ must appear in the statement of H1, but normalizing constraints on ρ are not properly as- is, they can be treated as nuisance parameters under sessed.45 both hypotheses. (The case of linear calibration in A similar approach is adopted for the compari- Section 5.93 is also processed as a by-product.) son of two correlation coefficients, with some quasi- 6.5 A Foray into Hierarchical Bayes hierarchical arguments (see Section 6.5) for the defi- nition of the prior under the alternative. Section 5.6 Section 5.4 explores further tests related to the is devoted to a very specific case of correlation anal- normal distribution, but Section 5.41 starts with a ysis that corresponds to our modern random effect highly unusual perspective. When testing whether model. A major part of this section argues in fa- or not the means of two normal samples—with like- vor of the model based on observations in various lihood L(1,2,σ) proportional to fields, but the connection with the chapter is the n devising of a test for the presence of those random σ−n1−n2 exp 1 (¯x )2 −2σ2 1 − 1 effects. The model is then formalized as normal ob-  2 2 2 2 servations xr (,τ + σ /kr) (1 r m), where n2 2 n1s1 + n2s2 ∼N ≤ ≤ (¯x2 2) , kr denotes the number of observations within class − 2σ2 − − 2σ2 r and τ is the variance of the random effect. The  —are equal, that is, H : = , Jeffreys also intro- null hypothesis is therefore H0 : τ = 0. Even at this 0 1 2 stage, the development is not directly relevant, ex- duces the value of the common mean, , into the cept for approximation purposes, and the few lines alternative. A possible, albeit slightly apocryphal, of discussion about the Bayes factor indicate that interpretation is to consider as an hyperparame- the (testing) Jeffreys prior on τ should be in 1/τ 2 ter that appears both under the null and under the alternative, which is then an incentive to use a single 45To be more specific, a normalizing constant c on the dis- improper prior under both hypotheses (once again tribution of ρ2 that depends on ρ appears in the closed-form because of the lack of relevance of the correspond- expression of K, as, for instance, in equation (14). ing pseudo-normalizing constant). But there is still THEORY OF PROBABILITY REVISITED 25 a difficulty with the introduction of three different n n s2 + n s2 2 (¯x )2 1 1 2 2 dσ d alternatives with a hyperparameter : − 2σ2 2 − − 2σ2  −n+1 1 = and 2 = , 1 = and 2 = , σ n1 2 2 exp 2 (¯x1 1) = and = . π −2σ − 1 2 .Z  2 2 n2 2 n1s1 + n2s2 Given that has no intrinsic meaning under the al- (¯x2 2) 46 − 2σ2 − − 2σ2 ternative, the most logical translation of this mul-  tiplication of alternatives is that the three formula- /( σ2 + ( )2 { 1 − } tions lead to three different priors, 2 2 σ + (2 ) ) dσ d d1 d2 1 1 I { − } π11(,1,2,σ) 2 2 µ1=µ, ∝ π σ + (2 ) makes more sense because of the presence of σ and − on both the numerator and the denominator. While 1 1 I π12(,1,2,σ) 2 2 µ2=µ, the numerator can be fully integrated into ∝ π σ + (1 ) − π/2nΓ (n 1)/2 (ns2/2)−(n−1)/2, { − } 0 π13(,1,2,σ) 2 where nsp0 denotes the usual sum of squares, the 1 σ denominator 2 2 2 2 2 . ∝ π σ + (1 ) σ + (2 ) −n { − }{ − } σ n1 2 exp (¯x1 1) When π11 and π12 are written in terms of a Dirac π/2 −2σ2 − Z  mass, they are clearly identical, 2 2 n2 2 n1s1 + n2s2 (¯x2 2) π11(1,2,σ)= π12(1,2,σ) − 2σ2 − − 2σ2  1 1 /(4σ2 + ( )2) dσ d d 2 2 . 1 2 1 2 ∝ π σ + (1 2) − − does require numerical or Monte Carlo integration. If we integrate out in π , the resulting posterior 13 It can actually be written as an expectation under is the standard noninformative posteriors, 2 1 π ( , ,σ) , 2 2 2 13 1 2 2 2 σ ((n 3)/2, (n s + n s )/2), ∝ π 4σ + (1 2) ∼ IG − 1 1 2 2 − 2 2 whose only difference from π is that the scale in 1 (¯x1,σ /n1), 2 (¯x2,σ /n2), 11 ∼N ∼N the Cauchy is twice as large. As noticed later by of the quantity Jeffreys, there is little to choose between the alterna- 2 tives, even though the third modeling makes more h(1,2,σ ) sense from a modern, hierarchical point of view: 2 Γ((n 3)/2) (n s2 + n s2)/2 −(n−3)/2 and σ denote the location and scale of the prob- = − { 1 1 2 2 } . √n n 4σ2 + ( )2 lem, no matter which hypothesis holds, with an ad- 1 2 1 − 2 ditional parameter (1,2) in the case of the alter- When simulating a range of values of the sufficient native hypothesis. Using a common improper prior statistics (ni, x¯i,si)i=1,2, the difference between the under both hypotheses can then be justified via a Bayes factor and Jeffreys’s approximation, limiting argument, as in Marin and Robert (2007), 1/2 because those parameters are common to both mod- π n1n2 K 2 els. Seen as such, the Bayes factor ≈ 2 n + n  1 2  2 −(n1+n2−1)/2 −n−1 n1 2 (¯x1 x¯2) σ exp 2 (¯x1 ) 1+ n n n + n − , −2σ − 1 2 1 2 n s2 + n s2 Z   1 1 2  is spectacular, as shown in Figure 9. The larger dis- 46This does not seem to be Jeffreys’s perspective, since he later (in Sections 5.46 and 5.47) adds up the posterior prob- crepancy (when compared to earlier figures) can be abilities of those three alternatives, effectively dividing the attributed in part to the larger number of sufficient Bayes factor by 3 or such. statistics involved in this setting. 26 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

n2 2 2 (¯x2 ) − 2σ2 − n s2 n s2 1 1 2 2 d − 2σ2 − 2σ2 1 2 

2 2 −n1 −n2 = 2π/(n2σ1 + n1σ2)σ1 σ2 q (¯x x¯ )2 n s2 n s2 exp 1 − 2 1 1 2 2 , −2(σ2/n + σ2/n ) − 2σ2 − 2σ2  1 1 2 2 1 2  but the computation of n n exp 1 (¯x )2 2 (¯x )2 −2σ2 1 − 1 − 2σ2 2 − Z  1 2  2 d d1 Fig. 9. Comparison of a Monte Carlo approximation to the 2 2 2πσ2 σ + ( ) Bayes factor for the normal mean comparison problem and 1 − 1 of Jeffreys’s approximation, corresponding to 103 statistics 4 [and the alternative versions] is not possible in closed (ni, x¯i,si)i=1,2 and 10 generations from the noninformative posterior. form. We note that π13 corresponds to a distribution on the difference with density equal to 1 − 2 π ( , σ ,σ ) A similar split of the alternative is studied in Sec- 13 1 2| 1 2 tion 5.42 when the standard deviations are differ- 1 = ((σ + σ )( )2 ent under both models, with further simplifications π 1 2 1 − 2 in Jeffreys’s approximations to the posteriors (since 3 2 2 3 + σ1 σ1σ2 σ1σ2 + σ2) the i’s are integrated out). It almost seems as if − − 2 2 2 2 2 2 x¯ x¯ acts as a pseudo-sufficient statistic. If we /([(1 2) + σ1 + σ2] 4σ1σ2) 1 − 2 − − start from a generic representation with L(1,2, 1 (σ + σ )(y2 + σ2 2σ σ + σ2) = 1 2 1 − 1 2 2 σ1,σ2) proportional to π (y2 + (σ + σ )2)(y2 + (σ σ )2) 1 2 1 − 2 −n1 −n2 n1 2 1 σ1 + σ2 σ σ exp (¯x1 1) = , 1 2 −2σ2 − π y2 + (σ + σ )2  1 1 2 n n s2 n s2 thus equal to a Cauchy distribution with scale (σ + 2 (¯x )2 1 1 2 2 , 1 2 2 2 2 2 σ ).47 Jeffreys uses instead a Laplace approxima- − 2σ2 − − 2σ1 − 2σ2 2  tion, and if we use again π(,σ ,σ ) 1/σ σ under the 1 2 1 2 2σ 1 null hypothesis and ∝ 1 2 2 , n1n2 σ1 + (¯x1 x¯2) 1 1 σ1 − π11(1,2,σ1,σ2) 2 2 , to the above integral, with no further justification. ∝ σ1σ2 π σ + ( ) 1 2 − 1 Given the differences between the three formulations 1 1 σ2 of the alternative hypothesis, it makes sense to try π12(1,2,σ1,σ2) 2 2 , ∝ σ1σ2 π σ + (2 1) to compare further those three priors (in our re- 2 − interpretation as hierarchical priors). As noted by π13(,1,2,σ1,σ2) Jeffreys, there may be considerable grounds for de- cision between the alternative hypotheses. It seems 1 1 σ1σ2 2 2 2 2 2 to us (based on the Laplace approximations) that ∝ σ1σ2 π σ + (1 ) σ + (2 ) { 1 − }{ 2 − } the most sensible prior is the hierarchical one, π13, under the alternative, then, as stated in Theory of Probability, 47While this result follows from the derivation of the den- sity by integration, a direct proof follows from considering −n −1 −n −1 n1 2 the characteristic function of the Cauchy distribution (0,σ), σ 1 σ 2 exp (¯x ) C 1 2 −2σ2 1 − equal to exp σ ξ (see Feller, 1971). Z  1 − | | THEORY OF PROBABILITY REVISITED 27 in that the scale depends on both variances rather Then, Section 6.0 discusses the difficulty of set- than only one. tling for an informative prior distribution that takes An extension of the test on a (normal) standard into account the actual state of knowledge. By subdi- deviation is considered in Section 5.44 for the agree- viding the sample into groups, different conclusions ment of two estimated standard errors. Once again, can obviously be reached, but this contradicts the the most straightforward interpretation of Jeffreys’s Likelihood Principle that the whole data set must derivation is to see it as a hierarchical modeling, be used simultaneously. Of course, this could also with a reference prior π(σ) = 1/σ on a global scale, be interpreted as a precursor attempt at defining σ1 say, and the corresponding (testing) Jeffreys prior pseudo-Bayes factors (Berger and Pericchi, 1996). on the ratio σ1/σ2 = exp ζ. The Bayes factor (in fa- Otherwise, as correctly pointed out by Jeffreys, the vor of the null hypothesis) is then given by prior probability when each subsample is considered is not the original prior probability but the poste- √2 K = rior probability left by the previous one, which is π the basic implementation of the Bayesian learning ∞ cosh(ζ) n e2(z−ζ) + n −n/2 principle. However, even with this correction, the e−n1ζ 1 2 dζ, 2z final outcome of a sequential approach is not the −∞ cosh(2ζ) n2e + n2 . Z   proper Bayesian solution, unless posteriors are also if z denotes log s1/s2 = logσ ˆ1/σˆ2. used within the integrals of the Bayes factor. 6.6 P -what?! Section 6.5 also recapitulates both Chapters V and VI with general comments. It reiterates the warn- Section 5.6 embarks upon a historically interest- ing, already made earlier, that the Bayes factors ob- ing discussion on the warnings given by too good a tained via this noninformative approach are usually 2 p-value: if, for instance, a χ test leads to a value of rarely immensely in favor of H . This somehow con- 2 0 the χ statistic that is very small, this means (al- tradicts later studies, like those of Berger and Sellke 2 most certain) incompatibility with the χ assump- (1987) and Berger, Boukai and Wang (1997), that tion just as well as too large a value. (Jeffreys re- the Bayes factor is generally less prone to reject the calls the example of the data set of Mendel that was null hypothesis. Jeffreys argues that, when an alter- modified by hand to agree with the Mendelian law native is actually used (. . .), the probability that it is 2 of inheritance, leading to too small a χ value.) This false is always of order n−1/2, without further justi- can be seen as an indirect criticism of the standard fication. Note that this last section also includes the tests (see also Section 8 below). seeds of model averaging: when a set of alternative hypotheses (models Mr) is considered, the predic- 7. CHAPTER VI: SIGNIFICANCE TESTS: tive should be VARIOUS COMPLICATIONS p(x′ x)= p (x′ x)π(M x), The best way of testing differences from a | r | r| r systematic rule is always to arrange our work so as X to ask and answer one question at a time. rather than conditional on the accepted hypothesis. H. Jeffreys, Theory of Probability, Section 6.1. Obviously, when K is large, [this] will give almost the same inference as the selected model/hypothesis. This chapter appears as a marginalia of the pre- 7.1 Multiple Parameters vious one in that it contains no major advance but rather a sequence of remarks, such as, for instance, Although it should proceed from first principles, an entry on time-series models (see Section 7.2 be- the extension of Jeffreys’s (second) rule for selection low). The very first paragraph of this chapter pro- priors (see Section 6.4) to several parameters is dis- duces a remarkably simple and intuitive justification cussed in Sections 6.1 and 6.2 in a spirit similar to of the incompatibility between improper priors and the reference priors of Berger and Bernardo (1992), significance tests: the mere fact that we are seriously by pointing out that, if two parameters α and β are considering the possibility that it is zero may be as- introduced sequentially against the null hypothesis sociated with a presumption that if it is not zero it H0 : α = β = 0, testing first that α = 0 then β = 0 is probably small. conditional on α does not lead to the same joint 28 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU prior as the symmetric steps of testing first β = 0 nρˆ2v2 dv 1F1 1 n, 1, , then α = 0 conditional on β. In fact, − −2(2s2 +ρ ˆ2) 1+ v2   1/2 1/2 1F1 denoting a confluent hypergeometric function. d arctan Jα d arctan Jβ|α A similar analysis is conducted in Section 6.21 for = d arctan J 1/2d arctan J 1/2. a linear regression model associated with a pair of β α|β harmonics (xt = α cos t + β sin t + εt), the only dif- Jeffreys then suggests using instead the marginal- ference being the inclusion of the covariate scales A ized version and B within the prior, 1/2 1/2 dJ dJα β π(α, β σ) 1 dα dβ | π(α, β)= 2 , π 1+ Jα 1+ Jβ √A2 + B2 = although he acknowledges that there are cases where π2√2 the symmetry does not make sense (as, for instance, σ . when parameters are not defined under the null, as, α2 + β2 σ2 + (A2 + B2)(α2 + β2)/2 e.g., in a mixture setting). He then resorts to Ock- { } ham’s razor (Section 6.12) to rank those unidimen- 7.2 Markovianp Models sional tests by stating that there is a best order of While the title of Section 6.3 (Partial and serial procedures, although there are cases where such an correlation) is slightly misleading, this section deals ordering is arbitrary or not even possible. Section 6.2 with an AR(1) model, considers a two-dimensional parameter (λ, ) and, switching to polar coordinates, uses a (half-)Cauchy xt+1 = ρxt + τεt. prior on the radius ρ = λ2 + 2 (and a uniform It is not conclusive with respect to the selection of prior on the angle). The Bayes factor for testing the the prior on ρ given that Jeffreys does not consider p nullity of the parameter (λ, ) is then the null value ρ = 0 but rather ρ = 1 which leads to difficulties, if only because there is± no stationary 2ns2 + n(¯x2 +y ¯2) K = σ−2n−1 exp dσ distribution in that case. Since the Kullback diver- − 2σ2 Z   gence is given by 1 1+ ρρ′ 2 2n ′ ′ 2 π σ J(ρ, ρ )= 2 ′2 (ρ ρ) , .Z (1 ρ )(1 ρ ) − exp (2ns2 − − {− Jeffreys’s (testing) prior (against H0 : ρ = 0) should + n([¯x λ]2 be − dλ d dσ 1 J 1/2(ρ, 0)′ 1 1 + [¯y ]2))/2σ2 = , − }ρ(σ2 + ρ2) π 1+ J(ρ, 0) π 1 ρ2 n 2 2 2 −n − = 2 (n 1)! 2ns + n(¯x +y ¯ ) which is also Jeffreys’s regular (estimation) prior in − { } p 1 that case. π2σ2n The (other) correlation problem of Section 6.4 also .Z deals with a Markov structure, namely, that n exp 2 α + (1 α)pr, if s = r, −2σ P (xt+1 = s xt = r)= −  | (1 α)ps, otherwise, [2s2 +ρ ˆ2  − the null (independence) hypothesis corresponding dφ dρ dσ 2ρρˆcos φ + ρ2] , to H0 : α = 0. Note that this parameterization of − ρ(σ2 + ρ2)  the Markov model means that the pr’s are the sta- whereρ ˆ2 =x ¯2 +y ¯2 and which can only be integrated tionary probabilities. The Kullback divergence being up to particularly intractable, m 1 2 ∞ ns2v2 α = exp J = α pr log 1+ , K π −2s2 +ρ ˆ2 pr(1 α) Z0   Xr=1  −  THEORY OF PROBABILITY REVISITED 29

See, however, Dawid, 2004, for a general synthesis on this point.) The fact that Student’s and Fisher’s analyses of the t statistic coincide with Jeffreys’s is seen as an argument in favor both of the Bayesian approach and of the choice of the reference prior π(,σ) 1/σ. The most∝ famous part of the chapter (Section 7.2) contains the often-quoted sentence above, which ap- plies to the criticism of p-values, since a decision to reject the null hypothesis is based on the observed p-value being in the upper tail of its distribution un- der the null, even though nothing but the observed value is relevant. Given that the p-value is a one- to-one transform of the original test statistics, the Fig. 10. Jeffreys’s prior of the coefficient α for the Markov criticism is maybe less virulent than it appears: Jef- model of Section 6.4. freys still refers to twice the standard error as a cri- terion for possible genuineness and three times the Jeffreys first produces the approximation standard error for definite acceptance. The major criticism that this quantity does not account for the (m 1)α2 J − alternative hypothesis (as argued, for instance, in ≈ 1 α − Berger and Wolpert, 1988) does not appear at this that would lead to the (testing) prior stage, but only later in Section 7.22. As perceived 2 1 α/2 in Theory of Probability, the problem with Pear- − son’s and Fisher’s approaches is therefore rather the π √1 α(1 α + α2) − − use of a convenient bound on the test statistic as [since the primitive of the above is arctan(√1 α/ two standard deviations (or on the p-value as 0.05). α)], but the possibility of negative−48 α leads him− to There is, however, an interesting remark that the use instead a flat prior on the possible range of α’s. choice of the hypothesis should eventually be aimed Note from Figure 10 that the above prior is quite at selecting the best inference, even though Jeffreys peaked in α = 1. concludes that there is no way of stating this suffi- ciently precisely to be of any use. Again, expressing 8. CHAPTER VII: FREQUENCY DEFINITIONS this objective in decision-theoretic terms seems the AND DIRECT METHODS most natural solution today. Interestingly, the fol- lowing sentence in Section 7.51 could be interpreted, An hypothesis that may be true may be rejected once again in an apocryphal way, as a precursor to because it has not predicted observable decision theory: There are cases where there is no results that have not occurred. positive new parameter, but important consequences H. Jeffreys, Theory of Probability, Section 7.2. might follow if it was not zero, leading to loss func- tions mixing estimation and testing as in Robert and This short chapter opposes the classical approaches of the time (Fisher’s fiducial and likelihood method- Casella (1994). ologies, Pearson’s and Neyman’s p-values) to the In Section 7.5 we find a similarly interesting rein- Bayesian principles developed in the earlier chap- terpretation of the classical first and second type ters. (The very first part of the chapter is a digres- errors, computing an integrated error based on the sion on the “frequentist” theories of probability that 0–1 loss (even though it is not defined this way) as is not particularly relevant from a mathematical per- ac ∞ spective and that we have already addressed earlier. f1(x) dx + f0(x) dx, Z0 Zac 48Because of the very specific (unidimensional) parameter- where x is the test statistic, f0 and f1 are the mar- ization of the Markov chain, using a negative α indeed makes ginals under the null and under the alternative, re- sense. spectively, and ac is the bound for accepting H0. The 30 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU optimal value of ac is therefore given by f0(ac)= in the sequential case discussed earlier), recogniz- f1(ac), which amounts to ing nonetheless the natural discrepancy between two c probability distributions conditional on two differ- π(H0 x = ac)= π(H x = ac), | 0| ent data sets. More fundamentally, this stresses that that is, K = 1 if both hypotheses are equally weighted Theory of Probability focuses on prior probabilities a priori. This is a completely rigorous derivation used to express ignorance more than anything else. of the optimal Bayesian decision for testing, even 3. Bayesian statistics naturally allow for model though Jeffreys does not approach it this way, in specification and, as such, do not suffer (as much) particular, because the prior probabilities are not from the neglect of an unforeseen alternative (Sec- necessarily equal (a point discussed earlier in Sec- tion 8.2). This is obviously true only to some extent: tion 6.0 for instance). It is nonetheless a fairly con- if, in the process of comparing models M based on vincing argument against p-values in terms of small- i an experiment, one very likely model is omitted from est number of mistakes. More prosaically, Jeffreys the list, the consequences may be severe. On the briefly discusses in this section the disturbing asym- metry of frequentist tests, when both hypotheses are other hand, and in relation to the previous discus- of the same type: if we must choose between two def- sion on the p-values, the Bayesian approach allows initely stated alternatives, we should naturally take for alternative models and is thus naturally embed- 50 the one that gives the larger likelihood, even though ding model specification within its paradigm. The each may be within the range of acceptance of the fact that it requires an alternative hypothesis to op- other. erate a test is an illustration of this feature. 4. Different theories leading to the same poste- 9. CHAPTER VIII: GENERAL QUESTIONS riors cannot be distinguished since questions that cannot be decided by means of observations are best A prior probability used to express ignorance is left alone (Section 8.3). The physicists’51 concept merely the formal statement of that ignorance. H. Jeffreys, Theory of Probability, Section 8.1. of rejection of unobservables is to be understood as the elimination of parameters in a law that make no This concluding chapter summarizes the main rea- contribution to the results of any observation or as a sons for using the Bayesian perspective: version of Ockham’s principle, introducing new pa- rameters only when observations showed them to be 1. Prior and sampling probabilities are represen- necessary (Section 8.4). See Dawid (1984, 2004) for tations of degrees of belief rather than frequencies a discussion of this principle he calls Jeffreys’s Law. (Section 8.0). Once again, we believe that this de- 5. The theory of Bayesian statistics as presented bate49 is settled today, by considering that proba- in Theory of Probability is consistent in that it pro- bility distributions and improper priors are defined according to the rules of measure theory; see, how- vides general rules to construct noninformative pri- ever, Dawid (2004) for another perspective oriented ors and to conduct tests of hypotheses (Section 8.6). toward calibration. It is in agreement with the Likelihood Principle and 52 2. While prior probabilities are subjective and can- with conditioning on sufficient statistics. It also not be uniquely assessed, Theory of Probability sets avoids the use of p-values for testing hypotheses by a general (objective) principle for the derivation of requiring no empirical hypothesis to be true or false prior distributions (Section 8.1). It is quite inter- esting to read Jeffreys’s defence of this point when 50The point about being prepared for occasional wrong de- taking into account the fact that this book was set- cisions could possibly be related to Popper’s notion of fal- ting the point of reference for constructing nonin- sifiability: by picking a specific prior, it is always possible to formative priors. Theory of Probability does little, modify inference toward one’s goal. Of course, the divergences however, toward the construction of informative pri- between Jeffreys’s and Popper’s approaches to induction make ors by integrating existing prior information (except them quite irreconcilable. See Dawid (2004) for a Bayes–de Finetti–Popper synthesis. 51Both paragraphs Sections 8.3 and 8.4 seem only con- 49Jeffreys argues that the limit definition was not stated till cerned with a physicists’ debate, particularly about the rele- eighty years later than Bayes, which sounds incorrect when vance of quantum theory. considering that the Law of Large Numbers was produced by 52We recall that Fisher information is not fully compatible Bernoulli in Ars Conjectandi. with the Likelihood Principle (Berger and Wolpert, 1988). THEORY OF PROBABILITY REVISITED 31 a priori. However, special cases and multidimen- that is centered on the Bayes factor. Premises of hi- sional settings show that this theory cannot claim erarchical Bayesian analysis, reference priors, match- to be completely universal. ing priors and mixture analysis can be found at var- 6. The final paragraph of Theory of Probability ious places in the book. That it sometimes lacks states that the present theory does not justify induc- mathematical rigor and often indulges in debates tion; what it does is to provide rules for consistency. that may look superficial today is once again a re- This is absolutely coherent with the above: although flection of the idiosyncrasies of the time: even the the book considers many special cases and excep- ultimate revolutions cannot be built on void and tions, it does provide a general rule for conduct- they do need the shoulders of earlier giants to step ing point inference (estimation) and testing of hy- further. We thus absolutely acknowledge the depth potheses by deriving generic rules for the construc- and worth of Theory of Probability as a foundational tion of noninformative priors. Many other solutions text for Bayesian Statistics and hope that the cur- are available, but the consistency cannot be denied, rent review may help in its reassessment. while a ranking of those solutions is unthinkable. In essence, Theory of Probability has thus mostly ACKNOWLEDGMENTS achieved its goal of presenting a self-contained the- ory of inference based on a minimum of assumptions This paper originates from a reading seminar held and covering the whole field of inferential purposes. at CREST in March 2008. The authors are grateful to the participants for their helpful comments. Pro- 10. CONCLUSION fessor Dennis Lindley very kindly provided light on several difficult passages and we thank him for his It is essential to the possibility of induction that we time, his patience and his supportive comments. We shall be prepared for occasional wrong decisions. are also grateful to Jim Berger and to Steve Addison H. Jeffreys, Theory of Probability, Section 8.2. for helpful suggestions, and to Mike Titterington for a very detailed reading of a preliminary version of Despite a tone that some may consider as overly this paper. David Aldrich directed us to his most in- critical, and therefore unfair to such a pioneer in our formative website about Harold Jeffreys and Theory field, this perusal of Theory of Probability leaves us of Probability. Parts of this paper were written dur- with the feeling of a considerable achievement to- ing the first author’s visit to the Isaac Newton In- ward the formalization of Bayesian theory and the stitute in Harold Jeffreys’s alma mater, Cambridge, construction of an objective and consistent frame- whose peaceful working environment was deeply ap- work. Besides setting the Bayesian principle in full preciated. Comments from the editorial team of Sta- generality, tistical Science were also most helpful. Posterior Probability PriorProbability Likelihood , ∝ REFERENCES including using improper priors indistinctly from prop- er priors, the book sets a generic theory for selecting Aldrich, J. (2008). R. A. Fisher on Bayes and Bayes’ theo- reference priors in general inferential settings, rem. Bayesian Anal. 3 161–170. MR2383255 Balasubramanian, V. (1997). , Occam’s π(θ) I(θ) 1/2, razor, and statistical mechanics on the space of probability ∝ | | distributions. Neural Comput. 9 349–368. as well as when testing point null hypotheses, Basu, D. (1988). Statistical Information and Likelihood: A Collection of Critical Essays by Dr. D. Basu. Springer, New 1 dJ 1/2 1 York. MR0953081 = d tan−1 J 1/2(θ) , π 1+ J π { } Bauwens, L. (1984). Bayesian Full Information of Simulta- neous Equations Models Using Integration by Monte Carlo. when J(θ)=div f( θ0),f( θ) is a divergence mea- Lecture Notes in Economics and Mathematical Systems sure between the{ sampling| | distribution} under the 232. Springer, New York. MR0766396 null and under the alternative. The lack of a decision- Bayarri, M. and Garcia-Donato, G. (2007). Extending conventional priors for testing general hypotheses in linear theoretic formalism for point estimation notwith- models. Biometrika 94 135–152. MR2367828 standing, Jeffreys sets up a completely operational Bayes, T. (1963). An essay towards solving a problem in the technology for hypothesis testing and model choice doctrine of chances. Phil. Trans. Roy. Soc. 53 370–418. 32 C. P. ROBERT, N. CHOPIN AND J. ROUSSEAU

Beaumont, M., Zhang, W. and Balding, D. (2002). Ap- quential approach (with discussion). J. Roy. Statist. Soc. proximate Bayesian computation in population genetics. Ser. A 147 278–292. MR0763811 Genetics 162 2025–2035. Dawid, A. (2004). Probability, causality and the empirical Berger, J. (1985). Statistical Decision Theory and Bayesian world: A Bayes–de Finetti–Popper–Borel synthesis. Statist. Analysis, 2nd ed. Springer, New York. MR0804611 Sci. 19 44–57. MR2082146 Berger, J. and Bernardo, J. (1992). On the development Dawid, A., Stone, N. and Zidek, J. (1973). Marginal- of the reference prior method. In Bayesian Statistics 4 ization paradoxes in Bayesian and structural inference (J. Berger, J. Bernardo, A. Dawid and A. Smith, eds.) 35– (with discussion). J. Roy. Statist. Soc. Ser. B 35 189–233. 49. Oxford Univ. Press, London. MR1380269 MR0365805 Berger, J., Bernardo, J. and Sun, D. (2009). Natural in- de Finetti, B. (1974). Theory of Probability, vol. 1. Wiley, duction: An objective Bayesian approach. Rev. R. Acad. New York. Cien. Serie A Mat. 103 125–135. de Finetti, B. (1975). Theory of Probability, vol. 2. Wiley, Berger, J., Boukai, B. and Wang, Y. (1997). Unified fre- New York. quentist and Bayesian testing of a precise hypothesis (with DeGroot, M. (1970). Optimal Statistical Decisions. discussion). Statist. Sci. 12 133–160. MR1617518 McGraw-Hill, New York. MR0356303 Berger, J. and Jefferys, W. (1992). Sharpening Ockham’s DeGroot, M. (1973). Doing what comes naturally: In- razor on a Bayesian strop. Amer. Statist. 80 64–72. terpreting a tail area as a posterior probability or as Berger, J. and Pericchi, L. (1996). The intrinsic Bayes a likelihood ratio. J. Amer. Statist. Assoc. 68 966–969. factor for model selection and prediction. J. Amer. Statist. MR0362639 Assoc. 91 109–122. MR1394065 Diaconis, P. and Ylvisaker, D. (1985). Quantifying prior Berger, J., Pericchi, L. and Varshavsky, J. (1998). Bayes opinion. In Bayesian Statistics 2 (J. Bernardo, M. De- factors and marginal distributions in invariant situations. Groot, D. Lindley and A. Smith, eds.) 163–175. North- Sankhy¯aSer. A 60 307–321. MR1718789 Holland, Amsterdam. MR0862481 Berger, J., Philippe, A. and Robert, C. (1998). Es- Earman, J. (1992). Bayes or Bust. MIT Press, Cambridge, timation of quadratic functions: Reference priors for MA. MR1170349 non-centrality parameters. Statist. Sinica 8 359–375. Feller, W. (1970). An Introduction to Probability Theory MR1624335 and Its Applications, vol. 1. Wiley, New York. Berger, J. and Robert, C. (1990). Subjective hierarchical Feller, W. (1971). An Introduction to Probability Theory Bayes estimation of a multivariate normal mean: On the and Its Applications, vol. 2. Wiley, New York. MR0270403 frequentist interface. Ann. Statist. 18 617–651. MR1056330 Fienberg, S. (2006). When did Bayesian inference become Berger, J. and Sellke, T. (1987). Testing a point-null hy- “Bayesian”? Bayesian Anal. 1 1–40. MR2227361 pothesis: The irreconcilability of significance levels and ev- Ghosh, M. and Meeden, G. (1984). A new Bayesian analysis idence (with discussion). J. Amer. Statist. Assoc. 82 112– of a random effects model. J. Roy. Statist. Soc. Ser. B 43 122. MR0883340 474–482. MR0790633 Berger, J. and Wolpert, R. (1988). The Likelihood Princi- Good, I. (1962). Theory of Probability by Harold Jeffreys. J. ple, 2nd ed. IMS Lecture Notes—Monograph Series 9. IMS, Roy. Statist. Soc. Ser. A 125 487–489. Hayward. Good, I. (1980). The contributions of Jeffreys to Bayesian Bernardo, J. (1979). Reference posterior distributions for statistics. In Bayesian Analysis in Econometrics and Bayesian inference (with discussion). J. Roy. Statist. Soc. Statistics: Essays in Honor of Harold Jeffreys 21–34. Ser. B 41 113–147. MR0547240 North-Holland, Amsterdam. MR0576546 Bernardo, J. and Smith, A. (1994). Bayesian Theory. Wi- Gourieroux,´ C. and Monfort, A. (1996). Statistics and ley, New York. MR1274699 Econometric Models. Cambridge Univ. Press. Billingsley, P. (1986). Probability and Measure, 2nd ed. Wi- Gradshteyn, I. and Ryzhik, I. (1980). Tables of Integrals, ley, New York. MR0830424 Series and Products. Academic Press, New York. Broemeling, L. and Broemeling, A. (2003). Studies in Haldane, J. (1932). A note on inverse probability. Proc. the history of probability and statistics xlviii the Bayesian Cambridge Philos. Soc. 28 55–61. contributions of Ernest Lhoste. Biometrika 90 728–731. Huzurbazar, V. (1976). Sufficient Statistics. Marcel Dekker, MR2006848 New York. Casella, G. and Berger, R. (2001). Statistical Inference, Jaakkola, T. and Jordan, M. (2000). Bayesian parameter 2nd ed. Wadsworth, Belmont, CA. estimation via variational methods. Statist. Comput. 10 25– Dacunha-Castelle, D. and Gassiat, E. (1999). Testing the 37. order of a model using locally conic parametrization: Pop- Jeffreys, H. (1931). Scientific Inference, 1st ed. Cambridge ulation mixtures and stationary ARMA processes. Ann. Univ. Press. Statist. 27 1178–1209. MR1740115 Jeffreys, H. (1937). Scientific Inference, 2nd ed. Cambridge Darmois, G. (1935). Sur les lois de probabilit´e`aestima- Univ. Press. tion exhaustive. Comptes Rendus Acad. Sciences Paris 200 Jeffreys, H. (1939). Theory of Probability, 1st ed. The 1265–1266. Clarendon Press, Oxford. Dawid, A. (1984). Present position and potential develop- Jeffreys, H. (1948). Theory of Probability, 2nd ed. The ments: Some personal views. Statistical theory. The pre- Clarendon Press, Oxford. THEORY OF PROBABILITY REVISITED 33

Jeffreys, H. (1961). Theory of Probability, 3rd ed. Oxford Raiffa, H. (1968). Decision Analysis: Introductory Lectures Classic Texts in the Physical Sciences. Oxford Univ. Press, on Choices Under Uncertainty. Addison-Wesley, Reading, Oxford. MR1647885 MA. Kass, R. (1989). The geometry of asymptotic inference (with Raiffa, H. and Schlaifer, R. (1961). Applied statistical de- discussion). Statist. Sci. 4 188–234. MR1015274 cision theory. Technical report, Division of Research, Grad- Kass, R. and Wasserman, L. (1996). Formal rules of select- uate School of Business Administration, Harvard Univ. ing prior distributions: A review and annotated bibliogra- MR0117844 phy. J. Amer. Statist. Assoc. 91 343–1370. MR1478684 Rissanen, J. (1983). A universal prior for integers and es- Koopman, B. (1936). On distributions admitting a sufficient timation by minimum description length. Ann. Statist. 11 statistic. Trans. Amer. Math. Soc. 39 399–409. MR1501854 416–431. MR0696056 Le Cam, L. (1986). Asymptotic Methods in Statistical Deci- Rissanen, J. (1990). Complexity of models. In Complexity, sion Theory. Springer, New York. MR0856411 Entropy, and the Physics of Information (W. Zurek, ed.) Lhoste, E. (1923). Le calcul des probabilit´es appliqu´e `a 8. Addison-Wesley, Reading, MA. l’artillerie. Revue D’Artillerie 91 405–423, 516–532, 58–82 Robert, C. (1996). Intrinsic loss functions. Theory and De- and 152–179. cision 40 191–214. MR1385186 Lindley, D. (1953). Statistical inference (with discussion). J. Robert, C. (2001). The Bayesian Choice, 2nd ed. Springer, Roy. Statist. Soc. Ser. B 15 30–76. MR0057522 New York. Lindley, D. (1957). A statistical paradox. Biometrika 44 Robert, C. and Casella, G. (1994). Distance penalized 187–192. MR0087273 losses for testing and confidence set evaluation. Test 3 163– Lindley, D. (1962). Theory of Probability by Harold Jeffreys. 182. MR1293113 J. Amer. Statist. Assoc. 57 922–924. Robert, C. and Casella, G. (2004). Monte Carlo Statistical Lindley, D. (1971). Bayesian Statistics, A Review. SIAM, Methods, 2nd ed. Springer, New York. MR2080278 Philadelphia. MR0329081 Rubin, H. (1987). A weak system of axioms for rational be- Lindley, D. (1980). Jeffreys’s contribution to modern sta- havior and the nonseparability of utility from prior. Statist. tistical thought. In Bayesian Analysis in Econometrics Decision 5 47–58. MR0886877 and Statistics: Essays in Honor of Harold Jeffreys 35–39. Savage, L. (1954). The Foundations of Statistical Inference. North-Holland, Amsterdam. MR0576546 Wiley, New York. MR0063582 Lindley, D. and Smith, A. (1972). Bayes estimates for Stigler, S. (1999). Statistics on the Table: The History of the linear model. J. Roy. Statist. Soc. Ser. B 34 1–41. Statistical Concepts and Methods. Harvard Univ. Press, MR0415861 Cambridge, MA. MR1712969 MacKay, D. J. C. (2002). Information Theory, Inference & Tanner, M. and Wong, W. (1987). The calculation of poste- Learning Algorithms. Cambridge Univ. Press. MR2012999 rior distributions by data augmentation. J. Amer. Statist. Marin, J.-M., Mengersen, K. and Robert, C. (2005). Assoc. 82 528–550. MR0898357 Bayesian modelling and inference on mixtures of distribu- Tibshirani, R. (1989). Noninformative priors for one param- tions. In Handbook of Statistics (C. Rao and D. Dey, eds.) eter of many. Biometrika 76 604–608. MR1040654 25. Springer, New York. Wald, A. (1950). Statistical Decision Functions. Wiley, New Marin, J.-M. and Robert, C. (2007). Bayesian Core. York. MR0036976 Springer, New York. MR2289769 Welch, B. and Peers, H. (1963). On formulae for confidence Pitman, E. (1936). Sufficient statistics and intrinsic accuracy. points based on integrals of weighted likelihoods. J. Roy. Proc. Cambridge Philos. Soc. 32 567–579. Statist. Soc. Ser. B 25 318–329. MR0173309 Popper, K. (1934). The Logic of Scientific Discovery. Zellner, A. (1980). Introduction. In Bayesian Analysis in Hutchinson and Co., London. (English translation, 1959.) Econometrics and Statistics: Essays in Honor of Harold MR0107593 Jeffreys 1–10. North-Holland, Amsterdam. MR0576546