Bayes, Jeffreys, Prior Distributions and the Philosophy of Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Science 2009, Vol. 24, No. 2, 176–178 DOI: 10.1214/09-STS284D Main article DOI: 10.1214/09-STS284 c Institute of Mathematical Statistics, 2009 Bayes, Jeffreys, Prior Distributions and the Philosophy of Statistics1 Andrew Gelman I actually own a copy of Harold Jeffreys’s The- tion of Jeffreys’s ideas is to explicitly include model ory of Probability but have only read small bits of checking in the process of data analysis. it, most recently over a decade ago to confirm that, indeed, Jeffreys was not too proud to use a classi- THE ROLE OF THE PRIOR DISTRIBUTION cal chi-squared p-value when he wanted to check the IN BAYESIAN DATA ANALYSIS misfit of a model to data (Gelman, Meng and Stern, At least in the field of statistics, Jeffreys is best 2006). I do, however, feel that it is important to un- known for his eponymous prior distribution and, derstand where our probability models come from, more generally, for the principle of constructing non- and I welcome the opportunity to use the present ar- informative, or minimally informative, or objective, ticle by Robert, Chopin and Rousseau as a platform 2 or reference prior distributions from the likelihood for further discussion of foundational issues. (see, for example, Kass and Wasserman, 1996). But In this brief discussion I will argue the follow- it can notoriously difficult to choose among nonin- ing: (1) in thinking about prior distributions, we formative priors; and, even more importantly, seem- should go beyond Jeffreys’s principles and move to- ingly noninformative distributions can sometimes have ward weakly informative priors; (2) it is natural for strong and undesirable implications, as I have found those of us who work in social and computational in my own experience (Gelman, 1996, 2006). As a re- sciences to favor complex models, contra Jeffreys’s sult I have become a convert to the cause of weakly preference for simplicity; and (3) a key generaliza- informative priors, which attempt to let the data speak while being strong enough to exclude vari- Andrew Gelman is Professor, Department of Statistics ous “unphysical” possibilities which, if not blocked, and Department of Political Science, Columbia can take over a posterior distribution in settings University e-mail: [email protected]; URL: with sparse data—a situation which is increasingly http://www.stat.columbia.edu/˜gelman. present as we continue to develop the techniques of 1Discussion of “Harold Jeffreys’s Theory of Probability re- visited,” by Christian Robert, Nicolas Chopin, and Judith working with complex hierarchical and nonparamet- Rousseau, for Statistical Science. ric models. arXiv:1001.2968v1 [stat.ME] 18 Jan 2010 This is an electronic reprint of the original article HOW THE SOCIAL AND COMPUTATIONAL published by the Institute of Mathematical Statistics in SCIENCES DIFFER FROM PHYSICS Statistical Science, 2009, Vol. 24, No. 2, 176–178. This reprint differs from the original in pagination and Robert, Chopin and Rousseau trace the applica- typographic detail. tion of Ockham’s razor (the preference for simpler 2On the topic of other books on the foundations of Bayesian models) from Jeffreys’s discussion of the law of grav- statistics, I confess to having found Savage (1954) to be nearly ity through later work of a mathematical statisti- unreadable, a book too much of a product of its time in its cian (Jim Berger), an astronomer (Bill Jefferys) and enthusiasm for game theory as a solution to all problems, an attitude which I find charming in the classic work of Luce and a physicist (David MacKay). From their perspec- Raiffa (1957) but more of annoyance in a book of statistical tive, Ockham’s razor seems unquestionably reason- methods. When it comes to Cold War-era foundational work able, with the only point of debate being the extent on Bayesian statistics, I much prefer the work of Lindley, in to which Bayesian inference automatically encom- his 1965 book and elsewhere. passes it. Also, I would be disloyal to my coauthors if I did not report My own perspective as a social scientist is com- that, despite what is said in the second footnote in the arti- cle under discussion, there is at least one other foundational pletely different. I’ve just about never heard some- Bayesian text of 1990s vintage that continues to receive more one in social science object to the inclusion of a citations than Jeffreys. variable or an interaction in a model; rather, the 1 2 A. GELMAN most serious criticisms of a model involve worries BAYESIAN INFERENCE VS. BAYESIAN DATA that certain potentially important factors have not ANALYSIS been included. In the social science problems I’ve One of my own epiphanies—actually stimulated seen, Ockham’s razor is at best an irrelevance and by the writings of E. T. Jaynes, yet another Bayesian at worse can lead to acceptance of models that are physicist—and incorporated into the title of my own missing key features that the data could actually book on Bayesian statistics, is that sometimes the provide information on. As such, I am no fan of most important thing to come out of an inference is methods such as BIC that attempt to justify the the rejection of the model on which it is based. Data use of simple models that do not fit observed data. analysis includes model building and criticism, not Don’t get me wrong—all the time I use simple mod- merely inference. Only through careful model build- els that don’t fit the data—but no amount of BIC ing is such definitive rejection possible. This idea— 3 will make me feel good about it! the comparison of predictive inferences to data— I much prefer Radford Neal’s line from his Ph.D. was forcefully put into Bayesian terms nearly thirty thesis: years ago by Box (1980) and Rubin (1984) but is even now still only gradually becoming standard in Sometimes a simple model will outperform Bayesian practice. a more complex model. Nevertheless, I A famous empiricist once said, “With great power [Neal] believe that deliberately limiting the comes great responsibility.” In Bayesian terms, the complexity of the model is not fruitful when stronger we make our model—following the excellent the problem is evidently complex. Instead, precepts of Jeffreys and Jaynes—the more able we if a simple model is found that outper- will be to find the model’s flaws and thus perform forms some particular complex model, the scientific learning. appropriate response is to define a differ- To roughly translate into philosophy-of-science jar- ent complex model that captures what- gon: Bayesian inference within a model is “normal ever aspect of the problem led to the sim- science,” and “scientific revolution” is the process of ple model performing well. checking a model, seeing its mismatches with reality, and coming up with a replacement. The revolution This is not really a Bayesian or a non-Bayesian is the glamour boy in this scenario, but, as Kuhn issue: complicated models with virtually unlimited (1962) emphasized, it is only the careful work of nor- nonlinearity and interactions are being developed mal science that makes the revolution possible: the using Bayesian principles. See, for example, Dun- better we can work out the implications of a theory, son (2006) and Chipman, George and McCulloch the more effectively we can find its flaws and thus (2008). To put it another way, you can be a prac- learn about nature.4 In this chicken-and-egg process, ticing Bayesian and prefer simpler models, or be a both normal science (Bayesian inference) and revo- practicing Bayesian and prefer complicated models. lution (Bayesian model revision) are useful, and they Or you can follow similar inclinations toward sim- feed upon each other. It is in this sense that graph- plicity or complexity from various non-Bayesian per- ical methods and exploratory data analysis can be spectives. viewed as explicitly Bayesian, as tools for comparing My point here is only that the Ockhamite tenden- posterior predictions to data (Gelman, 2003). cies of Jeffreys and his followers up to and includ- To get back to the Robert, Chopin, and Rousseau ing MacKay may derive, to some extent, from the article: I am suggesting that their identification (and simplicity of the best models of physics, the sense Jeffreys’s) of Bayesian data analysis with Bayesian that good science moves from the particular to the inference is limiting and, in practice, puts an unre- general—an attitude that does not fit in so well with alistic burden on any model. modern social and computational science. 4As Kuhn may very well have written had he lived long enough, scientific progress is fractal, with episodes of nor- 3See Gelman and Rubin (1995) for a fuller expression of this mal science and mini-revolutions happening over the period position, and Raftery (1995) for a defense of BIC in general of minutes, hours, days, and years, as well as the more familiar and in the context of two applications in sociology. examples of paradigms lasting over decades or centuries. COMMENT 3 CONCLUSION Gelman, A. (2003). A Bayesian formulation of exploratory data analysis and goodness-of-fit testing. Internat. Statist. If you wanted to do foundational research in statis- Rev. 71 369–382. tics in the mid-twentieth century, you had to be bit Gelman, A. (2006). Prior distributions for variance param- of a mathematician, whether you wanted to or not. eters in hierarchical models. Bayesian Anal. 1 515–533. As Robert, Chopin, and Rousseau’s own work re- MR2221284 veals, if you want to do statistical research at the Gelman, A., Carlin, J.