Recursive Pathways to Marginal Likelihood Estimation with Prior
Total Page:16
File Type:pdf, Size:1020Kb
Statistical Science 2014, Vol. 29, No. 3, 397–419 DOI: 10.1214/13-STS465 c Institute of Mathematical Statistics, 2014 Recursive Pathways to Marginal Likelihood Estimation with Prior-Sensitivity Analysis Ewan Cameron and Anthony Pettitt Abstract. We investigate the utility to computational Bayesian anal- yses of a particular family of recursive marginal likelihood estimators characterized by the (equivalent) algorithms known as “biased sam- pling” or “reverse logistic regression” in the statistics literature and “the density of states” in physics. Through a pair of numerical exam- ples (including mixture modeling of the well-known galaxy data set) we highlight the remarkable diversity of sampling schemes amenable to such recursive normalization, as well as the notable efficiency of the resulting pseudo-mixture distributions for gauging prior sensitivity in the Bayesian model selection context. Our key theoretical contributions are to introduce a novel heuristic (“thermodynamic integration via im- portance sampling”) for qualifying the role of the bridging sequence in this procedure and to reveal various connections between these recur- sive estimators and the nested sampling technique. Key words and phrases: Bayes factor, Bayesian model selection, im- portance sampling, marginal likelihood, Metropolis-coupled Markov Chain Monte Carlo, nested sampling, normalizing constant, path sam- pling, reverse logistic regression, thermodynamic integration. 1. INTRODUCTION Bayesian model selection and model averaging (Kass and Raftery, 1995; Hoeting et al. (1999)). For priors Though typically unnecessary for computational admitting an “ordinary” density, π(θ), with respect parameter inference in the Bayesian framework, the to the Lebesgue measure (a “Λ-density”), we write arXiv:1301.6450v3 [stat.ME] 15 Oct 2014 factor, Z, required to normalize the product of prior for the posterior and likelihood nevertheless plays a vital role in π(θ y)= L(y θ)π(θ)/Z with | | Ewan Cameron is Research Associate, Science and (1) Engineering Faculty, School of Mathematical Sciences Z = L(y θ)π(θ) dθ, | (Statistical Science), Queensland University of ZΩ Technology, GPO Box 2434, Brisbane, QLD 4001, and, more generally (e.g., for stochastic process pri- Australia e-mail: [email protected]. Anthony ors) we write Pettitt is Professor, Science and Engineering Faculty, School of Mathematical Sciences (Statistical Science), dP (θ)= L(y θ) dP (θ)/Z with θ|y | θ Queensland University of Technology, GPO Box 2434, (2) Brisbane, QLD 4001, Australia. Z = L(y θ) dPθ(θ) , Ω | { } This is an electronic reprint of the original article Z with the likelihood, L(y θ), a non-negative, real- published by the Institute of Mathematical Statistics in | Statistical Science, 2014, Vol. 29, No. 3, 397–419. This valued function supposed integrable with respect to reprint differs from the original in pagination and the prior. In this context Z is generally referred to as typographic detail. either the marginal likelihood (i.e., the likelihood of 1 2 E. CAMERON AND A. PETTITT the observed data marginalized [averaged] over the “volume” between the regions of greatest concentra- prior) or the evidence. With the latter term though, tion in prior and posterior mass, with huge sample one risks the impression of overstating the value of sizes necessary to achieve reasonable accuracy (e.g., this statistic in the case of limited prior knowledge Neal, 1999). (cf. Gelman et al., 2004, Chapter 6). A wealth of more sophisticated integration meth- Problematically, few complex statistical problems ods have thus lately been developed for generating admit an analytical solution to Equations (1) or (2), improved estimates of the marginal likelihood, as re- or span such low-dimensional spaces [D(θ) . 5–10] viewed in depth by Chen, Shao and Ibrahim (2000), that direct numerical integration presents a viable Robert and Wraith (2009) and Friel and Wyse alternative. With errors (at least in principle) inde- (2012). Notable examples include the following: pendent of dimension, Monte Carlo-based integra- adaptive multiple importance sampling (Cornuet tion methods have thus become the mode of choice et al. (2012)), annealed importance sampling (Neal for marginal likelihood estimation across a diverse (2001)), bridge sampling (Meng and Wong (1996)), range of scientific disciplines, from evolutionary bi- [ordinary] importance sampling (cf. Liu, 2001), ology (Xie et al. (2011); Arima and Tardella (2012); path sampling/thermodynamic integration (Gelman Baele et al. (2012)) and cosmology (Mukherjee, and Meng (1998); Lartillot and Phillipe (2006); Parkinson and Liddle (2006); Kilbinger et al. (2010)) Friel and Pettitt (2008); Calderhead and Girolami to quantitative finance (Li, Ni and Lin (2011)) and (2009)), nested sampling (Skilling (2006); Feroz sociology (Caimo and Friel (2013)). and Hobson (2008)), nested importance sampling 1.1 Monte Carlo-Based Integration Methods (Chopin and Robert (2010)), reverse logistic regres- sion (Geyer (1994)), sequential Monte Carlo (SMC; With the posterior most often “thinner-tailed” Capp´eet al., 2004; Del Moral, Doucet and Jasra, than the prior and/or constrained within a much di- 2006), the Savage–Dickey density ratio (Marin and minished sub-volume of the given parameter space, Robert (2010)) and the density of states (Habeck the simplest marginal likelihood estimators drawing (2012); Tan et al. (2012)). A common thread run- solely from π(θ) or π(θ y) cannot be relied upon ning through almost all these schemes is the aim for model selection purposes.| In the first case— for a superior exploration of the relevant param- strictly, that [1/L(y θ)]π(θ) dθ diverges—the har- Ω eter space via “guided” transitions across a se- monic mean estimator| (HME; Newton and Raftery, quence of intermediate distributions, typically fol- 1994), R lowing a bridging path between the π(θ) and n −1 π(θ y) extremes. [Or, more generally, the h(θ) and ˆH | Z = 1/n/L(y θi) for θi π(θ y), π(θ y) extremes if a suitable auxiliary/reference " | # ∼ | | Xi=1 density, h(θ), is available to facilitate the inte- suffers theoretically from an infinite variance, mean- gration; cf. Lefebvre, Steele and Vandal (2010).] ing in practice that its convergence toward the true However, the nature of this bridging path dif- Z as a one-sided α-stable limit law can be incredibly fers significantly between algorithms. Nested sam- slow (Wolpert and Schmidler (2012)). Even when pling, for instance, evolves its “live point set” over “robustified” as per Gelfand and Dey (1994) or a sequence of constrained-likelihood distributions, Raftery et al. (2007), however, the HME remains no- f(θ) π(θ)I(L(y θ) L ), transitioning from the ∝ | ≥ lim tably insensitive to changes in π(θ), whereas Z itself prior (Llim = 0) through to the vicinity of peak like- is characteristically sensitive (Robert and Wraith lihood (Llim Lmax ε), while thermodynamic in- (2009); Friel and Wyse (2012)). [See also Weinberg tegration, on≈ the other− hand, draws progressively (2012) for yet another approach to robustifying the (via Markov Chain Monte Carlo [MCMC]; Tierney, HME.] Though assuredly finite by default, the vari- 1994) from the family of “power posteriors,” ance of the prior arithmetic mean estimator (AME), (3) π (θ y) π(θ)L(y θ)t, n t | ∝ | ZˆA = L(y θ )/n for θ π(θ), | i i ∼ explicitly connecting the prior at t = 0 to the poste- Xi=1 rior at t = 1. on the other hand, will remain impractically large Another key point of comparison between these whenever there exists a substantial difference in rival Monte Carlo techniques lies in their choice of RECURSIVE PATHWAYS TO MARGINAL LIKELIHOOD ESTIMATION 3 identity by which the evidence is ultimately com- Following this review of the recursive family puted. The (geometric) path sampling identity, (which includes our theoretical contributions con- 1 cerning the link between the DoS and nested sam- log Z = E log L(y θ) dt, pling in Section 2.2.1), we highlight the potential πt { | } Z0 for efficient prior-sensitivity analysis when using for example, is shared across both thermodynamic these marginal likelihood estimators (Section 2.3) integration and SMC, in addition to its namesake. and discuss issues regarding design and sampling of However, SMC can also be run with the “stepping- the bridging sequence (Section 2.4). We then intro- stone” solution (cf. Xie et al., 2011), duce a novel heuristic to help inform the latter by m characterizing the connection between the bridging sequences of biased sampling and thermodynamic Z = Ztj /Ztj−1 , where t1 =0 and tm = 1, j=2 integration (Section 3). Finally, we present two nu- Y merical case studies illustrating the issues and tech- with t : j = 1,...,m indexing a sequence of (“tem- j niques discussed in the previous sections: the first pered”){ bridging densities,} and, indeed, this is the concerns a “mock” banana-shaped likelihood func- mode preferred by experienced practitioners (e.g., tion (Section 4) and includes the demonstration of a Del Moral, Doucet and Jasra, 2006). Yet another novel synthesis of the recursive pathway with nested identity for computing the marginal likelihood is sampling (Section 4.2), while the second concerns that of the recursive pathway explored here. mixture modeling of the galaxy data set (Section 5) First introduced within the “biased sampling” and includes a demonstration of importance sample paradigm (Vardi (1985)), the recursive pathway is reweighting of an infinite-dimensional mixture pos- shared by the popular techniques of “reverse logis- terior to recover its finite-dimensional counterparts tic regression” (RLR) and “the density of states” (Section 5.4.3). (DoS). By recursive we mean that, algorithmically, each may be run such that the desired Z is obtained 2. BIASED SAMPLING through backward induction of the complete set of intermediate normalizing constants correspond- The archetypal recursive marginal likelihood esti- ing to the sequence of distributions in the given mator—from which both the RLR and DoS meth- bridging path by supposing these to be already ods may be directly recovered—is that of biased known.