Recursive Pathways to Marginal Likelihood Estimation with Prior

Total Page:16

File Type:pdf, Size:1020Kb

Recursive Pathways to Marginal Likelihood Estimation with Prior Statistical Science 2014, Vol. 29, No. 3, 397–419 DOI: 10.1214/13-STS465 c Institute of Mathematical Statistics, 2014 Recursive Pathways to Marginal Likelihood Estimation with Prior-Sensitivity Analysis Ewan Cameron and Anthony Pettitt Abstract. We investigate the utility to computational Bayesian anal- yses of a particular family of recursive marginal likelihood estimators characterized by the (equivalent) algorithms known as “biased sam- pling” or “reverse logistic regression” in the statistics literature and “the density of states” in physics. Through a pair of numerical exam- ples (including mixture modeling of the well-known galaxy data set) we highlight the remarkable diversity of sampling schemes amenable to such recursive normalization, as well as the notable efficiency of the resulting pseudo-mixture distributions for gauging prior sensitivity in the Bayesian model selection context. Our key theoretical contributions are to introduce a novel heuristic (“thermodynamic integration via im- portance sampling”) for qualifying the role of the bridging sequence in this procedure and to reveal various connections between these recur- sive estimators and the nested sampling technique. Key words and phrases: Bayes factor, Bayesian model selection, im- portance sampling, marginal likelihood, Metropolis-coupled Markov Chain Monte Carlo, nested sampling, normalizing constant, path sam- pling, reverse logistic regression, thermodynamic integration. 1. INTRODUCTION Bayesian model selection and model averaging (Kass and Raftery, 1995; Hoeting et al. (1999)). For priors Though typically unnecessary for computational admitting an “ordinary” density, π(θ), with respect parameter inference in the Bayesian framework, the to the Lebesgue measure (a “Λ-density”), we write arXiv:1301.6450v3 [stat.ME] 15 Oct 2014 factor, Z, required to normalize the product of prior for the posterior and likelihood nevertheless plays a vital role in π(θ y)= L(y θ)π(θ)/Z with | | Ewan Cameron is Research Associate, Science and (1) Engineering Faculty, School of Mathematical Sciences Z = L(y θ)π(θ) dθ, | (Statistical Science), Queensland University of ZΩ Technology, GPO Box 2434, Brisbane, QLD 4001, and, more generally (e.g., for stochastic process pri- Australia e-mail: [email protected]. Anthony ors) we write Pettitt is Professor, Science and Engineering Faculty, School of Mathematical Sciences (Statistical Science), dP (θ)= L(y θ) dP (θ)/Z with θ|y | θ Queensland University of Technology, GPO Box 2434, (2) Brisbane, QLD 4001, Australia. Z = L(y θ) dPθ(θ) , Ω | { } This is an electronic reprint of the original article Z with the likelihood, L(y θ), a non-negative, real- published by the Institute of Mathematical Statistics in | Statistical Science, 2014, Vol. 29, No. 3, 397–419. This valued function supposed integrable with respect to reprint differs from the original in pagination and the prior. In this context Z is generally referred to as typographic detail. either the marginal likelihood (i.e., the likelihood of 1 2 E. CAMERON AND A. PETTITT the observed data marginalized [averaged] over the “volume” between the regions of greatest concentra- prior) or the evidence. With the latter term though, tion in prior and posterior mass, with huge sample one risks the impression of overstating the value of sizes necessary to achieve reasonable accuracy (e.g., this statistic in the case of limited prior knowledge Neal, 1999). (cf. Gelman et al., 2004, Chapter 6). A wealth of more sophisticated integration meth- Problematically, few complex statistical problems ods have thus lately been developed for generating admit an analytical solution to Equations (1) or (2), improved estimates of the marginal likelihood, as re- or span such low-dimensional spaces [D(θ) . 5–10] viewed in depth by Chen, Shao and Ibrahim (2000), that direct numerical integration presents a viable Robert and Wraith (2009) and Friel and Wyse alternative. With errors (at least in principle) inde- (2012). Notable examples include the following: pendent of dimension, Monte Carlo-based integra- adaptive multiple importance sampling (Cornuet tion methods have thus become the mode of choice et al. (2012)), annealed importance sampling (Neal for marginal likelihood estimation across a diverse (2001)), bridge sampling (Meng and Wong (1996)), range of scientific disciplines, from evolutionary bi- [ordinary] importance sampling (cf. Liu, 2001), ology (Xie et al. (2011); Arima and Tardella (2012); path sampling/thermodynamic integration (Gelman Baele et al. (2012)) and cosmology (Mukherjee, and Meng (1998); Lartillot and Phillipe (2006); Parkinson and Liddle (2006); Kilbinger et al. (2010)) Friel and Pettitt (2008); Calderhead and Girolami to quantitative finance (Li, Ni and Lin (2011)) and (2009)), nested sampling (Skilling (2006); Feroz sociology (Caimo and Friel (2013)). and Hobson (2008)), nested importance sampling 1.1 Monte Carlo-Based Integration Methods (Chopin and Robert (2010)), reverse logistic regres- sion (Geyer (1994)), sequential Monte Carlo (SMC; With the posterior most often “thinner-tailed” Capp´eet al., 2004; Del Moral, Doucet and Jasra, than the prior and/or constrained within a much di- 2006), the Savage–Dickey density ratio (Marin and minished sub-volume of the given parameter space, Robert (2010)) and the density of states (Habeck the simplest marginal likelihood estimators drawing (2012); Tan et al. (2012)). A common thread run- solely from π(θ) or π(θ y) cannot be relied upon ning through almost all these schemes is the aim for model selection purposes.| In the first case— for a superior exploration of the relevant param- strictly, that [1/L(y θ)]π(θ) dθ diverges—the har- Ω eter space via “guided” transitions across a se- monic mean estimator| (HME; Newton and Raftery, quence of intermediate distributions, typically fol- 1994), R lowing a bridging path between the π(θ) and n −1 π(θ y) extremes. [Or, more generally, the h(θ) and ˆH | Z = 1/n/L(y θi) for θi π(θ y), π(θ y) extremes if a suitable auxiliary/reference " | # ∼ | | Xi=1 density, h(θ), is available to facilitate the inte- suffers theoretically from an infinite variance, mean- gration; cf. Lefebvre, Steele and Vandal (2010).] ing in practice that its convergence toward the true However, the nature of this bridging path dif- Z as a one-sided α-stable limit law can be incredibly fers significantly between algorithms. Nested sam- slow (Wolpert and Schmidler (2012)). Even when pling, for instance, evolves its “live point set” over “robustified” as per Gelfand and Dey (1994) or a sequence of constrained-likelihood distributions, Raftery et al. (2007), however, the HME remains no- f(θ) π(θ)I(L(y θ) L ), transitioning from the ∝ | ≥ lim tably insensitive to changes in π(θ), whereas Z itself prior (Llim = 0) through to the vicinity of peak like- is characteristically sensitive (Robert and Wraith lihood (Llim Lmax ε), while thermodynamic in- (2009); Friel and Wyse (2012)). [See also Weinberg tegration, on≈ the other− hand, draws progressively (2012) for yet another approach to robustifying the (via Markov Chain Monte Carlo [MCMC]; Tierney, HME.] Though assuredly finite by default, the vari- 1994) from the family of “power posteriors,” ance of the prior arithmetic mean estimator (AME), (3) π (θ y) π(θ)L(y θ)t, n t | ∝ | ZˆA = L(y θ )/n for θ π(θ), | i i ∼ explicitly connecting the prior at t = 0 to the poste- Xi=1 rior at t = 1. on the other hand, will remain impractically large Another key point of comparison between these whenever there exists a substantial difference in rival Monte Carlo techniques lies in their choice of RECURSIVE PATHWAYS TO MARGINAL LIKELIHOOD ESTIMATION 3 identity by which the evidence is ultimately com- Following this review of the recursive family puted. The (geometric) path sampling identity, (which includes our theoretical contributions con- 1 cerning the link between the DoS and nested sam- log Z = E log L(y θ) dt, pling in Section 2.2.1), we highlight the potential πt { | } Z0 for efficient prior-sensitivity analysis when using for example, is shared across both thermodynamic these marginal likelihood estimators (Section 2.3) integration and SMC, in addition to its namesake. and discuss issues regarding design and sampling of However, SMC can also be run with the “stepping- the bridging sequence (Section 2.4). We then intro- stone” solution (cf. Xie et al., 2011), duce a novel heuristic to help inform the latter by m characterizing the connection between the bridging sequences of biased sampling and thermodynamic Z = Ztj /Ztj−1 , where t1 =0 and tm = 1, j=2 integration (Section 3). Finally, we present two nu- Y merical case studies illustrating the issues and tech- with t : j = 1,...,m indexing a sequence of (“tem- j niques discussed in the previous sections: the first pered”){ bridging densities,} and, indeed, this is the concerns a “mock” banana-shaped likelihood func- mode preferred by experienced practitioners (e.g., tion (Section 4) and includes the demonstration of a Del Moral, Doucet and Jasra, 2006). Yet another novel synthesis of the recursive pathway with nested identity for computing the marginal likelihood is sampling (Section 4.2), while the second concerns that of the recursive pathway explored here. mixture modeling of the galaxy data set (Section 5) First introduced within the “biased sampling” and includes a demonstration of importance sample paradigm (Vardi (1985)), the recursive pathway is reweighting of an infinite-dimensional mixture pos- shared by the popular techniques of “reverse logis- terior to recover its finite-dimensional counterparts tic regression” (RLR) and “the density of states” (Section 5.4.3). (DoS). By recursive we mean that, algorithmically, each may be run such that the desired Z is obtained 2. BIASED SAMPLING through backward induction of the complete set of intermediate normalizing constants correspond- The archetypal recursive marginal likelihood esti- ing to the sequence of distributions in the given mator—from which both the RLR and DoS meth- bridging path by supposing these to be already ods may be directly recovered—is that of biased known.
Recommended publications
  • The Composite Marginal Likelihood (CML) Inference Approach with Applications to Discrete and Mixed Dependent Variable Models
    Technical Report 101 The Composite Marginal Likelihood (CML) Inference Approach with Applications to Discrete and Mixed Dependent Variable Models Chandra R. Bhat Center for Transportation Research September 2014 Data-Supported Transportation Operations & Planning Center (D-STOP) A Tier 1 USDOT University Transportation Center at The University of Texas at Austin D-STOP is a collaborative initiative by researchers at the Center for Transportation Research and the Wireless Networking and Communications Group at The University of Texas at Austin. DISCLAIMER The contents of this report reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated under the sponsorship of the U.S. Department of Transportation’s University Transportation Centers Program, in the interest of information exchange. The U.S. Government assumes no liability for the contents or use thereof. Technical Report Documentation Page 1. Report No. 2. Government Accession No. 3. Recipient's Catalog No. D-STOP/2016/101 4. Title and Subtitle 5. Report Date The Composite Marginal Likelihood (CML) Inference Approach September 2014 with Applications to Discrete and Mixed Dependent Variable 6. Performing Organization Code Models 7. Author(s) 8. Performing Organization Report No. Chandra R. Bhat Report 101 9. Performing Organization Name and Address 10. Work Unit No. (TRAIS) Data-Supported Transportation Operations & Planning Center (D- STOP) 11. Contract or Grant No. The University of Texas at Austin DTRT13-G-UTC58 1616 Guadalupe Street, Suite 4.202 Austin, Texas 78701 12. Sponsoring Agency Name and Address 13. Type of Report and Period Covered Data-Supported Transportation Operations & Planning Center (D- STOP) The University of Texas at Austin 14.
    [Show full text]
  • A Widely Applicable Bayesian Information Criterion
    JournalofMachineLearningResearch14(2013)867-897 Submitted 8/12; Revised 2/13; Published 3/13 A Widely Applicable Bayesian Information Criterion Sumio Watanabe [email protected] Department of Computational Intelligence and Systems Science Tokyo Institute of Technology Mailbox G5-19, 4259 Nagatsuta, Midori-ku Yokohama, Japan 226-8502 Editor: Manfred Opper Abstract A statistical model or a learning machine is called regular if the map taking a parameter to a prob- ability distribution is one-to-one and if its Fisher information matrix is always positive definite. If otherwise, it is called singular. In regular statistical models, the Bayes free energy, which is defined by the minus logarithm of Bayes marginal likelihood, can be asymptotically approximated by the Schwarz Bayes information criterion (BIC), whereas in singular models such approximation does not hold. Recently, it was proved that the Bayes free energy of a singular model is asymptotically given by a generalized formula using a birational invariant, the real log canonical threshold (RLCT), instead of half the number of parameters in BIC. Theoretical values of RLCTs in several statistical models are now being discovered based on algebraic geometrical methodology. However, it has been difficult to estimate the Bayes free energy using only training samples, because an RLCT depends on an unknown true distribution. In the present paper, we define a widely applicable Bayesian information criterion (WBIC) by the average log likelihood function over the posterior distribution with the inverse temperature 1/logn, where n is the number of training samples. We mathematically prove that WBIC has the same asymptotic expansion as the Bayes free energy, even if a statistical model is singular for or unrealizable by a statistical model.
    [Show full text]
  • Categorical Distributions in Natural Language Processing Version 0.1
    Categorical Distributions in Natural Language Processing Version 0.1 MURAWAKI Yugo 12 May 2016 MURAWAKI Yugo Categorical Distributions in NLP 1 / 34 Categorical distribution Suppose random variable x takes one of K values. x is generated according to categorical distribution Cat(θ), where θ = (0:1; 0:6; 0:3): RYG In many task settings, we do not know the true θ and need to infer it from observed data x = (x1; ··· ; xN). Once we infer θ, we often want to predict new variable x0. NOTE: In Bayesian settings, θ is usually integrated out and x0 is predicted directly from x. MURAWAKI Yugo Categorical Distributions in NLP 2 / 34 Categorical distributions are a building block of natural language models N-gram language model (predicting the next word) POS tagging based on a Hidden Markov Model (HMM) Probabilistic context-free grammar (PCFG) Topic model (Latent Dirichlet Allocation (LDA)) MURAWAKI Yugo Categorical Distributions in NLP 3 / 34 Example: HMM-based POS tagging BOS DT NN VBZ VBN EOS the sun has risen Let K be the number of POS tags and V be the vocabulary size (ignore BOS and EOS for simplicity). The transition probabilities can be computed using K categorical θTRANS; θTRANS; ··· distributions ( DT NN ), with the dimension K. θTRANS = : ; : ; : ; ··· DT (0 21 0 27 0 09 ) NN NNS ADJ Similarly, the emission probabilities can be computed using K θEMIT; θEMIT; ··· categorical distributions ( DT NN ), with the dimension V. θEMIT = : ; : ; : ; ··· NN (0 012 0 002 0 005 ) sun rose cat MURAWAKI Yugo Categorical Distributions in NLP 4 / 34 Outline Categorical and multinomial distributions Conjugacy and posterior predictive distribution LDA (Latent Dirichlet Applocation) as an application Gibbs sampling for inference MURAWAKI Yugo Categorical Distributions in NLP 5 / 34 Categorical distribution: 1 observation Suppose θ is known.
    [Show full text]
  • Binomial and Multinomial Distributions
    Binomial and multinomial distributions Kevin P. Murphy Last updated October 24, 2006 * Denotes more advanced sections 1 Introduction In this chapter, we study probability distributions that are suitable for modelling discrete data, like letters and words. This will be useful later when we consider such tasks as classifying and clustering documents, recognizing and segmenting languages and DNA sequences, data compression, etc. Formally, we will be concerned with density models for X ∈{1,...,K}, where K is the number of possible values for X; we can think of this as the number of symbols/ letters in our alphabet/ language. We assume that K is known, and that the values of X are unordered: this is called categorical data, as opposed to ordinal data, in which the discrete states can be ranked (e.g., low, medium and high). If K = 2 (e.g., X represents heads or tails), we will use a binomial distribution. If K > 2, we will use a multinomial distribution. 2 Bernoullis and Binomials Let X ∈{0, 1} be a binary random variable (e.g., a coin toss). Suppose p(X =1)= θ. Then − p(X|θ) = Be(X|θ)= θX (1 − θ)1 X (1) is called a Bernoulli distribution. It is easy to show that E[X] = p(X =1)= θ (2) Var [X] = θ(1 − θ) (3) The likelihood for a sequence D = (x1,...,xN ) of coin tosses is N − p(D|θ)= θxn (1 − θ)1 xn = θN1 (1 − θ)N0 (4) n=1 Y N N where N1 = n=1 xn is the number of heads (X = 1) and N0 = n=1(1 − xn)= N − N1 is the number of tails (X = 0).
    [Show full text]
  • Bayesian Monte Carlo
    Bayesian Monte Carlo Carl Edward Rasmussen and Zoubin Ghahramani Gatsby Computational Neuroscience Unit University College London 17 Queen Square, London WC1N 3AR, England edward,[email protected] http://www.gatsby.ucl.ac.uk Abstract We investigate Bayesian alternatives to classical Monte Carlo methods for evaluating integrals. Bayesian Monte Carlo (BMC) allows the in- corporation of prior knowledge, such as smoothness of the integrand, into the estimation. In a simple problem we show that this outperforms any classical importance sampling method. We also attempt more chal- lenging multidimensional integrals involved in computing marginal like- lihoods of statistical models (a.k.a. partition functions and model evi- dences). We find that Bayesian Monte Carlo outperformed Annealed Importance Sampling, although for very high dimensional problems or problems with massive multimodality BMC may be less adequate. One advantage of the Bayesian approach to Monte Carlo is that samples can be drawn from any distribution. This allows for the possibility of active design of sample points so as to maximise information gain. 1 Introduction Inference in most interesting machine learning algorithms is not computationally tractable, and is solved using approximations. This is particularly true for Bayesian models which require evaluation of complex multidimensional integrals. Both analytical approximations, such as the Laplace approximation and variational methods, and Monte Carlo methods have recently been used widely for Bayesian machine learning problems. It is interesting to note that Monte Carlo itself is a purely frequentist procedure [O’Hagan, 1987; MacKay, 1999]. This leads to several inconsistencies which we review below, outlined in a paper by O’Hagan [1987] with the title “Monte Carlo is Fundamentally Unsound”.
    [Show full text]
  • Marginal Likelihood
    STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 2 Last Class •" In our last class, we looked at: -" Statistical Decision Theory -" Linear Regression Models -" Linear Basis Function Models -" Regularized Linear Regression Models -" Bias-Variance Decomposition •" We will now look at the Bayesian framework and Bayesian Linear Regression Models. Bayesian Approach •" We formulate our knowledge about the world probabilistically: -" We define the model that expresses our knowledge qualitatively (e.g. independence assumptions, forms of distributions). -" Our model will have some unknown parameters. -" We capture our assumptions, or prior beliefs, about unknown parameters (e.g. range of plausible values) by specifying the prior distribution over those parameters before seeing the data. •" We observe the data. •" We compute the posterior probability distribution for the parameters, given observed data. •" We use this posterior distribution to: -" Make predictions by averaging over the posterior distribution -" Examine/Account for uncertainly in the parameter values. -" Make decisions by minimizing expected posterior loss. (See Radford Neal’s NIPS tutorial on ``Bayesian Methods for Machine Learning'’) Posterior Distribution •" The posterior distribution for the model parameters can be found by combining the prior with the likelihood for the parameters given the data. •" This is accomplished using Bayes’
    [Show full text]
  • Bayesian Inference
    Bayesian Inference Thomas Nichols With thanks Lee Harrison Bayesian segmentation Spatial priors Posterior probability Dynamic Causal and normalisation on activation extent maps (PPMs) Modelling Attention to Motion Paradigm Results SPC V3A V5+ Attention – No attention Büchel & Friston 1997, Cereb. Cortex Büchel et al. 1998, Brain - fixation only - observe static dots + photic V1 - observe moving dots + motion V5 - task on moving dots + attention V5 + parietal cortex Attention to Motion Paradigm Dynamic Causal Models Model 1 (forward): Model 2 (backward): attentional modulation attentional modulation of V1→V5: forward of SPC→V5: backward Photic SPC Attention Photic SPC V1 V1 - fixation only V5 - observe static dots V5 - observe moving dots Motion Motion - task on moving dots Attention Bayesian model selection: Which model is optimal? Responses to Uncertainty Long term memory Short term memory Responses to Uncertainty Paradigm Stimuli sequence of randomly sampled discrete events Model simple computational model of an observers response to uncertainty based on the number of past events (extent of memory) 1 2 3 4 Question which regions are best explained by short / long term memory model? … 1 2 40 trials ? ? Overview • Introductory remarks • Some probability densities/distributions • Probabilistic (generative) models • Bayesian inference • A simple example – Bayesian linear regression • SPM applications – Segmentation – Dynamic causal modeling – Spatial models of fMRI time series Probability distributions and densities k=2 Probability distributions
    [Show full text]
  • On the Derivation of the Bayesian Information Criterion
    On the derivation of the Bayesian Information Criterion H. S. Bhat∗† N. Kumar∗ November 8, 2010 Abstract We present a careful derivation of the Bayesian Inference Criterion (BIC) for model selection. The BIC is viewed here as an approximation to the Bayes Factor. One of the main ingredients in the approximation, the use of Laplace’s method for approximating integrals, is explained well in the literature. Our derivation sheds light on this and other steps in the derivation, such as the use of a flat prior and the invocation of the weak law of large numbers, that are not often discussed in detail. 1 Notation Let us define the notation that we will use: y : observed data y1,...,yn Mi : candidate model P (y|Mi) : marginal likelihood of the model Mi given the data θi : vector of parameters in the model Mi gi(θi) : the prior density of the parameters θi f(y|θi) : the density of the data given the parameters θi L(θi|y) : the likelihood of y given the model Mi θˆi : the MLE of θi that maximizes L(θi|y) 2 Bayes Factor The Bayesian approach to model selection [1] is to maximize the posterior probability of n a model (Mi) given the data {yj}j=1. Applying Bayes theorem to calculate the posterior ∗School of Natural Sciences, University of California, Merced, 5200 N. Lake Rd., Merced, CA 95343 †Corresponding author, email: [email protected] 1 probability of a model given the data, we get P (y1,...,yn|Mi)P (Mi) P (Mi|y1,...,yn)= , (1) P (y1,...,yn) where P (y1,...,yn|Mi) is called the marginal likelihood of the model Mi.
    [Show full text]
  • Marginal Likelihood from the Gibbs Output Siddhartha Chib Journal Of
    Marginal Likelihood from the Gibbs Output Siddhartha Chib Journal of the American Statistical Association, Vol. 90, No. 432. (Dec., 1995), pp. 1313-1321. Stable URL: http://links.jstor.org/sici?sici=0162-1459%28199512%2990%3A432%3C1313%3AMLFTGO%3E2.0.CO%3B2-2 Journal of the American Statistical Association is currently published by American Statistical Association. Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/about/terms.html. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/journals/astata.html. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. The JSTOR Archive is a trusted digital repository providing for long-term preservation and access to leading academic journals and scholarly literature from around the world. The Archive is supported by libraries, scholarly societies, publishers, and foundations. It is an initiative of JSTOR, a not-for-profit organization with a mission to help the scholarly community take advantage of advances in technology. For more information regarding JSTOR, please contact [email protected]. http://www.jstor.org Wed Feb 6 14:39:37 2008 Marginal Likelihood From the Gibbs Output Siddhartha CHIB In the context of Bayes estimation via Gibbs sampling, with or without data augmentation, a simple approach is developed for computing the marginal density of the sample data (marginal likelihood) given parameter draws from the posterior distribution.
    [Show full text]
  • Bayesian Analysis of Graphical Models of Marginal Independence for Three Way Contingency Tables
    A Service of Leibniz-Informationszentrum econstor Wirtschaft Leibniz Information Centre Make Your Publications Visible. zbw for Economics Tarantola, Claudia; Ntzoufras, Ioannis Working Paper Bayesian Analysis of Graphical Models of Marginal Independence for Three Way Contingency Tables Quaderni di Dipartimento, No. 172 Provided in Cooperation with: University of Pavia, Department of Economics and Quantitative Methods (EPMQ) Suggested Citation: Tarantola, Claudia; Ntzoufras, Ioannis (2012) : Bayesian Analysis of Graphical Models of Marginal Independence for Three Way Contingency Tables, Quaderni di Dipartimento, No. 172, Università degli Studi di Pavia, Dipartimento di Economia Politica e Metodi Quantitativi (EPMQ), Pavia This Version is available at: http://hdl.handle.net/10419/95304 Standard-Nutzungsbedingungen: Terms of use: Die Dokumente auf EconStor dürfen zu eigenen wissenschaftlichen Documents in EconStor may be saved and copied for your Zwecken und zum Privatgebrauch gespeichert und kopiert werden. personal and scholarly purposes. Sie dürfen die Dokumente nicht für öffentliche oder kommerzielle You are not to copy documents for public or commercial Zwecke vervielfältigen, öffentlich ausstellen, öffentlich zugänglich purposes, to exhibit the documents publicly, to make them machen, vertreiben oder anderweitig nutzen. publicly available on the internet, or to distribute or otherwise use the documents in public. Sofern die Verfasser die Dokumente unter Open-Content-Lizenzen (insbesondere CC-Lizenzen) zur Verfügung gestellt haben sollten,
    [Show full text]
  • Objective Priors in the Empirical Bayes Framework Arxiv:1612.00064V5 [Stat.ME] 11 May 2020
    Objective Priors in the Empirical Bayes Framework Ilja Klebanov1 Alexander Sikorski1 Christof Sch¨utte1,2 Susanna R¨oblitz3 May 13, 2020 Abstract: When dealing with Bayesian inference the choice of the prior often re- mains a debatable question. Empirical Bayes methods offer a data-driven solution to this problem by estimating the prior itself from an ensemble of data. In the non- parametric case, the maximum likelihood estimate is known to overfit the data, an issue that is commonly tackled by regularization. However, the majority of regu- larizations are ad hoc choices which lack invariance under reparametrization of the model and result in inconsistent estimates for equivalent models. We introduce a non-parametric, transformation invariant estimator for the prior distribution. Being defined in terms of the missing information similar to the reference prior, it can be seen as an extension of the latter to the data-driven setting. This implies a natural interpretation as a trade-off between choosing the least informative prior and incor- porating the information provided by the data, a symbiosis between the objective and empirical Bayes methodologies. Keywords: Parameter estimation, Bayesian inference, Bayesian hierarchical model- ing, invariance, hyperprior, MPLE, reference prior, Jeffreys prior, missing informa- tion, expected information 2010 Mathematics Subject Classification: 62C12, 62G08 1 Introduction Inferring a parameter θ 2 Θ from a measurement x 2 X using Bayes' rule requires prior knowledge about θ, which is not given in many applications. This has led to a lot of controversy arXiv:1612.00064v5 [stat.ME] 11 May 2020 in the statistical community and to harsh criticism concerning the objectivity of the Bayesian approach.
    [Show full text]
  • Estimating the Marginal Likelihood with Integrated Nested Laplace Approximation (INLA)
    Estimating the marginal likelihood with Integrated nested Laplace approximation (INLA) Aliaksandr Hubin ∗ Department of Mathematics, University of Oslo and Geir Storvik Department of Mathematics, University of Oslo November 7, 2016 Abstract The marginal likelihood is a well established model selection criterion in Bayesian statistics. It also allows to efficiently calculate the marginal posterior model proba- bilities that can be used for Bayesian model averaging of quantities of interest. For many complex models, including latent modeling approaches, marginal likelihoods are however difficult to compute. One recent promising approach for approximating the marginal likelihood is Integrated Nested Laplace Approximation (INLA), design for models with latent Gaussian structures. In this study we compare the approxima- tions obtained with INLA to some alternative approaches on a number of examples of different complexity. In particular we address a simple linear latent model, a Bayesian linear regression model, logistic Bayesian regression models with probit and logit links, and a Poisson longitudinal generalized linear mixed model. Keywords: Integrated nested Laplace approximation; Marginal likelihood; Model Evidence; Bayes Factor; Markov chain Monte Carlo; Numerical Integration; Linear models; General- arXiv:1611.01450v1 [stat.CO] 4 Nov 2016 ized linear models; Generalized linear mixed models; Bayesian model selection; Bayesian model averaging. ∗The authors gratefully acknowledge the CELS project at the University of Oslo, http://www.mn.uio. no/math/english/research/groups/cels/index.html, for giving us the opportunity, inspiration and motivation to write this article. 1 1 Introduction Marginal likelihoods have been commonly accepted to be an extremely important quan- tity within Bayesian statistics. For data y and model M, which includes some unknown parameters θ, the marginal likelihood is given by Z p(yjM) = p(yjM; θ)p(θjM)dθ (1) Ωθ where p(θjM) is the prior for θ under model M while p(yjM; θ) is the likelihood function conditional on θ.
    [Show full text]