PUBH 8442, Spring 2016 Eric Lock

Total Page:16

File Type:pdf, Size:1020Kb

PUBH 8442, Spring 2016 Eric Lock PUBH 8442, Spring 2016 Eric Lock Yanghao Wang May 12, 2016 Contents 1 Prior and Posterior1 Maximum Likelihood Estimation (MLE)..........................1 Prior and Posterior.....................................1 Beta-Binomial Model....................................2 Conjugate Priors.......................................3 Normal-Normal Model...................................4 Prior and Posterior Predictive................................5 Normal-Normal Predictives.................................6 Non-Informative Priors...................................8 Improper Priors.......................................8 Jeffreys Priors........................................9 Elicited Priors........................................ 11 2 Estimation, and Decision Theory 11 Point Estimation....................................... 11 Loss Function........................................ 12 Decision Rules....................................... 13 Frequentist Risk....................................... 14 Minimax Rules....................................... 15 Bayes Risk......................................... 15 Bayes Decision Rules.................................... 16 3 The Bias-Variance Tradeoff 16 Unbiased rules........................................ 16 Bias-Variance Trade-Off.................................. 17 i 4 Interval Estimation 19 Credible sets......................................... 19 Decision Theory for Credible Sets............................. 20 Normal Model with Jeffreys Priors............................. 23 5 Decisions and Hypothesis Testing 25 Frequentist Hypothesis Testing............................... 26 Bayesian Hypothesis Testing................................ 27 Hypothesis Decision.................................... 28 Type I error......................................... 29 Bayes Factors........................................ 32 6 Model Comparison 32 Multiple Hypothesis/Models................................ 32 Model Choice........................................ 33 Bayes Factors for Model Comparison........................... 34 Bayesian Information Criterion............................... 34 Partial Bayes Factors.................................... 35 Intrinsic Bayes Factors................................... 36 Fractional Bayes Factors.................................. 37 Cross-Validated Likelihood................................. 38 Bayesian Residuals..................................... 38 Bayesian P-Values...................................... 39 7 Hierarchical Model 40 Bayesian Hierarchical Models............................... 40 Hierarchical Normal Model................................. 44 The Normal-Gamma Model................................. 46 Conjugate Normal with both µ and σ2 Unknown...................... 47 8 Bayesian Linear Models 48 Bayesian Framework with Uninformative Priors...................... 48 Bayesian Framework with Normal-Inverse Gamma Prior................. 50 9 Empirical Bayes Approaches 52 The Empirical Bayes Approach............................... 53 10 Direct Sampling 57 Direct Monte Carlo integration............................... 57 Direct Sampling for Multivariate Density......................... 59 ii 11 Importance Sampling 60 Importance Sampling.................................... 60 Density Approximation................................... 61 12 Rejection Sampling 64 Rejection Sampling..................................... 64 13 Metropolis-Hastings Sampling 65 Metropolis-Hastings Sampling............................... 66 Choice of Proposal Density................................. 67 14 Gibbs Sampling 70 Gibbs Sampling....................................... 70 15 More on MCMC 73 Combining MCMC methods................................ 74 16 Deviance Information Criterion 75 BIC and AIC........................................ 75 17 Bayesian Model Averaging 80 18 Mixture Models 83 Finite Mixture Models................................... 83 Bayesian Finite Mixture Model............................... 84 iii 1 Prior and Posterior Assume a sampling model for data y = (y1; : : : ; yn), specified by parameter θ (maybe unknown), is often represented as a probability density f (y j θ). 2 Example. Gaussian model specified by θ = µ, σ for independent yi’s: ( ) 1 n Pn (y − µ)2 f (y j θ) = p exp − i=1 i : σ 2π 2σ2 The probability density f (y j θ) is often called the likelihood, with the notation L (θ; y). Use this notation when making inferences about θ. Maximum Likelihood Estimation (MLE) Choose θ to maximize likelihood θ^ = arg max L (θ; y) : θ MLE approach assume θ is fixed (though unknown) and has been criticized for overfitting. Prior and Posterior Alternatively, θ can be a random variable. Let π (θ) be a probability density for θ or π (θ j η) be a probability density for θ where η is hyperparameter. The marginal density for y is Z p (y) = f (y j θ) π (θ) dθ; which implies a marginal density for y is a density for y averaging over θ. • Bayes’ rule for continuous random variable p (y; θ) f (y j θ) π (θ) p (θ j y) = = ; p (y) R f (y j θ) π (θ) dθ where π (θ) is the prior (distribution for θ) and p (θ j y) is the posterior (distribution for θ). – π (θ) is the density for θ without observed data while p (θ j y) is the density for θ with observed data. – Given data y, R f (y j θ) π (θ) dθ is a normalizing constant (a scalar) helping the poste- rior p (θ j y) integrate to 1. 1 • Bayes’ rule for discrete random variable – θ is continuous distributed while y is discrete: P (y = k j θ) π (θ) p (θ j y = k) = ; R P (y = k j θ) π (θ) dθ where y is discrete and P (y j θ) is the probability mass function (PMF) for y given θ. – θ is discrete distributed while y is continuous: f (y j θ = k) P (θ = k) P (θ = k j y) = P ; k2Θ f (y j θ = k) P (θ = k) where Θ is the set of all possible values of θ. Example (Gender Proneness). The Gupta’s (of CNN fame) have three daughters. What is the probability their next child will be a daughter? MLE Solution Let θ be the probability of a daughter for Gupta’s. Binomial likelihood for first 3 births with y = # daughter is P (y = 3 j θ) = θ3; which implies θ^ = 1 (overfitting problem). Bayes Solution Assume the prior distribution for θ is Uniform (0; 1) and hence π (θ) = 1 8θ 2 [0; 1]. P (y = 3 j θ) π (θ) θ3 × 1 p (θ j y = 3) = = = 4θ3; R P (y = 3 j θ) π (θ) dθ 1=4 R 1 4 1 1 where P (y = 3 j θ) π (θ) dθ = 4 θ jθ=0= 4 . Estimate θ by the posterior distribution: Z 1 ^ 4 5 1 4 θ = E [θ j y = 3] = θp (θ j y = 3) dθ = θ jθ=0= : 0 5 5 Beta-Binomial Model Assume θ ∼ Beta (a; b): 1 π (θ) = θa−1 (1 − θ)b−1 ; B (a; b) 2 where B (·; ·) is the beta function. If Y ∼ Binomial (n; θ) where n is sample size and θ is the success probability, then p (θ j y = k) = Beta (a + k; b + n − k) : Proof. By the Bayes’ rule, the denominator is a constant, the posterior distribution p (θ j y = k) is of proportion to the numerator P (y = k j θ) π (θ). Thus, θa−1 (1 − θ)b−1 p (θ j y = k) / P (y = k j θ) π (θ) = Cnθk (1 − θ)n−k k B (a; b) θa+k−1 (1 − θ)n−k+b−1 / = Beta (a + k; b + n − k) : B (a; b) Note. Beta distribution is a distribution about the belief of probability; i.e., θ ∼ Beta (a; b) for all a θ 2 [0; 1] and E[θ] = a+b . Thus, shape parameters a and b will determine the expectation for θ. In addition, Beta(1; 1) is an uniform distribution U [0; 1]. Example (Gender Proneness (cont.)). Based on data from many families, a more realistic prior for θ is π (θ) = Beta (39; 40), where 39 is the number of girls and 40 is the number of boys. For the Gupta family, p (θ j y = 3) = Beta (42; 40) since n = 3 (sample size in the Gupta’s) and k = 3 (observed number of girls). The expected value of Beta (a; b) is a= (a + b): ^ 39 • Prior estimate is θ = 39+40 = 0:494 ^ 42 • Posterior estimate is θ = 42+40 = 0:512 Conjugate Priors A prior is conjugate for a given likelihood if its posterior belongs to the same distribution family. For example, a beta prior is conjugate to a binomial random variable (i.e., if y is generated by a binomial process, prior distribution π (θ) is a Beta distribution implies the posterior distribution p (θ j y) is also a Beta distributions). • Conjugate priors facilitate computation of posteriors. – Particularly useful when updating the posterior adaptively. E.g., given π (θ) = Beta (39; 40), after one girl, the posterior is Beta (40; 40). Then after another girl, the posterior is updated to Beta (41; 40). • Conjugate does not give precision (no profound theoretical justification). 3 • Common conjugate families (credit: John D. Cook) Example (Poisson-Gamma Model). Let y ∼Poisson(θ) and θ ∼Gamma(α; β): π (θ) ∼ θα−1e−βθ. p (θ j y) /p (y j θ) π (θ) /θye−θθα−1e−βθ =θα+y−1e−(β+1)θ ≡ Gamma (α + y; β + 1) : Normal-Normal Model 2 2 2 Assume y ∼ Normal µ, σ with σ known. If π (µ) = Normal µ0; τ , then σ2µ + τ 2y σ2τ 2 p (µ j y) = Normal 0 ; : σ2 + τ 2 σ2 + τ 2 Proof. Note y ∼ Normal µ, σ2 implies " # 1 (y − µ)2 p (y j µ) = p exp − : σ 2π 2σ2 2 With π (µ) = Normal µ0; τ , we have " # 1 1 (y − µ)2 (µ − µ )2 p (µ j y) / p (y j µ) π (µ) = p p exp − − 0 σ 2π τ 2π 2σ2 2τ 2 " # τ 2 (y − µ)2 + σ2 (µ − µ )2 / exp − 0 2σ2τ 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 σ µ0+τ y σ µ0+τ y τ y +σ µ0 τ + σ µ − 2 2 − 2 2 + 2 2 6 σ +τ σ +τ σ +τ 7 / exp 6− 7 4 2σ2τ 2 5 4 " # σ2µ + τ 2y 2 . σ2τ 2 / exp − µ − 0 2 ; (∗) σ2 + τ 2 τ 2 + σ2 2 2 2 2 where the third line follows that τ (y − µ) + σ (µ − µ0) is 2 2 2 2 2 2 2 2 2 2 2 2 2 τ (y − µ) + σ (µ − µ0) = τ + σ µ − 2τ yµ − 2σ µ0µ + τ y + σ µ0 σ2µ + τ 2y τ 2y2 + σ2µ2 = τ 2 + σ2 µ2 − 2 0 µ + 0 σ2 + τ 2 σ2 + τ 2 " # σ2µ + τ 2y 2 σ2µ + τ 2y 2 τ 2y2 + σ2µ2 = τ 2 + σ2 µ − 0 − 0 + 0 ; σ2 + τ 2 σ2 + τ 2 σ2 + τ 2 2 2 2 2 2 2 2 2 2 σ µ0+τ y τ y +σ µ0 and the forth line follows that (given y, µ0, σ , and τ ) constants σ2+τ 2 and σ2+τ 2 are dropped for simplicity.
Recommended publications
  • Paradoxes and Priors in Bayesian Regression
    Paradoxes and Priors in Bayesian Regression Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Agniva Som, B. Stat., M. Stat. Graduate Program in Statistics The Ohio State University 2014 Dissertation Committee: Dr. Christopher M. Hans, Advisor Dr. Steven N. MacEachern, Co-advisor Dr. Mario Peruggia c Copyright by Agniva Som 2014 Abstract The linear model has been by far the most popular and most attractive choice of a statistical model over the past century, ubiquitous in both frequentist and Bayesian literature. The basic model has been gradually improved over the years to deal with stronger features in the data like multicollinearity, non-linear or functional data pat- terns, violation of underlying model assumptions etc. One valuable direction pursued in the enrichment of the linear model is the use of Bayesian methods, which blend information from the data likelihood and suitable prior distributions placed on the unknown model parameters to carry out inference. This dissertation studies the modeling implications of many common prior distri- butions in linear regression, including the popular g prior and its recent ameliorations. Formalization of desirable characteristics for model comparison and parameter esti- mation has led to the growth of appropriate mixtures of g priors that conform to the seven standard model selection criteria laid out by Bayarri et al. (2012). The existence of some of these properties (or lack thereof) is demonstrated by examining the behavior of the prior under suitable limits on the likelihood or on the prior itself.
    [Show full text]
  • A Compendium of Conjugate Priors
    A Compendium of Conjugate Priors Daniel Fink Environmental Statistics Group Department of Biology Montana State Univeristy Bozeman, MT 59717 May 1997 Abstract This report reviews conjugate priors and priors closed under sampling for a variety of data generating processes where the prior distributions are univariate, bivariate, and multivariate. The effects of transformations on conjugate prior relationships are considered and cases where conjugate prior relationships can be applied under transformations are identified. Univariate and bivariate prior relationships are verified using Monte Carlo methods. Contents 1 Introduction Experimenters are often in the position of having had collected some data from which they desire to make inferences about the process that produced that data. Bayes' theorem provides an appealing approach to solving such inference problems. Bayes theorem, π(θ) L(θ x ; : : : ; x ) g(θ x ; : : : ; x ) = j 1 n (1) j 1 n π(θ) L(θ x ; : : : ; x )dθ j 1 n is commonly interpreted in the following wayR. We want to make some sort of inference on the unknown parameter(s), θ, based on our prior knowledge of θ and the data collected, x1; : : : ; xn . Our prior knowledge is encapsulated by the probability distribution on θ; π(θ). The data that has been collected is combined with our prior through the likelihood function, L(θ x ; : : : ; x ) . The j 1 n normalized product of these two components yields a probability distribution of θ conditional on the data. This distribution, g(θ x ; : : : ; x ) , is known as the posterior distribution of θ. Bayes' j 1 n theorem is easily extended to cases where is θ multivariate, a vector of parameters.
    [Show full text]
  • A Review of Bayesian Optimization
    Taking the Human Out of the Loop: A Review of Bayesian Optimization The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proc. IEEE 104 (1) (January): 148– 175. doi:10.1109/jproc.2015.2494218. Published Version doi:10.1109/JPROC.2015.2494218 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:27769882 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP 1 Taking the Human Out of the Loop: A Review of Bayesian Optimization Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams and Nando de Freitas Abstract—Big data applications are typically associated with and the analytics company that sits between them. The analyt- systems involving large numbers of users, massive complex ics company must develop procedures to automatically design software systems, and large-scale heterogeneous computing and game variants across millions of users; the objective is to storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., rec- enhance user experience and maximize the content provider’s ommendation systems, medical analysis tools, real-time game revenue. engines, speech recognizers) thus involves many tunable config- The preceding examples highlight the importance of au- uration parameters.
    [Show full text]
  • 9 Introduction to Hierarchical Models
    9 Introduction to Hierarchical Models One of the important features of a Bayesian approach is the relative ease with which hierarchical models can be constructed and estimated using Gibbs sampling. In fact, one of the key reasons for the recent growth in the use of Bayesian methods in the social sciences is that the use of hierarchical models has also increased dramatically in the last two decades. Hierarchical models serve two purposes. One purpose is methodological; the other is substantive. Methodologically, when units of analysis are drawn from clusters within a population (communities, neighborhoods, city blocks, etc.), they can no longer be considered independent. Individuals who come from the same cluster will be more similar to each other than they will be to individuals from other clusters. Therefore, unobserved variables may in- duce statistical dependence between observations within clusters that may be uncaptured by covariates within the model, violating a key assumption of maximum likelihood estimation as it is typically conducted when indepen- dence of errors is assumed. Recall that a likelihood function, when observations are independent, is simply the product of the density functions for each ob- servation taken over all the observations. However, when independence does not hold, we cannot construct the likelihood as simply. Thus, one reason for constructing hierarchical models is to compensate for the biases—largely in the standard errors—that are introduced when the independence assumption is violated. See Ezell, Land, and Cohen (2003) for a thorough review of the approaches that have been used to correct standard errors in hazard model- ing applications with repeated events, one class of models in which repeated measurement yields hierarchical clustering.
    [Show full text]
  • On the Robustness to Misspecification of Α-Posteriors and Their
    Robustness of α-posteriors and their variational approximations On the Robustness to Misspecification of α-Posteriors and Their Variational Approximations Marco Avella Medina [email protected] Department of Statistics Columbia University New York, NY 10027, USA Jos´eLuis Montiel Olea [email protected] Department of Economics Columbia University New York, NY 10027, USA Cynthia Rush [email protected] Department of Statistics Columbia University New York, NY 10027, USA Amilcar Velez [email protected] Department of Economics Northwestern University Evanston, IL 60208, USA Abstract α-posteriors and their variational approximations distort standard posterior inference by downweighting the likelihood and introducing variational approximation errors. We show that such distortions, if tuned appropriately, reduce the Kullback-Leibler (KL) divergence from the true, but perhaps infeasible, posterior distribution when there is potential para- arXiv:2104.08324v1 [stat.ML] 16 Apr 2021 metric model misspecification. To make this point, we derive a Bernstein-von Mises theorem showing convergence in total variation distance of α-posteriors and their variational approx- imations to limiting Gaussian distributions. We use these distributions to evaluate the KL divergence between true and reported posteriors. We show this divergence is minimized by choosing α strictly smaller than one, assuming there is a vanishingly small probability of model misspecification. The optimized value becomes smaller as the the misspecification becomes
    [Show full text]
  • Prior Distributions for Variance Parameters in Hierarchical Models
    Bayesian Analysis (2006) 1, Number 3, pp. 515–533 Prior distributions for variance parameters in hierarchical models Andrew Gelman Department of Statistics and Department of Political Science Columbia University Abstract. Various noninformative prior distributions have been suggested for scale parameters in hierarchical models. We construct a new folded-noncentral-t family of conditionally conjugate priors for hierarchical standard deviation pa- rameters, and then consider noninformative and weakly informative priors in this family. We use an example to illustrate serious problems with the inverse-gamma family of “noninformative” prior distributions. We suggest instead to use a uni- form prior on the hierarchical standard deviation, using the half-t family when the number of groups is small and in other settings where a weakly informative prior is desired. We also illustrate the use of the half-t family for hierarchical modeling of multiple variance parameters such as arise in the analysis of variance. Keywords: Bayesian inference, conditional conjugacy, folded-noncentral-t distri- bution, half-t distribution, hierarchical model, multilevel model, noninformative prior distribution, weakly informative prior distribution 1 Introduction Fully-Bayesian analyses of hierarchical linear models have been considered for at least forty years (Hill, 1965, Tiao and Tan, 1965, and Stone and Springer, 1965) and have remained a topic of theoretical and applied interest (see, e.g., Portnoy, 1971, Box and Tiao, 1973, Gelman et al., 2003, Carlin and Louis, 1996, and Meng and van Dyk, 2001). Browne and Draper (2005) review much of the extensive literature in the course of comparing Bayesian and non-Bayesian inference for hierarchical models. As part of their article, Browne and Draper consider some different prior distributions for variance parameters; here, we explore the principles of hierarchical prior distributions in the context of a specific class of models.
    [Show full text]
  • The Role of Hierarchical Priors in Robust Bayesian Inference
    Order Number 9411946 The role of hierarchical priors in robust Bayesian inference George, Robert Emerson, Ph.D. The Ohio State University, 1993 U MI 300 N. Zeeb Rd. Ann Arbor, MI 48106 THE ROLE OF HIERARCHICAL PRIORS IN ROBUST BAYESIAN INFERENCE DISSERTATION Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Robert Emerson George, B. A., M.S. The Ohio State University 1993 Dissertation Committee: Prem K. Goel Mark Berliner Adviser, Saul Blumenthal Department of Statistics To Mrs. Sharia J, Keebaugh With Deep Appreciation for Showing Me the Majesty of the Mathematical Sciences ACKNOWLEDGEMENTS I wish to express my gratitude to my adviser, Dr. Prem Goel: without his invaluable advice, farsighted guidance, and unflagging enthusiasm, this thesis could not have been completed. I also wish to thank Drs. Mark Berliner and Saul Blumenthal, the other members of my Committee; and Dr. Steve MacEachem, who after serving on my General Examination Committee was prevented by scheduling conflicts from serving on my Committee. Also, I have benefitted greatly from the efficiency, knowledge, and eagerness to help of the staffs of the Statistical Computing Laboratory and of The Ohio State University Libraries. Finally, I thank my family for all the innumerable kindnesses, many great and and some seemingly small, they have so willingly shown me throughout my life. iii VITA July 5, 1966 Bom - Urbana, Ohio 1988 B. A ., Wittenberg University, Springfield, Ohio 1988-1991 National Science Foundation Fellow, Department of Statistics, The Ohio State University, Columbus, Ohio 1990 M.
    [Show full text]
  • Automatic Kernel Selection for Gaussian Processes Regression with Approximate Bayesian Computation and Sequential Monte Carlo
    ORIGINAL RESEARCH published: 30 August 2017 doi: 10.3389/fbuil.2017.00052 Automatic Kernel Selection for Gaussian Processes Regression with Approximate Bayesian Computation and Sequential Monte Carlo Anis Ben Abdessalem, Nikolaos Dervilis*, David J. Wagg and Keith Worden Dynamics Research Group, Department of Mechanical Engineering, University of Sheffield, Sheffield, United Kingdom The current work introduces a novel combination of two Bayesian tools, Gaussian Processes (GPs), and the use of the Approximate Bayesian Computation (ABC) algorithm for kernel selection and parameter estimation for machine learning applications. The combined methodology that this research article proposes and investigates offers the possibility to use different metrics and summary statistics of the kernels used for Bayesian regression. The presented work moves a step toward online, robust, consistent, and automated mechanism to formulate optimal kernels (or even mean functions) and their hyperparameters simultaneously offering confidence evaluation when these tools are used for mathematical or engineering problems such as structural health monitoring Edited by: (SHM) and system identification (SI). Eleni N. Chatzi, ETH Zurich, Switzerland Keywords: kernel selection, hyperparameter estimation, approximate Bayesian computation, sequential Monte Carlo, Gaussian processes Reviewed by: Donghyeon Ryu, New Mexico Institute of Mining and Technology, United States 1. INTRODUCTION AND MOTIVATION Luis David Avendaño Valencia, Regression analysis or classification using Bayesian formulation and specifically Gaussian Processes ETH Zurich, Switzerland (GPs) or relevance vector machines (RVMs) is becoming very popular and attractive due to *Correspondence: incorporation of uncertainty and the bypassing of unattractive features from methods like neural Nikolaos Dervilis [email protected] networks. Regression using neural networks for example, although they present a very powerful tool, sometimes can make it difficult and demanding to achieve the right tuning.
    [Show full text]
  • Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency
    SMAI-JCM SMAI Journal of Computational Mathematics Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency Matthew M. Dunlop, Tapio Helin & Andrew M. Stuart Volume 6 (2020), p. 69-100. <http://smai-jcm.centre-mersenne.org/item?id=SMAI-JCM_2020__6__69_0> © Société de Mathématiques Appliquées et Industrielles, 2020 Certains droits réservés. Publication membre du Centre Mersenne pour l’édition scientifique ouverte http://www.centre-mersenne.org/ Sousmission sur https://smai-jcm.math.cnrs.fr/index.php/SMAI-JCM SMAI Journal of Computational Mathematics Vol. 6, 69-100 (2020) Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency Matthew M. Dunlop 1 Tapio Helin 2 Andrew M. Stuart 3 1 Courant Institute of Mathematical Sciences, New York University, New York, New York, 10012, USA E-mail address: [email protected] 2 School of Engineering Science, Lappeenranta–Lahti University of Technology, Lappeenranta, 53850, Finland E-mail address: tapio.helin@lut.fi 3 Computing & Mathematical Sciences, California Institute of Technology, Pasadena, California, 91125, USA E-mail address: [email protected]. Abstract. The Bayesian formulation of inverse problems is attractive for three primary reasons: it provides a clear modelling framework; it allows for principled learning of hyperparameters; and it can provide uncertainty quantifi- cation. The posterior distribution may in principle be sampled by means of MCMC or SMC methods, but for many problems it is computationally infeasible to do so. In this situation maximum a posteriori (MAP) estimators are often sought. Whilst these are relatively cheap to compute, and have an attractive variational formulation, a key drawback is their lack of invariance under change of parameterization; it is important to study MAP estimators, however, because they provide a link with classical optimization approaches to inverse problems and the Bayesian link may be used to improve upon classical optimization approaches.
    [Show full text]
  • Sampling Hyperparameters in Hierarchical Models
    Sampling hyperparameters in hierarchical models: improving on Gibbs for high-dimensional latent fields and large data sets. Richard A. Norton1, J. Andr´es Christen2 and Colin Fox3 1Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand, [email protected], 2Centro de Investigaci´on en Matem´aticas (CIMAT), CONACYT, Guanajuato, GT, Mexico, [email protected] 3Department of Physics, University of Otago, Dunedin, New Zealand, fox@[email protected]. 20 October 2016 Abstract We consider posterior sampling in the very common Bayesian hierarchical model in which observed data depends on high-dimensional latent variables that, in turn, depend on relatively few hyperparameters. When the full conditional over the latent variables has a known form, the marginal posterior distribution over hyperparameters is accessible and can be sampled using a Markov chain Monte Carlo (MCMC) method on arXiv:1610.06632v1 [stat.CO] 21 Oct 2016 a low-dimensional parameter space. This may improve computational efficiency over standard Gibbs sampling since computation is not over the high-dimensional space of latent variables and correlations between hyperparameters and latent variables be- come irrelevant. When the marginal posterior over hyperparameters depends on a fixed-dimensional sufficient statistic, precomputation of the sufficient statistic renders the cost of the low-dimensional MCMC independent of data size. Then, when the hyperparameters are the primary variables of interest, inference may be performed in 1 big-data settings at modest cost. Moreover, since the form of the full conditional for the latent variables does not depend on the form of the hyperprior distribution, the method imposes no restriction on the hyperprior, unlike Gibbs sampling that typi- cally requires conjugate distributions.
    [Show full text]
  • STATS 300A Theory of Statistics Stanford University, Fall 2015
    STATS 300A Theory of Statistics Stanford University, Fall 2015 Problem Set 4 Due: Thursday, October 22, 2015 Instructions: • You may appeal to any result proved in class or proved in the course textbooks. • Any request to “find” requires proof that all requested properties are satisfied. Problem 1 (Minimum Risk Scale Equivariance). Suppose that X = (X1;:::;Xn) comes from a scale family with density 1 x x f (x) = f 1 ;:::; n τ τ n τ τ for known f and unknown scale parameter τ > 0 and that X X X Z = 1 ;:::; n−1 ; n : Xn Xn jXnj (a) Show that the minimum risk scale equivariant estimator of τ under the loss d (d − τ r)2 γ = τ r τ 2r is given by ∗ δ0(X)E1 [δ0(X) j Z] δ (X) = 2 : E1 [δ0(X) j Z] (b) Show that a minimum risk scale equivariant estimator of τ under the loss d jd − τ rj γ = τ r τ r is given by δ (X) δ∗(X) = 0 : w∗(Z) ∗ with w (Z) any scale median of δ0(X) under the conditional distribution of X given Z with τ = 1. That is, w∗(z) satisfies ∗ ∗ E1 [δ0(X)I (δ0(X) ≥ w (Z)) j Z] = E1 [δ0(X)I (δ0(X) ≤ w (Z)) j Z] : Hint: You might find it useful to prove the claim in Exercise 3.7a in Chapter 3 of TPE. 1 In part (b) you may assume that X has a continuous probability density function f(·; θ) with respect to Lebesgue measure. Problem 2 (Bayes Estimation).
    [Show full text]
  • Bayesian Optimization of Hyperparameters When the Marginal Likelihood Is Estimated by MCMC
    Bayesian Optimization of Hyperparameters when the Marginal Likelihood is Estimated by MCMC Oskar Gustafsson,∗ Mattias Villaniy and Pär Stockhammarz April 22, 2020 Abstract Bayesian models often involve a small set of hyperparameters determined by max- imizing the marginal likelihood. Bayesian optimization is a popular iterative method where a Gaussian process posterior of the underlying function is sequentially updated by new function evaluations. An acquisition strategy uses this posterior distribution to decide where to place the next function evaluation. We propose a novel Bayesian optimization framework for situations where the user controls the computational effort, and therefore the precision of the function evaluations. This is a common situation in econometrics where the marginal likelihood is often computed by Markov Chain Monte Carlo (MCMC) methods, with the precision of the marginal likelihood estimate deter- mined by the number of MCMC draws. The proposed acquisition strategy gives the optimizer the option to explore the function with cheap noisy evaluations and therefore finds the optimum faster. Prior hyperparameter estimation in the steady-state Bayesian vector autoregressive (BVAR) model on US macroeconomic time series data is used for illustration. The proposed method is shown to find the optimum much quicker than traditional Bayesian optimization or grid search. Keywords: Acquisition strategy, Optimized precision, Steady-state BVAR, US example. arXiv:2004.10092v1 [stat.CO] 21 Apr 2020 ∗Department of Statistics, Stockholm University, e-mail address: [email protected]. yDepartment of Statistics, Stockholm University and Department of Computer and Information Science, Linköping University, e-mail: [email protected]. zDepartment of Statistics, Stockholm University and Sveriges Riksbank, e-mail: [email protected].
    [Show full text]