Model Selection, Hyperparameter Optimisation, and Gaussian

Total Page:16

File Type:pdf, Size:1020Kb

Model Selection, Hyperparameter Optimisation, and Gaussian Learning model structure Probabilistic & Unsupervised Learning How many clusters in the data? Model selection, Hyperparameter optimisation, How smooth should the function be? and Gaussian Processes Is this input relevant to predicting that output? Maneesh Sahani [email protected] What is the order of a dynamical system? Gatsby Computational Neuroscience Unit, and MSc ML/CSML, Dept Computer Science SVYDAAAQLTADVKKDLRDSWKVIGSDKKGNG University College London How many states in a hidden Markov model? Term 1, Autumn 2014 How many auditory sources in the input? Model selection Model complexity and overfitting: a simple example Models (labelled by m) have parameters θm that specify the probability of data: M = 0 M = 1 M = 2 M = 3 P(Djθm; m) : 40 40 40 40 If model is known, learning θ means finding posterior or point estimate (ML, MAP, . ). m 20 20 20 20 What if we need to learn the model too? 0 0 0 0 I Could combine models into a single “supermodel”, with composite parameter (m; θm). I ML learning will overfit: favours most flexible (nested) model with most parameters, −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 even if the data actually come from a simpler one. I Density function on composite parameter space (union of manifolds of different M = 4 M = 5 M = 6 M = 7 dimensionalities) difficult to define ) MAP learning ill-posed. 40 40 40 40 I Joint posterior difficult to compute — dimension of composite parameter varies. 20 20 20 20 ) Separate model selection step: 0 0 0 0 P(θm; mjD) = P(θmjm; D) · P(mjD) | {z } | {z } model-specific posterior model selection −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 Model selection Bayesian model selection: some terminology Given models labeled by m wih parameters θm, identify the “correct” model for data D. A model class m is a set of distributions parameterised by θm, e.g. the set of all possible ML ML/MAP has no good answer: P(Djθm ) is always larger for more complex (nested) models. mixtures of m Gaussians. Neyman-Pearson hypothesis testing The model implies both a prior over the parameters P(θmjm); and a likelihood of data given I For nested models. Starting with simplest model (m = 1), compare (e.g. by likelihood parameters (which might require integrating out latent variables) P(Djθm; m): ratio test) null hypothesis m to alternative m + 1. Continue until m + 1 is rejected. I Usually only valid asympotically in data number. The posterior distribution over parameters is I Conservative (N-P hypothesis tests are asymmetric). P(Djθm; m)P(θmjm) P(θmjD; m)= : Likelihood validation P(Djm) I Partition data into disjoint training and validation data sets D = Dtr [Dvld. Choose The marginal probability of the data under model class m is: (D jθML) θML = (D jθ) model with greatest P vld m , with m argmax P tr . Z I Unbiased, but often high-variance. P(Djm)= P(Djθm; m)P(θmjm) dθm: I Cross-validation uses multiple partitions and averages likelihoods. Θm Bayesian model selection (also called the Bayesian evidence for model m). I Choose most likely model: argmax P(mjD). The ratio of two marginal probabilities (or sometimes its log) is known as the Bayes factor: I Principled from a probabilistic viewpoint—if true model is in set being considered—but 0 sensitive to assumed priors etc. P(Djm) P(mjD) p(m ) 0 = 0 I Can use posterior probabilities to weight models for combined predictions (no need to P(Djm ) P(m jD) p(m) select at all). The Bayesian Occam’s razor Bayesian model comparison: Occam’s razor at work Occam’s Razor is a principle of scientific philosophy: of two explanations adequate to explain the same set of observations, the simpler should always be preferred. Bayesian inference formalises and automatically implements a form of Occam’s Razor. Compare model classes m using their posterior probability given the data: M = 0 M = 1 M = 2 M = 3 Z P(Djm)P(m) 40 40 40 40 Model Evidence P(mjD) = ; P(Djm)= P(Djθm; m)P(θmjm) dθm P(D) 1 Θm 20 20 20 20 0.8 P(Djm): The probability that randomly selected parameter values from the model class 0 0 0 0 would generate data set D. −20 −20 −20 −20 0.6 0 5 10 0 5 10 0 5 10 0 5 10 Model classes that are too simple are unlikely to generate the observed data set. M = 4 M = 5 M = 6 M = 7 P(Y|M) 0.4 Model classes that are too complex can generate many possible data sets, so again, they are 40 40 40 40 unlikely to generate that particular data set at random. 0.2 20 20 20 20 Like Goldilocks, we favour a model that is just right. 0 0 1 2 3 4 5 6 7 0 0 0 0 M −20 −20 −20 −20 0 5 10 0 5 10 0 5 10 0 5 10 ) m Dj ( P data sets: D D0 Conjugate-exponential families (recap) Practical Bayesian approaches Can we compute P(Djm)? . Sometimes. I Laplace approximation: Suppose P(Djθm; m) is a member of the exponential family: I Approximate posterior by a Gaussian centred at the maximum a posteriori parameter estimate. N N T Y Y s(xi ) θm−A(θm) I Bayesian Information Criterion (BIC) P(Djθm; m) = P(xi jθm; m) = e : I an asymptotic (N ! 1) approximation. i=1 i=1 I Variational Bayes If our prior on θ is conjugate: m I Lower bound on the marginal probability. T I Biased estimate. sp θm−np A(θm) P(θmjm) = e =Z(sp; np) I Easy and fast, and often better than Laplace or BIC. I Monte Carlo methods: then the joint is in the same family: (i) I (Annealed) Importance sampling: estimate evidence using samples θ from arbitrary f (θ): T P s(x )+s θ −(N+n )A(θ ) (i) (i) Z i i p m p m X P(Djθ ; m)P(θ jm) P(D; θjm) P(D; θmjm) = e =Z (sp; p) ! dθ f (θ) = P(Djm) f (θ(i)) f (θ) i and so: I “Reversible jump” Markov Chain Monte Carlo: sample from posterior on composite (m; θm). Z # samples for each m / p(mjD). P(Djm) = dθ P(D; θ jm) = ZP s(x ) + s ; N + n Z(s ; p) m m i i p p p I Both exact in the limit of infinite samples, but may have high variance with finite samples. Not an exhaustive list (Bethe approximations, Expectation propagation, . ) But this is a special case. In general, we need to approximate . We will discuss Laplace and BIC now, leaving the rest for the second half of course. Laplace approximation Bayesian Information Criterion (BIC) BIC can be obtained from the Laplace approximation: Z We want to find P(Djm) = P(D; θ jm) dθ . d 1 m m log P(Djm) ≈ log P(θ∗ jm) + log P(Djθ∗ ; m) + log 2π − log jAj m m 2 2 As data size N grows (relative to parameter count d), θm becomes more constrained ∗ We have ) P(D; θmjm) / P(θmjD; m) becomes concentrated on posterior mode θm. ∗ 2 ∗ 2 ∗ 2 ∗ Idea: approximate log P(D; θmjm) to second-order around θ . A = r log P(D; θ jm) = r log P(Djθ ; m) + r log P(θ jm) Z Z So as the number of iid data N ! 1, A grows as NA0 + constant for a fixed matrix A0. log P(D;θmjm) P(D; θmjm)dθm = e dθm d ) log jAj ! log jNA0j = log(N jA0j) = d log N + log jA0j. Z log P(D;θ∗ jm)+r log P(D;θ∗ jm)·(θ −θ∗ )+ 1 (θ −θ∗ )Tr2 log P(D;θ∗jm)(θ −θ∗ ) Retaining only terms that grow with N we get: = e m m m m 2 m m m m dθ | {z } | {z } m =0 =−A d Z log P(Djm) ≈ log P(Djθ∗ ; m) − log N ∗ − 1 (θ −θ∗ )TA(θ −θ∗ ) m 2 m m m m = P(D; θmjm)e dθm 2 Properties: ∗ ∗ d − 1 2 2 = P(Djθm; m)P(θmjm)(2π) jAj I Quick and easy to compute. I Does not depend on prior. A = −∇2 log P(D; θ∗ jm) is the negative Hessian of log P(D; θjm) evaluated at θ∗ . m m I We can use the ML estimate of θ instead of the MAP estimate (= as N ! 1). I Related to the “Minimum Description Length” (MDL) criterion. This is equivalent to approximating the posterior by a Gaussian: an approximation which is I Assumes that in the large sample limit, all the parameters are well-determined (i.e. the asymptotically correct. model is identifiable; otherwise, d should be the number of well-determined parameters). I Neglects multiple modes (e.g. permutations in a MoG). Hyperparameters and Evidence optimisation Evidence optimisation in linear regression Consider simple linear regression: In some cases, we need to choose between a family of continuously parameterised models. Z xi P(Djη) = P(Djθ)P(θjη) dθ " hyperparameters w ∼ N (0; C) C w yi 2 This choice can be made by ascending the gradient in: σ the exact evidence (if tractable). I T 2 yi ∼ N w xi ; σ I the approximated evidence (Laplace, EP, Bethe, . ) I a free-energy bound on the evidence (Variational Bayes) Maximize or by placing a hyperprior on the hyperparameters η, and sampling from the posterior I Z 2 2 P(Djη)P(η) P(y1 ::: yN jx1 ::: xN ; C; σ ) = P(y1 ::: yN jx1 ::: xN ; w; σ )P(wjC) dw P(ηjD) = P(D) to find optimal values of C, σ.
Recommended publications
  • Arxiv:1803.00360V2 [Stat.CO] 2 Mar 2018 Prone to Misinterpretation [2, 3]
    Computing Bayes factors to measure evidence from experiments: An extension of the BIC approximation Thomas J. Faulkenberry∗ Tarleton State University Bayesian inference affords scientists with powerful tools for testing hypotheses. One of these tools is the Bayes factor, which indexes the extent to which support for one hypothesis over another is updated after seeing the data. Part of the hesitance to adopt this approach may stem from an unfamiliarity with the computational tools necessary for computing Bayes factors. Previous work has shown that closed form approximations of Bayes factors are relatively easy to obtain for between-groups methods, such as an analysis of variance or t-test. In this paper, I extend this approximation to develop a formula for the Bayes factor that directly uses infor- mation that is typically reported for ANOVAs (e.g., the F ratio and degrees of freedom). After giving two examples of its use, I report the results of simulations which show that even with minimal input, this approximate Bayes factor produces similar results to existing software solutions. Note: to appear in Biometrical Letters. I. INTRODUCTION A. The Bayes factor Bayesian inference is a method of measurement that is Hypothesis testing is the primary tool for statistical based on the computation of P (H j D), which is called inference across much of the biological and behavioral the posterior probability of a hypothesis H, given data D. sciences. As such, most scientists are trained in classical Bayes' theorem casts this probability as null hypothesis significance testing (NHST). The scenario for testing a hypothesis is likely familiar to most readers of this journal.
    [Show full text]
  • Paradoxes and Priors in Bayesian Regression
    Paradoxes and Priors in Bayesian Regression Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Agniva Som, B. Stat., M. Stat. Graduate Program in Statistics The Ohio State University 2014 Dissertation Committee: Dr. Christopher M. Hans, Advisor Dr. Steven N. MacEachern, Co-advisor Dr. Mario Peruggia c Copyright by Agniva Som 2014 Abstract The linear model has been by far the most popular and most attractive choice of a statistical model over the past century, ubiquitous in both frequentist and Bayesian literature. The basic model has been gradually improved over the years to deal with stronger features in the data like multicollinearity, non-linear or functional data pat- terns, violation of underlying model assumptions etc. One valuable direction pursued in the enrichment of the linear model is the use of Bayesian methods, which blend information from the data likelihood and suitable prior distributions placed on the unknown model parameters to carry out inference. This dissertation studies the modeling implications of many common prior distri- butions in linear regression, including the popular g prior and its recent ameliorations. Formalization of desirable characteristics for model comparison and parameter esti- mation has led to the growth of appropriate mixtures of g priors that conform to the seven standard model selection criteria laid out by Bayarri et al. (2012). The existence of some of these properties (or lack thereof) is demonstrated by examining the behavior of the prior under suitable limits on the likelihood or on the prior itself.
    [Show full text]
  • 1 Estimation and Beyond in the Bayes Universe
    ISyE8843A, Brani Vidakovic Handout 7 1 Estimation and Beyond in the Bayes Universe. 1.1 Estimation No Bayes estimate can be unbiased but Bayesians are not upset! No Bayes estimate with respect to the squared error loss can be unbiased, except in a trivial case when its Bayes’ risk is 0. Suppose that for a proper prior ¼ the Bayes estimator ±¼(X) is unbiased, Xjθ (8θ)E ±¼(X) = θ: This implies that the Bayes risk is 0. The Bayes risk of ±¼(X) can be calculated as repeated expectation in two ways, θ Xjθ 2 X θjX 2 r(¼; ±¼) = E E (θ ¡ ±¼(X)) = E E (θ ¡ ±¼(X)) : Thus, conveniently choosing either EθEXjθ or EX EθjX and using the properties of conditional expectation we have, θ Xjθ 2 θ Xjθ X θjX X θjX 2 r(¼; ±¼) = E E θ ¡ E E θ±¼(X) ¡ E E θ±¼(X) + E E ±¼(X) θ Xjθ 2 θ Xjθ X θjX X θjX 2 = E E θ ¡ E θ[E ±¼(X)] ¡ E ±¼(X)E θ + E E ±¼(X) θ Xjθ 2 θ X X θjX 2 = E E θ ¡ E θ ¢ θ ¡ E ±¼(X)±¼(X) + E E ±¼(X) = 0: Bayesians are not upset. To check for its unbiasedness, the Bayes estimator is averaged with respect to the model measure (Xjθ), and one of the Bayesian commandments is: Thou shall not average with respect to sample space, unless you have Bayesian design in mind. Even frequentist agree that insisting on unbiasedness can lead to bad estimators, and that in their quest to minimize the risk by trading off between variance and bias-squared a small dosage of bias can help.
    [Show full text]
  • A Compendium of Conjugate Priors
    A Compendium of Conjugate Priors Daniel Fink Environmental Statistics Group Department of Biology Montana State Univeristy Bozeman, MT 59717 May 1997 Abstract This report reviews conjugate priors and priors closed under sampling for a variety of data generating processes where the prior distributions are univariate, bivariate, and multivariate. The effects of transformations on conjugate prior relationships are considered and cases where conjugate prior relationships can be applied under transformations are identified. Univariate and bivariate prior relationships are verified using Monte Carlo methods. Contents 1 Introduction Experimenters are often in the position of having had collected some data from which they desire to make inferences about the process that produced that data. Bayes' theorem provides an appealing approach to solving such inference problems. Bayes theorem, π(θ) L(θ x ; : : : ; x ) g(θ x ; : : : ; x ) = j 1 n (1) j 1 n π(θ) L(θ x ; : : : ; x )dθ j 1 n is commonly interpreted in the following wayR. We want to make some sort of inference on the unknown parameter(s), θ, based on our prior knowledge of θ and the data collected, x1; : : : ; xn . Our prior knowledge is encapsulated by the probability distribution on θ; π(θ). The data that has been collected is combined with our prior through the likelihood function, L(θ x ; : : : ; x ) . The j 1 n normalized product of these two components yields a probability distribution of θ conditional on the data. This distribution, g(θ x ; : : : ; x ) , is known as the posterior distribution of θ. Bayes' j 1 n theorem is easily extended to cases where is θ multivariate, a vector of parameters.
    [Show full text]
  • Hierarchical Models & Bayesian Model Selection
    Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Contact information Please report any typos or errors to geoff[email protected] Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y ∼ Bin(n; θ), with probability mass function n p(Y = y j θ) = θy (1 − θ)n−y (1) y We want to estimate the parameter θ Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 − θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 We want to estimate the parameter θ Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 − θ Let Y be a random variable denoting number of observed heads in n coin tosses.
    [Show full text]
  • Bayes Factors in Practice Author(S): Robert E
    Bayes Factors in Practice Author(s): Robert E. Kass Source: Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 42, No. 5, Special Issue: Conference on Practical Bayesian Statistics, 1992 (2) (1993), pp. 551-560 Published by: Blackwell Publishing for the Royal Statistical Society Stable URL: http://www.jstor.org/stable/2348679 Accessed: 10/08/2010 18:09 Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you may use content in the JSTOR archive only for your personal, non-commercial use. Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at http://www.jstor.org/action/showPublisher?publisherCode=black. Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms of scholarship. For more information about JSTOR, please contact [email protected]. Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and extend access to Journal of the Royal Statistical Society.
    [Show full text]
  • Bayes Factor Consistency
    Bayes Factor Consistency Siddhartha Chib John M. Olin School of Business, Washington University in St. Louis and Todd A. Kuffner Department of Mathematics, Washington University in St. Louis July 1, 2016 Abstract Good large sample performance is typically a minimum requirement of any model selection criterion. This article focuses on the consistency property of the Bayes fac- tor, a commonly used model comparison tool, which has experienced a recent surge of attention in the literature. We thoroughly review existing results. As there exists such a wide variety of settings to be considered, e.g. parametric vs. nonparametric, nested vs. non-nested, etc., we adopt the view that a unified framework has didactic value. Using the basic marginal likelihood identity of Chib (1995), we study Bayes factor asymptotics by decomposing the natural logarithm of the ratio of marginal likelihoods into three components. These are, respectively, log ratios of likelihoods, prior densities, and posterior densities. This yields an interpretation of the log ra- tio of posteriors as a penalty term, and emphasizes that to understand Bayes factor consistency, the prior support conditions driving posterior consistency in each respec- tive model under comparison should be contrasted in terms of the rates of posterior contraction they imply. Keywords: Bayes factor; consistency; marginal likelihood; asymptotics; model selection; nonparametric Bayes; semiparametric regression. 1 1 Introduction Bayes factors have long held a special place in the Bayesian inferential paradigm, being the criterion of choice in model comparison problems for such Bayesian stalwarts as Jeffreys, Good, Jaynes, and others. An excellent introduction to Bayes factors is given by Kass & Raftery (1995).
    [Show full text]
  • A Review of Bayesian Optimization
    Taking the Human Out of the Loop: A Review of Bayesian Optimization The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Shahriari, Bobak, Kevin Swersky, Ziyu Wang, Ryan P. Adams, and Nando de Freitas. 2016. “Taking the Human Out of the Loop: A Review of Bayesian Optimization.” Proc. IEEE 104 (1) (January): 148– 175. doi:10.1109/jproc.2015.2494218. Published Version doi:10.1109/JPROC.2015.2494218 Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:27769882 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Open Access Policy Articles, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#OAP 1 Taking the Human Out of the Loop: A Review of Bayesian Optimization Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P. Adams and Nando de Freitas Abstract—Big data applications are typically associated with and the analytics company that sits between them. The analyt- systems involving large numbers of users, massive complex ics company must develop procedures to automatically design software systems, and large-scale heterogeneous computing and game variants across millions of users; the objective is to storage architectures. The construction of such systems involves many distributed design choices. The end products (e.g., rec- enhance user experience and maximize the content provider’s ommendation systems, medical analysis tools, real-time game revenue. engines, speech recognizers) thus involves many tunable config- The preceding examples highlight the importance of au- uration parameters.
    [Show full text]
  • Calibrated Bayes Factors for Model Selection and Model Averaging
    Calibrated Bayes factors for model selection and model averaging Dissertation Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University By Pingbo Lu, B.S. Graduate Program in Statistics The Ohio State University 2012 Dissertation Committee: Dr. Steven N. MacEachern, Co-Advisor Dr. Xinyi Xu, Co-Advisor Dr. Christopher M. Hans ⃝c Copyright by Pingbo Lu 2012 Abstract The discussion of statistical model selection and averaging has a long history and is still attractive nowadays. Arguably, the Bayesian ideal of model comparison based on Bayes factors successfully overcomes the difficulties and drawbacks for which frequen- tist hypothesis testing has been criticized. By putting prior distributions on model- specific parameters, Bayesian models will be completed hierarchically and thus many statistical summaries, such as posterior model probabilities, model-specific posteri- or distributions and model-average posterior distributions can be derived naturally. These summaries enable us to compare inference summaries such as the predictive distributions. Importantly, a Bayes factor between two models can be greatly affected by the prior distributions on the model parameters. When prior information is weak, very dispersed proper prior distributions are often used. This is known to create a problem for the Bayes factor when competing models that differ in dimension, and it is of even greater concern when one of the models is of infinite dimension. Therefore, we propose an innovative criterion called the calibrated Bayes factor, which uses training samples to calibrate the prior distributions so that they achieve a reasonable level of "information". The calibrated Bayes factor is then computed as the Bayes factor over the remaining data.
    [Show full text]
  • The Bayes Factor
    The Bayes Factor I The Bayes Factor provides a way to formally compare two competing models, say M1 and M2. I It is similar to testing a “full model” vs. “reduced model” (with, e.g., a likelihood ratio test) in classical statistics. I However, with the Bayes Factor, one model does not have to be nested within the other. I Given a data set x, we compare models M1 : f1(x|θ1) and M2 : f2(x|θ2) I We may specify prior distributions p1(θ1) and p2(θ2) that lead to prior probabilities for each model p(M1) and p(M2). The Bayes Factor By Bayes’ Law, the posterior odds in favor of Model 1 versus Model 2 is: R p(M1)f1(x|θ1)p1(θ1) dθ1 π(M |x) Θ1 p(x) 1 = π(M2|x) R p(M2)f2(x|θ2)p2(θ2) dθ2 Θ2 p(x) R f (x|θ )p (θ ) dθ p(M1) Θ1 1 1 1 1 1 = · R p(M2) f (x|θ )p (θ ) dθ Θ2 2 2 2 2 2 = [prior odds] × [Bayes Factor B(x)] The Bayes Factor Rearranging, the Bayes Factor is: π(M |x) p(M ) B(x) = 1 × 2 π(M2|x) p(M1) π(M |x)/π(M |x) = 1 2 p(M1)/p(M2) (the ratio of the posterior odds for M1 to the prior odds for M1). The Bayes Factor I Note: If the prior model probabilities are equal, i.e., p(M1) = p(M2), then the Bayes Factor equals the posterior odds for M1.
    [Show full text]
  • On the Derivation of the Bayesian Information Criterion
    On the derivation of the Bayesian Information Criterion H. S. Bhat∗† N. Kumar∗ November 8, 2010 Abstract We present a careful derivation of the Bayesian Inference Criterion (BIC) for model selection. The BIC is viewed here as an approximation to the Bayes Factor. One of the main ingredients in the approximation, the use of Laplace’s method for approximating integrals, is explained well in the literature. Our derivation sheds light on this and other steps in the derivation, such as the use of a flat prior and the invocation of the weak law of large numbers, that are not often discussed in detail. 1 Notation Let us define the notation that we will use: y : observed data y1,...,yn Mi : candidate model P (y|Mi) : marginal likelihood of the model Mi given the data θi : vector of parameters in the model Mi gi(θi) : the prior density of the parameters θi f(y|θi) : the density of the data given the parameters θi L(θi|y) : the likelihood of y given the model Mi θˆi : the MLE of θi that maximizes L(θi|y) 2 Bayes Factor The Bayesian approach to model selection [1] is to maximize the posterior probability of n a model (Mi) given the data {yj}j=1. Applying Bayes theorem to calculate the posterior ∗School of Natural Sciences, University of California, Merced, 5200 N. Lake Rd., Merced, CA 95343 †Corresponding author, email: [email protected] 1 probability of a model given the data, we get P (y1,...,yn|Mi)P (Mi) P (Mi|y1,...,yn)= , (1) P (y1,...,yn) where P (y1,...,yn|Mi) is called the marginal likelihood of the model Mi.
    [Show full text]
  • 9 Introduction to Hierarchical Models
    9 Introduction to Hierarchical Models One of the important features of a Bayesian approach is the relative ease with which hierarchical models can be constructed and estimated using Gibbs sampling. In fact, one of the key reasons for the recent growth in the use of Bayesian methods in the social sciences is that the use of hierarchical models has also increased dramatically in the last two decades. Hierarchical models serve two purposes. One purpose is methodological; the other is substantive. Methodologically, when units of analysis are drawn from clusters within a population (communities, neighborhoods, city blocks, etc.), they can no longer be considered independent. Individuals who come from the same cluster will be more similar to each other than they will be to individuals from other clusters. Therefore, unobserved variables may in- duce statistical dependence between observations within clusters that may be uncaptured by covariates within the model, violating a key assumption of maximum likelihood estimation as it is typically conducted when indepen- dence of errors is assumed. Recall that a likelihood function, when observations are independent, is simply the product of the density functions for each ob- servation taken over all the observations. However, when independence does not hold, we cannot construct the likelihood as simply. Thus, one reason for constructing hierarchical models is to compensate for the biases—largely in the standard errors—that are introduced when the independence assumption is violated. See Ezell, Land, and Cohen (2003) for a thorough review of the approaches that have been used to correct standard errors in hazard model- ing applications with repeated events, one class of models in which repeated measurement yields hierarchical clustering.
    [Show full text]