Bayesian Inference and Bayesian Model Selection

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Inference and Bayesian Model Selection Bayesian inference and Bayesian model selection Klaas Enno Stephan Lecture as part of "Methods & Models for fMRI data analysis", University of Zurich & ETH Zurich, 27 November 2018 With slides from and many thanks to: Kay Brodersen, Will Penny, Sudhir Shankar Raman Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Bayes' theorem The Reverend Thomas Bayes (1702-1761) p( y | ) p ( ) “... the theorem expresses py( | ) how a ... degree of belief should rationally change to py() account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayes' theorem The Reverend Thomas Bayes (1702-1761) p( y | ) p ( ) “... the theorem expresses py( | ) how a ... degree of belief p( y | ) p ( ) should rationally change to account for availability of related evidence." posterior = likelihood ∙ prior / evidence Wikipedia Bayesian inference: an animation The evidence term continuous p( y | ) p ( ) p( y | ) p ( ) pydiscrete( | ) py( | ) p( y | ) p ( ) p( y | ) p ( ) Bayesian inference: An example (with fictitious probabilities) • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20% • disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD) • A priori: p(Ebola) =10-6 p(AOD) = (1-10-6) • A patient presents with fever. What is the probability that he/she has ebola? p( y 1| 1) p ( 1) py( 1| 1) p( y 1| j ) p ( j ) j0,1 Bayesian inference: An example (with fictitious probabilities) • symptom: y 1: ebola 0: AOD y=1: fever y=0: no fever 1: fever 99.99% 20% • disease: p(y|) =1: Ebola 0: no fever 0.01% 80% =0: any other disease (AOD) • A priori: p(Ebola) =10-6 p(AOD) = (1-10-6) • A patient presents with fever. What is the probability that he/she has ebola? 0.999 106 py( 1| 1) 4.995 106 0.999 1066 0.2 (1 10 ) Generative models p( y | , m )pm( | ) p( | y , m ) 1. specify the joint probability over data (observations) and parameters 2. enforce mechanistic thinking: how could the data have been caused? 3. generate synthetic data (observations) by sampling from the prior – can model explain certain phenomena at all? 4. inference about parameters → p(|y) 5. model evidence p(y|m): index of model quality Bayesian inference in practice Formulation of a generative model likelihood function p(y|) Model prior distribution p() Observation of data Measurement data y Model inversion – updating one's beliefs posterior p( | y ) p ( y | ) p ( ) model evidence Example of a shrinkage prior Priors • Objective priors: – "non-informative" priors – objective constraints (e.g., non-negativity) • Subjective priors: – subjective but not arbitrary – can express beliefs that result from understanding of the problem or system – can be result of previous empirical results • Shrinkage priors: – emphasize regularization and sparsity • Empirical priors: – learn parameters of prior distributions from the data ("empirical Bayes") – rest on a hierarchical model A generative modelling framework for fMRI & EEG: Dynamic causal modeling (DCM) dwMRI EEG, MEG fMRI Forward model: Model inversion: Predicting measured Inferring neuronal activity parameters dx y g(x,) f(,,) x u dt Friston et al. 2003, NeuroImage Stephan et al. 2009, NeuroImage DCM for fMRI Stephan et al. 2015, Neuron 0.4 Neural population activity 0.3 0.2 0.1 u2 0 0 10 20 30 40 50 60 70 80 90 100 0.6 0.4 u 0.2 1 x3 0 0 10 20 30 40 50 60 70 80 90 100 0.3 0.2 0.1 0 0 10 20 30 40 50 60 70 80 90 100 x1 x2 3 2 fMRI signal change (%) 1 0 0 10 20 30 40 50 60 70 80 90 100 4 Nonlinear Dynamic Causal Model for fMRI 3 2 m n 1 dx (i) ( j) 0 -1 A u B x D x Cu 0 10 20 30 40 50 60 70 80 90 100 i j dt i1 j1 3 2 1 0 Stephan et al. 2008, NeuroImage 0 10 20 30 40 50 60 70 80 90 100 Bayesian system identification u(t) Design experimental inputs Neural dynamics dx dt f(,,) x u Define likelihood model Observer function y g(,) x p(y |,m) N(g( ),( )) Specify priors p(,m) N( , ) Inference on model p(y | m) p(y |,m) p( )d Invert model structure p(y |,m) p(,m) Inference on parameters p( | y,m) p(y | m) Make inferences Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Generative models as "computational assays" p( y | , m )pm( | ) p( | y , m ) Computational assays: Models of disease mechanisms Translational Neuromodeling dx f(,,) x u dt Individual treatment prediction Detecting physiological subgroups (based on inferred mechanisms) disease mechanism A Application to brain activity and disease mechanism B behaviour of individual patients disease mechanism C Stephan et al. 2015, Neuron Generative embedding (supervised) step 1 — A step 2 — A → B model inversion kernel construction A → C B → B B C B → C measurements from subject-specific subject representation in the an individual subject inverted generative model generative score space A step 4 — step 3 — interpretation support vector classification B C jointly discriminative separating hyperplane fitted to model parameters discriminate between groups Brodersen et al. 2011, PLoS Comput. Biol. Generative embedding (unsupervised) Brodersen et al. 2014, NeuroImage Clinical Differential diagnosis by model selection SYMPTOM y (behaviour or physiology) p( y | , mk ) p( mk | y ) HYPOTHETICAL ... ... MECHANISM m1 mk mK p( y | mkk ) p ( m ) pm(k | y) p( y | mkk ) p ( m ) k Stephan et al. 2017, NeuroImage Why should I know about Bayesian inference? Because Bayesian principles are fundamental for • statistical inference in general • system identification • translational neuromodeling ("computational assays") – computational psychiatry – computational neurology – computational psychosomatics • contemporary theories of brain function (the "Bayesian brain") – predictive coding – free energy principle – active inference Perception = inversion of a hierarchical generative model environm. states neuronal states others' mental states bodily states forward model p( y | x , m ) p ( x | m ) p( x | y , m ) perception Free energy principle: predictive coding & active inference sensations – predictions Prediction error Change Change sensory input predictions Action Perception Maximizing the evidence (of the brain's generative model) Friston et al. 2006, = minimizing the surprise about the data (sensory inputs). J Physiol Paris How is the posterior computed = how is a generative model inverted? Bayesian Inference Analytical solutions Approximate Inference Variational MCMC Bayes Sampling How is the posterior computed = how is a generative model inverted? • compute the posterior analytically – requires conjugate priors • variational Bayes (VB) – often hard work to derive, but fast to compute – uses approximations (approx. posterior, mean field) – problems: local minima, potentially inaccurate approximations • Sampling: Markov Chain Monte Carlo (MCMC) – theoretically guaranteed to be accurate (for infinite computation time) – problems: may require very long run time in practice, convergence difficult to prove Conjugate priors • for a given likelihood function, the choice of prior determines the algebraic form of the posterior • for some probability distributions a prior can be found such that the posterior has the same algebraic form as the prior • such a prior is called “conjugate” to the likelihood • examples: – Normal × Normal Normal p(θ | y ) p ( y | θ ) p ( θ ) – Beta × Binomial Beta – Dirichlet × Multinomial Dirichlet same form Posterior mean & variance of univariate Gaussians Likelihood & Prior 2 y p( y | ) N ( , e ) pN()(,) 2 pp Posterior: p( | y ) N ( , 2 ) Posterior 1 1 1 Likelihood 2 2 2 e p p Prior 1 1 2 2 2 p e p Posterior mean = variance-weighted combination of prior mean and data mean Same thing – but expressed as precision weighting Likelihood & prior 1 y p( y | ) N ( , e ) pN()(,) 1 pp Posterior: p( | y ) N ( , 1 ) Posterior Likelihood e p p Prior e p p Relative precision weighting Variational Bayes (VB) Idea: find an approximate density 푞(휃) that is maximally similar to the true posterior 푝 휃 푦 . This is often done by assuming a particular form for 푞 (fixed form VB) and then optimizing its sufficient statistics. true hypothesis posterior class 푝 휃 푦 divergence KL 푞||푝 best proxy 푞 휃 Kullback–Leibler (KL) divergence • asymmetric measure of the difference between two probability distributions P and Q • Interpretations of DKL(P‖Q): – "Bayesian surprise" when Q=prior, P=posterior: measure of the information gained when one updates one's prior beliefs to the posterior P – a measure of the information lost when Q is used to approximate P • non-negative: 0 (zero when P=Q) Variational calculus Standard calculus Variational Newton, Leibniz, and calculus others Euler, Lagrange, and others • functions 푓: 푥 ↦ 푓 푥 • functionals Leonhard Euler d푓 퐹: 푓 ↦ 퐹 푓 (1707 – 1783) • derivatives Swiss mathematician, d푥 d퐹 • derivatives ‘Elementa Calculi d푓 Variationum’ Example: maximize the likelihood Example: maximize expression 푝 푦 휃 the entropy 퐻 푝 w.r.t. 휃 w.r.t. a probability distribution 푝 푥 Variational Bayes ln 푝 푦 ∗ ln 푝 푦 ln 푝(푦) = KL[푞| 푝 + 퐹 푞, 푦 KL[푞| 푝 divergence neg.
Recommended publications
  • The Bayesian Approach to Statistics
    THE BAYESIAN APPROACH TO STATISTICS ANTHONY O’HAGAN INTRODUCTION the true nature of scientific reasoning. The fi- nal section addresses various features of modern By far the most widely taught and used statisti- Bayesian methods that provide some explanation for the rapid increase in their adoption since the cal methods in practice are those of the frequen- 1980s. tist school. The ideas of frequentist inference, as set out in Chapter 5 of this book, rest on the frequency definition of probability (Chapter 2), BAYESIAN INFERENCE and were developed in the first half of the 20th century. This chapter concerns a radically differ- We first present the basic procedures of Bayesian ent approach to statistics, the Bayesian approach, inference. which depends instead on the subjective defini- tion of probability (Chapter 3). In some respects, Bayesian methods are older than frequentist ones, Bayes’s Theorem and the Nature of Learning having been the basis of very early statistical rea- Bayesian inference is a process of learning soning as far back as the 18th century. Bayesian from data. To give substance to this statement, statistics as it is now understood, however, dates we need to identify who is doing the learning and back to the 1950s, with subsequent development what they are learning about. in the second half of the 20th century. Over that time, the Bayesian approach has steadily gained Terms and Notation ground, and is now recognized as a legitimate al- ternative to the frequentist approach. The person doing the learning is an individual This chapter is organized into three sections.
    [Show full text]
  • Scalable and Robust Bayesian Inference Via the Median Posterior
    Scalable and Robust Bayesian Inference via the Median Posterior CS 584: Big Data Analytics Material adapted from David Dunson’s talk (http://bayesian.org/sites/default/files/Dunson.pdf) & Lizhen Lin’s ICML talk (http://techtalks.tv/talks/scalable-and-robust-bayesian-inference-via-the-median-posterior/61140/) Big Data Analytics • Large (big N) and complex (big P with interactions) data are collected routinely • Both speed & generality of data analysis methods are important • Bayesian approaches offer an attractive general approach for modeling the complexity of big data • Computational intractability of posterior sampling is a major impediment to application of flexible Bayesian methods CS 584 [Spring 2016] - Ho Existing Frequentist Approaches: The Positives • Optimization-based approaches, such as ADMM or glmnet, are currently most popular for analyzing big data • General and computationally efficient • Used orders of magnitude more than Bayes methods • Can exploit distributed & cloud computing platforms • Can borrow some advantages of Bayes methods through penalties and regularization CS 584 [Spring 2016] - Ho Existing Frequentist Approaches: The Drawbacks • Such optimization-based methods do not provide measure of uncertainty • Uncertainty quantification is crucial for most applications • Scalable penalization methods focus primarily on convex optimization — greatly limits scope and puts ceiling on performance • For non-convex problems and data with complex structure, existing optimization algorithms can fail badly CS 584 [Spring 2016] -
    [Show full text]
  • 1 Estimation and Beyond in the Bayes Universe
    ISyE8843A, Brani Vidakovic Handout 7 1 Estimation and Beyond in the Bayes Universe. 1.1 Estimation No Bayes estimate can be unbiased but Bayesians are not upset! No Bayes estimate with respect to the squared error loss can be unbiased, except in a trivial case when its Bayes’ risk is 0. Suppose that for a proper prior ¼ the Bayes estimator ±¼(X) is unbiased, Xjθ (8θ)E ±¼(X) = θ: This implies that the Bayes risk is 0. The Bayes risk of ±¼(X) can be calculated as repeated expectation in two ways, θ Xjθ 2 X θjX 2 r(¼; ±¼) = E E (θ ¡ ±¼(X)) = E E (θ ¡ ±¼(X)) : Thus, conveniently choosing either EθEXjθ or EX EθjX and using the properties of conditional expectation we have, θ Xjθ 2 θ Xjθ X θjX X θjX 2 r(¼; ±¼) = E E θ ¡ E E θ±¼(X) ¡ E E θ±¼(X) + E E ±¼(X) θ Xjθ 2 θ Xjθ X θjX X θjX 2 = E E θ ¡ E θ[E ±¼(X)] ¡ E ±¼(X)E θ + E E ±¼(X) θ Xjθ 2 θ X X θjX 2 = E E θ ¡ E θ ¢ θ ¡ E ±¼(X)±¼(X) + E E ±¼(X) = 0: Bayesians are not upset. To check for its unbiasedness, the Bayes estimator is averaged with respect to the model measure (Xjθ), and one of the Bayesian commandments is: Thou shall not average with respect to sample space, unless you have Bayesian design in mind. Even frequentist agree that insisting on unbiasedness can lead to bad estimators, and that in their quest to minimize the risk by trading off between variance and bias-squared a small dosage of bias can help.
    [Show full text]
  • Introduction to Bayesian Inference and Modeling Edps 590BAY
    Introduction to Bayesian Inference and Modeling Edps 590BAY Carolyn J. Anderson Department of Educational Psychology c Board of Trustees, University of Illinois Fall 2019 Introduction What Why Probability Steps Example History Practice Overview ◮ What is Bayes theorem ◮ Why Bayesian analysis ◮ What is probability? ◮ Basic Steps ◮ An little example ◮ History (not all of the 705+ people that influenced development of Bayesian approach) ◮ In class work with probabilities Depending on the book that you select for this course, read either Gelman et al. Chapter 1 or Kruschke Chapters 1 & 2. C.J. Anderson (Illinois) Introduction Fall 2019 2.2/ 29 Introduction What Why Probability Steps Example History Practice Main References for Course Throughout the coures, I will take material from ◮ Gelman, A., Carlin, J.B., Stern, H.S., Dunson, D.B., Vehtari, A., & Rubin, D.B. (20114). Bayesian Data Analysis, 3rd Edition. Boco Raton, FL, CRC/Taylor & Francis.** ◮ Hoff, P.D., (2009). A First Course in Bayesian Statistical Methods. NY: Sringer.** ◮ McElreath, R.M. (2016). Statistical Rethinking: A Bayesian Course with Examples in R and Stan. Boco Raton, FL, CRC/Taylor & Francis. ◮ Kruschke, J.K. (2015). Doing Bayesian Data Analysis: A Tutorial with JAGS and Stan. NY: Academic Press.** ** There are e-versions these of from the UofI library. There is a verson of McElreath, but I couldn’t get if from UofI e-collection. C.J. Anderson (Illinois) Introduction Fall 2019 3.3/ 29 Introduction What Why Probability Steps Example History Practice Bayes Theorem A whole semester on this? p(y|θ)p(θ) p(θ|y)= p(y) where ◮ y is data, sample from some population.
    [Show full text]
  • Statistical Inference: Paradigms and Controversies in Historic Perspective
    Jostein Lillestøl, NHH 2014 Statistical inference: Paradigms and controversies in historic perspective 1. Five paradigms We will cover the following five lines of thought: 1. Early Bayesian inference and its revival Inverse probability – Non-informative priors – “Objective” Bayes (1763), Laplace (1774), Jeffreys (1931), Bernardo (1975) 2. Fisherian inference Evidence oriented – Likelihood – Fisher information - Necessity Fisher (1921 and later) 3. Neyman- Pearson inference Action oriented – Frequentist/Sample space – Objective Neyman (1933, 1937), Pearson (1933), Wald (1939), Lehmann (1950 and later) 4. Neo - Bayesian inference Coherent decisions - Subjective/personal De Finetti (1937), Savage (1951), Lindley (1953) 5. Likelihood inference Evidence based – likelihood profiles – likelihood ratios Barnard (1949), Birnbaum (1962), Edwards (1972) Classical inference as it has been practiced since the 1950’s is really none of these in its pure form. It is more like a pragmatic mix of 2 and 3, in particular with respect to testing of significance, pretending to be both action and evidence oriented, which is hard to fulfill in a consistent manner. To keep our minds on track we do not single out this as a separate paradigm, but will discuss this at the end. A main concern through the history of statistical inference has been to establish a sound scientific framework for the analysis of sampled data. Concepts were initially often vague and disputed, but even after their clarification, various schools of thought have at times been in strong opposition to each other. When we try to describe the approaches here, we will use the notions of today. All five paradigms of statistical inference are based on modeling the observed data x given some parameter or “state of the world” , which essentially corresponds to stating the conditional distribution f(x|(or making some assumptions about it).
    [Show full text]
  • Bayesian Inference for Median of the Lognormal Distribution K
    Journal of Modern Applied Statistical Methods Volume 15 | Issue 2 Article 32 11-1-2016 Bayesian Inference for Median of the Lognormal Distribution K. Aruna Rao SDM Degree College, Ujire, India, [email protected] Juliet Gratia D'Cunha Mangalore University, Mangalagangothri, India, [email protected] Follow this and additional works at: http://digitalcommons.wayne.edu/jmasm Part of the Applied Statistics Commons, Social and Behavioral Sciences Commons, and the Statistical Theory Commons Recommended Citation Rao, K. Aruna and D'Cunha, Juliet Gratia (2016) "Bayesian Inference for Median of the Lognormal Distribution," Journal of Modern Applied Statistical Methods: Vol. 15 : Iss. 2 , Article 32. DOI: 10.22237/jmasm/1478003400 Available at: http://digitalcommons.wayne.edu/jmasm/vol15/iss2/32 This Regular Article is brought to you for free and open access by the Open Access Journals at DigitalCommons@WayneState. It has been accepted for inclusion in Journal of Modern Applied Statistical Methods by an authorized editor of DigitalCommons@WayneState. Bayesian Inference for Median of the Lognormal Distribution Cover Page Footnote Acknowledgements The es cond author would like to thank Government of India, Ministry of Science and Technology, Department of Science and Technology, New Delhi, for sponsoring her with an INSPIRE fellowship, which enables her to carry out the research program which she has undertaken. She is much honored to be the recipient of this award. This regular article is available in Journal of Modern Applied Statistical Methods: http://digitalcommons.wayne.edu/jmasm/vol15/ iss2/32 Journal of Modern Applied Statistical Methods Copyright © 2016 JMASM, Inc. November 2016, Vol. 15, No. 2, 526-535.
    [Show full text]
  • 9 Bayesian Inference
    9 Bayesian inference 1702 - 1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake £pM to win £M and equally prepared to accept a stake of £pM to win £M. In other words ... ... the bet is fair and you are assumed to behave rationally. 9.1.1 Kolmogorov’s axioms How does subjective probability fit in with the fundamental axioms? Let A be the set of all subsets of a countable sample space Ω. Then (i) P(A) ≥ 0 for every A ∈A; (ii) P(Ω)=1; 83 (iii) If {Aλ : λ ∈ Λ} is a countable set of mutually exclusive events belonging to A,then P Aλ = P (Aλ) . λ∈Λ λ∈Λ Obviously the subjective interpretation has no difficulty in conforming with (i) and (ii). (iii) is slightly less obvious. Suppose we have 2 events A and B such that A ∩ B = ∅. Consider a stake of £pAM to win £M if A occurs and a stake £pB M to win £M if B occurs. The total stake for bets on A or B occurring is £pAM+ £pBM to win £M if A or B occurs. Thus we have £(pA + pB)M to win £M and so P (A ∪ B)=P(A)+P(B) 9.1.2 Conditional probability Define pB , pAB , pA|B such that £pBM is the fair stake for £M if B occurs; £pABM is the fair stake for £M if A and B occur; £pA|BM is the fair stake for £M if A occurs given B has occurred − other- wise the bet is off.
    [Show full text]
  • Bayesian Inference
    Bayesian Inference Thomas Nichols With thanks Lee Harrison Bayesian segmentation Spatial priors Posterior probability Dynamic Causal and normalisation on activation extent maps (PPMs) Modelling Attention to Motion Paradigm Results SPC V3A V5+ Attention – No attention Büchel & Friston 1997, Cereb. Cortex Büchel et al. 1998, Brain - fixation only - observe static dots + photic V1 - observe moving dots + motion V5 - task on moving dots + attention V5 + parietal cortex Attention to Motion Paradigm Dynamic Causal Models Model 1 (forward): Model 2 (backward): attentional modulation attentional modulation of V1→V5: forward of SPC→V5: backward Photic SPC Attention Photic SPC V1 V1 - fixation only V5 - observe static dots V5 - observe moving dots Motion Motion - task on moving dots Attention Bayesian model selection: Which model is optimal? Responses to Uncertainty Long term memory Short term memory Responses to Uncertainty Paradigm Stimuli sequence of randomly sampled discrete events Model simple computational model of an observers response to uncertainty based on the number of past events (extent of memory) 1 2 3 4 Question which regions are best explained by short / long term memory model? … 1 2 40 trials ? ? Overview • Introductory remarks • Some probability densities/distributions • Probabilistic (generative) models • Bayesian inference • A simple example – Bayesian linear regression • SPM applications – Segmentation – Dynamic causal modeling – Spatial models of fMRI time series Probability distributions and densities k=2 Probability distributions
    [Show full text]
  • Bayesian Inference in the Linear Regression Model
    Bayesian Inference in the Linear Regression Model Econ 690 Purdue University Justin L. Tobias (Purdue) Bayesian Regression 1 / 35 Outline 1 The Model and Likelihood 2 Results Under a Non-Informative Prior 3 Example With Real Data 4 Results With a Conjugate Prior 5 Marginal likelihood in the LRM Justin L. Tobias (Purdue) Bayesian Regression 2 / 35 The inverse Gamma distribution (again!) We denote the inverted Gamma density as Y ∼ IG (α; β). Though different parameterizations exist (particularly for how β enters the density), we utilize the following form here: Y ∼ IG(α; β) ) p(y) = [Γ(α)βα]−1y −(α+1) exp(−1=[yβ]); y > 0: The mean of this inverse Gamma is E(Y ) = [β(α − 1)]−1. Jump to Prior 1, σ2 posterior Jump to Prior 1, β posterior Justin L. Tobias (Purdue) Bayesian Regression 3 / 35 The student-t distribution (again) 0 A continuous k−dimensional random vector, Y = (Y1; ::; Yk ) , has a t distribution with mean µ (a k−vector), scale matrix Σ (a k × k positive definite matrix) and ν (a positive scalar referred to as a degrees of freedom parameter), denoted Y ∼ t (µ, Σ; ν), if its p.d.f. is given by: 1 ν+k 1 − 0 −1 − 2 ft (yjµ, Σ; ν) = jΣj 2 ν + (y − µ) Σ (y − µ) ; ct Jump to Prior 1, β posterior Justin L. Tobias (Purdue) Bayesian Regression 4 / 35 The Linear Regression Model The linear regression model is the workhorse of econometrics. We will describe Bayesian inference in this model under 2 different priors.
    [Show full text]
  • On the Derivation of the Bayesian Information Criterion
    On the derivation of the Bayesian Information Criterion H. S. Bhat∗† N. Kumar∗ November 8, 2010 Abstract We present a careful derivation of the Bayesian Inference Criterion (BIC) for model selection. The BIC is viewed here as an approximation to the Bayes Factor. One of the main ingredients in the approximation, the use of Laplace’s method for approximating integrals, is explained well in the literature. Our derivation sheds light on this and other steps in the derivation, such as the use of a flat prior and the invocation of the weak law of large numbers, that are not often discussed in detail. 1 Notation Let us define the notation that we will use: y : observed data y1,...,yn Mi : candidate model P (y|Mi) : marginal likelihood of the model Mi given the data θi : vector of parameters in the model Mi gi(θi) : the prior density of the parameters θi f(y|θi) : the density of the data given the parameters θi L(θi|y) : the likelihood of y given the model Mi θˆi : the MLE of θi that maximizes L(θi|y) 2 Bayes Factor The Bayesian approach to model selection [1] is to maximize the posterior probability of n a model (Mi) given the data {yj}j=1. Applying Bayes theorem to calculate the posterior ∗School of Natural Sciences, University of California, Merced, 5200 N. Lake Rd., Merced, CA 95343 †Corresponding author, email: [email protected] 1 probability of a model given the data, we get P (y1,...,yn|Mi)P (Mi) P (Mi|y1,...,yn)= , (1) P (y1,...,yn) where P (y1,...,yn|Mi) is called the marginal likelihood of the model Mi.
    [Show full text]
  • Bayesian Alternatives for Common Null-Hypothesis Significance Tests
    Quintana and Eriksen Bayesian alternatives for common null-hypothesis significance tests in psychiatry: A non-technical guide using JASP Daniel S Quintana* and Jon Alm Eriksen *Correspondence: [email protected] Abstract NORMENT, KG Jebsen Centre for Psychosis Research, Division of Despite its popularity, null hypothesis significance testing (NHST) has several Mental Health and Addiction, limitations that restrict statistical inference. Bayesian analysis can address many University of Oslo and Oslo of these limitations, however, this approach has been underutilized largely due to University Hospital, Oslo, Norway Full list of author information is a dearth of accessible software options. JASP is a new open-source statistical available at the end of the article package that facilitates both Bayesian and NHST analysis using a graphical interface. The aim of this article is to provide an applied introduction to Bayesian inference using JASP. First, we will provide a brief background outlining the benefits of Bayesian inference. We will then use JASP to demonstrate Bayesian alternatives for several common null hypothesis significance tests. While Bayesian analysis is by no means a new method, this type of statistical inference has been largely inaccessible for most psychiatry researchers. JASP provides a straightforward means of performing Bayesian analysis using a graphical “point and click” environment that will be familiar to researchers conversant with other graphical statistical packages, such as SPSS. Keywords: Bayesian statistics; null hypothesis significance tests; p-values; statistics; software Consider the development of a new drug to treat social dysfunction. Before reaching the public, scientists need to demonstrate drug efficacy by performing a randomized- controlled trial (RCT) whereby participants are randomized to receive the drug or placebo.
    [Show full text]
  • Evidence-Based Crime Investigation. a Bayesian Approach
    Evidence-based Crime Investigation. A Bayesian Approach Ph.D. Elisabet Lund Department of statistical science University College London May 2012 Preface This dissertation is about evidence and inference in crime investigation. The question is: What can and should "evidence-based" crime investigation mean? My immediate aim is to contribute to the methodological toolbox of crime investiga- tion, but I also aspire to contribute to the concept and standard of evidence in real and public decision-making. Epistemology, methodology, and the foundational issues of different knowledge disciplines have been my basic intellectual interest since my student days. During the obligatory curriculum for a degree in political science at the University of Oslo, I gradually discovered that my talent, if any, was in methodology and analysis and not in politics. I did not have the maturity or boldness of my fellow much more politically minded students. While they rapidly formed an opinion on how things should be on almost any social or political matter I became stuck in the premises: How did they know that the premises of their opinions were sufficiently certain? Is not the certainty of the premises important if your opinion is to become actual policy, having real consequence to real people? I completed my degree in political science with a thesis analysing central foundational problems of political science (rationality vs. social norms; individualism vs. collectivism). Having "discovered" that the premises for these foundational conflicts were ideological too, the question became: By which criteria may we evaluate the knowledge-claims of public poli- cies? I wanted to pursue the concept and standard of evidence and justification in practical public decision-making.
    [Show full text]