Bayesian Inference Chapter 9. Linear Models and Regression

Total Page:16

File Type:pdf, Size:1020Kb

Bayesian Inference Chapter 9. Linear Models and Regression 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Bayesian Inference Chapter 9. Linear models and regression M. Concepcion Ausin Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Chapter 9. Linear models and regression Objective Illustrate the Bayesian approach to fitting normal and generalized linear models. Recommended reading • Lindley, D.V. and Smith, A.F.M. (1972). Bayes estimates for the linear model (with discussion), Journal of the Royal Statistical Society B, 34, 1-41. • Broemeling, L.D. (1985). Bayesian Analysis of Linear Models, Marcel- Dekker. • Gelman, A., Carlin, J.B., Stern, H.S. and Rubin, D.B. (2003). Bayesian Data Analysis, Chapter 8. 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 9. Linear models and regression Chapter 9. Linear models and regression AFM Smith AFM Smith developed some ofObjective the central ideas in the theory and practice of modern Bayesian statistics. To illustrate the Bayesian approach to fitting normal and generalized linear models. Bayesian Statistics 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Contents 0. Introduction 1. The multivariate normal distribution 1.1. Conjugate Bayesian inference when the variance-covariance matrix is known up to a constant 1.2. Conjugate Bayesian inference when the variance-covariance matrix is unknown 2. Normal linear models 2.1. Conjugate Bayesian inference for normal linear models 2.2. Example 1: ANOVA model 2.3. Example 2: Simple linear regression model 3. Generalized linear models 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models The multivariate normal distribution Firstly, we review the definition and properties of the multivariate normal distribution. Definition T A random variable X = (X1;:::; Xk ) is said to have a multivariate normal distribution with mean µ and variance-covariance matrix Σ if: 1 1 f (xjµ; Σ) = exp − (x − µ)T Σ−1 (x − µ) ; k=2 1 (2π) jΣj 2 2 for x 2Rk : In this case, we write Xjµ; Σ ∼ N (µ; Σ). 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models The multivariate normal distribution The following properties of the multivariate normal distribution are well known: • Any subset of X has a (multivariate) normal distribution. Pk • Any linear combination i=1 αi Xi is normally distributed. • If Y = a + BX is a linear transformation of X, then: Yjµ, Σ ∼ N a + Bµ; BΣBT • If X1 µ1 Σ11 Σ12 X = µ; Σ ∼ N ; ; X2 µ2 Σ21 Σ22 then, the conditional density of X1 given X2 = x2 is: −1 −1 X1jx2; µ; Σ ∼ N µ1 + Σ12Σ22 (x2 − µ2); Σ11 − Σ12Σ22 Σ21 : 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models The multivariate normal distribution The likelihood function given a sample x = (x1;:::; xn) of data from N (µ; Σ) is: n ! 1 1 X T l µ; j − − µ −1 − µ ( Σ x) = nk=2 n exp (xi ) Σ (xi ) 2 2 (2π) jΣj i=1 n ! 1 1 X T −1 T −1 / n exp − (xi − ¯x) Σ (xi − ¯x) + n (µ − ¯x) Σ (µ − ¯x) 2 2 jΣj i=1 1 1 −1 T −1 / n exp − tr SΣ + n (µ − ¯x) Σ (µ − ¯x) jΣj 2 2 1 Pn Pn T where ¯x = n i=1 xi and S = i=1 (xi − ¯x)(xi − ¯x) and tr(M) represents the trace of the matrix M. It is possible to carry out Bayesian inference with conjugate priors for µ and Σ. We shall consider two cases which reflect different levels of knowledge about the variance-covariance matrix, Σ. 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 1 Conjugate Bayesian inference when Σ = φC Firstly, consider the case where the variance-covariance matrix is known 1 up to a constant, i.e. Σ = φ C where C is a known matrix. Then, we 1 have Xjµ; φ ∼ N (µ; φ C) and the likelihood function is, nk φ −1 T −1 l (µ; φjx) / φ 2 exp − tr SC + n (µ − ¯x) C (µ − ¯x) 2 Analogous to the univariate case, it can be seen that a multivariate normal-gamma prior distribution is conjugate. 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 1 Conjugate Bayesian inference when Σ = φC Definition We say that (µ; φ) have a multivariate normal gamma prior with −1 a b parameters m; V ; 2 ; 2 if, 1 µjφ ∼ N m; V φ a b φ ∼ G ; : 2 2 −1 a b In this case, we write (µ; φ) ∼ N G(m; V ; 2 ; 2 ). Analogous to the univariate case, the marginal distribution of µ is a multivariate, non-central t distribution. 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 1 Conjugate Bayesian inference when Σ = φC Definition A (k-dimensional) random variable, T = (T1;:::; Tk ), has a multivariate t distribution with parameters (d; µT ; ΣT ) if: d+k d+k − 2 Γ 2 1 T −1 f (t) = k 1 1 + (t−µT ) ΣT (t−µT ) 2 2 d d (πd) jΣT j Γ 2 In this case, we write T ∼ T (µT ; ΣT ; d). Theorem −1 a b Let µ; φ ∼ N G(m; V ; 2 ; 2 ). Then, the marginal density of µ is: b µ ∼ T (m; V; a) a . 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 1 Conjugate Bayesian inference when Σ = φC Theorem 1 −1 a b Let Xjµ; φ ∼ N (µ; φ C) and assume a priori (µ; φ) ∼ N G(m; V ; 2 ; 2 ). Then, given a sample data, x, we have that, 1 µjx; φ ∼ N m∗; V∗ φ a∗ b∗ φjx ∼ G ; : 2 2 where, −1 V∗ = V−1 + nC−1 m∗ = V∗ V−1m + nC−1x¯ a∗ = a + nk b∗ = b + tr SC−1 + mT V−1 + n¯xT C−1¯x + m∗V∗−1m∗ 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models 1 Conjugate Bayesian inference when Σ = φC Theorem 1 Given the reference prior p(µ; φ) / φ , then the posterior distribution is, nk −1 φ −1 T −1 p (µ; φjx) / φ 2 exp − tr SC + n (µ − ¯x) C (µ − ¯x) ; 2 −1 −1 (n−1)k tr(SC ) Then, µ; φ j x ∼ N G(x¯; nC ; 2 ; 2 ) which implies that, 1 µ j x; φ ∼ N x¯; C nφ −1! (n − 1) k tr SC φ j x ∼ G ; 2 2 ! tr SC−1 µ j x ∼ T x¯; C; (n − 1) k n (n − 1) k 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Conjugate Bayesian inference when Σ is unknown In this case, it is useful to reparameterize the normal distribution in terms of the precision matrix Φ = Σ−1. Then, the normal likelihood function becomes, n 1 T l(µ; Φ) / jΦj 2 exp − tr (SΦ) + n (µ − ¯x) Φ (µ − ¯x) 2 It is clear that a conjugate prior for µ and Σ must take a similar form to the likelihood. This is a normal-Wishart distribution. 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Conjugate Bayesian inference when Σ is unknown Definition A k × k dimensional symmetric, positive definite random variable W is said to have a Wishart distribution with parameters d and V if, d−k−1 jWj 2 1 f (W) = exp − tr V−1W ; dk d k(k−1) k 2 2 4 Q d+1−i 2 2 jVj π i=1Γ 2 where d > k − 1: In this case, E [W] = dV and we write W ∼ W (d; V) If W ∼ W (d; V), then the distribution of W−1 is said to be an inverse Wishart distribution, W−1 ∼ IW d; V−1, with mean −1 1 −1 E W = d−k−1 V 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Conjugate Bayesian inference when Σ is unknown Theorem −1 1 −1 Suppose that X j µ; Φ ∼ N (µ; Φ ) and let µ j Φ ∼ N (m; α Φ ) and Φ ∼ W(d; W). Then, αm + ¯x 1 µ j Φ; x ∼ N ; Φ−1 α + n α + n αn Φ j x ∼ W d + n; W−1 + S + (m − ¯x)(m − ¯x)T α + n Theorem k+1 Given the limiting prior, p(Φ) / jΦj 2 , the posterior distribution is, 1 µ j Φ; x ∼ N ¯x; Φ−1 n Φ j x ∼ W (n − 1; S) 0. Introduction 1. Multivariate normal 2. Normal linear models 3. Generalized linear models Conjugate Bayesian inference when Σ is unknown The conjugacy assumption that the prior precision of µ is proportional to the model precision φ is very strong in many cases. Often, we may simply wish to use a prior distribution of form µ ∼ N (m; V) where m and V are known and a Wishart prior for Φ, say Φ ∼ W(d; W) as earlier. In this case, the conditional posterior distributions are: −1 −1 µ j Φ; x ∼ N V−1 + nΦ V−1m + nΦ¯x ; V−1 + nΦ Φ j µ; x ∼ W d + n; W−1 + S + n (µ − ¯x)(µ − ¯x)T and therefore, it is straightforward to set up a Gibbs sampling algorithm to sample the joint posterior, as in the univariate case. 0. Introduction 1. Multivariate normal 2. Normal linear models 3.
Recommended publications
  • Statistical Methods for Data Science, Lecture 5 Interval Estimates; Comparing Systems
    Statistical methods for Data Science, Lecture 5 Interval estimates; comparing systems Richard Johansson November 18, 2018 statistical inference: overview I estimate the value of some parameter (last lecture): I what is the error rate of my drug test? I determine some interval that is very likely to contain the true value of the parameter (today): I interval estimate for the error rate I test some hypothesis about the parameter (today): I is the error rate significantly different from 0.03? I are users significantly more satisfied with web page A than with web page B? -20pt “recipes” I in this lecture, we’ll look at a few “recipes” that you’ll use in the assignment I interval estimate for a proportion (“heads probability”) I comparing a proportion to a specified value I comparing two proportions I additionally, we’ll see the standard method to compute an interval estimate for the mean of a normal I I will also post some pointers to additional tests I remember to check that the preconditions are satisfied: what kind of experiment? what assumptions about the data? -20pt overview interval estimates significance testing for the accuracy comparing two classifiers p-value fishing -20pt interval estimates I if we get some estimate by ML, can we say something about how reliable that estimate is? I informally, an interval estimate for the parameter p is an interval I = [plow ; phigh] so that the true value of the parameter is “likely” to be contained in I I for instance: with 95% probability, the error rate of the spam filter is in the interval [0:05; 0:08] -20pt frequentists and Bayesians again.
    [Show full text]
  • DSA 5403 Bayesian Statistics: an Advancing Introduction 16 Units – Each Unit a Week’S Work
    DSA 5403 Bayesian Statistics: An Advancing Introduction 16 units – each unit a week’s work. The following are the contents of the course divided into chapters of the book Doing Bayesian Data Analysis . by John Kruschke. The course is structured around the above book but will be embellished with more theoretical content as needed. The book can be obtained from the library : http://www.sciencedirect.com/science/book/9780124058880 Topics: This is divided into three parts From the book: “Part I The basics: models, probability, Bayes' rule and r: Introduction: credibility, models, and parameters; The R programming language; What is this stuff called probability?; Bayes' rule Part II All the fundamentals applied to inferring a binomial probability: Inferring a binomial probability via exact mathematical analysis; Markov chain Monte C arlo; J AG S ; Hierarchical models; Model comparison and hierarchical modeling; Null hypothesis significance testing; Bayesian approaches to testing a point ("Null") hypothesis; G oals, power, and sample size; S tan Part III The generalized linear model: Overview of the generalized linear model; Metric-predicted variable on one or two groups; Metric predicted variable with one metric predictor; Metric predicted variable with multiple metric predictors; Metric predicted variable with one nominal predictor; Metric predicted variable with multiple nominal predictors; Dichotomous predicted variable; Nominal predicted variable; Ordinal predicted variable; C ount predicted variable; Tools in the trunk” Part 1: The Basics: Models, Probability, Bayes’ Rule and R Unit 1 Chapter 1, 2 Credibility, Models, and Parameters, • Derivation of Bayes’ rule • Re-allocation of probability • The steps of Bayesian data analysis • Classical use of Bayes’ rule o Testing – false positives etc o Definitions of prior, likelihood, evidence and posterior Unit 2 Chapter 3 The R language • Get the software • Variables types • Loading data • Functions • Plots Unit 3 Chapter 4 Probability distributions.
    [Show full text]
  • Generalized Linear Models (Glms)
    San Jos´eState University Math 261A: Regression Theory & Methods Generalized Linear Models (GLMs) Dr. Guangliang Chen This lecture is based on the following textbook sections: • Chapter 13: 13.1 – 13.3 Outline of this presentation: • What is a GLM? • Logistic regression • Poisson regression Generalized Linear Models (GLMs) What is a GLM? In ordinary linear regression, we assume that the response is a linear function of the regressors plus Gaussian noise: 0 2 y = β0 + β1x1 + ··· + βkxk + ∼ N(x β, σ ) | {z } |{z} linear form x0β N(0,σ2) noise The model can be reformulate in terms of • distribution of the response: y | x ∼ N(µ, σ2), and • dependence of the mean on the predictors: µ = E(y | x) = x0β Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University3/24 Generalized Linear Models (GLMs) beta=(1,2) 5 4 3 β0 + β1x b y 2 y 1 0 −1 0.0 0.2 0.4 0.6 0.8 1.0 x x Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University4/24 Generalized Linear Models (GLMs) Generalized linear models (GLM) extend linear regression by allowing the response variable to have • a general distribution (with mean µ = E(y | x)) and • a mean that depends on the predictors through a link function g: That is, g(µ) = β0x or equivalently, µ = g−1(β0x) Dr. Guangliang Chen | Mathematics & Statistics, San Jos´e State University5/24 Generalized Linear Models (GLMs) In GLM, the response is typically assumed to have a distribution in the exponential family, which is a large class of probability distributions that have pdfs of the form f(x | θ) = a(x)b(θ) exp(c(θ) · T (x)), including • Normal - ordinary linear regression • Bernoulli - Logistic regression, modeling binary data • Binomial - Multinomial logistic regression, modeling general cate- gorical data • Poisson - Poisson regression, modeling count data • Exponential, Gamma - survival analysis Dr.
    [Show full text]
  • Bayesian Inference Chapter 4: Regression and Hierarchical Models
    Bayesian Inference Chapter 4: Regression and Hierarchical Models Conchi Aus´ınand Mike Wiper Department of Statistics Universidad Carlos III de Madrid Master in Business Administration and Quantitative Methods Master in Mathematical Engineering Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 1 / 35 Objective AFM Smith Dennis Lindley We analyze the Bayesian approach to fitting normal and generalized linear models and introduce the Bayesian hierarchical modeling approach. Also, we study the modeling and forecasting of time series. Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 2 / 35 Contents 1 Normal linear models 1.1. ANOVA model 1.2. Simple linear regression model 2 Generalized linear models 3 Hierarchical models 4 Dynamic models Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 3 / 35 Normal linear models A normal linear model is of the following form: y = Xθ + ; 0 where y = (y1;:::; yn) is the observed data, X is a known n × k matrix, called 0 the design matrix, θ = (θ1; : : : ; θk ) is the parameter set and follows a multivariate normal distribution. Usually, it is assumed that: 1 ∼ N 0 ; I : k φ k A simple example of normal linear model is the simple linear regression model T 1 1 ::: 1 where X = and θ = (α; β)T . x1 x2 ::: xn Conchi Aus´ınand Mike Wiper Regression and hierarchical models Masters Programmes 4 / 35 Normal linear models Consider a normal linear model, y = Xθ + . A conjugate prior distribution is a normal-gamma distribution:
    [Show full text]
  • Bayesian Methods: Review of Generalized Linear Models
    Bayesian Methods: Review of Generalized Linear Models RYAN BAKKER University of Georgia ICPSR Day 2 Bayesian Methods: GLM [1] Likelihood and Maximum Likelihood Principles Likelihood theory is an important part of Bayesian inference: it is how the data enter the model. • The basis is Fisher’s principle: what value of the unknown parameter is “most likely” to have • generated the observed data. Example: flip a coin 10 times, get 5 heads. MLE for p is 0.5. • This is easily the most common and well-understood general estimation process. • Bayesian Methods: GLM [2] Starting details: • – Y is a n k design or observation matrix, θ is a k 1 unknown coefficient vector to be esti- × × mated, we want p(θ Y) (joint sampling distribution or posterior) from p(Y θ) (joint probabil- | | ity function). – Define the likelihood function: n L(θ Y) = p(Y θ) | i| i=1 Y which is no longer on the probability metric. – Our goal is the maximum likelihood value of θ: θˆ : L(θˆ Y) L(θ Y) θ Θ | ≥ | ∀ ∈ where Θ is the class of admissable values for θ. Bayesian Methods: GLM [3] Likelihood and Maximum Likelihood Principles (cont.) Its actually easier to work with the natural log of the likelihood function: • `(θ Y) = log L(θ Y) | | We also find it useful to work with the score function, the first derivative of the log likelihood func- • tion with respect to the parameters of interest: ∂ `˙(θ Y) = `(θ Y) | ∂θ | Setting `˙(θ Y) equal to zero and solving gives the MLE: θˆ, the “most likely” value of θ from the • | parameter space Θ treating the observed data as given.
    [Show full text]
  • The Bayesian Approach to Statistics
    THE BAYESIAN APPROACH TO STATISTICS ANTHONY O’HAGAN INTRODUCTION the true nature of scientific reasoning. The fi- nal section addresses various features of modern By far the most widely taught and used statisti- Bayesian methods that provide some explanation for the rapid increase in their adoption since the cal methods in practice are those of the frequen- 1980s. tist school. The ideas of frequentist inference, as set out in Chapter 5 of this book, rest on the frequency definition of probability (Chapter 2), BAYESIAN INFERENCE and were developed in the first half of the 20th century. This chapter concerns a radically differ- We first present the basic procedures of Bayesian ent approach to statistics, the Bayesian approach, inference. which depends instead on the subjective defini- tion of probability (Chapter 3). In some respects, Bayesian methods are older than frequentist ones, Bayes’s Theorem and the Nature of Learning having been the basis of very early statistical rea- Bayesian inference is a process of learning soning as far back as the 18th century. Bayesian from data. To give substance to this statement, statistics as it is now understood, however, dates we need to identify who is doing the learning and back to the 1950s, with subsequent development what they are learning about. in the second half of the 20th century. Over that time, the Bayesian approach has steadily gained Terms and Notation ground, and is now recognized as a legitimate al- ternative to the frequentist approach. The person doing the learning is an individual This chapter is organized into three sections.
    [Show full text]
  • Generalized Linear Models
    Generalized Linear Models A generalized linear model (GLM) consists of three parts. i) The first part is a random variable giving the conditional distribution of a response Yi given the values of a set of covariates Xij. In the original work on GLM’sby Nelder and Wedderburn (1972) this random variable was a member of an exponential family, but later work has extended beyond this class of random variables. ii) The second part is a linear predictor, i = + 1Xi1 + 2Xi2 + + ··· kXik . iii) The third part is a smooth and invertible link function g(.) which transforms the expected value of the response variable, i = E(Yi) , and is equal to the linear predictor: g(i) = i = + 1Xi1 + 2Xi2 + + kXik. ··· As shown in Tables 15.1 and 15.2, both the general linear model that we have studied extensively and the logistic regression model from Chapter 14 are special cases of this model. One property of members of the exponential family of distributions is that the conditional variance of the response is a function of its mean, (), and possibly a dispersion parameter . The expressions for the variance functions for common members of the exponential family are shown in Table 15.2. Also, for each distribution there is a so-called canonical link function, which simplifies some of the GLM calculations, which is also shown in Table 15.2. Estimation and Testing for GLMs Parameter estimation in GLMs is conducted by the method of maximum likelihood. As with logistic regression models from the last chapter, the generalization of the residual sums of squares from the general linear model is the residual deviance, Dm 2(log Ls log Lm), where Lm is the maximized likelihood for the model of interest, and Ls is the maximized likelihood for a saturated model, which has one parameter per observation and fits the data as well as possible.
    [Show full text]
  • Posterior Propriety and Admissibility of Hyperpriors in Normal
    The Annals of Statistics 2005, Vol. 33, No. 2, 606–646 DOI: 10.1214/009053605000000075 c Institute of Mathematical Statistics, 2005 POSTERIOR PROPRIETY AND ADMISSIBILITY OF HYPERPRIORS IN NORMAL HIERARCHICAL MODELS1 By James O. Berger, William Strawderman and Dejun Tang Duke University and SAMSI, Rutgers University and Novartis Pharmaceuticals Hierarchical modeling is wonderful and here to stay, but hyper- parameter priors are often chosen in a casual fashion. Unfortunately, as the number of hyperparameters grows, the effects of casual choices can multiply, leading to considerably inferior performance. As an ex- treme, but not uncommon, example use of the wrong hyperparameter priors can even lead to impropriety of the posterior. For exchangeable hierarchical multivariate normal models, we first determine when a standard class of hierarchical priors results in proper or improper posteriors. We next determine which elements of this class lead to admissible estimators of the mean under quadratic loss; such considerations provide one useful guideline for choice among hierarchical priors. Finally, computational issues with the resulting posterior distributions are addressed. 1. Introduction. 1.1. The model and the problems. Consider the block multivariate nor- mal situation (sometimes called the “matrix of means problem”) specified by the following hierarchical Bayesian model: (1) X ∼ Np(θ, I), θ ∼ Np(B, Σπ), where X1 θ1 X2 θ2 Xp×1 = . , θp×1 = . , arXiv:math/0505605v1 [math.ST] 27 May 2005 . X θ m m Received February 2004; revised July 2004. 1Supported by NSF Grants DMS-98-02261 and DMS-01-03265. AMS 2000 subject classifications. Primary 62C15; secondary 62F15. Key words and phrases. Covariance matrix, quadratic loss, frequentist risk, posterior impropriety, objective priors, Markov chain Monte Carlo.
    [Show full text]
  • Scalable and Robust Bayesian Inference Via the Median Posterior
    Scalable and Robust Bayesian Inference via the Median Posterior CS 584: Big Data Analytics Material adapted from David Dunson’s talk (http://bayesian.org/sites/default/files/Dunson.pdf) & Lizhen Lin’s ICML talk (http://techtalks.tv/talks/scalable-and-robust-bayesian-inference-via-the-median-posterior/61140/) Big Data Analytics • Large (big N) and complex (big P with interactions) data are collected routinely • Both speed & generality of data analysis methods are important • Bayesian approaches offer an attractive general approach for modeling the complexity of big data • Computational intractability of posterior sampling is a major impediment to application of flexible Bayesian methods CS 584 [Spring 2016] - Ho Existing Frequentist Approaches: The Positives • Optimization-based approaches, such as ADMM or glmnet, are currently most popular for analyzing big data • General and computationally efficient • Used orders of magnitude more than Bayes methods • Can exploit distributed & cloud computing platforms • Can borrow some advantages of Bayes methods through penalties and regularization CS 584 [Spring 2016] - Ho Existing Frequentist Approaches: The Drawbacks • Such optimization-based methods do not provide measure of uncertainty • Uncertainty quantification is crucial for most applications • Scalable penalization methods focus primarily on convex optimization — greatly limits scope and puts ceiling on performance • For non-convex problems and data with complex structure, existing optimization algorithms can fail badly CS 584 [Spring 2016] -
    [Show full text]
  • The Deviance Information Criterion: 12 Years On
    J. R. Statist. Soc. B (2014) The deviance information criterion: 12 years on David J. Spiegelhalter, University of Cambridge, UK Nicola G. Best, Imperial College School of Public Health, London, UK Bradley P. Carlin University of Minnesota, Minneapolis, USA and Angelika van der Linde Bremen, Germany [Presented to The Royal Statistical Society at its annual conference in a session organized by the Research Section on Tuesday, September 3rd, 2013, Professor G. A.Young in the Chair ] Summary.The essentials of our paper of 2002 are briefly summarized and compared with other criteria for model comparison. After some comments on the paper’s reception and influence, we consider criticisms and proposals for improvement made by us and others. Keywords: Bayesian; Model comparison; Prediction 1. Some background to model comparison Suppose that we have a given set of candidate models, and we would like a criterion to assess which is ‘better’ in a defined sense. Assume that a model for observed data y postulates a density p.y|θ/ (which may include covariates etc.), and call D.θ/ =−2log{p.y|θ/} the deviance, here considered as a function of θ. Classical model choice uses hypothesis testing for comparing nested models, e.g. the deviance (likelihood ratio) test in generalized linear models. For non- nested models, alternatives include the Akaike information criterion AIC =−2log{p.y|θˆ/} + 2k where θˆ is the maximum likelihood estimate and k is the number of parameters in the model (dimension of Θ). AIC is built with the aim of favouring models that are likely to make good predictions.
    [Show full text]
  • Heteroscedastic Errors
    Heteroscedastic Errors ◮ Sometimes plots and/or tests show that the error variances 2 σi = Var(ǫi ) depend on i ◮ Several standard approaches to fixing the problem, depending on the nature of the dependence. ◮ Weighted Least Squares. ◮ Transformation of the response. ◮ Generalized Linear Models. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Weighted Least Squares ◮ Suppose variances are known except for a constant factor. 2 2 ◮ That is, σi = σ /wi . ◮ Use weighted least squares. (See Chapter 10 in the text.) ◮ This usually arises realistically in the following situations: ◮ Yi is an average of ni measurements where you know ni . Then wi = ni . 2 ◮ Plots suggest that σi might be proportional to some power of 2 γ γ some covariate: σi = kxi . Then wi = xi− . Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Variances depending on (mean of) Y ◮ Two standard approaches are available: ◮ Older approach is transformation. ◮ Newer approach is use of generalized linear model; see STAT 402. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Transformation ◮ Compute Yi∗ = g(Yi ) for some function g like logarithm or square root. ◮ Then regress Yi∗ on the covariates. ◮ This approach sometimes works for skewed response variables like income; ◮ after transformation we occasionally find the errors are more nearly normal, more homoscedastic and that the model is simpler. ◮ See page 130ff and check under transformations and Box-Cox in the index. Richard Lockhart STAT 350: Heteroscedastic Errors and GLIM Generalized Linear Models ◮ Transformation uses the model T E(g(Yi )) = xi β while generalized linear models use T g(E(Yi )) = xi β ◮ Generally latter approach offers more flexibility.
    [Show full text]
  • 1 Estimation and Beyond in the Bayes Universe
    ISyE8843A, Brani Vidakovic Handout 7 1 Estimation and Beyond in the Bayes Universe. 1.1 Estimation No Bayes estimate can be unbiased but Bayesians are not upset! No Bayes estimate with respect to the squared error loss can be unbiased, except in a trivial case when its Bayes’ risk is 0. Suppose that for a proper prior ¼ the Bayes estimator ±¼(X) is unbiased, Xjθ (8θ)E ±¼(X) = θ: This implies that the Bayes risk is 0. The Bayes risk of ±¼(X) can be calculated as repeated expectation in two ways, θ Xjθ 2 X θjX 2 r(¼; ±¼) = E E (θ ¡ ±¼(X)) = E E (θ ¡ ±¼(X)) : Thus, conveniently choosing either EθEXjθ or EX EθjX and using the properties of conditional expectation we have, θ Xjθ 2 θ Xjθ X θjX X θjX 2 r(¼; ±¼) = E E θ ¡ E E θ±¼(X) ¡ E E θ±¼(X) + E E ±¼(X) θ Xjθ 2 θ Xjθ X θjX X θjX 2 = E E θ ¡ E θ[E ±¼(X)] ¡ E ±¼(X)E θ + E E ±¼(X) θ Xjθ 2 θ X X θjX 2 = E E θ ¡ E θ ¢ θ ¡ E ±¼(X)±¼(X) + E E ±¼(X) = 0: Bayesians are not upset. To check for its unbiasedness, the Bayes estimator is averaged with respect to the model measure (Xjθ), and one of the Bayesian commandments is: Thou shall not average with respect to sample space, unless you have Bayesian design in mind. Even frequentist agree that insisting on unbiasedness can lead to bad estimators, and that in their quest to minimize the risk by trading off between variance and bias-squared a small dosage of bias can help.
    [Show full text]