Bayesian Modeling, Inference and Prediction
Total Page:16
File Type:pdf, Size:1020Kb
Bayesian Modeling, Inference and Prediction David Draper Department of Applied Mathematics and Statistics University of California, Santa Cruz [email protected] www.ams.ucsc.edu/ draper/ ∼ Draft 6 (December 2005): Many sections still quite rough. Comments welcome. c 2005 David Draper All rights reserved. No part of this book may be reprinted, reproduced, or utilized in any form or by any electronic, mechanical or other means, now known or hereafter invented, including photocopying and recording, or by an information storage or retrieval system, without permission in writing from the author. This book was typeset by the author using a PostScript-based phototypesetter ( c Adobe Systems, Inc.). The figures were generated in PostScript using the R data analysis language (R Project, 2005), and were directly incorporated into the typeset document. The text was formatted using the LATEX language (Lamport, 1994), a version of TEX (Knuth, 1984). ii David Draper Bayesian Modeling, Inference and Prediction iii To Andrea, from whom I've learned so much iv David Draper Contents Preface ix 1 Background and basics 1 1.1 Quantification of uncertainty . 1 1.2 Sequential learning; Bayes' Theorem . 5 1.3 Bayesian decision theory . 7 1.4 Problems . 11 2 Exchangeability and conjugate modeling 19 2.1 Quantification of uncertainty about observables. Binary out- comes . 19 2.2 Review of frequentist modeling . 20 2.2.1 The likelihood function . 26 2.2.2 Calibrating the MLE . 31 2.3 Bayesian modeling . 35 2.3.1 Exchangeability . 35 2.3.2 Hierarchical (mixture) modeling . 37 2.3.3 The simplest mixture model . 38 2.3.4 Conditional independence . 39 2.3.5 Prior, posterior, and predictive distributions . 40 2.3.6 Conjugate analysis. Comparison with frequentist mod- eling . 44 2.4 Integer-valued outcomes . 61 2.5 Continuous outcomes . 89 2.6 Appendix on hypothesis testing . 107 2.7 Problems . 110 v vi David Draper 3 Simulation-based computation 123 3.1 IID sampling . 123 3.1.1 Rejection sampling . 129 3.2 Markov chains . 134 3.2.1 The Monte Carlo and MCMC datasets . 141 3.3 The Metropolis algorithm . 143 3.3.1 Metropolis-Hastings sampling . 144 3.3.2 Gibbs sampling . 149 3.3.3 Why the Metropolis algorithm works . 154 3.3.4 Directed acyclic graphs . 158 3.3.5 Practical MCMC monitoring and convergence diagnos- tics . 168 3.3.6 MCMC accuracy . 176 3.4 Problems . 193 4 Bayesian model specification 199 4.1 Hierarchical model selection: An example with count data . 199 4.1.1 Poisson fixed-effects modeling . 202 4.1.2 Additive and multiplicative treatment effects . 208 4.1.3 Random-effects Poisson regression modeling . 216 4.2 Bayesian model choice . 226 4.2.1 Data-analytic model specification . 228 4.2.2 Model selection as a decision problem . 229 4.2.3 The log score and the deviance information criterion . 234 4.2.4 Connections with Bayes factors . 244 4.2.5 When is a model good enough? . 247 4.3 Problems . 250 5 Hierarchical models for combining information 261 5.1 The role of scientific context in formulating hierarchical models 261 5.1.1 Maximum likelihood and empirical Bayes . 264 5.1.2 Incorporating study-level covariates . 267 5.2 Problems . 282 6 Bayesian nonparametric modeling 285 6.1 de Finetti's representation theorem revisited . 285 Bayesian Modeling, Inference and Prediction vii 6.2 The Dirichlet process . 285 6.3 Dirichlet process mixture modeling . 285 6.4 P´olya trees . 285 6.5 Problems . 285 References 287 Index 295 viii David Draper Preface Santa Cruz, California David Draper December 2005 ix x David Draper Chapter 1 Background and basics 1.1 Quantification of uncertainty Case study: Diagnostic screening for HIV. Widespread screening for HIV has been proposed by some people in some countries (e.g., the U.S.). Two blood tests that screen for HIV are available: ELISA, which is relatively inexpensive (roughly $20) and fairly accurate; and Western Blot (WB), which is considerably more accurate but costs quite a bit more (about $100). Let's say1 I'm a physician, and a new patient comes to me with symptoms that suggest the patient (male, say) may be HIV positive. Questions: Is it appropriate to use the language of probability to quantify my • uncertainty about the proposition A = this patient is HIV positive ? f g If so, what kinds of probability are appropriate, and how would I assess • P (A) in each case? What strategy (e.g., ELISA, WB, both?) should I employ to decrease • my uncertainty about A? If I decide to run a screening test, how should my uncertainty be updated in light of the test results? 1As will become clear, the Bayesian approach to probability and statistics is explicit about the role of personal judgment in uncertainty assessment. To make this clear I'll write in the first person in this book, but as you read I encourage you to constantly imagine yourself in the position of the person referred to as \I" and to think along with that person in quantifying uncertainty and making choices in the face of it. 1 2 David Draper Statistics might be defined as the study of uncertainty: how to mea- sure it, and what to do about it, and probability as the part of math- ematics (and philosophy) devoted to the quantification of uncertainty. The systematic study of probability is fairly recent in the history of ideas, dating back to about 1650 (e.g., Hacking, 1975). In the last 350 years three main ways to define probability have arisen (e.g., Oakes, 1990): Classical: Enumerate the elemental outcomes (EOs) (the funda- • mental possibilities in the process under study) in a way that makes them equipossible on, e.g., symmetry grounds, and compute PC (A) = ratio of nA =(number of EOs favorable to A) to n =(total number of EOs). Frequentist: Restrict attention to attributes A of events: phenomena • that are inherently (and independently) repeatable under \identical" conditions; define PF (A) = the limiting value of the relative frequency with which A occurs in the repetitions. Personal, or \Subjective", or Bayesian: I imagine betting with • someone about the truth of proposition A, and ask myself what odds O (in favor of A) I would need to give or receive in order that I judge O the bet fair; then (for me) PB(A) = (1+O) . Other approaches not covered here include logical (e.g., Jeffreys, 1961) and fiducial (Fisher, 1935) probability. Each of these probability definitions has general advantages and disad- vantages: Classical • { Plus: Simple, when applicable (e.g., idealized coin-tossing, draw- ing colored balls from urns, choosing cards from a well-shuffled deck, and so on). { Minus: The only way to define \equipossible" without a circular appeal to probability is through the principle of insufficient reason|I judge EOs equipossible if I have no grounds (empirical, logical, or symmetrical) for favoring one over another|but this can lead to paradoxes (e.g., assertion of equal uncertainty is not invariant to the choice of scale on which it's asserted). Bayesian Modeling, Inference and Prediction 3 Frequentist • { Plus: Mathematics relatively tractable. { Minus: Only applies to inherently repeatable events, e.g., from the vantage point of (say) 2005, PF (the Republicans will win the White House again in 2008) is (strictly speaking) undefined. Bayesian • { Plus: All forms of uncertainty are in principle quantifiable with this approach. { Minus: There's no guarantee that the answer I get by asking myself about betting odds will retrospectively be seen by me or others as \good" (but how should the quality of an uncertainty assessment itself be assessed?). Returning to P (A) = P (this patient is HIV-positive), data are available from medical journals on the prevalence of HIV-positivity in various subsets of = all humans (e.g., it's higher in gay people and lower in women). All P f g three probabilistic approaches require me to use my judgment to identify the recognizable subpopulation this patient (Fisher, 1956; Draper et al., 1993): this is P the smallest subset to which this patient belongs for which the HIV prevalence differs from that in the rest of by an amount I judge as large enough to matter in a practicP al sense. The idea is that within this patient I regard HIV prevalence as close enough to constant that the differencesP aren't worth bothering over, but the differ- ences between HIV prevalence in this patient and its complement matter to me. P Here this patient might consist (for me) of everybody who matches this patient on gender,P age (category, e.g., 25{29), and sexual orientation. NB This is a modeling choice based on judgment; different reasonable people might make different choices. As a classicist I would then (a) use this definition to establish equipos- sibility within this patient, (b) count n =(number of HIV-positive people in P A this patient) and n =(total number of people in this patient), and (c) compute P nA P PC (A) = n . 4 David Draper As a frequentist I would (a) equate P (A) to P (a person chosen at random with replacement (independent identically distributed (IID) sampling) from this patient is HIV-positive), (b) imagine repeating this random sampling indefinitelyP , and (c) conclude that the limiting value of the relative frequency nA of HIV-positivity in these repetitions would also be PF (A) = n . NB Strictly speaking I'm not allowed in the frequentist approach to talk about P (this patient is HIV-positive)|he either is or he isn't. In the frequentist paradigm I can only talk about the process of sampling people like him from this patient.