Basic Elements of Bayesian Analysis

1 Basic Elements of Bayesian Analysis In a frequentist analysis, one chooses a model (likelihood function) for the available data, and then either calculates a p-value (which tells you how un- usual your data would be, assuming your null hypothesis is exactly true), or calculates a confidence interval. We have already seen the many deficiencies of p-values, and confidence intervals, while more useful, have a somewhat unnat- ural interpretation, and ignore any prior information that may be available. Alternatively, Bayesian inferences can be calculated, which provide direct probability statements about parameters of interest, at the expense of hav- ing to first summarize prior information about the parameters of interest. In a nutshell, the choice between frequentist and Bayesian inferences can be seen as a choice between being “backwards” (frequentists calculate P (data|H0), rather than P (H0|data)) or being “subjective” (Bayesian analyses require prior information, which must be elicited subjectively). 2 The basic elements in a “full” Bayesian analysis are: 1. The parameter of interest, say θ. Note that this is completely general, since θ may be vector valued. So θ might be a binomial parameter, or the mean and variance from a Normal distribution, or an odds ratio, or a set of regression coefficients, etc. The parameter of interest is sometimes usefully thought of as the “true state of nature”. 2. The prior distribution of θ, f(θ). This prior distribution summarizes what is known about θ before the experiment is carried out. It is “subjective”, so may vary from investigator to investigator. 3. The likelihood function, f(x|θ). The likelihood function provides the distribution of the data, x, given the parameter value θ. So it may be the binomial likelihood, a normal likelihood, a likelihood from a regression equation with associated normal residual variance, logistic regression model, etc. 4. The posterior distribution, f(θ|x). The posterior distribution summarizes the information in the data, x, together with the information in the prior distribution, f(θ). Thus, it summarizes what is known about the parameter of interest θ after the data are collected. 5. Bayes Theorem. This theorem relates the above quantities: likelihood of the data × prior distribution posterior distribution = , a normalizing constant or f(x|θ) × f(θ) f(θ|x) = R f(x|θ) × f(θ)dθ, or, forgetting about the normalizing constant, f(θ|x) ∝ f(x|θ) × f(θ). Thus we “update” the prior distribution to a posterior distribution after seeing the data via Bayes Theorem. 3 6. The action, a. The action is the decision or action that is taken after the analysis is completed. For example, one may decide to treat a patient with Drug 1 or Drug 2, depending on the data collected in a clinical trial. Thus our action will either be to use Drug 1 (so that a = 1) or Drug 2 (so that a = 2). 7. The loss function, L(θ, a). Each time we choose an action, there is some loss we incur, which depends on what the true state of nature is, and what action we decide to take. For example, if the true state of nature is that Drug 1 is in fact superior to Drug 2, then choosing action a = 1 will incur a smaller loss than choosing a = 2. Now, the usual problem is that we do not know the true state of nature, we only have data that lets us make probabilistic statements about it (ie, we have a posterior distribution for θ, but do not usually know the exact value of θ). Also, we rarely make decisions before seeing the data, so that in general, a = a(x) is a function of the data. Note that while we will refer to these as “losses”, we could equally well use “gains”. 8. Expected Bayes Loss (Bayes Risk): We do not know the true value of θ, but we do have a posterior distribution once the data are known, f(θ|x). Hence, to make a “coherent” Bayesian decision, we minimize the Expected Bayesian Loss, defined by: Z EBL = L(θ, a(x))f(θ|x)dθ In other words, we choose the action a(x) such that the EBL is mini- mized. The first five elements in the above list comprise a non-decision theoretic Bayesian approach to statistical inference. This type of analysis (ie, non- decision theoretic) is what most of us are used to seeing in the medical lit- erature. However, many Bayesians argue that the main reason we carry out any statistical analyses is to help in making decisions, so that elements 6, 7, and 8 are crucial. There is little doubt that we will see more such analyses in the near future, but it remains to be seen how popular the decision theoretic framework will become in medicine. The main problem is to specify the loss functions, since there are so many possible consequences (main outcomes, side- effects, costs, etc.) to medical decisions, and it is difficult to combine these into a single loss function. My guess is that much work will have to be done on developing loss functions before the decision theoretic approach becomes mainstream. This course, therefore, will focus on elements 1 through 5. 4 Simple Univariate Inference for Common Situ- ations As you may have seen in your previous classes and in your experience, many data analyses begin with very simple univariate analyses, using models such as the normal (for continuous data), the binomial (for dichotomous data), the Poisson (for count data) and the multinomial (for multicategorical data). Here we will see how analyses typically proceed for these simple models from a Bayesian viewpoint. As described above, in Bayesian analyses, aside from a data model (by which I mean the likelihood function), we need a prior distribution over all unknown parameters in the model. Thus, here we consider “standard” likelihood-prior combinations for these simple situations. To begin, here is a summary chart of what we will see: Data Type (summary) Model (likelihood function) Conjugate Prior Posterior Density ! θ + x −1 2 2 τ2 σ2/n h 1 n i Continuous (x, n) normal(µ, σ ) normal(θ, τ ) normal 1 + 1 , τ 2 + σ2 τ2 σ2/n Dichotomous (x, n) binomial(θ, n) beta(α, β) beta(α + x, β + (n − x)) Count (x) Poisson(λ) gamma(α, β) gamma(α + x, β + 1) Multicat (x1, x2, . , xm) multinom(p1, p2, . , pm) dirich(α1, α2, . , αm) dirich(α1 + x1, α2 + x2, . , αp + xm) 5 6 We will now look at each of these four cases in detail. Bayesian Inference For A Single Normal Mean Example: Consider the situation where we are trying to estimate the mean diastolic blood pressure of Americans living in the United States from a sample of 27 patients. The data are: 76, 71, 82, 63, 76, 64, 64, 74, 70, 64, 75, 81, 75, 78, 66, 62, 79, 82, 78, 62, 72, 83, 79, 41, 80, 77, 67. [Note: These are in fact real data obtained from an experiment designed to estimate the effects of calcium supplementation on blood pressure. These are the baseline data for 27 subjects from the study, whose reference is: Lyle, R.M., Melby, C.L., Hyner, G.C., Edmonson, J.W., Miller, J.Z., and Weinberger, M.H. (1987). Blood pressure and metabolic effects of calcium supplementation in normotensive white and black men. Journal of the American Medical Association, 257, 1772–1776.] √ From this data, we find x = 71.89, and s2 = 85.18, so that s = 85.18 = 9.22 Let us assume the following: 1. The standard deviation is known a priori to be 9 mm Hg. 2. The observations come from a Normal distribution, i.e., 2 2 xi ∼ N(µ, σ = 9 ), for i = 1, 2,..., 27. We will follow the three usual steps used in Bayesian analyses: 1. Write down the likelihood function for the data. 2. Write down the prior distribution for the unknown parameter, in this case µ. 3. Use Bayes theorem to derive the posterior distribution. Use this posterior distribution, or summaries of it like 95% credible intervals for statistical inferences. 7 Step 1: The likelihood function for the data is based on the Normal distribution, i.e., n 1 (x − µ)2 1 !n Pn (x − µ)2 f(x , x , . , x |µ) = Y √ exp(− i ) = √ exp(− i=1 i ). 1 2 n 2 2 2 i=1 2πσ 2σ 2πσ 2σ Step 2: Suppose that we have a priori information that the random parameter µ is likely to be in the interval (60,80). That is, we think that the mean diastolic blood pressure should be about 70, but would not be too surprised if it were as low as perhaps 60, or as high as about 80. We will represent this prior distribution as a second Normal distribution (not to be confused with the fact that the data are also assumed to follow a Normal density). The Normal prior density is chosen here for the same reason as the Beta distribution is chosen when we looked at the binomial distribution: it makes the solution of Bayes Theorem very easy. We can therefore approximate our prior knowledge as: µ ∼ N(θ, τ 2) = N(70, 52 = 25). (1) In general, this choice for a prior is based on any information that may be available at the time of the experiment. In this case, the prior distribution was chosen to have a somewhat large standard deviation (τ = 5) to reflect that we have very little expertise in blood pressures of average Americans.

Basic Elements of Bayesian Analysis

2 Probability Theory and Classical Statistics

Importance Sampling

Probability Distributions and Related Mathematical Constructs

Lecture 17: Bivariate Normal Distribution F (X, Y) Is Given By

Bridgesampling: an R Package for Estimating Normalizing Constants

Bayesian Inference

9 Importance Sampling 3 9.1 Basic Importance Sampling

Derivations of the Univariate and Multivariate Normal Density

Exact Formulas for the Normalizing Constants of Wishart Distributions for Graphical Models

Orthogonal Polynomials in Stein's Method

Beta Distribution

The Continuous Bernoulli: Fixing a Pervasive Error in Variational