Bayesian Linear and Logistic Regression

1 Bayesian Inference for Linear and Logistic Re- gression Parameters Bayesian inference for simple linear and logistic regression parameters follows the usual pattern for all Bayesian analyses: 1. Form a prior distribution over all unknown parameters. 2. Write down the likelihood function of the data. 3. Use Bayes theorem to find the posterior distribution of all parameters. • We have applied this generic formulation so far to problems with bino- mial distributions, normal means (with variance known), multinomial parameters, and to the Poisson mean parameter. All of these problems involved only one parameter at a time (strictly speaking, more than one in the multinomial case, but only a single Dirichlet distribution was used, so all parameters were treated together, as if it were just one parameter). • What makes regression different is that we have three unknown parameters, since the intercept and slope of the line, α and β are unknown, and the residual standard deviation (for linear regression), σ is also unknown. • Hence our Bayesian problem becomes slightly more complicated, since we are in a multi-parameter situation. Thus, we will use the Gibbs sampler, as automatically implemented in WinBUGS software. Brief Sketch of Bayesian linear regression Recall the three steps: prior ! likelihood ! posterior. 1. We need a joint prior distribution over α, β, and σ. We will specify these as three independent priors [which when multiplied together will produce a joint prior]. One choice is: • α ∼ uniform[−∞; +1] 2 • β ∼ uniform[−∞; +1] • log(σ) ∼ uniform[−∞; +1] With this choice, we can approximate the results from a frequentist regression analysis. Another possible choice is (more practical for WinBUGS): • α ∼ Normal(0; 0:0001) • β ∼ Normal(0; 0:0001) • σ ∼ uniform[0; 100] These are very diffuse (typically, depends on scale of data), so approximate the first set of priors. Of course, if real prior information is available, it can be incorporated. Notes: • The need for the log in first set of priors comes from the fact the the variance must be positive. The prior on σ is equivalent to a density that 1 is proportional to σ2 . • We specify a non-informative prior distribution of these three parameters. Of course, we can also include prior information when available, but this is beyond the scope of this course. • First set of priors is in fact \improper" because their densities do not integrate to one, since the area under these curves in infinite! In general this is to be avoided since sometimes it can cause problems with posterior distributions. This is not one of those problem cases, however, and it is sometimes convenient to use a “flat” prior everywhere, so it is mentioned here (even if we use proper priors for the rest of this lecture). 2. Likelihood function in regression: • As is often the case, the likelihood function used in a Bayesian analysis is the same as the one used for the frequentist analysis. 3 • Recall that we have normally distributed residuals, ∼ N(0; σ2) • Recall that the mean of the regression line, given that we know α and β is y = α + β × x. • Putting this together, we have y ∼ N(α + β × x; σ2). • So for a single patient with observed value xi, we have y ∼ N(α + β × 2 xi; σ ) • So for a single patient, the likelihood function is: ( 2 ) 1 (yi − (α + β × xi)) f(yi) = p exp 2πσ σ2 • So for a group of n patients each contributing data (xi; yi), the likelihood function is given by n Y f(yi) = f(y1) × f(y2) × f(y3) × ::: × f(yn) i=1 • So the likelihood function is simply a bunch of normal densities multiplied together . a multivariate normal likelihood of dimension n. 3. Posterior densities in regression • Bayes theorem now says to multiply the likelihood function (multivariate normal) with the prior. For the first prior set, the joint prior simply is 1 1 × 1 × σ2 . For the second, it is two normal distributions multiplied by a uniform. Since the uniform is equal to 1 everywhere, this reduced to a product of two normal distributions. • For the first prior set, the posterior distribution simply is: n Y 1 f(yi) × 2 : i=1 σ • This is a three dimensional posterior involving α, β, and σ2. • By integrating this posterior density, we can obtain the marginal densities for each of α, β, and σ2. 4 • After integration (tedious details omitted): { α ∼ tn−2 { β ∼ tn−2 { σ2 ∼ Inverse Chi-Square (so 1/σ2 ∼ Chi-Square) • Note the similar results given by Bayesian and frequentist approaches for α and β, and, in fact, the means and variances are the same as well, assuming the “infinite priors" listed above are used. • When the second prior set is used, the resulting formulae are more com- plex, and WinBUGS will be used. However, as both sets of priors are very “diffuse” or \non-informative", numerically, results will be similar to each other, and both will be similar to the inferences given by the frequentist approach (but interpretations will be very different). • Although the t distributions listed above can be used, Bayesian computa- tions usually carried out via computer programs. We will use WinBUGS to compute Bayesian posterior distributions for regression and logistic problems. • Bayes approach also suggests different ways to assess goodness of fit and model selection (we will see this later). Example Consider the following data: 5 Community DMF per 100 Fluoride Concentration Number children in ppm 1 236 1.9 2 246 2.6 3 252 1.8 4 258 1.2 5 281 1.2 6 303 1.2 7 323 1.3 8 343 0.9 9 412 0.6 10 444 0.5 11 556 0.4 12 652 0.3 13 673 0.0 14 703 0.2 15 706 0.1 16 722 0.0 17 733 0.2 18 772 0.1 19 810 0.0 20 823 0.1 21 1027 0.1 We will now analyze these data from a Bayesian viewpoint, using WinBUGS. The program we need is: 6 model { for (i in 1:21) # loop over cities { mu.dmf[i] <- alpha + beta*fl[i] # regression equation dmf[i] ~ dnorm(mu.dmf[i],tau) # distribution individual values } alpha ~ dnorm(0.0,0.000001) # prior for intercept beta ~ dnorm(0.0,0.000001) # prior for slope sigma ~ dunif(0,400) # prior for residual SD tau <- 1/(sigma*sigma) # precision required by WinBUGS for (i in 1:21) { # calculate residuals residual[i] <- dmf[i] - mu.dmf[i] } pred.mean.1.7 <- alpha + beta*1.7 # mean prediction for fl=1.7 pred.ind.1.7 ~ dnorm(pred.mean.1.7, tau) # individual pred for fl=1.7 } Data are entered in one of the following two formats (both equivalent, note blank line required after \END" statement): dmf[] fl[] 236 1.9 246 2.6 252 1.8 258 1.2 281 1.2 303 1.2 323 1.3 343 0.9 412 0.6 444 0.5 556 0.4 652 0.3 673 0.0 703 0.2 706 0.1 722 0.0 7 733 0.2 772 0.1 810 0.0 823 0.1 1027 0.1 END or list(dmf=c( 236, 246, 252, 258, 281, 303, 323, 343, 412, 444, 556, 652, 673, 703, 706, 722, 733, 772, 810, 823, 1027), fl=c( 1.9, 2.6, 1.8, 1.2, 1.2, 1.2, 1.3, 0.9, 0.6, 0.5, 0.4, 0.3, 0.0, 0.2, 0.1, 0.0, 0.2, 0.1, 0.0, 0.1, 0.1)) Let's look at a graph of these data: 8 We will use these initial values (rather arbitrary, just to get into right general area to start): list(alpha=100, beta = 0, sigma=100) Running 2000 burn-in values and then 10,000 further iterations, produces the following results: node mean sd 2.5% median 97.5% alpha 730.1 41.45 646.8 730.6 809.7 beta -277.4 40.86 -358.8 -277.4 -196.8 mu.dmf[1] 203.0 57.16 89.49 203.0 314.1 mu.dmf[2] 8.869 82.93 -152.8 8.885 171.1 mu.dmf[3] 230.8 53.72 124.3 230.6 335.6 mu.dmf[4] 397.2 35.98 325.8 397.5 467.1 mu.dmf[5] 397.2 35.98 325.8 397.5 467.1 mu.dmf[6] 397.2 35.98 325.8 397.5 467.1 mu.dmf[7] 369.5 38.42 293.3 369.6 444.2 mu.dmf[8] 480.4 30.81 418.9 480.4 540.8 mu.dmf[9] 563.7 30.09 503.0 563.4 621.8 mu.dmf[10] 591.4 30.94 529.2 591.4 651.0 mu.dmf[11] 619.1 32.29 554.6 619.3 680.9 mu.dmf[12] 646.9 34.08 578.9 647.1 712.3 mu.dmf[13] 730.1 41.45 646.8 730.6 809.7 mu.dmf[14] 674.6 36.24 602.0 675.1 744.8 mu.dmf[15] 702.4 38.72 624.6 702.7 777.0 mu.dmf[16] 730.1 41.45 646.8 730.6 809.7 mu.dmf[17] 674.6 36.24 602.0 675.1 744.8 mu.dmf[18] 702.4 38.72 624.6 702.7 777.0 mu.dmf[19] 730.1 41.45 646.8 730.6 809.7 mu.dmf[20] 702.4 38.72 624.6 702.7 777.0 mu.dmf[21] 702.4 38.72 624.6 702.7 777.0 pred.ind.1.7 257.3 145.8 -33.89 255.3 546.8 pred.mean.1.7 258.5 50.37 158.7 258.5 356.8 residual[1] 32.96 57.16 -78.05 32.97 146.5 residual[2] 237.1 82.93 75.25 237.1 398.8 residual[3] 21.22 53.72 -83.55 21.39 127.9 residual[4] -139.2 35.98 -209.0 -139.5 -67.76 9 residual[5] -116.2 35.98 -186.0 -116.5 -44.76 residual[6] -94.22 35.98 -164.0 -94.46 -22.76 residual[7] -46.48 38.42 -121.0 -46.57 29.72 residual[8] -137.4 30.81 -197.8 -137.4 -75.79 residual[9] -151.7 30.09 -209.8 -151.4 -90.83 residual[10] -147.4 30.94 -206.9 -147.4 -85.25 residual[11] -63.13 32.29 -124.9 -63.27 1.432 residual[12] 5.126 34.08 -60.23 4.903 73.22 residual[13] -57.09 41.45 -136.7 -57.55 26.23 residual[14] 28.39 36.24 -41.83 27.93 101.0 residual[15] 3.647 38.72 -71.02 3.295 81.4 residual[16] -8.092 41.45 -87.68 -8.55 75.23 residual[17] 58.39 36.24 -11.83 57.93 131.0 residual[18] 69.65 38.72 -5.02 69.3 147.4 residual[19] 79.91 41.45 0.324 79.45 163.2 residual[20] 120.6 38.72 45.98 120.3 198.4 residual[21] 324.6 38.72 250.0 324.3 402.4 sigma 135.0 21.83 98.52 132.5 183.2 A quadratic model may fit better.

Bayesian Linear and Logistic Regression

Logistic Regression, Dependencies, Non-Linear Data and Model Reduction

The Bayesian Approach to Statistics

Scalable and Robust Bayesian Inference Via the Median Posterior

1 Estimation and Beyond in the Bayes Universe

Logistic Regression Maths and Statistics Help Centre

Generalized Linear Models

Introduction to Bayesian Inference and Modeling Edps 590BAY

Statistical Inference: Paradigms and Controversies in Historic Perspective

An Introduction to Logistic Regression: from Basic Concepts to Interpretation with Particular Attention to Nursing Domain

Variance Partitioning in Multilevel Logistic Models That Exhibit Overdispersion

Bayesian Inference for Median of the Lognormal Distribution K

Randomization Does Not Justify Logistic Regression