Intermediate Bayes 2016
Total Page:16
File Type:pdf, Size:1020Kb
Collaborative Centre for Data Analysis, Modelling and Computation QUT, GPO Box 2434 IntermediateBrisbane 4001, Australia Bayes Kerrie Mengersen QUT Brisbane ACEMS 2016 Course Outline 1. Foundations 2. Linear and hierarchical modelling 3. Computational methods 4. Software 5. Case Study 6. Latent variable models 7. Spatial models 8. Bayesian networks Acknowledgement to Dr Clair Alston, Griffith University, Australia, for some of these course notes. • 1. Foundations A Bayesian and a Frequentist were to be executed. The judge asked them what were their last wishes. The Bayesian replied that he would like to give the Frequentist one more lecture. The judge granted the Bayesian's wish and then turned to the Frequentist for his last wish. The Frequentist quickly responded that he wished to hear the lecture again and again and again and again........ (Xiao-Li Meng) p(q|y) = p(y|q) p(q) / p(y) Bayes Laplace Boole Fisher Jeffreys Geman Gelfand Today’s Venn Neyman Geman Smith Bayesians 1763 1812 1838 1930’s 1950’s 1980’s 1990’s 2000’s “Probability “ Inverse “Bayesian Theory” Probability” Analysis” Recall Bayes’ Rule p(A|B) = p(B|A)P(A) / p(B) p(A | B) = p(A and B) / p(B) p(B | A) = p(B and A) / p(A) p(A and B) = p(B and A) p(B and A) = p(B | A) p(A) p(A | B) = p(B | A) p(A) / p(B) Bayes’ Theorem p(A|B) = p(B|A) p(A) / p(B) Think of: A=q (unknown parameters, etc) B=y (known/observed ‘data’) So: p(q|y) = p(y|q) p(q) / p(y) The Reverend Thomas Bayes (1701-61) studied how to compute a distribution for the probability parameter of a binomial distribution (in modern terminology). Bayesian Modelling Frequentist approach to modelling We have some data y, and want to know about q given y q can be unknown parameters, missing data, latent variables, etc. Eg 1: sample y “successes” from n trials. What is Pr(success), q? Eg 2: sample y from N(q,1). What is population mean q? Frequentist: estimate q through the likelihood: p(y|q) How likely is y for specified values of q? Eg: prob. of observing y if y~Bin(n,q=0.3) or y~N(q=1,1) Solved using moment estimators or maximum likelihood. But we really want to know about p(q|y) Bayesian approach to modelling Example: Estimating a proportion • From an ecologist: I want to know where koalas might be present. I surveyed 29 sites and 22 have koalas. What is the probability that a koala will be present at a different site in the same area, given this information? • From a clinician: I want to know about the safety of a medical procedure. I treated 29 patients and 22 survived. What is the probability of survival, given this information? • What is unobserved? q = probability of success (presence of koalas, survival) Likelihood Prior DAG: Binomial model Model y ~ Binomial (q, n) q ~ Beta (a,b) a b q n y Posterior Your turn Binomial example with 22 successes out of 29 trials: Consider the following priors for q: Beta(1,1) Beta(9,1) Beta(100,100) Choose one of these priors: 1. What is the prior mean for q? 2. What is the posterior distribution for q? 3. What is the posterior mean for q? 4. What general conclusions can you make about the influence of priors and sample size? Answers Sample proportion = 22/29 = 0.76 Beta(1,1): Prior mean = 1/(1+1) = 0.5 Posterior mean = (22+1)/(22+1+7+1) = (22+1)/(29+1+1) = 0.74 Beta(9,1): Prior mean = 9/(9+1) = 0.90 Posterior mean = (22+9)/(22+9+7+1) = 0.79 Beta(100,100): Prior mean = (100)/(100+100) = 0.5 Posterior mean = (22+100)/(22+100+7+100) = 0.53 Your turn Simulate and plot the density for the likelihood and these sets of priors and posteriors. Sample code # calculate the likelihood # p(y=22|theta) = Bin(n=29, theta) y=22 n=29 theta=c(0,0.001,seq(0.01,0.99,0.01),0.999,1) lik=dbinom(x=y,size=n,prob=theta) lik=lik/sum(lik) plot(theta,lik,type="l",ylab="prob",ylim=c(0,0.2)) # calculate the prior # p(theta) = Beta(1,1) # change to Beta(9,1), Beta(100,100) later a1=1; b1=1 prior1=dbeta(theta,a1,b1) prior1=prior1/sum(prior1) lines(theta,prior1,col=2,lty=2) # calculate the corresponding posterior post1=dbeta(theta,y+a1,n-y+b1) post1=post1/sum(post1) lines(theta,post1,col=2,lty=1) Influences on posterior • The posterior mean is a compromise between the prior mean and the data. • The stronger the prior, the more weight the prior has in the posterior. • The larger the sample size, the more weight the likelihood has in the posterior. Conjugate priors • It might be reasonable to expect the posterior distribution to be of the same form as the prior distribution. This is the principle of conjugacy. • A conjugate prior for a Binomial likelihood is a Beta distribution: the posterior is then also a Beta distribution. Conjugate priors Dynamic Updating If we obtain more data, we do not have to redo all of the analysis: our posterior from the first analysis simply becomes our prior for this next analysis. Binomial example: Stage 0. Prior p(q) ~ Beta(1,1); ie E(q)=0.5. Stage 1. Observe y=22 presences from 29 sites. Likelihood: p(y|q)~Bin(n=29, q) Posterior: p(q|y)~Beta(23,8); ie E(q|y) = 0.74 Stage 2: Observe 5 more presences from 10 sites. Likelihood: p(y|q)~Bin(n=10, q); Prior p(q)~Beta(23,8); Posterior p(q|y)~Beta(28,13); ie E(q|y) = 0.68. Your turn Confirm that the dynamic updating method described in the previous slide gives the same outcome as analysing all of the data together. Data: Prior: Posterior: Posterior mean: Example: Estimating a normal mean n observations Y = (y1,..,yn) from a normal distribution, unknown mean m, known variance s2 Normal Model Normal model, unknown mean unknown variance s2 ~ Inverse Gamma(a,b) s ~ Uniform(a,b) What do these ‘look like’? s ~ Half Cauchy(a,b) Linear regression Linear regression: priors Linear regression: Posterior Linear regression: Posterior Model Comparison • Bayes factors, posterior odds, BIC, DIC • Reversible jump MCMC, Birth and death MCMC • Model averaging Bayes factors • Consider models M1 and M2 (not necessarily nested) • Choose a model based on its posterior probability given the data. This is proportional to the prior probability of the model multiplied by the likelihood of the model given the data. So we consider: p(M2|y) p(M2) p(y|M2) Bayes factors To compare M2 versus M1: p(M2|y) / P(M1|y) = {p(M2) / p(M1)} {p(y|M2) / P(y|M1)} • The second term (the ratio of marginal likelihoods) is termed the Bayes factor B21. This is similar to a likelihood ratio, but p(y|M) is integrated over the parameters instead of maximised: eg, p(y|M1) = p(y|M1,q1) p(q1) dq1 • 2log(B21) gives same scale as usual deviance and LR statistics. Guidelines for Bayes Factors (arbitrary!) B21 2log(B21) Interpretation <1 Negative Supports M1 1 to 3 0 to 2 Weak support for M2 3-20 2-6 Supports M2 20-150 6-10 Strong evidence for M2 >150 >10 Very strong support for M2 Bayesian Information Criterion (BIC) • Approximate the Bayes factor • Under some assumptions, if p is the dimension of the model and n is the no. observations: BIC = log P(y|q*,M) – p/2 log n We can rewrite as BIC = n log(1-R2) + k log(n) Discussion of BIC • BIC penalises models which improve fit at the expense of more parameters (encourages parsimony). • A problem is that the true dimensionality (number of parameters p) of the model is often not known, and also that the number of parameters may increase with sample size n. • Can approximate using the effective number of parameters (Speigelhalter et al, 1999). • Alternatives are DIC (deviance information criterion, calculated in WinBUGS), conditional posterior predictive probabilities, etc. Markov chain Monte Carlo • “Decompose” joint posterior distribution into a sequence of conditional distributions – these are often much simpler (eg, simple univariate normals, etc) • Simulate from each conditional distribution in turn. We use a simulation method that resembles a Markov chain (so that the new simulated value relies only on the previous value), giving a set of simulated values q (1), q (2), …, q (i) , ... which converges to the required conditional, The resulting simulations will come from the required joint distribution • We can use Markov chain theory to make statements about behaviour and convergence of the chain Computational Algorithms • Gibbs sampling: sample from the conditionals themselves • Metropolis-Hastings: sample from an “easy” distribution and accept those values that conform to the conditional distribution • Lots of variations: reversible jump, slice sampling, particle filters, perfect sampling, adaptive rejection sampling, etc • Need to ensure conditions, eg detailed balance, reversibility • Approximations: Variational Bayes (VB), Approximate Bayesian Computation (ABC), Sequential Monte Carlo (SMC) Gibbs sampling Suppose we have a joint posterior p(q1, q2 | y,… ) (0) (0) 0. Choose starting values q1 , q2 1. At ith iteration (i) (i) (i-1) Sample q1 from p(q1 | q2 ,y,...) (i) (i) Sample q2 from p(q2 | q1 ,y,…) 2. Repeat step 1 many times 3. Make inferences based on simulated values Exercise Data yi|l ~ Poisson(l), i=1,…,m yi|f ~ Poisson(f), i=m+1,…,n Priors l ~ Gamma(a,b) f ~ Gamma(c,d) m is discrete over {1,…, n} a, b, c, d known constants.