15. Bayesian Methods

15. Bayesian Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 2003. They can be used as an adjunct to Chapter 13 of our subsequent book Microeconometrics: Methods and Applications Cambridge University Press, 2005. Original version of slides: May 2003 Outline 1. Introduction 2. Bayesian Approach 3. Bayesian Analysis of Linear Regression 4. Monte Carlo Integration 5. Markov Chain Monte Carlo Simulation 6. MCMC Example: Gibbs Sampler for SUR 7. Data Augmentation 8. Bayesian Model Selection 9. Practical Considerations 1 Introduction Bayesian regression has grown greatly since books by Arnold Zellner (1971) and Leamer (1978). Controversial. Requires specifying a probabilistic model of prior beliefs about the unknown parameters. [Though role of prior is negligible in large samples and relatively uninformative priors can be speci…ed.] Growth due to computational advances. In particular, despite analytically intractable posterior can use simulation (Monte Carlo) methods to – estimate posterior moments – make draws from the posterior. 2 Bayesian Approach 1. Prior () Uncertainty about parameters explicitly modelled by density (). e.g. is an income elasticity and on basis of eco- nomic model or previous studies it is felt that Pr[0:8 1:2] = 0:95. Possible prior is [1; 0:12]. N 2. Sample joint density or likelihood f(y ) j Similar to ML framework. In single equation case y is N 1 vector and depen- dence on regressors X is suppressed. 3. Posterior p( y) j Obtained by combining prior and sample. 2.1 Bayes Theorem Bayes inverse law of probability gives posterior f(y )() p( y) = j ; (1) j f(y) where f(y) is marginal (wrt ) prob. distn of y f(y) = f(y )()d: (2) Z j Pr[A B] Pr[B A] Pr[A] Proof: Use Pr[A B] = \ = j : j Pr[B] Pr[B] f(y) in (1) is free of , so can write p( y) as j proportional to the product of the pdf and the prior p( y) f(y )(): (3) j _ j Big di¤erence – Frequentist: 0 is constant and is random. – Bayesian: is random. b 2.2 Normal-Normal iid Example 1. Sample Density f(y ) j Assume y [; 2] with unknown and 2 ij N given. N N=2 f(y ) = 22 exp (y )2 =22 j 8 i 9 < iX=1 = N exp (y :)2 ; ; _ 2 2 2. Prior (). Suppose [; 2] where and 2 are given. N 1=2 () = 2 2 exp ( )2 =2 2 1 n o exp ( )2 ; _ 2 2 3. Posterior density p( y) j N 1 p( y) exp (y )2 exp ( )2 _ 2 2 j 2 2 After some algebra (completing the square) 1 ( )2 (y )2 p( y) exp + _ 2 12 2 j (2 " N + #) 2 1 ( 1) _ exp 2 (2 " 1 #) 2 2 2 1 = 1 Ny= + = 2 2 2 1 1 = N= + 1= : Properties of posterior: – Posterior density is y [ ; 2]: j N 1 1 – Posterior mean 1 is weighted average of prior mean and sample average y. 2 – Posterior precision 1 is sum of sample precision of y, N=2, and prior precision 1= 2. [Precision is the reciprocal of the variance.] – As N , y [y; 2=N] ! 1 j N Normal-Normal Example with 2 = 100; = 5, 2 = 3, N = 50 and y = 10. 2.3 Speci…cation of the Prior Tricky. Not the focus of this talk. Prior can be improper yet yield proper posterior. A noninformative prior has little impact on the re- sulting posterior distribution. Use Je¤reys prior, not uniform prior, as invariant to reparametrization. For informative prior prefer natural conjugate prior as it yields analytical posterior. Exponential family prior, density and posterior. ) e.g. normal-normal, Poisson-gamma Hierarchical priors popular for multilevel models. 2.4 Measures Related to Posterior Marginal Posterior: p(k y) = p(1; :::; d y)d1::dk 1dk+1::dd: j j R Posterior Moments: mean/median; standard devn. Point Estimation: no unknown to estimate. 0 Instead …nd value of that minimizes a loss function. Posterior Intervals (95%): Pr ) y = 0:95. k;:025 k k;:975 j h i Hypothesis Testing: Not relevant. Bayes factors. Conditional Posterior Density: p(k j; j k; y) = p( y)=p(j k y): j 2 j 2 j 2.5 Large Sample Behavior of Posterior Asymptotically role of prior disappears. If there is a true then the posterior mode (max- 0 imum of the posterior) is consistent for this. b Posterior is asymptotically normal a y ; () 1 ; (4) j N I h i centered around the posteriorb mode,b where 1 @2 ln p( y) = j : I " @@ # 0 = b b Called a Bayesian central limit theorem. 3 Bayesian Linear Regression Linear regression model y X; ;2 [X ;2I ]: j N N Di¤erent results with noninformative and informative priors. Even within these get di¤erent results according to setup. 3.1 Noninformative Priors Je¤reys’priors: ( ) c and (2) 1=2. j _ _ All values of j equally likely. Smaller values of 2 are viewed as more likely. 2 2 ( ; ) _ 1= : Posterior density after some algebra p( ;2 y; X) j 1 K=2 1 1 exp ( ) (X X)( ) _ 2 0 2 0 2 1 (N K)=2+1 b (N K) s2 b exp 2 2 2 ! 2 2 1 Cond. posterior p( ; y; X) is [ ; X X ] j N OLS 0 b Marginal posterior p( y; X) (integrate out 2) is j multivariate t-distribution centered at with N K 2 1 dof and variance s (N K) X X = (N K 2). 0 b Marginal posterior p(2 y; X) is inverse gamma. j Qualitatively similar to frequentist analysis in …nite samples. Interpretation is quite di¤erent. E.g. Bayesian 95 percent posterior interval for j is j t:025;N K se j h i b b – means that j lies in this interval with posterior probability 0.95 – not that if we had many samples and constructed many such intervals 95 percent of them will con- tain the true j0. 3.2 Informative Priors Use conjugate priors: – Prior for 2 is [ ; 2 ]: j N 0 0 – Prior for 2 is inverse-gamma. Posterior after much algebra is p( ;1= y; X) j (0+N)=2 1 s1 K=2 2 exp 2 _ 2 2 1 exp 0 22 1 where 1 = 0 + X0X ( 0 0 + X0X ) 1 = 0 + X0X b 1 1 s = s + u u + 0 + X X 1 0 0 0 0 b b Conditional posterior p( 2; y; X) is [ ; ]: j N 1 Marginal posterior p( y; X) (integrate out 2) is j multivariate t-distribution centered at : Here is average of and prior mean : OLS 0 And precision is sum of prior and sample precisions. b 4 Monte Carlo Integration Compute key posterior moments, without …rst ob- taining the posterior distribution. Want E[m( y)], where expectation is wrt to pos- j terior density p( y). j For notational convenience suppress y: So wish to compute E [m()] = m()p()d: (5) Z Need a numerical estimate of an integral: - Numerical quadrature too hard. - Direct Monte Carlo with draws from p() not possible. - Instead use importance sampling. 4.1 Importance Sampling Rewrite E [m()] = m()p()d Z m()p() = g()d; Z g() ! where g() > 0 is a known density with same support as p(). The corresponding Monte Carlo integral estimate is 1 S m(s)p(s) E [m()] = ; (6) S g(s) sX=1 where s,bs = 1; :::; S, are S draws from of from g() not p(). To apply to posterior need also to account for constant of integration in the denominator of (1). Let pker() = f (y ) () be posterior kernel. j Then posterior density is pker() p() = ; pker()d with posterior moment R pker() E [m()] = m() d ker Z p ()d! m() pker()d = R ker R p ()d ker mR () p ()=g() g()d = : R pker()=g() g()d R The importance sampling-based estimate is then 1 S m(s)pker(s)=g(s) E [m()] = S s=1 ; (7) 1 S ker s s PS s=1 p ( )=g( ) b where s, s = 1; :::;P S, are S draws of from the importance sampling density g(): Method was proposed by Kloek and van Dijk (1978). Geweke (1989) established consistency and asymp- totic normality as S if ! 1 – E[m()] < so the posterior moment exists 1 – p()d = 1 so the posterior density is proper. May require ()d < . R 1 R – g() > 0 over the support of p() – g() should have thicker tails than the p() to ensure that the importance weight w() = p()=g() remains bounded. e.g. use multivariate- t. The importance sampling method can be used to estimates many quantities, including mean, standard deviation and percentiles of the posterior. 5 Markov Chain Monte Carlo Sim- ulation If can make S draws from the posterior, 1 s E[m()] can be estimated by S s m( ). P But hard to make draws if no tractable closed form expression for the posterior density. Instead make sequential draws that, if the sequence is run long enough, converge to a stationary distribution that coincides with the posterior density p(). Called Markov chain Monte Carlo, as it involves simulation (Monte Carlo) and the sequence is that of a Markov chain. Note that draws are correlated. 5.1 Markov Chains A Markov chain is a sequence of random variables xn (n = 0; 1; 2; :::) with Pr [xn+1 = x xn; xn 1; :::; x0] = Pr [xn+1 = x xn] ; j j so that the distribution of xn+1 given past is com- pletely determined only by the preceding value xn.

15. Bayesian Methods

The Exponential Family 1 Definition

1 Introduction to Bayesian Inference 2 Introduction to Gibbs Sampling

Fast Bayesian Non-Negative Matrix Factorisation and Tri-Factorisation

Bayesian Inference in the Normal Linear Regression Model

An Introduction to Markov Chain Monte Carlo Methods and Their Actuarial Applications

Accelerating Bayesian Inference on Structured Graphs Using Parallel Gibbs Sampling

Gibbs Sampling, Exponential Families and Orthogonal Polynomials

Gibbs Sampling

Differentially Private Bayesian Inference for Exponential Families

A New Hyperprior Distribution for Bayesian Regression Model with Application in Genomics

9 Introduction to Hierarchical Models

Chapter 4 MONTE CARLO METHODS and MCMC SIMULATION