15. Bayesian Methods
Total Page:16
File Type:pdf, Size:1020Kb
15. Bayesian Methods c A. Colin Cameron & Pravin K. Trivedi 2006 These transparencies were prepared in 2003. They can be used as an adjunct to Chapter 13 of our subsequent book Microeconometrics: Methods and Applications Cambridge University Press, 2005. Original version of slides: May 2003 Outline 1. Introduction 2. Bayesian Approach 3. Bayesian Analysis of Linear Regression 4. Monte Carlo Integration 5. Markov Chain Monte Carlo Simulation 6. MCMC Example: Gibbs Sampler for SUR 7. Data Augmentation 8. Bayesian Model Selection 9. Practical Considerations 1 Introduction Bayesian regression has grown greatly since books by Arnold Zellner (1971) and Leamer (1978). Controversial. Requires specifying a probabilistic model of prior beliefs about the unknown parameters. [Though role of prior is negligible in large samples and relatively uninformative priors can be speci…ed.] Growth due to computational advances. In particular, despite analytically intractable poste- rior can use simulation (Monte Carlo) methods to – estimate posterior moments – make draws from the posterior. 2 Bayesian Approach 1. Prior () Uncertainty about parameters explicitly modelled by density (). e.g. is an income elasticity and on basis of eco- nomic model or previous studies it is felt that Pr[0:8 1:2] = 0:95. Possible prior is [1; 0:12]. N 2. Sample joint density or likelihood f(y ) j Similar to ML framework. In single equation case y is N 1 vector and depen- dence on regressors X is suppressed. 3. Posterior p( y) j Obtained by combining prior and sample. 2.1 Bayes Theorem Bayes inverse law of probability gives posterior f(y )() p( y) = j ; (1) j f(y) where f(y) is marginal (wrt ) prob. distn of y f(y) = f(y )()d: (2) Z j Pr[A B] Pr[B A] Pr[A] Proof: Use Pr[A B] = \ = j : j Pr[B] Pr[B] f(y) in (1) is free of , so can write p( y) as j proportional to the product of the pdf and the prior p( y) f(y )(): (3) j _ j Big di¤erence – Frequentist: 0 is constant and is random. – Bayesian: is random. b 2.2 Normal-Normal iid Example 1. Sample Density f(y ) j Assume y [; 2] with unknown and 2 ij N given. N N=2 f(y ) = 22 exp (y )2 =22 j 8 i 9 < iX=1 = N exp (y :)2 ; ; _ 2 2 2. Prior (). Suppose [; 2] where and 2 are given. N 1=2 () = 2 2 exp ( )2 =2 2 1 n o exp ( )2 ; _ 2 2 3. Posterior density p( y) j N 1 p( y) exp (y )2 exp ( )2 _ 2 2 j 2 2 After some algebra (completing the square) 1 ( )2 (y )2 p( y) exp + _ 2 12 2 j (2 " N + #) 2 1 ( 1) _ exp 2 (2 " 1 #) 2 2 2 1 = 1 Ny= + = 2 2 2 1 1 = N= + 1= : Properties of posterior: – Posterior density is y [ ; 2]: j N 1 1 – Posterior mean 1 is weighted average of prior mean and sample average y. 2 – Posterior precision 1 is sum of sample preci- sion of y, N=2, and prior precision 1= 2. [Precision is the reciprocal of the variance.] – As N , y [y; 2=N] ! 1 j N Normal-Normal Example with 2 = 100; = 5, 2 = 3, N = 50 and y = 10. 2.3 Speci…cation of the Prior Tricky. Not the focus of this talk. Prior can be improper yet yield proper posterior. A noninformative prior has little impact on the re- sulting posterior distribution. Use Je¤reys prior, not uniform prior, as invariant to reparametrization. For informative prior prefer natural conjugate prior as it yields analytical posterior. Exponential family prior, density and posterior. ) e.g. normal-normal, Poisson-gamma Hierarchical priors popular for multilevel models. 2.4 Measures Related to Posterior Marginal Posterior: p(k y) = p(1; :::; d y)d1::dk 1dk+1::dd: j j R Posterior Moments: mean/median; standard devn. Point Estimation: no unknown to estimate. 0 Instead …nd value of that minimizes a loss function. Posterior Intervals (95%): Pr ) y = 0:95. k;:025 k k;:975 j h i Hypothesis Testing: Not relevant. Bayes factors. Conditional Posterior Density: p(k j; j k; y) = p( y)=p(j k y): j 2 j 2 j 2.5 Large Sample Behavior of Posterior Asymptotically role of prior disappears. If there is a true then the posterior mode (max- 0 imum of the posterior) is consistent for this. b Posterior is asymptotically normal a y ; () 1 ; (4) j N I h i centered around the posteriorb mode,b where 1 @2 ln p( y) = j : I " @@ # 0 = b b Called a Bayesian central limit theorem. 3 Bayesian Linear Regression Linear regression model y X; ;2 [X ;2I ]: j N N Di¤erent results with noninformative and informative priors. Even within these get di¤erent results according to setup. 3.1 Noninformative Priors Je¤reys’priors: ( ) c and (2) 1=2. j _ _ All values of j equally likely. Smaller values of 2 are viewed as more likely. 2 2 ( ; ) _ 1= : Posterior density after some algebra p( ;2 y; X) j 1 K=2 1 1 exp ( ) (X X)( ) _ 2 0 2 0 2 1 (N K)=2+1 b (N K) s2 b exp 2 2 2 ! 2 2 1 Cond. posterior p( ; y; X) is [ ; X X ] j N OLS 0 b Marginal posterior p( y; X) (integrate out 2) is j multivariate t-distribution centered at with N K 2 1 dof and variance s (N K) X X = (N K 2). 0 b Marginal posterior p(2 y; X) is inverse gamma. j Qualitatively similar to frequentist analysis in …nite samples. Interpretation is quite di¤erent. E.g. Bayesian 95 percent posterior interval for j is j t:025;N K se j h i b b – means that j lies in this interval with posterior probability 0.95 – not that if we had many samples and constructed many such intervals 95 percent of them will con- tain the true j0. 3.2 Informative Priors Use conjugate priors: – Prior for 2 is [ ; 2 ]: j N 0 0 – Prior for 2 is inverse-gamma. Posterior after much algebra is p( ;1= y; X) j (0+N)=2 1 s1 K=2 2 exp 2 _ 2 2 1 exp 0 22 1 where 1 = 0 + X0X ( 0 0 + X0X ) 1 = 0 + X0X b 1 1 s = s + u u + 0 + X X 1 0 0 0 0 b b Conditional posterior p( 2; y; X) is [ ; ]: j N 1 Marginal posterior p( y; X) (integrate out 2) is j multivariate t-distribution centered at : Here is average of and prior mean : OLS 0 And precision is sum of prior and sample precisions. b 4 Monte Carlo Integration Compute key posterior moments, without …rst ob- taining the posterior distribution. Want E[m( y)], where expectation is wrt to pos- j terior density p( y). j For notational convenience suppress y: So wish to compute E [m()] = m()p()d: (5) Z Need a numerical estimate of an integral: - Numerical quadrature too hard. - Direct Monte Carlo with draws from p() not pos- sible. - Instead use importance sampling. 4.1 Importance Sampling Rewrite E [m()] = m()p()d Z m()p() = g()d; Z g() ! where g() > 0 is a known density with same sup- port as p(). The corresponding Monte Carlo integral estimate is 1 S m(s)p(s) E [m()] = ; (6) S g(s) sX=1 where s,bs = 1; :::; S, are S draws from of from g() not p(). To apply to posterior need also to account for con- stant of integration in the denominator of (1). Let pker() = f (y ) () be posterior kernel. j Then posterior density is pker() p() = ; pker()d with posterior moment R pker() E [m()] = m() d ker Z p ()d! m() pker()d = R ker R p ()d ker mR () p ()=g() g()d = : R pker()=g() g()d R The importance sampling-based estimate is then 1 S m(s)pker(s)=g(s) E [m()] = S s=1 ; (7) 1 S ker s s PS s=1 p ( )=g( ) b where s, s = 1; :::;P S, are S draws of from the importance sampling density g(): Method was proposed by Kloek and van Dijk (1978). Geweke (1989) established consistency and asymp- totic normality as S if ! 1 – E[m()] < so the posterior moment exists 1 – p()d = 1 so the posterior density is proper. May require ()d < . R 1 R – g() > 0 over the support of p() – g() should have thicker tails than the p() to ensure that the importance weight w() = p()=g() remains bounded. e.g. use multivariate- t. The importance sampling method can be used to estimates many quantities, including mean, standard deviation and percentiles of the posterior. 5 Markov Chain Monte Carlo Sim- ulation If can make S draws from the posterior, 1 s E[m()] can be estimated by S s m( ). P But hard to make draws if no tractable closed form expression for the posterior density. Instead make sequential draws that, if the sequence is run long enough, converge to a stationary distrib- ution that coincides with the posterior density p(). Called Markov chain Monte Carlo, as it involves simulation (Monte Carlo) and the sequence is that of a Markov chain. Note that draws are correlated. 5.1 Markov Chains A Markov chain is a sequence of random variables xn (n = 0; 1; 2; :::) with Pr [xn+1 = x xn; xn 1; :::; x0] = Pr [xn+1 = x xn] ; j j so that the distribution of xn+1 given past is com- pletely determined only by the preceding value xn.