Prior Vs Likelihood Vs Posterior Posterior Predictive Distribution Poisson Data
Total Page:16
File Type:pdf, Size:1020Kb
Prior vs Likelihood vs Posterior Posterior Predictive Distribution Poisson Data Statistics 220 Spring 2005 Copyright °c 2005 by Mark E. Irwin Choosing the Likelihood Model While much thought is put into thinking about priors in a Bayesian Analysis, the data (likelihood) model can have a big e®ect. Choices that need to be made involve ² Independence vs Exchangable vs More Complex Dependence ² Tail size, e.g. Normal vs tdf ² Probability of events Choosing the Likelihood Model 1 Example: Probability of God's Existance Two di®erent analyses - both using the prior P [God] = P [No God] = 0:5 Likelihood Ratio Components: P [DataijGod] Di = P [DataijNo God] Evidence (Datai) Di - Unwin Di - Shermer Recognition of goodness 10 0.5 Existence of moral evil 0.5 0.1 Existence of natural evil 0.1 0.1 Intranatural miracles (prayers) 2 1 Extranatural miracles (resurrection) 1 0.5 Religious experiences 2 0.1 Choosing the Likelihood Model 2 P [GodjData]: 2 ² Unwin: 3 ² Shermer: 0.00025 So even starting with the same prior, the di®erence beliefs about what the data says gives quite di®erent posterior probabilities. This is based on an analysis published in an July 2004 Scienti¯c American article. (Available on the course web site on the Articles page.) Stephen D. Unwin is a risk management consultant who has done work in physics on quantum gravity. He is author of the book The Probability of God. Michael Shermer is the publisher of Skeptic and a regular contributor to Scienti¯c American as author of the column Skeptic. Choosing the Likelihood Model 3 Note in the article, Bayes' rule is presented as P [God]D P [GodjData] = P [God]D + P [No God] See if you can show that this is equivalent to the normal version of Bayes' rule, under the assumption that the components of the data model are independent. Choosing the Likelihood Model 4 Prior vs Likelihood vs Posterior The posterior distribution can be seen as a compromise between the prior and the data In general, this can be seen based on the two well known relationships E[θ] = E[E[θjy]] (1) Var(θ) = E[Var(θjy)] + Var(E[θjy]) (2) The ¯rst equation says that our prior mean is the average of all possible posterior means (averaged over all possible data sets). The second says that the posterior variance is, on average, smaller than the prior variance. The size of the di®erence depends on the variability of the posterior means. Prior vs Likelihood vs Posterior 5 This can be exhibited more precisely using examples ² Binomial Model - Conjugate prior ¼ » Beta(a; b) yj¼ » Bin(n; ¼) ¼jy » Beta(a + y; b + n ¡ y) Prior mean: a E[¼] = =¼ ~ a + b MLE: y ¼^ = n Prior vs Likelihood vs Posterior 6 Then the posterior mean satis¯es a + b n E[¼jy] = ¼~ + ¼^ =¼ ¹ a + b + n a + b + n a weighted average of the prior mean and the sample proportion (MLE) Prior variance: ¼~(1 ¡ ¼~) Var(¼) = a + b + 1 Posterior variance: ¼¹(1 ¡ ¼¹) Var(¼jy) = a + b + n + 1 So if n is large enough, the posterior variance will be smaller than the prior variance. Prior vs Likelihood vs Posterior 7 ² Normal Model - Conjugate prior, ¯xed variance 2 θ » N(¹0;¿0 ) iid 2 yijθ » N(θ; σ ); i = 1; : : : ; n 2 θjy » N(¹n;¿n) Then 1 n 2 ¹0 + 2 y¹ ¿0 σ 1 1 n ¹n = 1 n and 2 = 2 + 2 2 + 2 ¿n ¿0 σ ¿0 σ So the posterior mean is a weighted average of the prior mean and a sample mean of the data. Prior vs Likelihood vs Posterior 8 The posterior mean can be thought of in two other ways 2 ¿0 ¹n = ¹0 + (¹y ¡ ¹0) σ2 2 n + ¿0 σ2 n =y ¹ ¡ (¹y ¡ ¹0) σ2 2 n + ¿0 The ¯rst case has ¹n as the prior mean adjusted towards the sample average of the data. The second case has the sample average shrunk towards the prior mean. In most problems, the posterior mean can be thought of as a shrinkage estimator, where the estimate just based on the data is shrunk toward the prior mean. The form of the shrinkage may not be able to be written out in as quite a nice form for more general problems. Prior vs Likelihood vs Posterior 9 In this example the posterior variance is never bigger than the prior variance as 1 1 n 1 1 n 2 = 2 + 2 ¸ 2 and 2 ¸ 2 ¿n ¿0 σ ¿0 ¿n σ The ¯rst part of this is thought of as Posterior Precision = Prior Precision + Data Precision The ¯rst inequality gives 2 2 ¿n · ¿0 The second inequality gives σ2 ¿ 2 · n n Prior vs Likelihood vs Posterior 10 n = 5; y = 2; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0.0 0.5 1.0 1.5 2.0 2.5 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 6 E[¼] = ¼^ = E[¼jy] = = 0:5 7 5 12 3 1 Var(¼) = = 0:0306 Var(¼jy) = = 0:0192 98 52 SD(¼) = 0:175 SD(¼jy) = 0:139 Prior vs Likelihood vs Posterior 11 n = 20; y = 8; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 12 E[¼] = ¼^ = E[¼jy] = = 0:444 7 5 27 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0088 98 SD(¼) = 0:175 SD(¼jy) = 0:094 Prior vs Likelihood vs Posterior 12 n = 100; y = 40; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 2 4 6 8 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 44 E[¼] = ¼^ = E[¼jy] = = 0:411 7 5 107 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0022 98 SD(¼) = 0:175 SD(¼jy) = 0:047 Prior vs Likelihood vs Posterior 13 n = 1000; y = 400; a = 4; b = 3 Posterior Likelihood Prior |y) π p( 0 5 10 15 20 25 0.0 0.2 0.4 0.6 0.8 1.0 π 4 2 404 E[¼] = ¼^ = E[¼jy] = = 0:401 7 5 1007 3 Var(¼) = = 0:0306 Var(¼jy) = 0:0024 98 SD(¼) = 0:175 SD(¼jy) = 0:0154 Prior vs Likelihood vs Posterior 14 Prediction Another useful summary is the posterior predictive distribution of a future observation, y~ Z p(~yjy) = p(~yjy; θ)p(θjy)dθ In many situations, y~ will be conditionally independent of y given θ. Thus the distribution in this case reduces to Z p(~yjy) = p(~yjθ)p(θjy)dθ In many situations this can be di±cult to calculate, though it is often easy with a conjugate prior. Prediction 15 For example, with Binomial-Beta model, the posterior distribution of the success probability is Beta(a1; b1) (for some a1; b1). Then the distribution of the number of successes in m new trials is Z p(~yjy) = p(~yj¼)p(¼jy)d¼ Z µ ¶ 1 m ¡(a + b ) = ¼y~(1 ¡ ¼)m¡y~ 1 1 ¼a1¡1(1 ¡ ¼)b1¡1d¼ 0 y~ ¡(a1)¡(b1) µ ¶ m ¡(a + b ) ¡(a +y ~)¡(b + m ¡ y~) = 1 1 1 1 y~ ¡(a1)¡(b1) ¡(a1 + b1 + m) Which is an example of the Beta-Binomial distribution. The mean this distribution is a E[~yjy] = m 1 = m¼~ a1 + b1 Prediction 16 This can be gotten by applying E[~yjy] = E[E[~yj¼]jy] = E[m¼jy] The variance of this can be gotten by Var(~yjy) = Var(E[~yj¼]jy) + E[Var(~yj¼)jy] = Var(m¼jy) + E[m¼(1 ¡ ¼)jy] This is of the form m¼~(1 ¡ ¼~)f1 + (m ¡ 1)¿ 2g Prediction 17 One way of thinking about this is that there is two pieces of uncertainty in predicting a new observation. 1. Uncertainty about the true success probability 2. Deviation of an observation from its expected value This is more clear with the Normal-Normal model with ¯xed variance. As we saw earlier, the posterior distribution is of the form 2 θjy » N(¹n;¿n) Then Z µ ¶ µ ¶ 1 1 1 1 p 2 p 2 p(~yjy) = exp ¡ 2(~y ¡ θ) exp ¡ 2(θ ¡ ¹n) dθ σ 2¼ 2σ ¿n 2¼ 2¿n Prediction 18 A little bit of calculus will show that this reduces to a normal density with mean E[~yjy] = ¹n = E[E[~yj¹n]jy] = E[θjy] and variance 2 2 Var(~yjy) = ¿n + σ = Var(E[~yjθ]jy) + E[Var(~yjθ)jy] 2 = Var(¹njy) + E[σ jy] Prediction 19 An analogue to this is the variance for prediction in linear regression. It is exactly of this form µ 2 ¶ 2 1 (x ¡ x¹) Var(^yjx) = σ 1 + + 2 n (n ¡ 1)sx Simulating the posterior predictive distribution This is easy to do, assuming that you can simulate from the posterior distribution of the parameter, which is usually feasible. To do it involves two steps: 1. Simulate θi from θjy; i = 1; : : : ; m 2. Simulate y~i from y~jθi (=y ~jθi; y); i = 1; : : : ; m The pairs (θi; y~i) are draws from the joint distribution θ; y~jy. Therefore the y~i are draws from y~jy. Prediction 20 Why interest in the posterior predictive distribution? ² You might want to do predictions. For example, what will happen to a stock in 6 months. ² Model checking: Is your model reasonable? There are a number of ways of doing this. Future observations could be compared with the posterior predictive distribution. Another option might be something along the lines of cross validation.