Conjugate Priors

Chapter 9 The exponential family: Conjugate priors Within the Bayesian framework the parameter θ is treated as a random quantity. This requires us to specify a prior distribution p(θ), from which we can obtain the posterior distribution p(θ x) via Bayes theorem: | p(x θ)p(θ) p(θ x)= | , (9.1) | p(x) where p(x θ) is the likelihood. | Most inferential conclusions obtained within the Bayesian framework are based in one way or another on averages computed under the posterior distribution, and thus for the Bayesian framework to be useful it is essential to be able to compute these integrals with some effective procedure. In particular, prediction of future data xnew is based on the predictive probability: p(x x)= p(x θ)p(θ x)dθ, (9.2) new | new | | Z which is an integral with respect to the posterior. (We have assumed that Xnew X θ). Note also that forming the posterior distribution itself involves computing an integral:⊥⊥ | to normalize the posterior we must compute p(x)= p(x θ)p(θ)dθ, (9.3) | Z which is an integral with respect to the prior. In this section we introduce the idea of a conjugate prior. The basic idea is as follows. Given a likelihood p(x θ), we choose a family of prior distributions such that integrals of the form Eq. (9.3) can| be obtained tractably (for every prior in the family). Moreover, we 1 2 CHAPTER 9. THE EXPONENTIAL FAMILY: CONJUGATE PRIORS choose this family such that prior-to-posterior updating yields a posterior that is also in the family. This means that integrals of the form Eq. (9.2) can also be obtained tractably for every posterior distribution in the family. In general these two goals are in conflict. For example, the goal of invariance of prior-to-posterior updating (i.e., asking that the posterior remains in the same family of distributions of the prior) can be acheived vacuously by defining the family of all probability distributions, but this would not yield tractable integrals. On the other extreme, we could aim to obtain tractable integrals by taking the family of prior distributions to be a single distribution of a simple form (e.g., a constant), but the posterior would not generally retain this form. In the setting of the exponential family this dilemma is readily resolved. For exponential families the likelihood is a simple standarized function of the parameter and we can define conjugate priors by mimicking the form of the likelihood. Multiplication of a likelihood and a prior that have the same exponential form yields a posterior that retains that form. Moreover, for the exponential families that are most useful in practice, these exponential forms are readily integrated. In the remainder of this section we present examples that illustrate conjugate priors for exponential family distributions. Conjugate priors thus have appealing computational properties and for this reason they are widely used in practice. Indeed, for the complex models of the kind that are often constructed using the graphical model toolbox, computational considerations may be paramount, and there may be little choice but to use conjugate priors. On the other hand, there are also good reasons not to use conjugate priors and one should not be lulled into a sense of complacency when using conjugate priors. Before turning to a presentation of examples, let us briefly discuss some of the philosophical issues. we will return to this discussion in Section ?? after we have obtained a better idea of some of the options. Recall from our earlier discussion in Section ?? the distinction between subjective Bayesian and objective Bayesian perspectives. The subjective Bayesian perspective takes the opti- mistic view that priors are an opportunity to express knowledge; in particular, a prior may be a posterior from a previous experiment. The objective Bayesian perspective takes the more pessimistic view that prior knowledge is often not available and that priors should be chosen to have as little impact on the analysis as possible, relative to the impact of the data. In this regard, it is important to note that conjugate priors involve making relatively strong assumptions. Indeed, in a sense to be made clear in Section ??, conjugate priors minimize the impact of the data on the posterior. From the subjective perspective, this can be viewed favorably—conjugate priors provide an opportunity to express knowledge in a relatively influential way. From the objective perspective, however, conjugate priors are decidedly dangerous; objective priors aim to maximize the impact of the data on the posterior. The general point to be made is that one should take care with conjugate priors. The use of conjugate priors involves relatively strong assumptions and thus it is particularly important to do sensitivity analysis to assess how strongly the posterior is influenced by 3 the prior. If the answer is “not much,” then one can proceed with some confidence. If the answer is “a lot,” then one should either take great care to assess whether a domain expert is comfortable with these priors on subjective grounds or one should consider other kinds of priors (such as those discussed in Section ??) and/or gather more data so as to diminish the effect of the prior. 9.0.1 Bernoulli distribution and beta priors We have stated that conjugate priors can be obtained by mimicking the form of the likelihood. This is easily understood by considering examples. Let us begin with the Bernoulli distribution. Parameterizing the Bernoullli distribution using the mean parameter θ, the likelihood takes the following form: p(x θ)= θx(1 θ)1−x. (9.4) | − Under i.i.d. sampling, this expression retains the form of a product of powers of θ and 1 θ, with the exponents growing. This suggests that to obtain a conjugate prior for θ, we use− a distribution that is a product of powers of θ and 1 θ, with free parameters in the exponents: − p(θ τ) θτ1 (1 θ)τ2 . (9.5) | ∝ − This expression can be normalized if τ1 > 1 and τ2 > 1. The resulting distribution is known as the beta distribution, another example− of an exponential− family distribution. The beta distribution is traditionally parameterized using αi 1 instead of τi in the exponents (for a reason that will become clear below), yielding the− following standard form for the conjugate prior: − − p(θ α)= K(α)θα1 1(1 θ)α2 1. (9.6) | − where the normalization factor K(α) can be obtained analytically (see Exercise ??): −1 − − K(α) = θα1 1(1 θ)α2 1dθ (9.7) − Z Γ(α + α ) = 1 2 (9.8) Γ(α1)Γ(α2) as a ratio of gamma functions. If we multiply the beta density by the Bernoulli likelihood we obtain a beta density. T Consider in particular N i.i.d. Bernoulli observations, x =(x1,...,xN ) : N − − − p(θ x, α) θxn (1 θ)1 xn θα1 1(1 θ)α2 1 (9.9) | ∝ − − n=1 ! Y PN x +α −1 N−PN x +α −1 = θ n=1 n 1 (1 θ) n=1 n 2 . (9.10) − 4 CHAPTER 9. THE EXPONENTIAL FAMILY: CONJUGATE PRIORS N This is a beta density with updated values of the parameters. In particular, it is a Beta( n=1 xn+ α ,N N x + α ) distribution. 1 − n=1 n 2 P Note the simple nature of the prior-to-posterior updating procedure. For each observation P xn we simply add xn to the first parameter of the beta distribution and add 1 xn to the second parameter of the beta distribution. At each step we simply retain two− numbers as our representation of the posterior distribution. Note also that the form of the updating procedure provides an interpretation for the parameters α1 and α2. In particular, viewing the posterior as a prior from a previous experiment, we can view α1 and α2 as “effective counts”; α1 can be viewed as an effective number of prior observations of X = 1 and α2 can be interpreted as an effective number of prior observations of X = 0. (In general, however, the parameters are not restricted to integer values.) The fact that the normalization factor of the beta distribution has an analytic form allows us to compute various averages in closed form. Consider, in particular, the mean of a beta random variable: − − E[θ α] = θK(α)θα1 1(1 θ)α2 1dθ (9.11) | − Z − = K(α) θθα1 (1 θ)α2 1dθ (9.12) − Z Γ(α + α ) Γ(α + 1)Γ(α ) = 1 2 1 2 (9.13) Γ(α1)Γ(α2) Γ(α1 +1+ α2) α = 1 , (9.14) α1 + α2 using Γ(a +1) = aΓ(a) in the final line. A similar calculation yields the variance: α1α2 Var[θ α]= 2 . (9.15) | (α1 + α2 + 1)(α1 + α2) From these results we see that the relative values of α1 and α2 determine the mean, whereas the magnitude α1 + α2 determines the variance. That is, for a fixed value of the mean, the variance goes to zero as α1 + α2 goes to infinity. Applying these results to the posterior distribution in Eq. (9.10), we can compute the posterior mean: N x + α E[θ x, α]= n=1 n 1 . (9.16) | N + α + α P 1 2 and the posterior variance: ( N x + α )(N N x + α ) Var[θ x, α]= n=1 n 1 − n=1 n 2 . (9.17) | (N + α + α + 1)(N + α + α )2 P 1 2 P 1 2 5 These equations yield several significant pieces of information.

Load more