Bayesian Inference

Ricardo Ehlers [email protected]

Departamento de Matem´atica Aplicada e Estat´ıstica Universidade de S˜ao Paulo Estimation Introduction to

A decision problem is completely specified by the description of the following spaces:

• Parameter space or state of nature, Θ. • Space of possible results of an , Ω. • Space of possible actions, A.

A decision rule δ is a function defined in Ω which assumes values in A, i.e. δ :Ω → A.

1 For each decision δ and each possible value of the parameter θ we can associate a loss L(δ, θ) assuming positive values. This is supposed to measure the penalty associated with a decision δ when the parameter takes the value θ.

2 Definition The risk of a decision rule, denoted R(δ), is the posterior expected loss, i.e.

R(δ) = Eθ|x[L(δ, θ)] Z = L(δ, θ)p(θ|x)dθ. Θ

3 Definition A decision rule δ∗ is optimal if its risk is minimum, i.e.

R(δ∗) < R(δ), ∀δ.

• This rule is called the Bayes rule and its risk is the Bayes risk. • Then, δ∗ = arg min R(δ), is the Bayes . • Bayes risks can be used to compare . A decision rule δ1 is preferred to a rule δ2 if

R(δ1) < R(δ2)

4 Loss Functions

Definition The quadratic is defined as,

L(δ, θ) = (δ − θ)2, and the associated risk is,

2 R(δ) = Eθ|x[(δ − θ) ] Z = (δ − θ)2p(θ|x)dθ. Θ

5 Lemma If L(δ, θ) = (δ − θ)2 the of θ is E(θ|x). The Bayes risk is,

E[E(θ|x) − θ]2 = Var(θ|x).

6 Definition The absolute loss function is defined as,

L(δ, θ) = |δ − θ|.

Lemma If L(δ, θ) = |δ − θ| the Bayes estimator of θ is the of the posterior distribution.

7 Definition The 0-1 loss function is defined as,

L(δ, θ) = 1 − I (|θ − δ| < ), where I (·) is an indicator function. Lemma Let δ() be the Bayes rule for this loss function and δ∗ the of the posterior distribution. Then,

lim δ() = δ∗. →0 So, the Bayes estimator for a 0-1 loss function is the mode of the posterior distribution.

8 Consequently, for a 0-1 loss function and θ continuous the Bayes estimate is,

θ∗ = arg max p(θ|x) θ = arg max p(x|θ)p(θ) θ = arg max[log p(x|θ) + log p(θ)]. θ

This is also referred to as the generalized maximum likelihood estimator (GMLE).

9 Example. Let X1,..., Xn a random sample from a Bernoulli distribution with parameter θ. For a Beta(α, β) it follows that the posterior distribution is,

θ|x ∼ Beta(α + t, β + n − t)

Pn where t = i=1 xi . Under quadratic loss the Bayes estimate is given by,

α + Pn x E(θ|x) = i=1 i . α + β + n

10 Example. In the previous example suppose that n = 100 and t = 10 so that the Bayes estimate under quadratic loss is, α + 10 E(θ|x) = . α + β + 100

For a Beta(1,1) prior the estimate is 0.108 while for quite a different Beta(1,2) prior the estimate is 0.107. Both estimates are close to the maximum likelihood estimate 0.1. Under 0-1 loss with a Beta(1,1) prior the Bayes estimate is then 0.1.

11 Example. Bayesian estimates of θ under quadratic loss with a Beta(a, b) prior, varying n and keeping x = 0.1.

Prior Parameters n (1,1) (1,2) (2,1) (0.001,0.001) (7,1.5) 10 0.167 (0.067) 0.154 (0.063) 0.231 (0.061) 0.1 (0.082) 0.432 (0.039) 20 0.136 (0.038) 0.13 (0.037) 0.174 (0.035) 0.1 (0.043) 0.316 (0.025) 50 0.115 (0.016) 0.113 (0.016) 0.132 (0.016) 0.1 (0.018) 0.205 (0.013) 100 0.108 (0.008) 0.107 (0.008) 0.117 (0.008) 0.1 (0.009) 0.157 (0.007) 200 0.104 (0.004) 0.103 (0.004) 0.108 (0.004) 0.1 (0.004) 0.129 (0.004)

12 Example. Let X1,..., Xn a random sample from a with parameter θ. Using a conjugate prior it follows that,

X1,..., Xn ∼ Poisson(θ) θ ∼ Gamma(α, β) θ|x ∼ Gamma(α + t, β + n) Pn where t = i=1 xi . The Bayes estimate under quadratic loss is, α + Pn x α + nx E[θ|x] = i=1 i = β + n β + n while the Bayes risk is, α + nx E[θ|x] Var(θ|x) = = . (β + n)2 β + n

If α → 0 and β → 0 then, E[θ|x] → x and Var(θ|x) → x/n. If n → ∞ then E[θ|x] → x. 13 Bayes estimates of θ under quadratic loss using a Gamma(a, b) prior, varying n and keeping x = 10.

Prior Parameters n (1,0.01) (1,2) (2,1) (0.001,0.001) (7,1.5) 10 10.09 (1.008) 8.417 (0.701) 9.273 (0.843) 9.999 (1) 9.304 (0.809) 20 10.045 (0.502) 9.136 (0.415) 9.619 (0.458) 10 (0.5) 9.628 (0.448) 50 10.018 (0.2) 9.635 (0.185) 9.843 (0.193) 10 (0.2) 9.845 (0.191) 100 10.009 (0.1) 9.814 (0.096) 9.921 (0.098) 10 (0.1) 9.921 (0.098) 200 10.004 (0.05) 9.906 (0.049) 9.96 (0.05) 10 (0.05) 9.96 (0.049)

14 2 Example. If X1,..., Xn is a random sample from a N(θ, σ ) with 2 2 σ known and using the conjugate prior, i.e. θ ∼ N(µ0, τ0 ) then the posterior is also normal in which case , median and mode coincide. The Bayes estimate of θ is then given by,

τ −2µ + nσ−2x 0 0 . −2 −2 τ0 + nσ

Under a Jeffreys prior for θ the Bayes estimate is simply x.

15 2 Example. Let X1,..., Xn a random sample from a N(θ, σ ) distribution with θ known and φ = σ−2 unknown.

n n σ2  φ ∼ Gamma 0 , 0 0 2 2 n + n n σ2 + ns2  φ|x ∼ Gamma 0 , 0 0 0 . 2 2

2 Pn 2 where ns0 = i=1(xi − θ) . Then,

n0 + n E(φ|x) = 2 2 n0σ0 + ns0 is the Bayes estimate under quadratic loss.

16 • The quadratic loss can be extended to the multivariate case,

L(δ, θ) = (δ − θ)0(δ − θ),

and the Bayes estimate is E(θ|x).

• Likewise, the 0-1 loss can also be extended,

L(δ, θ) = lim I (|θ − δ| ∈ A), vol(A)→0

and the Bayes estimate is the joint mode of the posterior distribution.

• However, the absolute loss has no clear extension.

17 Example. Suppose X = (X1,..., Xp) has a multinomial distribution with parameters n and θ = (θ1, . . . , θp). If we adopt a Dirichlet prior with parameters α1, . . . , αp the posterior distribution is Dirichlet with parameters xi + αi , i = 1,..., p. Under quadratic loss, the Bayes estimate of θ is E(θ|x) where,

xi + αi E(θi |x) = E(θi |xi ) = Pp . n + j=1 αj

18 Definition A loss function is defined as,

L(δ, θ) = c1(δ − θ)I(−∞,δ)(θ) + c2(θ − δ)I(δ,∞)(θ), where c1 > 0 and c2 > 0. It can be shown that the Bayes estimate of θ is a value θ∗ such that, c P(θ ≤ θ∗) = 2 . c1 + c2

So the Bayes estimate is the quantile of order c2/(c1 + c2) of the posterior distribution.

19 Definition The Linex (Linear Exponential) loss function is defined as,

L(δ, θ) = exp[c(δ − θ)] − c(δ − θ) − 1,

It can be shown that the Bayes estimator is, 1 δ∗ = log E[ecθ|x], c 6= 0. c

20 Linex function with c < 0 reflecting small losses for overestimation and large losses for underestimation. 40 30 Loss 20 c = − 0.5 c = − 1 c = − 2 10 0

−6 −4 −2 0 2 4 6

δ − θ 21 Credible Sets

• Point estimates simplify the posterior distribution into single figures.

• How precise are point estimates?

• We seek a compromise between reporting a single number representing the posterior distribution or report the distribution itself.

22 Definition A set C ∈ Θ is a 100(1-α)% credible set for θ if,

P(θ ∈ C) ≥ 1 − α.

• The inequality is useful when θ has a discrete distribution, otherwise an equality is used in practice. • This definition differs fundamentally from the classical confidence region.

23 Invariance under Transformation

Credible sets are invariant under 1 to 1 parameter transformations. Let θ∗ = g(θ) and C ∗ denotes the image of θ under g. Then,

P(θ∗ ∈ C ∗) = 1 − α.

In the univariate case, if C = [a, b] is a 100(1-α)% for θ then [g(a), g(b)] is a 100(1-α)% credible interval for θ∗.

24 • Credible sets are not unique in general.

• For any α > 0 there are infinitely many solutions to

P(θ ∈ C) = 1 − α.

25 90% credible intervals for a Poisson parameter θ when the posterior is Gamma(4,0.5). 0.10 0.10 0.08 0.08 0.06 0.06 0.04 0.04 0.02 0.02 0.00 0.00

0 5 10 15 20 25 0 5 10 15 20 25

θ θ 0.10 0.08 0.06 0.04 0.02

0.00 26 0 5 10 15 20 25 90% credible intervals for Binomial parameter θ|x ∼ Beta(2, 1.5). 1.5 1.5 1.0 1.0 0.5 0.5 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

θ θ 1.5 1.0 0.5 0.0

0.0 0.2 0.4 0.6 0.8 1.0 27 90% credible intervals for θ|x ∼ Normal(2, 1). 0.4 0.4 0.3 0.3 0.2 0.2 0.1 0.1 0.0 0.0

−2 0 2 4 6 −2 0 2 4 6

θ θ 0.4 0.3 0.2 0.1 0.0

−2 0 2 4 6 28 Definition

A 100(1-α)% highest probability density (HPD) credible set for θ is a 100(1-α)% credible set for θ with the property

p(θ1|x) ≥ p(θ2|x),

∀θ1 ∈ C and all θ2 ∈/ C.

29 • For symmetric distributions HPD credible sets are obtained by fixing the same probability for the tails.

• HPD credible sets are not invariant under transformation.

• In the univariate case, if C = [a, b] is a 100(1-α)% HPD interval for θ then [g(a), g(b)] is a 100(1-α)% interval for g(θ) but not necessarily HPD.

30 2 Example. Let X1, ··· , Xn a random sample from a N(θ, σ ) with 2 2 2 σ known. If θ ∼ N(µ0, τ0 ) then θ|x ∼ N(µ1, τ1 ) and, θ − µ Z = 1 |x ∼ N(0, 1). τ1

Define zα/2 as the value of Z such that,

P(Z ≤ zα/2) = 1 − α/2.

We can find the zα/2 such that,   θ − µ1 P −zα/2 ≤ ≤ zα/2 = 1 − α τ1 or, equivalently  P µ1 − zα/2τ1 ≤ θ ≤ µ1 + zα/2τ1 = 1 − α.  Then, µ1 − zα/2τ1; µ1 + zα/2τ1 is the 100(1-α)% HPD interval for θ.

31 2 Example. In the previous example, if τ0 → ∞ it follows that −2 −2 τ1 → nσ and µ1 → x. Then, √ n(θ − x) Z = |x ∼ N(0, 1). σ

The 100(1-α)% HPD credible interval is given by, √ √  x − zα/2 σ/ n; x + zα/2 σ/ n which concides numerically wit the classical confidence interval. The interpretation however is completely different.

32 Example. In the previous example, the classical approach would base inference on,  σ2  X ∼ N θ, , n or equivalently, √ n(X − θ) U = ∼ N(0, 1). σ

U (called a pivot) is a function of the sample and of θ but its distribution does not depend on θ.

33 Again we can find the percentile zα/2 such that,  P −zα/2 ≤ U ≤ zα/2 = 1 − α or, equivalently √ √  P X − zα/2σ/ n ≤ θ ≤ X + zα/2σ/ n = 1 − α.

However, this is a probabilistic statement about the limits of the interval, and not about θ.

34 The classical interpretation If the same experiment were to be repeated infinitely many times, in 100(1 − α)% of them the random limits of the interval would include θ.

Useless in practice since it is based on unobserved samples.

In the example, when X = x is observed it is said that there is a 100(1 − α)% confidence (not probability) that the interval √ √ (x − zα/2 σ/ n; x + zα/2 σ/ n) contains θ.

35 95% confidence intervals for the mean of 100 samples of size 20 simulated from a N(0, 100). Arrows indicate interval that do not contain zero. 20 10 0 −10 −20

0 20 40 60 80 100 Samples Real confidence level = 95 % 36 Normal Approximation

If the posterior distribution is unimodal and approximately symmetric it can be approximated by a centered about the posterior mode. Consider the Taylor expansion of log p(θ|x) about the mode θ∗,

 d  log p(θ|x) = log p(θ∗|x) + (θ − θ∗) log p(θ|x) dθ θ=θ∗  2  1 ∗ 2 d + (θ − θ ) 2 log p(θ|x) + ... 2 dθ θ=θ∗

37 By definition,  d  log p(θ|x) = 0 dθ θ=θ∗ Defining,  d2  h(θ) = − log p(θ|x) dθ2 it follows that,  h(θ∗)  p(θ|x) ≈ constant × exp − (θ − θ∗)2 . 2

Then, for large n, we have the following approximation,

θ|x ∼ N(θ∗, h(θ∗)−1).

38 These results can be extended to the multivariate case. Since, ∂ log p(θ|x) = 0, ∗ ∂θ θ=θ defining the matrix,

∂2 log p(θ|x) H(θ) = − , ∂θ∂θ0 then, for large n, we have the following approximation,

θ|x ∼ N(θ∗, H(θ∗)−1).

In particular, it is possible to construct approximate credibility regions based on the above results.

39 Definition Let θ ∈ Θ. A region C ⊂ Θ is an asymptotic 100(1 − α)% credibility region if

lim P(θ ∈ C|x) ≥ 1 − α. n→∞

40 Posterior Gamma density and its normal approximation with simulated (n = 10). 3000 2500

2000 p(θ|x) aprox. normal 1500 1000 500 0

0.0000 0.0005 0.0010 0.0015

θ 41 Posterior Gamma density and its normal approximation with simulated data (n = 100). 4000 3000

p(θ|x) aprox. normal 2000 1000 0

0.0005 0.0007 0.0009 0.0011

θ 42 Example. Consider the model,

X1,..., Xn ∼ Poisson(θ) θ ∼ Gamma(α, β).

The posterior distribution is given by, X θ|x ∼ Gama(α + xi , β + n) portanto, P p(θ|x) ∝ θα+ xi −1 exp{−θ(β + n)} ou equivalentemente, X log p(θ|x) = (α + xi − 1) log θ − θ(β + n) + constant.

First and second derivatives, d α + P x − 1 log p(θ|x) = −(β + n) + i dθ θ d2 α + P x − 1 log p(θ|x) = − i . dθ2 θ2

43 It then follows that,

α + P x − 1 α + P x − 1 θ∗ = i , h(θ) = i β + n θ2 2 ∗ (β + n) h(θ ) = P . α + xi − 1

The approximate posterior distribution is,

α + P x − 1 α + P x − 1 θ|x ∼ N i , i . β + n (β + n)2

An approximate 100(1-α)% credible interval for θ,

∗ ∗ −1/2 ∗ ∗ −1/2 θ − zα/2h(θ ) < θ < θ + zα/2h(θ )

44 P 20 simulated Poisson data with θ = 2, prior Gamma(1, 2), xi = 35. 1.5

1.0 p(θ|x) aprox. normal 0.5 0.0

0 1 2 3 4 5

θ 45 Example. For the model X1,..., Xn ∼ Bernoulli(θ) with θ ∼ Beta(α, β) the posterior is,

n X θ|x ∼ Beta(α + t, β + n − t), t = xi . i=1

Then,

p(θ|x) ∝ θα+t−1(1 − θ)β+n−t−1

log p(θ|x) = (α + t − 1) log θ + (β + n − t − 1) log(1 − θ)

d α + t − 1 β + n − t − 1 log p(θ|x) = − + constant dθ θ 1 − θ

d2 α + t − 1 β + n − t − 1 log p(θ|x) = − − dθ2 θ2 (1 − θ)2

46 α + t − 1 θ∗ = α + β + n − 2 α + t − 1 β + n − t − 1 h(θ) = + θ2 (1 − θ)2 α + β + n − 2 h(θ∗) = . θ∗(1 − θ∗)

47 The approximate posterior distribution is,

 θ∗(1 − θ∗)  θ|x ∼ N θ∗, . α + β + n − 2

An approximate 100(1-α)% credible interval for θ, " s s # θ∗(1 − θ∗) θ∗(1 − θ∗) θ∗ − z ; θ∗ + z α/2 α + β + n − 2 α/2 α + β + n − 2

48 20 simulated Bernoulli observations, θ = 0.2, prior Beta(1, 1), P20 i=1 xi = 3. 5 4 3 conjugate posterior |x) θ normal approximation p( 2 1 0

0.0 0.2 0.4 0.6 0.8 1.0

θ 49 Example. If X1,..., Xn ∼ Exp(θ) and p(θ) ∝ 1, it follows that,

n n −θt X p(θ|x) ∝ θ e , t = xi i=1 π(θ) = log p(θ|x) = n log(θ) − θ t + c n π0(θ) = − t θ n π00(θ) = − θ2

Then, n 1 Mode(θ|x) = θ∗ = = t x n h(θ∗) = = nx2. (θ∗)2

50 The approximate posterior distribution is,

 1 1  θ|x ∼ N , . x nx2

An approximate 100(1-α)% credible interval for θ is, " # 1 r 1 1 r 1 − z ; + z . x α/2 nx2 x α/2 nx2

51