<<

9 Bayesian

1702 - 1761

9.1 Subjective probability This is probability regarded as degree of . A subjective probability of an event A is assessed as p if you are prepared to stake £pM to win £M and equally prepared to accept a stake of £pM to win £M.

In other words ...

... the bet is fair and you are assumed to behave rationally.

9.1.1 Kolmogorov’s axioms How does subjective probability fit in with the fundamental axioms? Let A be the set of all subsets of a countable sample space Ω. Then

(i) P(A) ≥ 0 for every A ∈A;

(ii) P(Ω)=1;

83 (iii) If {Aλ : λ ∈ Λ} is a countable set of mutually exclusive events belonging to A,then     P Aλ = P (Aλ) . λ∈Λ λ∈Λ Obviously the subjective interpretation has no difficulty in conforming with (i) and (ii). (iii) is slightly less obvious. Suppose we have 2 events A and B such that A ∩ B = ∅. Consider a stake of £pAM to win £M if A occurs and a stake £pB M to win £M if B occurs. The total stake for bets on A or B occurring is £pAM+ £pBM to win £M if A or B occurs. Thus we have £(pA + pB)M to win £M and so P (A ∪ B)=P(A)+P(B)

9.1.2 Conditional probability

Define pB , pAB , pA|B such that

£pBM is the fair stake for £M if B occurs; £pABM is the fair stake for £M if A and B occur; £pA|BM is the fair stake for £M if A occurs given B has occurred − other- wise the bet is off.

Consider the 3 alternative outcomes A only, B only, A and B.IfG1, G2, G3 are the gains, then

A : −pB M2 − pABM3 = G1 B : −pA|BM1 +(1− pB)M2 − pAB)M3 = G2 A ∩ B :(1− pA|B)M1 +(1− pB)M2 +(1− pAB)M3 = G3 The principle of rationality does not allow the possibility of setting the stake to obtain sure profit. Thus      0 −pB −pAB     −pA|B 1 − pB −pAB  =0   1 − pA|B 1 − pB 1 − pAB

⇒ pAB − pB pA|B =0 or P (A ∩ B) P(A | B)= . P (B)

84 9.2 Estimation formulated as a decision problem A The action space, which is the set of all possible actions a ∈Aavailable.

Θ The space, consisting of all possible “states of nature”, only one of which occurs or will occur ( this “true” state being unknown).

L The , having domain Θ ×A(the set of all possible conse- quences (θ, a)) and codomain R.

RX Containing all possible realisations x ∈ RX of a X belonging to the family {f(x; θ):θ ∈ Θ} .

D The decision space, consisting of all possible decisions δ ∈D, each

decision function having domain RX and codomain δ.

9.2.1 The basic idea • The true state of the world is unknown when the action is chosen.

• The loss is known which results from each of the possible consequences (θ, a).

are collected to provide about the unknown θ.

• A choice of action is taken for each possible observation of X.

Decision theory is concerned with the criteria for making a choice. The form of the loss function is crucial − it depends upon the nature of the problem.

Example You are a fashion buyer and must stock up with trendy dresses. The true number you will sell is θ (unknown).

85 You stock up with a dresses. If you overstock you must sell off at your end- of-season sale and lose £A per dress. If you understock you lose the profit of £B per dress. Thus  A(a − θ),a≥ θ, L(θ, a)= B(θ − a),a≤ θ.

Special forms of loss function If A = B, you have modular loss,

L(θ, a)=A |θ − a|

To suit comparison with classical inference, quadratic loss is often used.

L(θ, a)=C(θ − a)2

9.2.2 The function This is a measure of the quality of a decision function. Suppose we choose a decision δ(x) (i.e. choose a = δ(x)). The function R with domain Θ ×Dand codomain R defined by  R(θ, δ)= L (θ, δ(x)) f(x; θ)dx RX  or R(θ, δ)= L (θ, δ(x)) p(x; θ) x∈RX is called the risk function. We may write this as

R(θ, δ)=E [L (θ,δ(x))] .

Example Suppose X is a random sample from N(θ,θ) and suppose we assume quadratic loss.

86 Consider 1 n δ (X)=X, δ (X)= (X − X)2. 1 2 n − 1 i i=1

2 R(θ, δ1)=E (θ − X) . We know that E(X)=θ,so

θ R(θ, δ )=V (X)= . 1 n

2 R(θ, δ2)=E (θ − δ2(X)) .

But E [δ2(X)] = θ,sothat

R(θ, δ2)=V [δ2(X)]

θ2 n (X − X)2 = V i 2 (n − 1) i=1 θ

θ2 2θ2 = · 2(n − 1) = . (n − 1)2 n − 1

R(θ,d) d2

2 (n − 1)/2n d1

θ (n − 1)/2n

Without knowing θ, how do we choose between δ1 and δ2? Two sensible criteria

1. Take precautions against the worst.

2. Take into account any further information you may have.

87 Criterion 1 leads to . Choose the decision leading to the smallest maximum risk, i.e. choose δ∗ ∈Dsuch that

sup {R(θ, δ∗)} = inf sup {R(θ, δ)} . θ∈Θ δ∈D θ∈Θ Minimax rules have disadvantages, the main one being that they treat Nature as an adversary. Minimax guards against the worst choice. The value of θ should not be regarded as a value deliberately chosen by Nature to make the risk large. Fortunately in many situations they turn out to be reasonable in practice, if not in concept. Criterion 2 involves introducing beliefs about θ into the analysis. If π(θ) represents our beliefs about θ, then define Bayes risk r(π, δ) as   R(θ, δ)π(θ)dθ if θ is continuous, r(π, δ)= Θ θ∈Θ R(θ, δ)π(θ) if θ is discrete. Now choose δ∗ such that

r(π, δ∗)=inf{r(π, δ)} . δ∈D Bayes Rule The Bayes rule with respect to a prior π is the decision rule δ∗ that minimises the Bayes risk among all possible decision rules.

Bayes’ tomb

88 9.2.3 Decision terminology δ∗ is called a Bayes decision function. δ(X) is a Bayes . δ(x) is a Bayes estimate. If a decision function δ is such that ∃ δ satisfying

R(θ, δ) ≤ R(θ, δ) ∀θ ∈ Θ and R(θ, δ)

89 9.3 Prior and posterior distributions The prior distribution represents our initial belief in the parameter θ. This that θ is regarded as a random variable with p.d.f. π(θ). From Bayes Theorem, f(x | θ)π(θ)=π(θ | x)f(x) or f(x | θ)π(θ) π(θ | x)= . f(x | θ)π(θ)dθ Θ π(θ | x) is called the posterior p.d.f. It represents our updated belief in θ, having taken account of the data. We write the expression in the form π(θ | x) ∝ f(x | θ)π(θ) or Posterior ∝ Likelihood × Prior. Example

Suppose X1,X2, ...,Xn is a random sample from a , Poi(θ), where θ has a prior distribution θα−1e−θ π(θ)= ,θ≥ 0,α>0. Γ(α) The posterior distribution is given by  θ xi e−nθ θα−1e−θ π(θ | x) ∝ · xi! Γ(α) which simplifies to  π(θ | x) ∝ θ xi+α−1e−(n+1)θ,θ≥ 0. This is recognisable as a Γ-distribution and, by comparison with π(θ),we can write   x +α−1 (n +1) xi+αθ i e−(n+1)θ π(θ | x)=  ,θ≥ 0, Γ( xi + α) which represents our posterior belief in θ.

90 9.4 Decisions We have already defined the Bayes risk to be  r(π,δ)= R(θ,δ)π(θ)dθ Θ where  R(θ, δ)= L (θ,δ(x)) f(x | θ)dx

RX so that   r(π, δ)= L (θ, δ(x)) f(x | θ)dxπ(θ)dθ Θ RX   = L (θ, δ(x)) π(θ | x)dθf(x)dx RX Θ For any given x, minimising the Bayes risk is equivalent to minimising  L (θ,δ(x)) π(θ | x)dθ. Θ i.e. we minimise the posterior expected loss.

Example Take the posterior distribution in the previous example and, for no special reason, assume quadratic loss

L (θ,δ(x))=[θ − δ(x)]2 .

Then we minimise  [θ − δ(x)]2 π(θ | x)dθ Θ by differentiating with respect to δ to obtain  −2 [θ − δ(x)] π(θ | x)dθ =0 Θ  ⇒ δ(x)= θπ(θ | x)dθ. Θ

91 This is the of the posterior distribution.

   x +α ∞ (n +1) xi+αθ i e−(n+1)θ δ(x)=  dθ 0 Γ( xi + α)   xi+α (n +1) Γ( xi + α +1) =   x +α+1 Γ( xi + α)(n +1) i  x + α = i n +1 Theorem: Suppose that δ π is a decision rule that is Bayes with respect to some prior distribution π. If the risk function of δ π satisfies

R(θ, δ π) ≤ r(π, δ π) for all θ ∈ Θ, then δ π is a minimax decision rule. Proof : Suppose δ π is not minimax. Then there is a decision rule δ such that

sup R(θ, δ) < sup R(θ, δ π). θ∈Θ θ∈Θ For this decision rule we have, since, if f(x) ≤ k,then E (f(X)) ≤ k, r(π, δ) ≤ sup R(θ, δ) θ∈Θ < sup R(θ, δ π) θ∈Θ ≤ r(π,δ π), contradicting the statement that δ π is Bayes with respect to π. Hence δ π is minimax.

Example

Let X1,...,X n be a random sample from a Bernoulli distribution B(1,p). Let Y = Xi and consider estimating p using the loss function (p − a)2 L(p, a)= , p(1 − p)

92 a loss which penalises mistakes more if p is near 0 or 1. The m.l.e. is p = Y/n and   (p− p)2 R(p, p)=E p(1 − p)

1 p(1 − p) 1 = = p(1 − p) n n The risk is constant and, therefore, minimax. [NB A decision rule with constant risk is called an equalizer rule]

We now show that p is the with respect to the U(0, 1) prior which will imply that p is minimax.

π(θ | y) ∝ py(1 − p)n−y so we want the value of a which minimises  1 (p − a)2 py(1 − p)n−ydp. 0 p(1 − p) Differentiation with respect to a yields  1 py(1 − p)n−y−1dp a =  0 . 1 py−1(1 − p)n−y−1dp 0 Now a β-distribution with α, β has pdf

Γ(α + β)xα−1(1 − x)β−1 f(x)= ,x∈ [0, 1], Γ(α)Γ(β) so Γ(y + 1)Γ(n − y) Γ(n) a = Γ(n +1) Γ(y)Γ(n − y) and, using the identity Γ(α +1)=αΓ(α), y a = .i.e.p = Y/n. n

93 9.5 Bayes estimates (i) The mean of π(θ | x) is the Bayes estimate with respect to quadratic loss. Proof Choose δ(x) to minimise  [θ − δ(x)]2 π(θ | x)dθ Θ by differentiating with respect to δ and equating to zero. Then  δ(x)= θπ(θ | x)dθ Θ  since π(θ | x)dθ =1. Θ (ii) The of π(θ | x) is the Bayes estimate with respect to modular loss (i.e. absolute value loss). Proof Choose δ(x) to minimise  |θ − δ(x)| π(θ | x)dθ Θ

 δ(x)  ∞ = (δ(x) − θ) π(θ | x)dθ + (θ − δ(x)) π(θ | x)dθ −∞ δ(x) Differentiate with respect to δ and equate to zero.   δ(x) π(θ | x)dθ = ∞ π(θ | x)dθ −∞ δ(x)   ⇒ 2 δ(x) π(θ | x)dθ = ∞ π(θ | x)dθ =1 −∞ −∞ so that  δ(x) π(θ | x)dθ = 1, 2 −∞ and δ(x) is the median of π(θ | x).

94 (iii) The of π(θ | x) is the Bayes estimate with respect to zero-one loss (i.e. loss of zero for a correct estimate and loss of one for any incorrect eastimate). Proof The loss function has the form  1, if δ(x) = θ, L (θ, δ(x)) = 0, if δ(x)=θ.

This is only a useful idea for a discrete distribution.   θ∈Θ L (θ, δ(x)) π(θ | x)= π(θ | x) θ∈Θ\{θ∗} =1− π(θ∗ | x)

where δ(x)=θ∗. Minimisation means maximising π(θ∗ | x),soδ(x) is taken to be the ‘most likely’ value − the mode.

Remember that π(θ | x) represents our most up-to-date belief in θ and is all that is needed for inference.

95 9.6 Credible intervals Definition

The interval (θ1,θ2) is a 100(1 − α)% for θ if  θ2 π(θ | x)dθ =1− α, 0 ≤ α ≤ 1. θ1

π(θ x)

θ

θ1 θ1 θ2 θ2

Both (θ1,θ2) and (θ1,θ2) are 100(1 − α)% credible intervals. Definition

The interval (θ1,θ2) is a 100(1 − α)% highest posterior density interval if

(i) (θ1,θ2) is a 100(1 − α)% credible interval;

(ii) ∀θ ∈ (θ1,θ2) and θ ∈/ (θ1,θ2),

π(θ | x) ≥ π(θ | x)

Obviously this defines the shortest possible 100(1 − α)% credible interval.

96 π(θ x)

1−α

θ

θ1 θ2

97 9.7 Features of Bayesian models (i) If there is a sufficient , say T , the posterior distribution will depend upon the data only through T.

(ii) There is a natural of priors such that the posterior distributions also belong to this family. Such priors are called conjugate priors.

Example: Binomial likelihood, beta prior A binomial likelihood has the form   n p(x | θ)= θx (1 − θ)n−x ,x=0, 1,...,n, x and a beta prior is

θa−1(1 − θ)b−1 π(θ)= ,θ∈ [0, 1], B(a, b) where Γ(a)Γ(b) B(a, b)= . Γ(a + b) Posterior ∝ Likelihood × Prior, so that π(θ | x) ∝ θx+a−1 (1 − θ)n−x+b−1 and comparison with the prior leads to

θx+a−1 (1 − θ)n−x+b−1 π(θ | x)= ,θ∈ [0, 1], B(x + a, n − x + b) which is also a .

98 9.8 testing The Bayesian only has a concept of a hypothesis test in certain very specific situations. This is because of the alien idea of having to regard a parameter as a fixed point. However the concept has some validity in an example such as the following.

Example: Water samples are taken from a reservoir and checked for bacterial content. These provide information about θ, the proportion of infected samples. If it is believed that bacterial density is such that more than 1% of samples are infected, there is a legal obligation to change the chlorination procedure. Bayesian procedure is

(i) use the data to obtain the posterior distribution of θ;

(ii) calculate P(θ>0.01).

EC rules are formulated in statistical terms and, in such cases, allow for a 5% significance level. This could be interpreted in a Bayesian sense as allowing for a wrong decision with probability no more than 0.05, so order increased chlorination if P(θ>0.01) > 0.05.

Where for some reason (e.g. legal reason) a fixed parameter value is of interest because it defines a threshold, then a Bayesian concept of a test is possible. It is usually performed either by calculating a credible bound or a credible interval and looking at it in relation to that threshold value or values, or by calculating posterior probabilities based on the threshold or thresholds.

99 9.9 for normal distributions 9.9.1 Unknown mean, known Consider data from N(θ, σ2) and a prior N(τ,κ2).

Posterior ∝ Likelihood × Prior. so, since   1  f(x | θ,σ2)=(2πσ2)−n/2 exp − (x − θ)2 2σ2 i and   1 1 π(θ)=√ exp − (θ − τ)2 ,θ∈ R, 2πκ2 2κ2   n 1 π(θ | x,σ2,τ,κ2) ∝ exp − (θ2 − 2θx) − (θ2 − 2θτ) . 2σ2 2κ2 But n 1 (θ2 − 2θx)+ (θ2 − 2θτ) 2σ2 2κ2     1 n 1 nx τ = + θ2 − + θ + garbage 2 σ2 κ2 σ2 κ2    1 n 1 nx /σ2 + τ /κ2 2 = + θ − 2 σ2 κ2 n /σ2 +1/κ2 so the posterior p.d.f. is proportional to     1 nx /σ2 + τ /κ2 2 exp − θ − 2(n /σ2 +1/κ2 )−1 n /σ2 +1/κ2

This means that   nx /σ2 + τ /κ2   θ | x,σ2,τ,κ2 ∼ N , (n σ2 +1 κ2 )−1 . n /σ2 +1/κ2 The form is a weighted average. The prior mean τ has weight 1/κ2 : this is 1/(prior variance),...

100 and the observed sample mean has weight n /σ2 : this is 1/(variance of sample mean). This is reasonable because   nx /σ2 + τ /κ2 lim = x n→∞ n /σ2 +1/κ2 and   nx /σ2 + τ /κ2 lim = x. κ2→∞ n /σ2 +1/κ2 The posterior mean tends to x as the amount of data swamps the prior or as prior beliefs become very vague.

9.9.2 Unknown variance, known mean We need to assign a prior p.d.f. π(σ2). The Γ family of p.d.f.’s provides a flexible set of shapes over [0, ∞). We take   1 λ νλ τ = ∼ Γ , σ2 2 2 where λ and ν are chosen to provide suitable location and shape. In other words, the prior p.d.f. of τ =1/σ2 is chosen to be of the form   (νλ)ν/2 τ ν/2 −1 νλτ π(τ)= exp − ,τ≥ 0. 2ν/2 Γ(ν /2) 2 or     (νλ)ν/2 (σ2)−ν/2 −1 νλ π σ2 = exp − . 2ν/2 Γ(ν /2) 2σ2 τ =1/σ2 is called the . The posterior p.d.f. is given by       π σ2 | x,θ ∝ f x | θ, σ2 π σ2 , and the right-hand side as a function of σ2, is proportional to       1    νλ σ2 −n/2 exp − (x − θ)2 × σ2 −ν/2 −1 exp − . 2σ2 i 2σ2

101 Thus π (σ2 | x,θ) is proportional to     1  σ2 −(ν+n)/2 −1 exp − νλ + (x − θ)2 2σ2 i

This is the same as the prior except that ν is replaced by ν + n and λ is replaced by νλ + (x − θ)2 i . ν + n Consider the random variable

νλ+ (x − θ)2 W = i . σ2 Clearly (ν+n)/2−1 −w/2 fW (w)=Cw e ,w≥ 0. Thisisaχ2-distribution with ν + n degrees of freedom. In other words,

νλ + (x − θ)2 i ∼ χ2(ν + n). σ2

Again this agrees with intuition. As n →∞we approach the classical estimate, and also as ν → 0 which corresponds to vague prior knowledge.

9.9.3 Mean and variance unknown We now have to assign a joint prior p.d.f. π (θ, σ2) to obtain a joint posterior p.d.f.       π θ, σ2 | x ∝ f x | θ, σ2 π θ, σ2 . We shall look at the special case of a non-informative prior. From theoretical considerations beyond the scope of this course, a non-informative prior may be approximated by   1 π θ, σ2 ∝ . σ2

102 Note that this is not to be thought of as an expression of prior beliefs but rather as an approximate prior which allows the posterior to be dominated by the data.       1  π θ, σ2 | x ∝ σ2 −n/2−1 exp − (x − θ)2 . 2σ2 i The right-hand side may be written in the form     1  σ2 −n/2 −1 exp − (x − x)2 + n(x − θ)2 . 2σ2 i We can use this form to integrate out, in turn, θ and σ2.

Using  ∞ n  2πσ2 exp − (θ − x)2 dθ = 2 −∞ 2σ n we find       1  π σ2 | x ∝ σ2 −(n−1)/2 −1 exp − (x − x)2 2σ2 i

2 2 and, putting W = (xi − x) /σ ,

(n−1)/2 −1 −w/2 fW (w)=Cw e ,w≥ 0, so we see that (x − x)2 i ∼ χ2(n − 1). σ2 To integrate out σ2, make the substitution σ2 = τ −1. Then π (θ | x) is proportional to ∞   n/2−1 τ 2 2 τ exp − (xi − x) + n(x − θ) dτ. 0 2 From the definition of the Γ-function,

∞ α−1 −βy Γ(α) y e dy = α , 0 β and we see, since only the terms in θ are relevant,

 −n/2 2 2 π (θ | x) ∝ (xi − x) + n(x − θ) .

103 or   t2 −n/2 π (θ | x) ∝ 1+ n − 1 where √ n(θ − x) t =  . 2 (xi − x) /(n − 1)

104