9 Bayesian Inference
Total Page:16
File Type:pdf, Size:1020Kb
9 Bayesian inference 1702 - 1761 9.1 Subjective probability This is probability regarded as degree of belief. A subjective probability of an event A is assessed as p if you are prepared to stake £pM to win £M and equally prepared to accept a stake of £pM to win £M. In other words ... ... the bet is fair and you are assumed to behave rationally. 9.1.1 Kolmogorov’s axioms How does subjective probability fit in with the fundamental axioms? Let A be the set of all subsets of a countable sample space Ω. Then (i) P(A) ≥ 0 for every A ∈A; (ii) P(Ω)=1; 83 (iii) If {Aλ : λ ∈ Λ} is a countable set of mutually exclusive events belonging to A,then P Aλ = P (Aλ) . λ∈Λ λ∈Λ Obviously the subjective interpretation has no difficulty in conforming with (i) and (ii). (iii) is slightly less obvious. Suppose we have 2 events A and B such that A ∩ B = ∅. Consider a stake of £pAM to win £M if A occurs and a stake £pB M to win £M if B occurs. The total stake for bets on A or B occurring is £pAM+ £pBM to win £M if A or B occurs. Thus we have £(pA + pB)M to win £M and so P (A ∪ B)=P(A)+P(B) 9.1.2 Conditional probability Define pB , pAB , pA|B such that £pBM is the fair stake for £M if B occurs; £pABM is the fair stake for £M if A and B occur; £pA|BM is the fair stake for £M if A occurs given B has occurred − other- wise the bet is off. Consider the 3 alternative outcomes A only, B only, A and B.IfG1, G2, G3 are the gains, then A : −pB M2 − pABM3 = G1 B : −pA|BM1 +(1− pB)M2 − pAB)M3 = G2 A ∩ B :(1− pA|B)M1 +(1− pB)M2 +(1− pAB)M3 = G3 The principle of rationality does not allow the possibility of setting the stake to obtain sure profit. Thus 0 −pB −pAB −pA|B 1 − pB −pAB =0 1 − pA|B 1 − pB 1 − pAB ⇒ pAB − pB pA|B =0 or P (A ∩ B) P(A | B)= . P (B) 84 9.2 Estimation formulated as a decision problem A The action space, which is the set of all possible actions a ∈Aavailable. Θ The parameter space, consisting of all possible “states of nature”, only one of which occurs or will occur ( this “true” state being unknown). L The loss function, having domain Θ ×A(the set of all possible conse- quences (θ, a)) and codomain R. RX Containing all possible realisations x ∈ RX of a random variable X belonging to the family {f(x; θ):θ ∈ Θ} . D The decision space, consisting of all possible decisions δ ∈D, each decision function having domain RX and codomain δ. 9.2.1 The basic idea • The true state of the world is unknown when the action is chosen. • The loss is known which results from each of the possible consequences (θ, a). • Data are collected to provide information about the unknown θ. • A choice of action is taken for each possible observation of X. Decision theory is concerned with the criteria for making a choice. The form of the loss function is crucial − it depends upon the nature of the problem. Example You are a fashion buyer and must stock up with trendy dresses. The true number you will sell is θ (unknown). 85 You stock up with a dresses. If you overstock you must sell off at your end- of-season sale and lose £A per dress. If you understock you lose the profit of £B per dress. Thus A(a − θ),a≥ θ, L(θ, a)= B(θ − a),a≤ θ. Special forms of loss function If A = B, you have modular loss, L(θ, a)=A |θ − a| To suit comparison with classical inference, quadratic loss is often used. L(θ, a)=C(θ − a)2 9.2.2 The risk function This is a measure of the quality of a decision function. Suppose we choose a decision δ(x) (i.e. choose a = δ(x)). The function R with domain Θ ×Dand codomain R defined by R(θ, δ)= L (θ, δ(x)) f(x; θ)dx RX or R(θ, δ)= L (θ, δ(x)) p(x; θ) x∈RX is called the risk function. We may write this as R(θ, δ)=E [L (θ,δ(x))] . Example Suppose X is a random sample from N(θ,θ) and suppose we assume quadratic loss. 86 Consider 1 n δ (X)=X, δ (X)= (X − X)2. 1 2 n − 1 i i=1 2 R(θ, δ1)=E (θ − X) . We know that E(X)=θ,so θ R(θ, δ )=V (X)= . 1 n 2 R(θ, δ2)=E (θ − δ2(X)) . But E [δ2(X)] = θ,sothat R(θ, δ2)=V [δ2(X)] θ2 n (X − X)2 = V i 2 (n − 1) i=1 θ θ2 2θ2 = · 2(n − 1) = . (n − 1)2 n − 1 R(θ,d) d2 2 (n − 1)/2n d1 θ (n − 1)/2n Without knowing θ, how do we choose between δ1 and δ2? Two sensible criteria 1. Take precautions against the worst. 2. Take into account any further information you may have. 87 Criterion 1 leads to minimax. Choose the decision leading to the smallest maximum risk, i.e. choose δ∗ ∈Dsuch that sup {R(θ, δ∗)} = inf sup {R(θ, δ)} . θ∈Θ δ∈D θ∈Θ Minimax rules have disadvantages, the main one being that they treat Nature as an adversary. Minimax guards against the worst choice. The value of θ should not be regarded as a value deliberately chosen by Nature to make the risk large. Fortunately in many situations they turn out to be reasonable in practice, if not in concept. Criterion 2 involves introducing beliefs about θ into the analysis. If π(θ) represents our beliefs about θ, then define Bayes risk r(π, δ) as R(θ, δ)π(θ)dθ if θ is continuous, r(π, δ)= Θ θ∈Θ R(θ, δ)π(θ) if θ is discrete. Now choose δ∗ such that r(π, δ∗)=inf{r(π, δ)} . δ∈D Bayes Rule The Bayes rule with respect to a prior π is the decision rule δ∗ that minimises the Bayes risk among all possible decision rules. Bayes’ tomb 88 9.2.3 Decision terminology δ∗ is called a Bayes decision function. δ(X) is a Bayes estimator. δ(x) is a Bayes estimate. If a decision function δ is such that ∃ δ satisfying R(θ, δ) ≤ R(θ, δ) ∀θ ∈ Θ and R(θ, δ) <R(θ,δ) for some θ ∈ Θ, then δ is dominated by δ and δ is said to be inadmissible. All other decisions are admissible. 89 9.3 Prior and posterior distributions The prior distribution represents our initial belief in the parameter θ. This means that θ is regarded as a random variable with p.d.f. π(θ). From Bayes Theorem, f(x | θ)π(θ)=π(θ | x)f(x) or f(x | θ)π(θ) π(θ | x)= . f(x | θ)π(θ)dθ Θ π(θ | x) is called the posterior p.d.f. It represents our updated belief in θ, having taken account of the data. We write the expression in the form π(θ | x) ∝ f(x | θ)π(θ) or Posterior ∝ Likelihood × Prior. Example Suppose X1,X2, ...,Xn is a random sample from a Poisson distribution, Poi(θ), where θ has a prior distribution θα−1e−θ π(θ)= ,θ≥ 0,α>0. Γ(α) The posterior distribution is given by θ xi e−nθ θα−1e−θ π(θ | x) ∝ · xi! Γ(α) which simplifies to π(θ | x) ∝ θ xi+α−1e−(n+1)θ,θ≥ 0. This is recognisable as a Γ-distribution and, by comparison with π(θ),we can write x +α−1 (n +1) xi+αθ i e−(n+1)θ π(θ | x)= ,θ≥ 0, Γ( xi + α) which represents our posterior belief in θ. 90 9.4 Decisions We have already defined the Bayes risk to be r(π,δ)= R(θ,δ)π(θ)dθ Θ where R(θ, δ)= L (θ,δ(x)) f(x | θ)dx RX so that r(π, δ)= L (θ, δ(x)) f(x | θ)dxπ(θ)dθ Θ RX = L (θ, δ(x)) π(θ | x)dθf(x)dx RX Θ For any given x, minimising the Bayes risk is equivalent to minimising L (θ,δ(x)) π(θ | x)dθ. Θ i.e. we minimise the posterior expected loss. Example Take the posterior distribution in the previous example and, for no special reason, assume quadratic loss L (θ,δ(x))=[θ − δ(x)]2 . Then we minimise [θ − δ(x)]2 π(θ | x)dθ Θ by differentiating with respect to δ to obtain −2 [θ − δ(x)] π(θ | x)dθ =0 Θ ⇒ δ(x)= θπ(θ | x)dθ. Θ 91 This is the mean of the posterior distribution. x +α ∞ (n +1) xi+αθ i e−(n+1)θ δ(x)= dθ 0 Γ( xi + α) xi+α (n +1) Γ( xi + α +1) = x +α+1 Γ( xi + α)(n +1) i x + α = i n +1 Theorem: Suppose that δ π is a decision rule that is Bayes with respect to some prior distribution π. If the risk function of δ π satisfies R(θ, δ π) ≤ r(π, δ π) for all θ ∈ Θ, then δ π is a minimax decision rule.