Part II Bayesian Statistics
Total Page:16
File Type:pdf, Size:1020Kb
Part II Bayesian Statistics 1 Chapter 6 Bayesian Statistics: Background In the frequency interpretation of probability, the probability of an event is limiting proportion of times the event occurs in an infinite sequence of independent repetitions of the experiment. This interpretation assumes that an experiment can be repeated! Problems with this interpretation: • Independence is defined in terms of probabilities; if probabilities are defined in terms of independent events, this leads to a circular defini- tion. • How can we check whether experiments were independent, without doing more experiments? • In practice we have only ever a finite number of experiments. 6.1 Subjective probability Let P (A) denote your personal probability of an event A; this is a numerical measure of the strength of your degree of belief that A will occur, in the light of available information. 2 Your personal probabilities may be associated with a much wider class of events than those to which the frequency interpretation pertains. For example: - non-repeatable experiments (e.g. that England will win the World Cup next time); - propositions about nature (e.g. that this surgical procedure results in increased life expectancy). All subjective probabilities are conditional, and may be revised in the light of additional information. Subjective probabilities are assessments in the light of incomplete information; they may even refer to events in the past. 6.2 Axiomatic development Coherence states that a system of beliefs should avoid internal inconsisten- cies. Basically, a quantitative, coherent belief system must behave as if it was governed by a subjective probability distribution. In particular this assumes that all events of interest can be compared. Note: different individuals may assign different probabilities to the same event, even if they have identical background information. See Chapters 2 and 3 in Bernardo and Smith for fuller treatment of foun- dational issues. 6.3 Bayes Theorem and terminology Bayes Theorem: Let B1,B2,...,Bk be a set of mutually exclusive and exhaustive events. For any event A with P (A) > 0, P (B ∩ A) P (A|B )P (B ) P (B |A) = i = i i . i P (A) Pk j=1 P (A|Bj)P (Bj) Equivalently we write P (Bi|A) ∝ P (A|Bi)P (Bi). 3 Terminology: P (Bi) is the called prior probability of Bi; P (A|Bi) is called the likelihood of A given Bi; P (Bi|A) is called the posterior probability of Bi; P (A) is called the predictive probability of A implied by the likelihoods and the prior probabilities. Example: Two events. Assume we have two events B1,B2, then P (B |A) P (B ) P (A|B ) 1 = 1 × 1 . P (B2|A) P (B2) P (A|B2) If the data is relatively more probable under B1 than under B2, our belief in B1 compared to B2 is increased, and conversely. c If in addition B2 = B1, then P (B1|A) P (B1) P (A|B1) c = c × c . P (B1|A) P (B1) P (A|B1) It follows that: posterior odds = prior odds × likelihood ratio. 6.4 Parametric models A Bayesian statistical model consists of 1. A parametric statistical model f(x|θ) for the data x, where θ ∈ Θ a parameter; x may be multidimensional. 2. A prior distribution π(θ) on the parameter. Note: The parameter θ is now treated as random! The posterior distribution of θ given x is f(x|θ)π(θ) π(θ|x) = R f(x|θ)π(θ)dθ Shorter, we write: π(θ|x) ∝ f(x|θ)π(θ), or posterior ∝ prior × likelihood . 4 Example: normal distribution. Suppose that θ has the normal N (µ, σ2)- distribution. Then, just focussing on what depends on θ, 1 1 π(θ) = √ exp − (θ − µ)2 2πσ2 2σ2 1 ∝ exp − (θ − µ)2 2σ2 1 = exp − (θ2 − 2θµ + µ2) 2σ2 1 µ ∝ exp − θ2 + θ . 2σ2 σ2 Conversely, if there are constants a > 0 and b such that π(θ) ∝ exp −aθ2 + bθ b 1 then it follows that θ has the normal N 2a , 2a -distribution. 6.4.1 Nuisance parameters Let θ = (ψ, λ), where λ is a nuisance parameter. Then π(θ|x) = π((ψ, λ)|x). We calculate the marginal posterior of ψ: Z π(ψ|x) = π(ψ, λ|x)dλ and continue inference with this marginal posterior. Thus we just integrate out the nuisance parameter. 6.4.2 Prediction The (prior) predictive distribution of x on the basis π is Z p(x) = f(x|θ)π(θ)dθ. 5 Suppose that data x1 is available, and we want to predict additional data: p(x2, x1) p(x2|x1) = p(x1) R f(x2, x1|θ)π(θ)dθ = R f(x1|θ)π(θ)dθ Z f(x1|θ)π(θ) = f(x2|θ, x1)R dθ f(x1|θ)π(θ)dθ Z f(x1|θ)π(θ) = f(x2|θ)R dθ f(x1|θ)π(θ)dθ Z = f(x2|θ)π(θ|x1)dθ. Note that x2 and x1 are assumed conditionally independent given θ. They are not, in general, unconditionally independent. Example: Billard ball. A billard ball W is rolled from left to right on a line of length 1 with a uniform probability of stopping anywhere on the line. It stops at p. A second ball O is then rolled n times under the same assumptions, and X denotes the number of times that the ball O stopped on the left of W . Given X, what can be said about p? Our prior is π(p) = 1 for 0 ≤ p ≤ 1; our model is n P (X = x|p) = px(1 − p)n−x for x = 0, . , n. x We calculate the predictive distribution, for x = 0, . , n Z 1 n P (X = x) = px(1 − p)n−xdp 0 x n = B(x + 1, n − x + 1) x nx!(n − x)! 1 = = , x (n + 1)! n + 1 where B is the beta function, Z 1 Γ(α)Γ(β) B(α, β) = pα−1(1 − p)β−1dp = . 0 Γ(α + β) 6 We calculate the posterior distribution: n π(p|x) ∝ 1 × px(1 − p)n−x ∝ px(1 − p)n−x, x so px(1 − p)n−x π(p|x) = ; B(x + 1, n − x + 1) this is the Beta(x + 1, n − x + 1)-distribution. In particular the posterior mean is Z 1 px(1 − p)n−x E(p|x) = p dp 0 B(x + 1, n − x + 1) B(x + 2, n − x + 1) x + 1 = = . B(x + 1, n − x + 1) n + 2 x (For comparison: the mle is n .) Further, P (O stops to the left of W on the x+1 next roll |x) is Bernoulli-distributed with probability of success E(p|x) = n+2 . Example: exponential model, exponential prior. Let X1,...,Xn be a random sample with density f(x|θ) = θe−θx for x ≥ 0, and assume π(θ) = µe−µθ for θ ≥ 0; and some known µ. Then Pn n −θ xi f(x1, . , xn|θ) = θ e i=1 and hence the posterior distribution is n −θ Pn x −µθ n −θ(Pn x +µ) π(θ|x) ∝ θ e i=1 i µe ∝ θ e i=1 i , Pn which we recognize as Gamma(n + 1, µ + i=1 xi). Example: normal model, normal prior. Let X1,...,Xn be a random sample from N (θ, σ2), where σ2 is known, and assume that the prior is normal, π(θ) ∼ N (µ, τ 2), where µ, τ 2 is known. Then ( n 2 ) 2 − n 1 X (xi − θ) f(x , . , x |θ) = (2πσ ) 2 exp − , 1 n 2 σ2 i=1 7 so we calculate for the posterior that ( n 2 2 !) 1 X (xi − θ) (θ − µ) π(θ|x) ∝ exp − + 2 σ2 τ 2 i=1 − 1 M =: e 2 . We can calculate (Exercise) b 2 b2 M = a θ − − + c, a a n 1 a = + , σ2 τ 2 1 X µ b = x + , σ2 i τ 2 1 X µ2 c = x2 + . σ2 i τ 2 So it follows that the posterior is normal, b 1 π(θ|x) ∼ N , . a a Exercise: the predictive distribution for x is N (µ, σ2 + τ 2). Note: The posterior mean for θ is 1 P µ σ2 xi + τ 2 µ1 = n 1 . σ2 + τ 2 If τ 2 is very large compared to σ2, then the posterior mean is approximately x. If σ2/n is very large compared to τ 2, then the posterior mean is approximately µ. = The posterior variance for θ is 2 1 σ 2 φ = n 1 < min , τ , σ2 + τ 2 n which is smaller than the original variances. 8 6.4.3 Credible intervals A (1 − α) (posterior) credible interval is an interval of θ−values within which 1 − α of the posterior probability lies. In the above example: θ − µ P −z < √ 1 < z = 1 − α α/2 φ α/2 is a (1 − α) (posterior) credible interval for θ. The equality is correct conditionally on x, but the r.h.s. does not depend on x, so the equality is also unconditionally correct. 2 σ2 If τ → ∞ then φ − n → 0, and µ1 − x → 0, and σ σ P x − z √ < θ < x + z √ = 1 − α, α/2 n α/2 n which gives the usual 100(1−α)% confidence interval in frequentist statistics. Note: The interpretation of credible intervals is different to confidence √σ intervals: In frequentist statistics, x ± zα/2 n applies before x is observed; the randomness relates to the distribution of x, in Bayesian statistics:,the credible interval is conditional on the observed x; the randomness relates to the distribution of θ.