Credible Interval Estimates

Home , Credible interval, Pivotal quantity

Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

1 Introduction

Point estimates of unknown parameters θ governing the distribution of an observed quantity X are unsatisfying if they come with no measure of ac- curacy or precision. One approach to giving such a measure is to offer interval estimates for θ, rather than point estimates; upon observing X, we construct an interval [a, b] which is very likely to contain θ, and which is very short. The approach to exactly how these are constructed and inter- preted is different for inference in the Sampling Theory tradition, and in the Bayesian tradition. In these notes I’ll present both approaches to estimat- ing the means of the Normal and Exponential distributions, using “pivotal quantities,” and of Poisson random variables, using detailed features of the distribution, on the basis of a random sample of fixed size n.

1.1 Pivotal Quantities

A pivotal quantity is a function of the data and the parameters (so it’s not a statistic) whose probability distribution does not depend on any uncertain parameter values. Some examples:

Ex(λ): If X Ex(λ) then λX Ex(1) is pivotal and, for samples of • ∼ ∼ size n, λX¯ Ga(n,n) and 2nλX¯ χ2 are pivotal. n ∼ n ∼ 2n Ga(α, λ): If X Ga(α, λ) with α known then λX Ga(α, 1) is pivotal • ∼ ∼ and, for samples of size n, λX¯ Ga(αn,n) and 2nλX¯ χ2 are n ∼ n ∼ 2αn pivotal. No(µ,σ2): If µ is unknown but σ2 is known, then (X µ)/σ No(0, 1) • − ∼ is pivotal and, for samples of size n, √n(X¯ µ)/σ No(0, 1) is n − ∼ pivotal.

1 No(µ,σ2): If µ is known but σ2 is unknown, then (X µ)/σ No(0, 1) • − ∼ is pivotal and, for samples of size n, (X µ)2/σ2 χ2 is pivotal. i − ∼ n No(µ,σ2): If µ and σ2 are both unknownP then for samples of size n, • both (X X¯ )2/σ2 χ2 and √n(X¯ µ)/σ No(0, 1) are i − n ∼ n−1 n − ∼ pivotal. This is the key example below. P Un(0,θ): If X Un(0,θ) with θ unknown then (X/θ) Un(0, 1) is • ∼ ∼ pivotal and, for samples of size n, max(Xi)/θ Be(n, 1) is pivotal. ∼ iid Find a suﬃcient pair of pivotal quantities for X Un(α, β). { i} ∼ We(α, β): If X We(α, β) has a Weibull distribution then βXα • ∼ ∼ Ex(1) is pivotal.

Pivotal quantites allow us to construct sampling-theory “conﬁdence intervals” for uncertain parameters. For example, since 2nλX¯ χ2 for iid n ∼ 2n X Ex(λ), and since the 5% and 95% quantiles of the χ2 distribu- { j} ∼ 10 tion are 3.940299 and 18.307038, a sample of size n = 5 from the Ex(λ) distribution satisﬁes

0.90 = Pr[3.940299 < 2nλX¯n < 18.307038] 0.3940299 1.8307038 = Pr <λ< , X¯ X¯ 5 5 so [0.39/X¯5, 1.83/X¯5] is a 90% confidence interval for λ. Most discrete distributions don’t have (exact) pivotal quantities, but the central limit theorem usually leads to approximate confidence intervals for most distributions for large samples. Exact intervals are available for many distributions, with a little more work, even for small samples; see Section (4) for a construction of exact confidence and credible intervals for the Poisson distribution. Ask me if you’re interested in more details.

2 Conﬁdence Intervals for a Normal Mean

First let’s verify the claim above that, for a sample of size n from the No(µ,σ2) distribution, the pivotal quantities (X X¯ )2/σ2 χ2 and i − n ∼ n−1 √n(X¯ µ)/σ No(0, 1) have the distributions I claimed for them— and, n − ∼ P moreover, that they are independent. Let x = X ,...,X iid No(µ,σ2) be a simple random sample from a { 1 n} ∼ normal distribution mean µ and variance σ2. The log likelihood function for

2 θ = (µ,σ2) is

2 2 log f(x θ) = log (2πσ2)−n/2 e− P(xi−µ) /2σ | nn 1 o n(¯x µ)2 = log(2πσ2) (x x¯ )2 n − − 2 − 2σ2 i − n − 2σ2 X so the MLEs are

1 1 µˆ =x ¯ = x andσ ˆ2 = s2 /n = (x x¯ )2. n n i n n i − n X X First we turn to discovering the probability distributions of these estimators, so we can make sampling-based interval estimates for µ and σ2. Since the X are independent, their sum has a No(nµ, nσ2) distribution { i} and µˆ =x ¯ No(µ,σ2/n). n ∼ Since the covariance betweenx ¯ and each component of (x x¯ ) is zero, n − n and since they’re all jointly Gaussian,x ¯ must be independent of (x x¯ ) n − n and hence of any function of (x x¯ ), including s2 = (x x¯ )2. Now we − n n i − n can use moment generating functions to discover the distribution of s2 . P n Since Z2 χ2 = Ga(1/2, 1/2) for a standard normal Z No(0, 1), we have ∼ 1 ∼ n(¯x µ)2/σ2 Ga(1/2, 1/2) and (x µ)2/σ2 Ga(n/2, 1/2) n − ∼ i − ∼ so X

n(¯x µ)2 Ga(1/2, 1/2σ2) and (x µ)2 Ga(n/2, 1/2σ2). n − ∼ i − ∼ X By completing the square we have

(x µ)2 = (x x¯ )2 + n(¯x µ)2 i − i − n n − X X as the sum of two independent terms. Recall (or compute) that the Gamma Ga(α, β) MGF is (1 t/β)−α for t < β, and that the MGF for the sum of −

3 independent random variables is the product of the individual MGFs, so

E exp t (x µ)2 = (1 2σ2t)−n/2 i − − n o X = E exp t (x x¯ )2 E exp tn(¯x µ)2 i − n n − n o = E exp t X(x x¯ )2 (1 2σ2t)−1/2, so i − n − n o E exp t (x x¯ )2 = (1 2σ2tX)−(n−1)/2 by dividing. Thus i − n − n X o n 1 1 s2 (x x¯ )2 Ga − , and so n ≡ i − n ∼ 2 2σ2 X 2 sn 2 x¯n µ No χn−1 is independent of − (0, 1). σ2 ∼ σ2/n ∼ p 2.1 Conﬁdence Intervals for µ when the Variance is Known

The pivotal quantity Z = (¯x µ)/ σ2/n has a standard No(0, 1) normal n − distribution, with CDF Φ(z). If σ2 were known, then for any 0 <γ< 1 and p for z∗ such that Φ(z∗)=(1+ γ)/2 we could write

∗ x¯n µ ∗ γ = Pµ z − z "− ≤ σ2/n ≤ # P ∗ ∗ = µ x¯n z σ/p√n µ x¯n + z σ/√n − ≤ ≤ and find a “confidence interval” [¯x z∗σ/√n, x¯ + z∗σ/√n] for µ by re- n − n placingx ¯n with its observed value. Note this is a random interval; from a sampling theory perspective, µ is fixed so once we replace “¯xn” with its value, the interval is no longer random and we can no longer make probability statements about it. That’s why the word “confidence” is used for these intervals, and not “probability.” If σ2 is not known, and must be estimated from the data, then we have a bit of a problem— because although the analogous quantity x¯ µ n − σˆ2/n does have a probability distributionp that does not depend on µ or σ2, and so is pivotal, its disribution is not Normal. Heuristically, this quantity has “fatter tails” than the normal density function, because it can be far from 2 zero if either x¯n is far from its mean µ or if the estimateσ ˆ for the variance is too small.

4 Following William S. Gosset (a statistician working for the Guinness Brewery in Dublin, Ireland) as adapted by Ronald A. Fisher (an English theoretical statistician), we consider the (slightly rescaled, by Fisher) pivotal quantity: x¯ µ (¯x µ)√n t := n − = n − σˆ2/(n 1) s2 /(n 1) − n − x¯n−µ p√σ2/n pZ = 2 = , sn Y/ν σ2(n−1) q p for independent Z No(0, 1) and Y χ2 with ν = n 1. Now we turn to ∼ ∼ ν − ﬁnding the density function for t.

2.2 The t pdf

Note X Z2/2 Ga(1/2, 1) and U Y/2 Ga(ν/2, 1). Since Z has a ≡ ∼ ≡ ∼ symmetric distribution about zero, so does t and its pdf will satisfy fν(t)= f ( t). For t> 0, ν − Z 1 Z2 1 Z2 Y t2 P >t = P >t2 = P > 2 Y/ν 2 2 2 ν " Y/ν # p 1 ∞ ∞ x−1/2 uν/2−1 = e−x dx e−u du 2 2 Γ(1/2) Γ(ν/2) Z0 (Zut /ν ) Taking the negative derivative wrt t, and noting Γ(1/2) = √π,

∞ ν/2−1 1 2ut 2 u f (t)= (ut2/ν)−1/2e−ut /ν e−u du ν 2√π ν Γ(ν/2) Z0 ( ) ∞ 1 ν+1 2 = u 2 −1 e−u(1+t /ν) du √πν Γ(ν/2) Z0 Γ( ν+1 ) 1 = 2 2 (ν+1)/2, "√πν Γ(ν/2)# (1 + t /ν) the Student t density function. It’s not important to remember or be able to reproduce the derivation, or to remember the normalizing constant— but it’s useful to know a few things about the t density function:

2 −(ν+1)/2 fν(t) (1+t /ν) is symmetric and bell-shaped, but falls oﬀ to • ∝ −ν−1 zero as t more slowly than the Normal density (fν(t) t → ±∞ 2 ≍ | | while φ(z) e−z /2). ≍

5 For one degree of freedom ν = 1, the t1 is identical to the standard • −1 2 Cauchy distribution f1(t) = π /(1 + t ), with undeﬁned mean and inﬁnite variance.

As ν , f (t) φ(t) converges to the standard Normal No(0, 1) • → ∞ ν → density function.

2.3 Conﬁdence Intervals for µ when σ2 is Unknown

With the t distribution (and hence its CDF Fν(t)) now known, for any iid random sample x = X ,...,X No(µ,σ2) from the normal distribution { 1 n} ∼ we can set ν n 1 and compute suﬃcient statistics ≡ − 1 1 1 x¯ = x σˆ2 = s2 = (x x¯ )2 n n i n n n n i − n X ∗ ∗ X and, for any 0 <γ< 1, ﬁnd t such that Fν(t )=(1+ γ)/2, then compute

∗ x¯n µ ∗ γ = Pµ t − t 2 "− ≤ σˆn/ν ≤ # P ∗ ∗ = µ x¯n t σ/ˆp√ν µ x¯n + t σ/ˆ √ν − ≤ ≤ Once again we have a random interval that will containµ with specified 2 probability γ— and can replace the sufficient statisticsx ¯n,σ ˆ with their observed values to get a confidence interval. In this sampling theory approach the unknown µ and σ2 are kept fixed and only the data x are treated as random; that’s what the subscript “µ” on P was intended to suggest. For example, with just n = 2 observations the t distribution will have only ν = 1 degrees of freedom, so it coincides with the Cauchy distribution with CDF t 1/π 1 1 F (t)= dx = + arctan(t). 1 1+ x2 2 π Z−∞ For a confidence interval with γ = 0.95 we need t∗ = tan[π(0.975 0.5)] = − tan(0.475π) = 12.70620 (much larger than z∗ = 1.96); the interval is

0.95 = P [¯x 12.7ˆσ µ x¯ + 12.7ˆσ] µ − ≤ ≤ withx ¯ = (x + x )/2 andσ ˆ = x x /2. 1 2 | 1 − 2| When σ2 is known it is better to use its known value than to estimate it, because (on average) the interval estimates will be shorter. Of course some things we “know” turn out to be false... or, as Will Rogers put it,

6 It isn’t what we don’t know that gives us trouble, it’s what we know that ain’t so.

3 Bayesian Credible Intervals for a Normal Mean

How would we make inference about µ for the normal distribution using Bayesian methods?

3.1 Unknown mean µ, known precision τ = σ−2

When only the mean µ is uncertain but the precision τ 1/σ2 is known, ≡ the normal likelihood function

τ 2 τ 2 τ0 2 f(x µ) = (τ/2π)n/2 e− 2 P(xi−x¯n) − 2 n(¯xn−µ) e− 2 (µ−µ0) | ∝ is proportional to a normal density in µ with mean µ0 =x ¯n and precision τ = nτ, so π (µ) No(µ ,τ −1) is a conjugate prior distribution. The 0 µ ∼ 0 0 posterior in this case is again normal µ x No(µ ,τ −1) with updated | ∼ 1 1 “hyper-parameters”

τ0µ0 + nτx¯n µ1 = , τ1 = τ0 + nτ. τ0 + nτ Posterior credible intervals are available of any size 0 <γ < 1; using the critical value z∗ such that Φ(z∗)=(1+ γ)/2, we have

γ = P µ z∗/√τ <µ<µ + z∗/√τ x . 1 − 1 1 1 h i The limit as µ 0 and τ 0 leads to the improper uniform prior 0 → 0 → π (µ) 1 with posterior µ x No(µ =x ¯ ,τ −1 = σ2/n), with a Bayesian µ ∝ | ∼ 1 n 1 credible interval [¯x z∗σ/√n, x +z∗σ/√n] identical to the sampling theory n− n conﬁdence interval of Section (2.1), but with a diﬀerent interpretation: here γ = P[¯x z∗σ/√n<µ

3.2 Known mean µ, unknown precision τ = σ−2

When µ is known and only the precision τ is uncertain,

f(x τ) τ α0−1e−β0τ | ∝

7 is proportional to a gamma density in τ with shape α0 =1+ n/2 and rate β = Σ(x µ)2, so π (τ) Ga(α , β ) is conjugate for τ. The posterior 0 i − τ ∼ 0 0 distribution is τ x Ga(α , β ) with updated hyper-parameters | ∼ 1 1 n 1 α = α + , β = β + (x µ)2. 1 0 2 1 0 2 i − X There’s no point in giving interval estimates for µ (since it’s known), but 2 we can use the fact that 2β1τ χν with ν = 2α1 degrees of freedom to 2 ∼ generate intervals for τ or σ . For numbers 0 < γ1 < γ2 < 1 ﬁnd quantiles 0 < c < c < of the χ1 distribution that satisfy P[χ2 c ]= γ ; then 1 2 ∞ ν ν ≤ j j

c1 c2 2β1 2 2β1 γ2 γ1 = P <τ < x = P <σ < x − 2β1 2β1 c2 c1 h i h i

For 0 <γ< 1 the symmetric case of γ = (1 γ)/2, γ = (1+ γ )/2 is not 1 − 2 the shortest possible interval of size γ = γ γ , because the χ2 density isn’t 2 − 1 symmetric. The shortest choice is called the “HPD” or “highest posterior density” interval because the Ga(α1, β1) density function takes equal values at c1, c2 and is higher inside [c1, c2] than outside. Typically HPDs are found by a numerical search. In the limit as α 0 and β 0, we have the improper prior π (τ) 0 → 0 → τ ∝ τ −1 with posterior τ x Ga(n/2, Σ(x µ)2/2) proportional to a χ2 | ∼ i − n distribution, and the Bayesian credible interval coincides with a sampling theory conﬁdence interval for σ2 (again, with the conditioning reversed).

3.3 Both mean µ and precision τ = σ−2 Unknown

When both parameters are uncertain, there is no conjugate family with µ,τ independent under both prior and posterior— but there is a four-parameter family, the “normal-gamma” distributions, that is conjugate (see 7.6 in § (3/e) or 8.6 in (4/e) of the text). It is usually expressed in conditional § form:

τ Ga(α , β ), µ τ No µ , [λ τ]−1 ∼ 0 0 | ∼ 0 0 with the prior precision for µ proportional to the data precision τ. Its density function is seldom needed, but easy to write down:

α0 2 β0 λ0 α0−1/2 −τ[β0+λ0(µ−µ0) /2] π(µ,τ α0, β0,µ0, λ0)= τ e (1) | Γ(α0)r2π

8 The posterior distribution is again of the same form, with updated hyper- 2 parameters that depend on n and the suﬃcient statisticsx ¯n andσ ˆn: n n λ (¯x µ )2 α = α + β = β + σˆ2 + 0 n − 0 1 0 2 1 0 2 n λ + n 0 λ0µ0 + nx¯n µ1 = λ1 = λ0 + n λ0 + n The conventional “non-informative” or “vague” improper prior distribution for a location-scale family like this is π(µ,τ)= τ −1, invariant under changes in both location x x + a and scale x cx; from Equation (1) we see this can be achieved (apart from the irrelevant normalizing constant) by taking α = 1/2 and β = λ = 0, with µ arbitrary. In this limiting case we ﬁnd 0 − 0 0 0 posterior distributions of τ Ga(ν/2,s2 /2) and µ τ No(¯x ,nτ) with ∼ n | ∼ n ν = n 1, so − µ x¯n − tν 2 σˆn/ν ∼ and the Bayesian posterior crediblep interval

∗ σˆn ∗ σˆn γ = P x¯n t <µ< x¯n + t x − √n 1 √n 1 − − coincides with the sampling-theory conﬁdence interval

∗ σˆn ∗ σˆn γ = P x¯n t <µ< x¯n + t µ,τ − √n 1 √n 1 − − of Section (2.3), with a diﬀerent interpretation.

More generally, for any positive α∗, β∗,µ∗, λ∗, the marginal distribution for µ is that of a shifted (or “non-central”) and scaled tν distribution with ν = 2α∗ degrees of freedom; speciﬁcally,

µ µ∗ − tν, ν = 2α∗ β∗/α∗λ∗ ∼ One way to see that isp to begin with the relations 1 τ Ga(α∗, β∗), µ τ No µ∗, ∼ | ∼ λ∗τ and, after scaling and centering, ﬁnd

Z (µ µ ) λ τ No(0, 1) Y 2β τ Ga(α , 1/2) = χ2 ≡ − ∗ ∗ ∼ ⊥⊥ ≡ ∗ ∼ ∗ ν p 9 for ν 2α and hence ≡ ∗ Z µ µ∗ = − tν. Y/ν β∗/α∗λ∗ ∼

p p ∗ ∗ With the normal-gamma prior, and with t chosen so that Fν(t ) = (1+γ)/2, a Bayesian posterior credible interval is:

γ = P µ t∗ β /α λ µ µ + t∗ β /α λ x , 1 − 1 1 1 ≤ ≤ 1 1 1 1 h i p p 2 an interval that is a meaningful probability statement even after x ¯n andσ ˆn (and hence α1, β1, µ1, λ1) are replaced with their observed values from the data.

4 Interval Estimates for Poisson Means

Let x = X be a sample of n iid observations from the Poisson distribution { j} with unknown mean θ, and let 0 <γ< 1 be a number between zero and X Nn x one. Denote by = 0 the space of all possible values of , n-tuples of non-negative integers; also denote by Θ = R+ the possible values of θ.

4.1 Sampling Theory: Conﬁdence Intervals

A γ-Conﬁdence Interval is a random interval I(x) with the property that P [θ I(x)] γ for every ﬁxed value of γ > 0, i.e., that will contain θ with θ ∈ ≥ probability at least γ no matter what θ might be. We can specify an interval by giving its end-points, a pair of functions A : X R and B : X R → → with the property that

( θ Θ) P [A(x) <θ

symmetric in the sense that the two possible errors each have the same • error bound, 1 γ 1 γ P [θ A(x)] − , P [B(x) θ] − (3) θ ≤ ≤ 2 θ ≤ ≤ 2

10 as short as possible, subject to the error bound. • Clearly each function A and B will be a monotonically increasing function of the sufficient statistic S =ΣX Po(nθ); let’s write A and B for those j ∼ S S functions, and consider AS first. Before we do, remember that the arrival time Tk for the k’th event in a unit-rate Poisson process Xt has the Ga(k, 1) distribution, and that X k if and only if T t (at least k fish by time t ≥ k ≤ t if and only if the k’th fish arrives before time t)— hence, in R, the CDF functions for Gamma and Poisson are related for all k N and t> 0 by ∈ 1 ppois(k 1, t)= pgamma(t, k, 1). − − Also recall the Gamma quantile function in R, an inverse for the CDF function, which satisfies p = pgamma(t, k, b) if and only if t = qgamma(p, k, b), and that if Y Ga(α, β) and b> 0 then Y/b Ga(α, β b), so ∼ ∼ pgamma(b θ, α, 1)= pgamma(θ, α, b).

Fix any positive integer k. To achieve Equation (3) for θ A , we need: ≤ k 1 γ − P [θ A(x)] 2 ≥ θ ≤ P [S k] ≥ θ ≥ = 1 ppois(k 1, nθ) − − = pgamma(nθ, k, 1)= pgamma(θ, k, n), i.e. 1 γ qgamma( − , k, n) θ 2 ≥ for each θ A . Evidently this happens if and only if: ≤ k 1 γ qgamma( − , k, n) A . (4) 2 ≥ k Similarly, for nonnegative k and B θ, Equation (3) requires: k ≤ 1 γ − P [B(x) θ] 2 ≥ θ ≤ P [S k] ≥ θ ≤ = ppois(k, nθ) = 1 pgamma(nθ, k + 1, 1), i.e. − 1+ γ pgamma(θ, k + 1, n), i.e. 2 ≤ 1 + γ qgamma( , k + 1, n) θ 2 ≤

11 for each θ B . This happens if and only if: ≥ k 1 + γ qgamma( , k + 1, n) B . (5) 2 ≤ k The shortest interval subject to the two constraints of Equations (4, 5) is: 1 γ 1 + γ A = qgamma( − , k, n) B = qgamma( , k + 1, n). (6) k 2 k 2

4.2 Bayesian Credible Intervals

iid A conjugate Bayesian analysis for iid Poisson data X Po(θ) begins { j} ∼ with the selection of hyperparameters α > 0, β > 0 for a Ga(α, β) prior density

π(θ) θα−1e−βθ ∝ and calculation of the likelihood function

n θxj f(x θ)= e−θ | xj! jY=1 θSe−nθ, ∝ n where again S = j=1 Xj. The posterior distribution is P π(θ x) θα+S−1e−(β+n)θ | ∝ Ga(α + S, β + n). ∼ Thus a symmetric γ posterior (“credible”) interval for θ can be given by

γ = P [a(x) <θ

12 4.3 Comparison

The two probability statements in Equations (2, 7) are different— in Equa- tion (2) the value of θ is fixed while x (and hence the sufficient statistic S) is random. Because S has a discrete distribution it is not possible to achieve exact equality for all θ; the probability Pθ[A(x) <θ

Figure 1: Exact coverage probability for 95% Poisson Conﬁdence Intervals

The formulas for the interval endpoints given in Equations (6, 8) are similar— 1 if we take β = 0 and α = 2 they will be as close as possible to each other. Ga 1 Note that this corresponds to an improper ( 2 , 0) prior distribution

π(θ) θ−1/21 , ∝ {θ>0} but the posterior distribution π(θ x) Ga(S+ 1 ,n) is proper for any x X. | ∼ 2 ∈ For any α and β, all the intervals have the same asymptotic behavior for

13 large n; by the central limit theorem,

A(x), a(x) X¯ z X/n,¯ B(x), b(x) X¯ + z X/n¯ − γ γ q q where Φ(z )=(1+ γ)/2, so γ = Φ(z ) Φ( z ). γ γ − − γ