<<

Introduction to Bayesian

Deepayan Sarkar

Based on notes by Mohan Delampady

IIAP Astrostatistics School, July, 2013 Outline 1 Statistical Inference 2 Frequentist

3 Conditioning on

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References What is Statistical Inference? It is an inverse problem as in ‘Toy Example’: Example 1 (Toy). Suppose a million candidate stars are examined for the presence of planetary systems associated with them. If 272 ‘successes’ are noticed, how likely that the success rate is 1%, 0.1%, 0.01%, ··· for the entire universe? models for observed data involve direct : Example 2. An astronomical study involved 100 galaxies of which 20 are Seyfert galaxies and the rest are starburst galaxies. To illustrate generalization of certain conclusions, say 10 of these 100 galaxies are randomly drawn. How many galaxies drawn will be Seyfert galaxies? This is exactly like an artificial problem involving an urn having 100 marbles of which 20 are red and the rest blue. 10 marbles are drawn at random with replacement (repeatedly, one by one, after replacing the one previously drawn and mixing the marbles well). How many marbles drawn will be red? Data and Models

X = number of Seyfert galaxies (red marbles) in the (out of sample size n = 10)

n P(X = k | θ) = θk (1 − θ)(n−k), k = 0, 1,... n (1) k

In (1) θ is the proportion of Seyfert galaxies (red marbles) in the urn, which is also the probability of drawing a Seyfert galaxy at 20 each draw. In Example 2, θ = 100 = 0.2 and n = 10. So, P(X = 0 | θ = 0.2) = 0.810, P(X = 1 | θ = 0.2) = 10 × 0.2 × 0.89, and so on. In practice, as in ‘Toy Example’, θ is unknown and inference about it is the question to solve. In the Seyfert/starburst galaxy example, if θ is not known and 3 galaxies out of 10 turned out to be Seyfert, one could ask: how likely is θ = 0.1, or 0.2 or 0.3 or ...? Thus inference about θ is an inverse problem: Causes (parameters) ←− Effects (observations) How does this inversion work? The direct probability model P(X = k | θ) provides a for the unknown parameter θ when data X = x is observed: l(θ | x) = f (x | θ) (= P(X = x | θ) when X is a discrete ) as function of θ for given x. Interpretation: f (x | θ) says how likely x is under different θ or the model P(. | θ), so if x is observed, then P(X = x | θ) = f (x | θ) = l(θ | x) should be able to indicate what the likelihood of different θ values or P(. | θ) are for that x. As a function of x for fixed θ P(X = x | θ) is a probability mass function or density, but as a function of θ for fixed x, it has no such meaning, but just a measure of likelihood. After an is conducted and seeing data x, the only entity available to convey the about θ obtained from the experiment is l(θ | x). For the Urn Example we have l(θ | X = 3) ∝ θ3(1 − θ)7:

> curve(dbinom(3, prob = x, size = 10), from = 0, to = 1, + xlab = expression(theta), ylab = "", las = 1, + main = expression(L(theta) %prop% theta^3 * (1-theta)^7))

L(θ) ∝ θ3(1 − θ)7

0.25

0.20

0.15

0.10

0.05

0.00

0.0 0.2 0.4 0.6 0.8 1.0

θ Maximum Likelihood Estimation (MLE): If l(θ | x) measures the likelihood of different θ (or the corresponding models P(. | θ)), just find that θ = θˆ which maximizes the likelihood. For model (1)

θˆ = θˆ(x) = x/n = sample proportion of successes .

This is only an estimate. How good is it? What is the possible error in estimation? Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Frequentist Approach

ˆ Consider repeating this experiment again and again. ˆ One can look at all possible sample data X ∼ Bin(n, θ) → θˆ = X /n. ˆ Utilize long-run average behaviour of the MLE. i.e. treat θˆ as a random quantity (function of X .

E(θˆ) = E(X /n) = θ Var(θˆ) = Var(X /n) = θ(1 − θ)/n q ˆ Gives “” θˆ(1 − θˆ)/n. ˆ For large n, can use Law of Large Numbers and the Confidence Statements

Specifically, for large n, approximately

θˆ − θ ∼ N (0, 1), pθ(1 − θ)/n or

θˆ − θ q ∼ N (0, 1). (2) θˆ(1 − θˆ)/n

From (2), an approximate 95% confidence interval for θ (when n is large) is q θˆ ± 2 θˆ(1 − θˆ)/n. Nothing; either θ is inside

(0.3 − 2p0.3 × 0.7/10, 0.3 + 2p0.3 × 0.7/10)

or it is outside. If θ is treated as fixed unknown constant, conditioning on the given data X = x is meaningless.

What Does This ? Simply, if we sample again and again, in about 19 cases out of 20 this random interval  q q  θˆ(X ) − 2 θˆ(X )(1 − θˆ(X ))/n, θˆ(X ) + 2 θˆ(X )(1 − θˆ(X ))/n will contain the true unknown value of θ. Fine, but what can we say about the one interval that we can construct for the given sample or data x? What Does This Mean? Simply, if we sample again and again, in about 19 cases out of 20 this random interval  q q  θˆ(X ) − 2 θˆ(X )(1 − θˆ(X ))/n, θˆ(X ) + 2 θˆ(X )(1 − θˆ(X ))/n will contain the true unknown value of θ. Fine, but what can we say about the one interval that we can construct for the given sample or data x? Nothing; either θ is inside

(0.3 − 2p0.3 × 0.7/10, 0.3 + 2p0.3 × 0.7/10) or it is outside. If θ is treated as fixed unknown constant, conditioning on the given data X = x is meaningless. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Conditioning on Data

ˆ What other approach is possible, then? ˆ How does one condition on data? ˆ How does one talk about probability of a model or a hypothesis?

Example 3.(not from physics but medicine) Consider a blood test for a certain disease; result is positive (x = 1) or negative (x = 0). Suppose θ1 denotes disease is present, θ2 disease not present. Test is not confirmatory. Instead the of X for different θ is: x = 0 x = 1 What does it say? θ1 0.2 0.8 Test is +ve 80% of time if ‘disease present’ θ2 0.7 0.3 Test is −ve 70% of time if ‘disease not present’ If for a particular patient the test result comes out to be ‘positive’, what should the doctor conclude? What is the Question?

What is to be answered is ‘what are the chances that the disease is present given that the test is positive?’ i.e., P(θ = θ1|X = 1).

What we have is P(X = 1|θ = θ1) and P(X = 1|θ = θ2). We have the ‘wrong’ conditional probabilities. They need to be ‘reversed’. But how? Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Rule of total probability says,

P(B) = P(B and Ω) = P(B and A) + P(B and Ac) = P(B|A)P(A) + P(B|Ac)(1 − P(A)), so P(B|A)P(A) P(A|B) = (3) P(B|A)P(A) + P(B|Ac)(1 − P(A))

The Bayesian Recipe Recall Bayes Theorem: If A and B are two events, P(A and B) P(A|B) = P(B) assuming P(B) > 0. Therefore, P(A and B) = P(A|B)P(B), and by symmetry P(A and B) = P(B|A)P(A). Consequently, if P(B|A) is given and P(A|B) is desired, note P(A and B) P(B|A)P(A) P(A|B) = = . P(B) P(B)

But how can we get P(B)? The Bayesian Recipe Recall Bayes Theorem: If A and B are two events, P(A and B) P(A|B) = P(B) assuming P(B) > 0. Therefore, P(A and B) = P(A|B)P(B), and by symmetry P(A and B) = P(B|A)P(A). Consequently, if P(B|A) is given and P(A|B) is desired, note P(A and B) P(B|A)P(A) P(A|B) = = . P(B) P(B)

But how can we get P(B)? Rule of total probability says,

P(B) = P(B and Ω) = P(B and A) + P(B and Ac) = P(B|A)P(A) + P(B|Ac)(1 − P(A)), so P(B|A)P(A) P(A|B) = (3) P(B|A)P(A) + P(B|Ac)(1 − P(A)) Bayes Theorem allows one to invert a certain to get a certain other conditional probability. How does this help us?

In our example we want P(θ = θ1|X = 1). From (3),

P(θ = θ1 | X = 1) P(X = 1 | θ = θ )P(θ = θ ) = 1 1 . (4) P(X = 1 | θ1)P(θ = θ1) + P(X = 1 | θ2)P(θ = θ2)

So, all we need is P(θ = θ1), which is simply the probability that a randomly chosen person has this disease, or just the ‘prevalence’ of this disease in the concerned population. The doctor most likely has this information. But this is not part of the experimental data. This is pre-experimental information or prior information. If we have this, and are willing to incorporate it in the analysis, we get the post-experimental information or posterior information in the form of P(θ|X = x). Formula (4) which shows how to ‘invert’ the given conditional probabilities, P(X = x | θ) into the conditional probabilities of interest, P(θ | X = x) is an instance of the Bayes Theorem, and hence the Theory of (usage at the time of Bayes and Laplace, late eighteenth century and even by Jeffreys), is known these days as .

In our example, if we take P(θ = θ1) = 0.05 or 5%, we get

0.8 × 0.05 0.04 P(θ = θ | X = 1) = = = 0.123 1 0.8 × 0.05 + 0.3 × 0.95 0.325 which is only 12.3% and P(θ = θ2 | X = 1) = 0.877 or 87.7%. In our example, if we take P(θ = θ1) = 0.05 or 5%, we get

0.8 × 0.05 0.04 P(θ = θ | X = 1) = = = 0.123 1 0.8 × 0.05 + 0.3 × 0.95 0.325 which is only 12.3% and P(θ = θ2 | X = 1) = 0.877 or 87.7%. Formula (4) which shows how to ‘invert’ the given conditional probabilities, P(X = x | θ) into the conditional probabilities of interest, P(θ | X = x) is an instance of the Bayes Theorem, and hence the Theory of Inverse Probability (usage at the time of Bayes and Laplace, late eighteenth century and even by Jeffreys), is known these days as Bayesian inference. Ingredients of Bayesian inference:

likelihood function, l(θ|x); θ can be a parameter vector , π(θ)

Combining the two, one gets the density or mass function

 π(θ)l(θ|x)  if θ is discrete;  P π(θ )l(θ |x) π(θ | x) = j j j (5) π(θ)l(θ|x)  if θ is continuous.  R π(u)l(u|x) du http://xkcd.com/1132/

A more relevant example A more relevant example http://xkcd.com/1132/ A more relevant example

Frequentist answer?

ˆ Null hypothesis: sun has not exploded. ˆ p-value=?

Bayesian answer? A more relevant example Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Inference for Binomial proportion

Example 2 contd. Suppose we have no special information available on θ. Then assume θ is uniformly distributed on the interval (0, 1). i.e., the prior density is π(θ) = 1, 0 < θ < 1. This is a choice of non-informative or vague or reference prior. Often, Bayesian inference from such a prior coincides with classical inference. The posterior density of θ given x is then

π(θ)l(θ|x) π(θ|x) = R π(u)l(u|x) du (n + 1)! = θx (1 − θ)n−x , 0 < θ < 1. x!(n − x)!

As a function of θ, this is the same as the likelihood function l(θ|x) ∝ θx (1 − θ)n−x , and so maximizing the posterior probability density will give the same estimate as the maximum likelihood estimate! Influence of the Prior If we had some knowledge about θ which can be summarized in the form of a Beta prior distribution with parameters α and γ, the posterior will also be Beta with parameters x + α and n − x + γ. Such priors which result in posteriors from the same ‘family’ are called ‘natural conjugate priors’.

1.5 Beta(1.5, 1.5) Beta(2, 2) Beta(2.5, 2.5)

1.0 ) θ ( P

0.5

0.0

0.0 0.2 0.4 0.6 0.8 1.0

θ Objective Bayesian Approach:

Invariant priors (Jeffreys) Reference priors (Bernardo, Jeffreys) Maximum entropy priors (Jaynes)

Subjective Bayesian Approach: Expert opinion, previous studies, etc.

Robustness? If the answer depends on choice of prior, which prior should we choose? Robustness? If the answer depends on choice of prior, which prior should we choose? Objective Bayesian Approach:

Invariant priors (Jeffreys) Reference priors (Bernardo, Jeffreys) Maximum entropy priors (Jaynes)

Subjective Bayesian Approach: Expert opinion, previous studies, etc. In Example 2, what π(θ|x) says is that

ˆ The uncertainty in θ can now be described in terms of an actual probability distribution (posterior distribution). ˆ The MLE θˆ = x/n happens to be the value where the posterior is maximum (traditionally called the of the distribution). ˆ θˆ can thus be interpreted as the most probable value of the unknown parameter θ conditional on the sample data x. ˆ Also called the ‘maximum a posteriori estimate (MAP)’, or the ‘highest posterior density estimate (HPD)’, or simply ‘posterior mode’. Of course we do not need to mimic the MLE. In fact the more common Bayes estimate is the posterior mean (which minimizes the posterior dispersion):

2 2 E[(θ − θˆB ) |x] = min E[(θ − a) |x], a when θˆB = E(θ|x). If we choose θˆB as the estimate of θ, we get a natural measure of variability of this estimate in the form of the posterior :

E[(θ − E(θ|x))2|x].

Therefore the posterior is a natural measure of p 2 estimation error. i.e., our estimate is θˆB ± E[(θ − E(θ|x)) |x]. In fact, we can say much more. For any interval around θˆ we can compute the (posterior) probability of it containing the true parameter θ. In other words, a statement such as

P(θˆB − k1 ≤ θ ≤ θˆB + k2|x) = 0.95 is perfectly meaningful. All these are conditional on the given data. In Example 2, if the prior is a Beta distribution with parameters α and γ, then θ|x will have a Beta(x + α, n − x + γ) distribution, so the Bayes estimate of θ will be

(x + α) n x α + γ α θˆ = = + . B (n + α + γ) n + α + γ n n + α + γ α + γ

This is a convex combination of sample mean and prior mean, with the weights depending upon the sample size and the strength of the prior information as measured by the values of α and γ. Bayesian inference relies on the conditional probability language to revise one’s knowledge. In the above example, prior to the collection of sample data one had some (vague, perhaps) information on θ. Then came the sample data. Combining the model density of this data with the prior density one gets the posterior density, the conditional density of θ given the data. From now on until further data is available, this posterior distribution of θ is the only relevant information as far as θ is concerned. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Inference With Normals/Gaussians

Gaussian PDF

(x−µ)2 2 1 − f (x|µ, σ ) = √ e 2σ2 over [−∞, ∞] (6) 2πσ2 Common abbreviated notation: X ∼ N (µ, σ2) Parameters Z µ = E(X ) ≡ x f (x|µ, σ2) dx Z σ2 = E(X − µ)2 ≡ (x − µ)2 f (x|µ, σ2) dx Inference About a Normal Mean Example 4. Fit a normal/Gaussian model to the ‘globular cluster luminosity functions’ data. The set-up is as follows.

Our data consist of n measurements, Xi = µ + i . Suppose the noise contributions are independent, and 2 i ∼ N (0, σ ). Denoting the random sample (x1,..., xn ) by x,

2 Y 2 f (x|µ, σ ) = f (xi |µ, σ ) i Y 1 − 1 (x −µ)2 = √ e 2σ2 i 2 i 2πσ 1 Pn 2 2 −n/2 − (xi −µ) = (2πσ ) e 2σ2 i=1 1 Pn 2 2 2 −n/2 − [ (xi −x¯) +n(¯x−µ) ] = (2πσ ) e 2σ2 i=1 .

¯ 2 Pn ¯ 2 Note (X , s = i=1(Xi − X ) /(n − 1)) is sufficient for the parameters (µ, σ2). This is a very substantial . Inference About a Normal Mean, σ2 known

Not useful, but easy to understand.

2 − n (µ−x¯)2 l(µ|x) ∝ f (x|µ, σ ) ∝ e 2σ2 , so that X¯ is sufficient. Also, X¯ |µ ∼ N (µ, σ2/n). 2 If an informative prior, µ ∼ N (µ0, τ ) is chosen for µ,

π(µ|x) ∝ l(µ|x)π(µ)  2 2  − 1 n(µ−x¯) + (µ−µ0) ∝ e 2 σ2 τ2 2 2  2 2 2 − τ +σ /n µ− τ σ /n ( µ0 + nx¯ ) ∝ e 2τ2σ2/n τ2+σ2/n τ2 σ2 . i.e., µ|x ∼ N (ˆµ, δ2):

τ 2σ2/n µ nx¯  µˆ = 0 + τ 2 + σ2/n τ 2 σ2 τ 2 σ2/n = x¯ + µ . τ 2 + σ2/n τ 2 + σ2/n 0 What happens as τ 2 → ∞, or as the prior becomes more and more flat? σ µˆ → x¯, δ → √ n So Jeffreys’ prior π(µ) = C reproduces .

µˆ is the Bayes estimate of µ, which is just a weighted average of sample mean x¯ and prior mean µ0. δ2 is the posterior variance of µ and τ 2σ2/n σ2 τ 2 δ2 = = . τ 2 + σ2/n n τ 2 + σ2/n

Therefore

ˆ µˆ ± δ is our estimate for µ, and ˆ µˆ ± 2δ is a 95% HPD (Bayesian) for µ. µˆ is the Bayes estimate of µ, which is just a weighted average of sample mean x¯ and prior mean µ0. δ2 is the posterior variance of µ and τ 2σ2/n σ2 τ 2 δ2 = = . τ 2 + σ2/n n τ 2 + σ2/n

Therefore

ˆ µˆ ± δ is our estimate for µ, and ˆ µˆ ± 2δ is a 95% HPD (Bayesian) credible interval for µ.

What happens as τ 2 → ∞, or as the prior becomes more and more flat? σ µˆ → x¯, δ → √ n So Jeffreys’ prior π(µ) = C reproduces frequentist inference. Inference About a Normal Mean, σ2 unknown

Our observations X1,... Xn is a random sample from a Gaussian population with both mean µ and variance σ2 unknown. We are only interested in µ. How do we get rid of the nuisance parameter σ2? Bayesian inference uses posterior distribution which is a probability distribution, so σ2 should be integrated out from the joint posterior distribution of µ and σ2 to get the marginal posterior distribution of µ. 1 Pn 2 2 2 2 −n/2 − [ (xi −x¯) +n(µ−x¯) ] l(µ, σ |x) = (2πσ ) e 2σ2 i=1 .

Start with π(µ, σ2) and get

π(µ, σ2|x) ∝ π(µ, σ2)l(µ, σ2|x) and then get Z ∞ π(µ|x) = π(µ, σ2|x) dσ2. 0 ∞ Z 1 Pn 2 2 2 −(n+1)/2 − [ (xi −x¯) +n(µ−x¯) ] 2 π(µ|x) ∝ (σ ) e 2σ2 i=1 dσ 0 −n/2 ∝ (n − 1)s2 + n(µ − x¯)2  1 n(µ − x¯)2 −n/2 ∝ 1 + n − 1 s2 ∝ density of Students tn−1.

Use Jeffreys’ prior π(µ, σ2) ∝ 1/σ2: Flat prior for µ which is a location or translation parameter, and an independent flat prior for log(σ) which is again a , being the log of a . 1 π(µ, σ2|x) ∝ l(µ, σ2|x) σ2 Use Jeffreys’ prior π(µ, σ2) ∝ 1/σ2: Flat prior for µ which is a location or translation parameter, and an independent flat prior for log(σ) which is again a location parameter, being the log of a scale parameter. 1 π(µ, σ2|x) ∝ l(µ, σ2|x) σ2

∞ Z 1 Pn 2 2 2 −(n+1)/2 − [ (xi −x¯) +n(µ−x¯) ] 2 π(µ|x) ∝ (σ ) e 2σ2 i=1 dσ 0 −n/2 ∝ (n − 1)s2 + n(µ − x¯)2  1 n(µ − x¯)2 −n/2 ∝ 1 + n − 1 s2 ∝ density of Students tn−1. √ n(µ − x¯) | data ∼ t s n−1 s s P(¯x − t (0.975)√ ≤ µ ≤ x¯ + t (0.975)√ | data) = 95% n−1 n n−1 n i.e., the Jeffreys’ translation-scale invariant prior reproduces frequentist inference. What if there are some constraints on µ such as −A ≤ µ ≤ B, for example, µ > 0? We will get a truncated tn−1 instead, but the procedure will go through with minimal change. Example 4 contd. (GCL Data) n = 360, x¯ = 14.46, s = 1.19. √ 360(µ − 14.46) | data ∼ t 1.19 359 µ| data ∼ N (14.46, 0.0632) approximately.

Estimate for mean GCL is 14.46 ± 0.063 and 95% HPD credible interval is (14.33, 14.59). Comparing two Normal

Example 5. Check whether the mean distance indicators in the two populations of LMC datasets are different. http://www.iiap.res.in/astrostat/School10/datasets/ LMC_distance.html

> x <- c(18.70, 18.55, 18.55, 18.575, 18.4, 18.42, 18.45, + 18.59, 18.471, 18.54, 18.54, 18.64, 18.58) > y <- c(18.45, 18.45, 18.44, 18.30, 18.38, 18.50, 18.55, + 18.52, 18.40, 18.45, 18.55, 18.69)

Model as follows:

2 X1,... Xn1 is a random sample from N (µ1, σ1). 2 Y1,... Yn2 is a random sample from N (µ2, σ2) (independent). 2 2 Unknown parameters: (µ1, µ2, σ1, σ2) Quantity of interest: η = µ1 − µ2 2 2 Nuisance parameters: σ1 and σ2 2 2 2 Case 1. σ = σ . Then sufficient for (µ1, µ2, σ ) is  1 2   X¯ , Y¯ , s2 = 1 Pn1 (X − X¯ )2 + Pn2 (Y − Y¯ )2 n1+n2−2 i=1 i j =1 j It can be shown that

2 2 ˆ X¯ |µ1, µ2, σ ∼ N (µ1, σ /n1), 2 2 ˆ Y¯ |µ1, µ2, σ ∼ N (µ2, σ /n2), ˆ 2 2 2 2 (n1 + n2 − 2)s |µ1, µ2, σ ∼ σ χn1+n2−2, and these three are independently distributed. X¯ − Y¯ |µ , µ , σ2 ∼ N (η, σ2( 1 + 1 )), η = µ − µ 1 2 n1 n2 1 2 2 2 Use Jeffreys’ location-scale invariant prior π(µ1, µ2, σ ) ∝ 1/σ 1 1 η|σ2, x, y ∼ N (¯x − y¯, σ2( + )), and n1 n2 π(η, σ2|x, y) ∝ π(η|σ2, x, y)π(σ2|s2), (7) Integrate out σ2 from (7) as in the previous example to get η − (¯x − y¯) q | x, y ∼ tn1+n2−2. s 1 + 1 n1 n2

95% HPD credible interval for η = µ1 − µ2 is r 1 1 x¯ − y¯ ± tn1+n2−2(0.975)s + , n1 n2 same as frequentist t-interval. Example 5 contd. We have x¯ = 18.539, y¯ = 18.473, n1 = 13, 2 n2 = 12 and s = 0.0085. ηˆ =x ¯ − y¯ = 0.066, q s 1 + 1 = 0.037, t (0.975) = 2.069. n1 n2 23

95% HPD credible interval for η = µ1 − µ2: (0.066 − 2.069 × 0.037, 0.066 + 2.069 × 0.037) = (−0.011, 0.142). 2 2 Case 2. σ1 and σ2 are not known to be equal. From the one-sample normal example, note that (X¯ , s2 = 1 Pn1 (X − X¯ )2) sufficient for (µ , σ2), and X n1−1 i=1 i 1 1 (Y¯ , s2 = 1 Pn2 (Y − Y¯ )2) sufficient for (µ , σ2). Y n2−1 j =1 j 2 2 2 2 Making inference on η = µ1 − µ2 when σ1 and σ2 are not assumed to be equal is called the Behrens-Fisher problem for which the frequentist solution is not very straight forward, but the Bayes solution is. It is a well-known result that

¯ 2 2 X | µ1, σ1 ∼ N (µ1, σ1/n1) 2 2 2 2 (n1 − 1)sX | µ1, σ1 ∼ σ χn1−1 and are independently distributed. Similarly,

¯ 2 2 Y | µ2, σ2 ∼ N (µ2, σ2/n2) 2 2 2 2 (n2 − 1)sY | µ2, σ2 ∼ σ χn2−1 and are independently distributed. X and Y samples are independent. Use Jeffreys’ prior

2 2 2 2 π(µ1, µ2, σ1, σ2) ∝ 1/σ1 × 1/σ2 Calculations similar to those in one-sample case give: √ n (µ − x¯) 1 1 | data ∼ t , s n1−1 √ X n2(µ2 − y¯) | data ∼ tn2−1, (8) sY and these two are independent. Posterior distribution of η = µ1 − µ2 given the data is non-standard (difference of two independent t variables) but not difficult to get.

Use Monte-Carlo : Simply generate (µ1, µ2) repeatedly from (8) and construct a for η = µ1 − µ2. Example 5 (LMC) contd.

> xbar <- mean(x) > ybar <- mean(y) > n1 <- length(x) > n2 <- length(y) > sx <- sqrt(var(x)) > sy <- sqrt(var(y)) > mu1.sim <- xbar + sx * rt(100000, df = n1 - 1) / sqrt(n1) > mu2.sim <- ybar + sy * rt(100000, df = n2 - 1) / sqrt(n2) > plot(density(mu1.sim - mu2.sim)) > s <- sqrt((1/n1 + 1/n2) * ((n1-1)*sx^2 + (n2-1)*sy^2) / (n1+n2-2)) > curve(dt((x - (xbar-ybar)) / s, df = n1 + n2 - 2) / s, + add = TRUE, col = "red")

density.default(x = mu1.sim − mu2.sim) 10 8 6 Density 4 2 0

−0.2 −0.1 0.0 0.1 0.2 0.3

N = 100000 Bandwidth = 0.003582 Posterior mean of η = µ1 − µ2 is

ηˆ = E(µ1 − µ2 | data) = 0.0654. (9)

95% HPD credible interval for η = µ1 − µ2 is

 (−0.011, 0.142) equal variance; = (10) (−0.014, 0.147) unequal variance. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Bayesian Computations

Bayesian analysis requires computation of expectations and quantiles of probability distributions (posterior distributions). Most often posterior distributions will not be standard distributions. Then posterior quantities of inferential interest cannot be computed in closed form. Special techniques are needed. Example M1. Suppose X1, X2,..., Xk are observed number of certain type of stars in k similar regions. Model them as independent Poisson counts: Xi ∼ Poisson(θi ). θi are a priori considered related. νi = log(θi ) is the ith element of ν and suppose 2  0  ν ∼ Nk µ1, τ (1 − ρ)Ik + ρ11 , where 1 is the k-vector with all elements being 1, and µ, τ 2 and ρ are known constants. Then

k ! k X νi Y f (x | ν) = exp − {e − νi xi } / xi !. i=1 i=1  1  π(ν) ∝ exp − (ν − µ1)0 (1 − ρ)I + ρ110−1 (ν − µ1) 2τ 2 k

π(ν | x) ∝ n 0 0 −1 o Pk νi (ν−µ1) ((1−ρ)Ik +ρ11 ) (ν−µ1) exp − i=1{e − νi xi } − 2τ 2 .

To obtain the posterior mean of θj , compute R π π Rk exp(νj )g(ν | x) dν E (θj | x) = E (exp(νj ) | x) = R , Rk g(ν | x) dν where g(ν | x) = n 0 0 −1 o Pk νi (ν−µ1) ((1−ρ)Ik +ρ11 ) (ν−µ1) exp − i=1{e − νi xi } − 2τ 2 . This is a ratio of two k-dimensional integrals, and as k grows, the integrals become less and less easy to work with. Numerical integration techniques fail to be an efficient technique in this case. This problem, known as the curse of dimensionality, is due to the fact that the size of the part of the space that is not relevant for the computation of the integral grows very fast with the dimension. Consequently, the error in approximation associated with this numerical method increases as the power of the dimension k, making the technique inefficient. The recent popularity of Bayesian approach to statistical applications is mainly due to advances in statistical computing. These include the E-M algorithm and the Markov chain Monte Carlo (MCMC) sampling techniques. Monte Carlo Sampling Consider an expectation that is not available in closed form. To estimate a population mean, gather a large sample from this population and consider the corresponding sample mean. The Law of Large Numbers guarantees that the estimate will be good provided the sample is large enough. Specifically, let f be a probability density function (or a mass function) and suppose the quantity of interest is a finite expectation of the form Z Ef h(X ) = h(x)f (x) dx (11) X (or the corresponding sum in the discrete case). If i.i.d. observations X 1, X 2,... can be generated from the density f , then m 1 X h¯ = h(X ) (12) m m i i=1 ¯ converges in probability to Ef h(X ). This justifies using hm as an approximation for Ef h(X ) for large m. To provide a measure of accuracy or the extent of error in the approximation, compute the standard error. If Varf h(X ) is finite, ¯ then Varf (hm ) = Varf h(X )/m. Further, 2 2 Varf h(X ) = Ef h (X ) − Ef h(X ) can be estimated by

m 1 X 2 s2 = h(X ) − h¯  , m m i m i=1 and hence the standard error of h¯m can be estimated by

m 1 1 X 2 1/2 √ s = h(X ) − h¯   . m m m i m i=1 Confidence intervals for Ef h(X ): Using CLT √  m h¯m − E h(X ) f −→ N (0, 1), so sm m→∞ ¯ √ ¯ √  hm − zα/2sm / m, hm + zα/2sm / m can be used as an approximate 100(1 − α)% confidence interval for Ef h(X ), with zα/2 denoting the 100(1 − α/2)% quantile of standard normal. What Does This Say?

If we want to approximate the posterior mean, try to generate i.i.d. observations from the posterior distribution and consider the mean of this sample. This is rarely useful because most often the posterior distribution will be a non-standard distribution which may not easily allow sampling from it. What are some other possibilities? Example M2. Suppose X is N (θ, σ2) with known σ2 and a Cauchy(µ, τ) prior on θ is considered appropriate. Then

−1 π(θ | x) ∝ exp −(θ − x)2/(2σ2) τ 2 + (θ − µ)2 , and hence the posterior mean is

R ∞  (θ−x)2  2 2−1 −∞ θ exp − 2σ2 τ + (θ − µ) dθ E π(θ | x) = R ∞  (θ−x)2  2 2 −1 −∞ exp − 2σ2 (τ + (θ − µ) ) dθ R ∞ θ  1 φ θ−x  τ 2 + (θ − µ)2−1 dθ = −∞ σ σ , R ∞  1 θ−x  2 2 −1 −∞ σ φ σ (τ + (θ − µ) ) dθ where φ denotes the density of standard normal. E π(θ | x) is the ratio of expectation of h(θ) = θ/(τ 2 + (θ − µ)2) to that of h(θ) = 1/(τ 2 + (θ − µ)2), both expectations being with respect to the N (x, σ2) distribution. Therefore, we simply sample 2 θ1, θ2,... from N (x, σ ) and use

Pm 2 2−1 θi τ + (θi − µ) E π\(θ | x) = i=1 Pm 2 2 −1 i=1 (τ + (θi − µ) ) as our Monte Carlo estimate of E π(θ | x). Note that (11) and (12) are applied separately to both the numerator and denominator, but using the same sample of θ’s. It is unwise to assume that the problem has been completely solved. The sample of θ’s generated from N (x, σ2) will tend to concentrate around x, whereas to satisfactorily account for the contribution of the Cauchy prior to the posterior mean, a significant portion of the θ’s should come from the tails of the posterior distribution. Why not express the posterior mean in the form

R ∞  (θ−x)2  −∞ θ exp − 2σ2 π(θ) dθ E π(θ | x) = , R ∞  (θ−x)2  −∞ exp − 2σ2 π(θ) dθ and then sample θ’s from Cauchy(µ, τ) and use the approximation

 2  Pm (θi −x) i=1 θi exp − 2σ2 E π\(θ | x) = ?  2  Pm (θi −x) i=1 exp − 2σ2

However, this is also not satisfactory because the tails of the posterior distribution are not as heavy as those of the Cauchy prior, and there will be excess sampling from the tails relative to the center. So the convergence of the approximation will be slower resulting in a larger error in approximation (for a fixed m). Ideally, therefore, sampling should be from the posterior distribution itself. With this view in mind, a variation of the above theme, called Monte Carlo importance sampling has been developed. Consider (11) again. Suppose that it is difficult or expensive to sample directly from f , but there exists a probability density u that is very close to f from which it is easy to sample. Then we can rewrite (11) as Z Z f (x) Ef h(X ) = h(x)f (x) dx = h(x) u(x) dx X X u(x) Z = {h(x)w(x)} u(x) dx = Eu {h(X )w(X )} , X where w(x) = f (x)/u(x). Now apply (12) with f replaced by u and h replaced by hw. In other words, generate i.i.d. observations X 1, X 2,... from the density u and compute m 1 X hw = h(X )w(X ). m m i i i=1 The sampling density u is called the importance function. Markov Chain Monte Carlo Methods

A severe drawback of the standard Monte Carlo sampling/ importance sampling: complete determination of the functional form of the posterior density is needed for implementation. Situations where posterior distributions are incompletely specified or are specified indirectly cannot be handled: joint posterior distribution of the vector of parameters is specified in terms of several conditional and marginal distributions, but not directly. This covers a large of Bayesian analysis because a lot of Bayesian modeling is hierarchical so that the joint posterior is difficult to calculate but the conditional posteriors given parameters at different levels of hierarchy are easier to write down (and hence sample from). Markov Chains in MCMC

A sequence of random variables {Xn }n≥0 is a Markov chain if for any n, given the current value, Xn , the past {Xj , j ≤ n − 1} and the future {Xj : j ≥ n + 1} are independent. In other words,

P(A ∩ B | Xn ) = P(A | Xn )P(B | Xn ), (13) where A and B are events defined respectively in terms of the past and the future. Important subclass: Markov chains with time homogeneous or stationary transition probabilities: the probability distribution of Xn+1 given Xn = x, and the past, Xj : j ≤ n − 1 depends only on x and does not depend on the values of Xj : j ≤ n − 1 or n.

If the set S of values {Xn } can take, known as the state space, is countable, this reduces to specifying the transition probability matrix P ≡ ((pij )) where for any two values i, j in S, pij is the probability that Xn+1 = j given Xn = i, i.e., of moving from state i to state j in one time unit. For state space S that is not countable, specify a transition kernel or transition function P(x, ·) where P(x, A) is the probability of moving from x into A in one step, i.e., P(Xn+1 ∈ A | Xn = x). Given the transition probability and the probability distribution of the initial value X0, one can construct the joint probability distribution of {Xj : 0 ≤ j ≤ n} for any finite n. i.e.,

P(X0 = i0, X1 = i1,..., Xn−1 = in−1, Xn = in )

= P(Xn = in | X0 = i0,..., Xn−1 = in−1)

×P(X0 = i0, X1 = i1,... Xn−1 = in−1)

= pin−1in P(X0 = i0,..., Xn−1 = in−1)

= P(X0 = i0)pi0i1 pi1i2 ... pin−1in . A probability distribution π is called stationary or invariant for a transition probability P or the associated Markov chain {Xn } if it is the case that when the probability distribution of X0 is π then the same is true for Xn for all n ≥ 1. Thus in the countable state space case a probability distribution π = {πi : i ∈ S} is stationary for a transition probability matrix P if for each j in S, X P(X1 = j ) = P(X1 = j | X0 = i)P(X0 = i) i X = πi pij = P(X0 = j ) = πj . (14) i

In vector notation it says π = (π1, π2,...) is a left eigenvector of the matrix P with eigenvalue 1 and

π = πP. (15) Similarly, if S is a continuum, a probability distribution π with density p(x) is stationary for the transition kernel P(·, ·) if Z Z π(A) = p(x) dx = P(x, A)p(x) dx A S for all A ⊂ S.

A Markov chain {Xn } with a countable state space S and transition probability matrix P ≡ ((pij )) is said to be irreducible if for any two states i and j the probability of the Markov chain visiting j starting from i is positive, i.e., for some (n) n ≥ 1, pij ≡ P(Xn = j | X0 = i) > 0. A similar notion of irreducibility, known as Harris or Doeblin irreducibility exists for the general state space case also. Theorem (Law of Large Lumbers for Markov Chains). {Xn }n≥0 is a Markov chain with a countable state space S and a transition probability matrix P. Suppose it is irreducible and has a stationary probability distribution π ≡ (πi : i ∈ S) as defined in (14). Then, for any bounded function h : S → R and for any initial distribution of X0

n−1 1 X X h(X ) → h(j )π (16) n i j i=0 j in probability as n → ∞. A similar law of large numbers (LLN) holds when the state space S is not countable. The limit value in (16) will be the integral of h with respect to the stationary distribution π. A sufficient condition for the validity of this LLN is that the Markov chain {Xn } be Harris irreducible and have a stationary distribution π. How is this Useful?

A probability distribution π on a set S is given. Want to compute P the “integral of h with respect to π”, which reduces to j h(j )πj in the countable case.

Look for an irreducible Markov chain {Xn } with state space S and stationary distribution π. Starting from some initial value X0, run the Markov chain {Xj } for a period of time, say 0, 1, 2,... n − 1 and consider as an estimate

n−1 1 X µ = h(X ). (17) n n j 0

P By the LLN (16), µn will be close to j h(j )πj for large n. This technique is called Markov chain Monte Carlo (MCMC). P To approximate π(A) ≡ j ∈A πj for some A ⊂ S simply consider

n−1 1 X π (A) ≡ I (X ) → π(A), n n A j 0 where IA(Xj ) = 1 if Xj ∈ A and 0 otherwise. An irreducible Markov chain {Xn } with a countable state space S is called aperiodic if for some i ∈ S the greatest common divisor, (n) g.c.d. {n : pii > 0} = 1. Then, in addition to the LLN (16), the following result on the convergence of P(Xn = j ) holds. X | P(Xn = j ) − πj | → 0 (18) j as n → ∞, for any initial distribution of X0. In other words, for large n the probability distribution of Xn will be close to π. There exists a result similar to (18) for the general state space case also. This suggests that instead of doing one run of length n, one could do N independent runs each of length m so that n = Nm and th th then from the i run use only the m observation, say, Xm,i and consider the estimate

N 1 X µ˜ ≡ h(X ). (19) N ,m N m,i i=1 Metropolis-Hastings Algorithm Very general MCMC method with wide applications. Idea is not to directly simulate from the given target density (which may be computationally difficult), but to simulate an easy Markov chain that has this target density as the stationary distribution. Let π be the target probability distribution on S, a finite or countable set. Let Q ≡ ((qij )) be a transition probability matrix such that for each i, it is computationally easy to generate a sample from the distribution {qij : j ∈ S}. Generate a Markov chain {Xn } as follows. If Xn = i, first sample from the distribution {qij : j ∈ S} and denote that observation Yn . Then, choose Xn+1 from the two values Xn and Yn according to

P(Xn+1 = Yn | Xn , Yn ) = ρ(Xn , Yn ) = 1−P(Xn+1 = Xn | Xn , Yn ), where the “acceptance probability” ρ(·, ·) is given by   πj qji ρ(i, j ) = min , 1 for all (i, j ) such that πi qij > 0. πi qij {Xn } is a Markov chain with transition probability matrix P = ((pij )) given by ( qij ρij j 6= i, P pij = 1 − pik , j = i. (20) k6=i Q is called the “proposal transition probability” and ρ the “acceptance probability”. A significant feature of this transition mechanism P is that P and π satisfy

πi pij = πj pji for all i, j . (21)

This implies that for any j X X πi pij = πj pji = πj , (22) i i or, π is a stationary probability distribution for P. Suppose S is irreducible with respect to Q and πi > 0 for all i in S. It can then be shown that P is irreducible, and because it has a stationary distribution π, LLN (16) is available. This algorithm is thus a very flexible and useful one. The choice of Q is subject only to the condition that S is irreducible with respect to Q.A sufficient condition for the aperiodicity of P is that pii > 0 for some i or equivalently X qij ρij < 1. j 6=1

A sufficient condition for this is that there exists a pair (i, j ) such that πi qij > 0 and πj qji < πi qij . Recall that if P is aperiodic, then both the LLN (16) and (18) hold. If S is not finite or countable but is a continuum and the target distribution π(·) has a density p(·), then one proceeds as follows: Let Q be a transition function such that for each x, Q(x, ·) has a density q(x, y). Then proceed as in the discrete case but set the “acceptance probability” ρ(x, y) to be p(y)q(y, x)  ρ(x, y) = min , 1 p(x)q(x, y) for all (x, y) such that p(x)q(x, y) > 0. A particularly useful feature of the above algorithm is that it is enough to know p(·) upto a multiplicative constant as the “acceptance probability” ρ(·, ·) needs only the ratios p(y)/p(x) or πi /πj . This assures us that in Bayesian applications it is not necessary to have the normalizing constant of the posterior density available for computation of the posterior quantities of interest. Most of the new problems that Bayesians are asked to solve are high-dimensional: e.g. micro-arrays, image processing. Bayesian analysis of such problems involve target (posterior) distributions that are high-dimensional multivariate distributions. In image processing, typically one has N × N square grid of pixels with N = 256 and each pixel has k ≥ 2 possible values. Each configuration has (256)2 components and the state space S has k (256)2 configurations. How does one simulate a random configuration from a target distribution over such a large S? Gibbs sampler is a technique especially suitable for generating an irreducible aperiodic Markov chain that has as its stationary distribution a target distribution in a high-dimensional space having some special structure. The most interesting aspect of this technique: to run this Markov chain, it suffices to generate observations from univariate distributions. The Gibbs sampler in the context of a bivariate probability distribution can be described as follows. Let π be a target probability distribution of a bivariate random vector (X , Y ). For each x, let P(x, ·) be the conditional probability distribution of Y given X = x. Similarly, let Q(y, ·) be the conditional probability distribution of X given Y = y. Note that for each x, P(x, ·) is a univariate distribution, and for each y, Q(y, ·) is also a univariate distribution. Now generate a bivariate Markov chain Zn = (Xn , Yn ) as follows:

Start with some X0 = x0. Generate an observation Y0 from the distribution P(x0, ·). Then generate an observation X1 from Q(Y0, ·). Next generate an observation Y1 from P(X1, ·) and so on. At stage n if Zn = (Xn , Yn ) is known, then generate Xn+1 from Q(Yn , ·) and Yn+1 from P(Xn+1, ·). If π is a discrete distribution concentrated on {(xi , yj ) : 1 ≤ i ≤ K , 1 ≤ j ≤ L} and if πij = π(xi , yj ) then P(xi , yj ) = πij /πi· and Q(yj , xi ) = πij /π·j , where P P πi· = j πij , π·j = i πij . Thus the transition probability matrix R = ((r(ij ),(k`))) for the {Zn } chain is given by

r(ij ),(k`) = Q(yj , xk )P(xk , y`) π π = kj k` . π·j πk· Verify that this chain is irreducible, aperiodic, and has π as its stationary distribution. Thus LLN (16) and (18) hold in this case. Thus for large n, Zn can be viewed as a sample from a distribution P that is close to π and one can approximate i,j h(i, j )πij by Pn 1=1 h(Xi , Yi )/n. Illustration: Consider sampling from  X   0 1 ρ  ∼ N ,   . The conditional distribution of Y 2 0 ρ 1 X given Y = y and that of Y given X = x are

X | Y = y ∼ N (ρy, 1−ρ2) and Y | X = x ∼ N (ρx, 1−ρ2). (23)

Using this property, Gibbs sampling proceeds as follows: Generate (Xn , Yn ), n = 0, 1, 2,..., by starting from an arbitrary value x0 for X0, and repeat the following steps for i = 0, 1,..., n.

2 1 Given xi for X , draw a random deviate from N (ρxi , 1 − ρ ) and denote it by Yi . 2 2 Given yi for Y , draw a random deviate from N (ρyi , 1 − ρ ) and denote it by Xi+1. The theory of Gibbs sampling tells us that if n is large, then (xn , yn ) is a random draw from a distribution that is close to  0   1 ρ  N , . 2 0 ρ 1 Multivariate extension: π is a probability distribution of a k-dimensional random vector (X1, X2,..., Xk ). If u = (u1, u2,..., uk ) is any k-vector, let u−i = (u1, u2,..., ui−1, ui+1,..., uk ) be the k − 1 dimensional vector resulting by dropping the ith component ui . Let πi (· | x −i ) denote the univariate conditional distribution of Xi given that X −i ≡ (X1, X2, Xi−1, Xi+1,..., Xk ) = x −i . Starting with some initial value for X 0 = (x01, x02,..., x0k ) generate X 1 = (X11, X12,..., X1k ) sequentially by generating X11 according to the univariate distribution π1(· | x 0−1 ) and then generating X12 according to π2(· | (X11, x03, x04,..., x0k ) and so on. The most important feature to recognize here is that all the univariate conditional distributions, Xi | X −i = x −i , known as full conditionals should easily allow sampling from them. This is the case in most hierarchical Bayes problems. Thus, the Gibbs sampler is particularly well adapted for Bayesian computations with hierarchical priors. Rao-Blackwellization

The variance reduction idea of the famous Rao-Blackwell theorem in the presence of auxiliary information can be used to provide improved when MCMC procedures are adopted.

Theorem (Rao-Blackwell) Let δ(X1, X2,..., Xn ) be an of θ with finite variance. Suppose that T is sufficient for θ, and let ∗ ∗ δ (T ), defined by δ (t) = E(δ(X1, X2,..., Xn ) | T = t), be the conditional expectation of δ(X1, X2,..., Xn ) given T = t. Then

∗ 2 2 E(δ (T ) − θ) ≤ E(δ(X1, X2,..., Xn ) − θ) .

The inequality is strict unless δ = δ∗, or equivalently, δ is already a function of T . By the property of iterated conditional expectation,

∗ E(δ (T )) = E [E(δ(X1, X2,..., Xn ) | T )] = E(δ(X1, X2,..., Xn )).

Therefore, to compare the mean squared errors (MSE) of the two estimators, compare their only. Now,

Var(δ(X1, X2,..., Xn )) = Var [E(δ | T )] + E [Var(δ | T )] = Var(δ∗) + E [Var(δ | T )] > Var(δ∗), unless Var(δ | T ) = 0, which is the case only if δ is a function of T . The Rao–Blackwell theorem involves two key steps: variance reduction by conditioning and conditioning by a sufficient statistic. The first step is based on the formula: For any two random variables S and T , because

Var(S) = Var(E(S | T )) + E(Var(S | T )), one can reduce the variance of a random variable S by taking conditional expectation given some auxiliary information T . This can be exploited in MCMC. (Xj , Yj ), j = 1, 2,..., N : a single run of the Gibbs sampler algorithm with a target distribution of a bivariate random vector (X , Y ). Let h(X ) be a function of the X component of (X , Y ) and let its mean value be µ. Goal is to estimate µ. A first estimate is the sample mean of the h(Xj ), j = 1, 2,..., N . From the MCMC theory, as N → ∞, this estimate will converge to µ in probability. The computation of variance of this estimator is not easy due to the (Markovian) dependence of the sequence {Xj , j = 1, 2,..., N }. Suppose we make n independent runs of Gibbs sampler and generate (Xij , Yij ), j = 1, 2,..., N ; i = 1, 2,..., n. Suppose that N is sufficiently large so that (XiN , YiN ) can be regarded as a sample from the limiting target distribution of the Gibbs sampling scheme. Thus (XiN , YiN ), i = 1, 2,..., n form a random sample from the target distribution. Consider a second estimate of µ—the sample mean of h(XiN ), i = 1, 2,..., n. This estimator ignores part of the MCMC data but has the advantage that the variables h(XiN ), i = 1, 2,..., n are independent and hence the variance of their mean is of order n−1. Now applying the variance reduction idea of the Rao-Blackwell theorem by using the auxiliary information YiN , i = 1, 2,..., n, one can improve this estimator as follows:

Let k(y) = E(h(X ) | Y = y). Then for each i, k(YiN ) has a smaller variance than h(XiN ) and hence the following third estimator, n 1 X k(Y ), n iN i=1 has a smaller variance than the second one. A crucial fact to keep in mind here is that the exact functional form of k(y) be available for implementing this improvement. (Example M2 continued.) X | θ ∼ N (θ, σ2) with known σ2 and θ ∼ Cauchy(µ, τ). Simulate θ from the posterior distribution, but sampling directly is difficult. Gibbs sampling: Cauchy is a scale mixture of normal densities, with the scale parameter having a Gamma distribution.

−1 π(θ) ∝ τ 2 + (θ − µ)2 Z ∞   λ 1/2 λ 2 1/2−1 λ ∝ ( 2 ) exp − 2 (θ − µ) λ exp(− ) dλ, 0 2πτ 2τ 2 so that π(θ) may be considered the marginal prior density from the joint prior density of (θ, λ) where

θ | λ ∼ N (µ, τ 2/λ) and λ ∼ Gamma(1/2, 1/2).

This implicit hierarchical prior structure implies: π(θ | x) is the marginal density from π(θ, λ | x). Full conditionals of π(θ, λ | x) are standard distributions:

 τ 2 λσ2 τ 2σ2  θ | λ, x ∼ N x + µ, , (24) τ 2 + λσ2 τ 2 + λσ2 τ 2 + λσ2 τ 2 + (θ − µ)2  λ | θ, x ∼ λ | θ ∼ Exponential . (25) 2τ 2

Thus, the Gibbs sampler will use (24) and (25) to generate (θ, λ) from π(θ, λ | x). Example M5. X = number of defectives in the daily production of a product. (X | Y , θ) ∼ binomial(Y , θ), where Y , a day’s production, is Poisson with known mean λ, and θ is the probability that any product is defective. The difficulty is that Y is not observable, and inference has to be made on the basis of X only. Prior: (θ | Y = y) ∼ Beta(α, γ), with known α and γ independent of Y . Bayesian analysis here is not difficult because the posterior distribution of θ | X = x can be obtained as follows. First, X | θ ∼ Poisson(λθ). Next, θ ∼ Beta(α, γ). Therefore,

π(θ | X = x) ∝ exp(−λθ)θx+α−1(1 − θ)γ−1, 0 < θ < 1.(26) This is not a standard distribution, and hence posterior quantities cannot be obtained in closed form. Instead of focusing on θ | X directly, view it as a marginal component of (Y , θ | X ). Check that the full conditionals of this are given by Y | X = x, θ ∼ x + Poisson(λ(1 − θ)), and θ | X = x, Y = y ∼ Beta(α + x, γ + y − x) both of which are standard distributions. Example M5 continued. It is actually possible here to sample from the posterior distribution using the accept-reject : Let g(x)/K be the target density, where K is the possibly unknown normalizing constant of the unnormalized density g. Suppose h(x) is a density that can be simulated by a known method and is close to g, and suppose there exists a known constant c > 0 such that g(x) < ch(x) for all x. Then, to simulate from the target density, the following two steps suffice. Step 1. Generate Y ∼ h and U ∼ U (0, 1); Step 2. Accept X = Y if U ≤ g(Y )/{ch(Y )}; return to Step 1 otherwise. The optimal choice for c is sup{g(x)/h(x). In Example M5, from (26), g(θ) = exp(−λθ)θx+α−1(1 − θ)γ−1I {0 ≤ θ ≤ 1}, so that h(θ) may be chosen to be the density of Beta(x + α, γ). Then, with the above-mentioned choice for c, if θ ∼ Beta(x + α, γ) is generated in Step 1, its ‘acceptance probability’ in Step 2 is simply exp(−λθ). Even though this method works here, let us see how the Metropolis-Hastings algorithm can be applied. The required Markov chain is generated by taking the transition density q(z, y) = q(y | z) = h(y), independently of z. Then the acceptance probability is g(y)h(z)  ρ(z, y) = min , 1 g(z)h(y) = min {exp (−λ(y − z)) , 1} . The steps involved in this “independent” M-H algorithm are:

Start at t = 0 with a value x0 in the support of the target distribution; in this case, 0 < x0 < 1. Given xt , generate the next value in the chain as given below.

(a) Draw Yt from Beta(x + α, γ). (b) Let  Yt with probability ρt x(t+1) = xt otherwise, where ρt = min{exp (−λ(Yt − xt )) , 1}. (c) Set t = t + 1 and go to step (a). Run this chain until t = n, a suitably chosen large integer. In our example, for x = 1, α = 1, γ = 49 and λ = 100, we simulated such a Markov chain. The resulting histogram is shown in Figure below, with the true posterior density super-imposed on it. Figure: M-H frequency histogram and true posterior density. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Empirical Bayes Methods for High Dimensional Problems This is becoming popular again, this time for ‘high dimensional’ problems. Astronomers routinely estimate characteristics of millions of similar astronomical objects – distance, radial velocity whatever. Consider the data:

      X11 X21 Xp1  X12   X22   Xp2  (X =   , X =   , ··· X =  ). 1  .  2  .  p  .   .   .   .  X1n X2n Xpn

Xj represents n repeated independent observations on the j th object, j = 1, 2,... p. The important point is n is small, 2, 5, or 10, whereas p is large, such as a million.

2 Suppose Xj 1,... Xjn measure µj with variability σ . Problem: Maximum likelihood can give wrong estimates Take n = 2 and suppose      2  Xj 1 µj σ 0 ∼ N , 2 , j = 1, 2,... p. Xj 2 µj 0 σ i.e., we measure µj with 2 independent measurements, each coming with a N (0, σ2) error added to it; we do this for a very large number p of objects. What is the MLE of σ2? 2 2 l(µ1, . . . µp; σ | x1,... xp) = f (x1,... xp | µ1, . . . µp; σ ) p 2 Y Y 2 = f (xji | µj , σ ) j =1 i=1 p 2 1 X X = (2πσ2)−p exp(− (x − µ )2) 2σ2 ji j j =1 i=1 p " 2 # 1 X X = (2πσ2)−p exp(− (x − x¯ )2 + 2(¯x − µ )2 ). 2σ2 ji j j j j =1 i=1 µˆj =x ¯j = (xj 1 + xj 2)/2 and p 2 1 X X σˆ2 = (x − x¯ )2 2p ji j j =1 i=1 p " 2  2# 1 X xj 1 + xj 2 xj 1 + xj 2 = x − + x − 2p j 1 2 j 2 2 j =1 p 2 p 1 X (xj 1 − xj 2) 1 X = 2 = (x − x )2. 2p 4 4p j 1 j 2 j =1 j =1

2 Since Xj 1 − Xj 2 ∼ N (0, 2σ ), j = 1, 2 ..., p 1 X 2 P 2 (Xj 1 − Xj 2) −→ 2σ , so that p p→∞ j =1 p 2 ˆ2 1 X 2 P σ 2 σ = (Xj 1 − Xj 2) −→ , and not σ . 4p p→∞ 2 j =1 Good estimates for σ2 do exist, for example,

p 1 X 2 P 2 (Xj 1 − Xj 2) −→ 2σ . 2p p→∞ j =1

What is going wrong here? This is not a small p, large n problem, but a small n, large p problem. i.e. a high dimensional problem, so needs care! As p → ∞, there are too many parameters to estimate and the likelihood function is unable to see where information lies, so tries to distribute it everywhere. What is the way out? Go Bayesian! There is a lot of information available on σ2 (note Pp 2 2 2 j =1(Xj 1 − Xj 2) ∼ 2σ χp) but very little on individual µj . However, if µj are ‘similar’, there is a lot of information on where they come from, because we get to see p samples, p large.

Suppose we are interested in µj . How can we use the above information? Model as follows:

2 2 X¯j | µj , σ ∼ N (µj , σ /2), j = 1,... p, independent observations. σ2 may be assumed known, since a reliable estimate ˆ2 1 Pp 2 σ = 2p j =1(Xj 1 − Xj 2) is available. Express the information that µj are ‘similar’ in the form: 2 µj , j = 1,... p is a random sample (collection) from N (η, τ ). Where do we get the η and τ 2, the prior mean and prior variance?

Marginally (or in predictive sense) X¯j , j = 1,... p is a random 2 2 sample from N (µ0, τ + σ /2). Use this random sample. ¯ 1 P ¯ 2 Estimate η by ηˆ = X = p Xj and τ by  + ˆ2 1 Pp ¯ ¯ 2 2 τ = p−1 j =1(Xj − X ) − σ /2 .

ˆ2 Now one could pretend that the prior for (µ1, . . . µp) is N (ˆη, τ ) and compute the Bayes estimates for µj :

¯ E(µj | X1,... Xp) = (1 − Bˆ )X¯j + Bˆ X¯ ,

2 where Bˆ = σ /2 . If instead of 2 observations, each sample has σ2/2+τˆ2 n observations, replace 2 by n. This is called Empirical Bayes since the prior is estimated using data. There is also a fully Bayesian counter-part called Hierarchical Bayes. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Formal Methods for Model Selection What is the best model for Gamma-ray burst afterglow? Consider a simpler, abstract problem instead. Suppose X having density f (x | θ) is observed, with θ being an unknown element of the parameter space Θ. We are interested in comparing two models M0 and M1:

M0 : X has density f (x | θ) where θ ∈ Θ0;

M1 : X has density f (x | θ) where θ ∈ Θ1. (27)

Simplify even further, and assume we want to test

M0 : θ = θ0 versus M1 : θ 6= θ0, (28) Frequentist: A (classical) significance test is derived. It is based on a T (X ), large values of which are deemed to provide evidence against the null hypothesis, M0. If data X = x is observed, with corresponding t = T (x), the P-value is

α = Pθ0 (T (X ) ≥ T (x)) . 2 Example 6. Consider a random sample X1,..., Xn from N (θ, σ ), where σ2 is known. Then X¯ is sufficient for θ and it has the N (θ, σ2/n) distribution. Noting that √  T = T (X¯ ) = | n X¯ − θ0 /σ | is a natural test statistic to test (28), one obtains the usual P-value as α = 2[1 − Φ(t)], where √ t = | n (¯x − θ0) /σ | and Φ is the standard normal cumulative distribution function. What is a P-value and what does it say? P-value is the probability under a (simple) null hypothesis of obtaining a value of a test statistic that is at least as extreme as that observed in the sample data. To compute a P-value we take the observed value of the test statistic to the reference distribution and check if it is likely or unlikely under M0. χ2 Goodness-of-fit test

Example 7. Rutherford and Geiger (1910) gave the following observed numbers of intervals of 1/8 minute when 0, 1, ... α-particles are ejected by a specimen. Check if Poisson fits well. Number 0 1 2 3 4 5 Obs. 57 203 383 525 532 408 Exp. 54 211 407 525 508 393 Number 6 7 8 9 10 11 12 or more Obs. 273 139 45 27 10 4 2 Exp. 254 140 68 29 11 4 1

k 2 X (Oi − Ei ) Test statistic: T = ∼ χ2 approximately for large n, E k−2 j =1 i where k is the number of cells, Oi is the observed and Ei is the expected count (estimated) for the ith cell. Estimated Poisson intensity rate = (total number of particles ejected)/(total number of intervals) = 100097/2608 =3.87. k = 13. 2 P-value = P(T ≥ 14.03) ≈ 0.21 (under χ11). Likelihood Ratio Criterion

Standard likelihood ratio criterion for comparing M0 and M1 is

f (x | θˆ ) max f (x | θ) λ = 0 = θ∈Θ0 . (29) n ˆ f (x | θ) maxθ∈Θ0∪Θ1 f (x | θ)

0 < λn ≤ 1, and large values of λn provide evidence for M0. Reject M0 for small values. Use λn (or a function of λn ) as a test statistic if its distribution under M0 can be derived. Otherwise, use the large sample result:

L 2 −2 log(λn ) −→ χ , n→∞ p1−p0 under M0 where p0 and p1 are dimensions of Θ0 and Θ0 ∪ Θ1. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Bayesian Model Selection How does the Bayesian approach work? X ∼ f (x | θ) and we want to test

M0 : θ ∈ Θ0 versus M1 : θ ∈ Θ1. (30)

If Θ0 and Θ1 are of the same dimension (eg: M0 : θ ≤ 0 and M1 : θ > 0), choose a prior density that assigns positive prior probability to Θ0 and Θ1. Then calculate the posterior probabilities P{Θ0 | x}, P{Θ1 | x} as well as the posterior , namely,

P{Θ0 | x}/P{Θ1 | x}. Find a threshold like 1/9 or 1/19, etc. to decide what constitutes evidence against H0.

Alternatively, let π0 and 1 − π0 be the prior probabilities of Θ0 and Θ1. Let gi (θ) be the prior p.d.f. of θ under Θi (or Mi ), so that Z gi (θ)dθ = 1. Θi The prior in the previous approach is nothing but

π(θ) = π0g0(θ)I {θ ∈ Θ0} + (1 − π0)g1(θ)I {θ ∈ Θ1}.

Need not require any longer that Θ0 and Θ1 are of the same dimension. Sharp null hypotheses are also covered. Proceed as before and report posterior probabilities or posterior odds. To compute these posterior quantities, note that the marginal density of X under the prior π can be expressed as Z mπ(x) = f (x | θ)π(θ) dθ Θ Z Z = π0 f (x | θ)g0(θ) dθ + (1 − π0) f (x | θ)g1(θ) dθ Θ0 Θ1 and hence the posterior density of θ given the data X = x as f (x | θ)π(θ)  π f (x | θ)g (θ)/m (x) if θ ∈ Θ ; π(θ | x) = = 0 0 π 0 mπ(x) (1 − π0)f (x | θ)g1(θ)/mπ(x) if θ ∈ Θ1. It follows then that Z π π π0 P (M0 | x) = P (Θ0 | x) = f (x | θ)g0(θ) dθ mπ(x) Θ0 R π0 f (x | θ)g0(θ) dθ = Θ0 ; π R f (x | θ)g (θ) dθ + (1 − π ) R f (x | θ)g (θ) dθ 0 Θ0 0 0 Θ1 1 Z π π (1 − π0) P (M1 | x) = P (Θ1 | x) = f (x | θ)g1(θ) dθ mπ(x) Θ1 R (1 − π0) f (x | θ)g1(θ) dθ = Θ1 . π R f (x | θ)g (θ) dθ + (1 − π ) R f (x | θ)g (θ) dθ 0 Θ0 0 0 Θ1 1

One may also report the , which does not depend on π0. The Bayes factor of M0 relative to M1 is defined as R f (x | θ)g0(θ) dθ P(Θ0 | x)P(Θ0) Θ0 BF01 = = R . (31) P(Θ1 | x) P(Θ1) f (x | θ)g (θ) dθ Θ1 1 Note:

ˆ BF10 = 1/BF01.

ˆ Posterior odds ratio of M0 relative to M1:   P(Θ0 | x) π0 = BF01. P(Θ1 | x) 1 − π0

ˆ 1 Posterior odds ratio of M0 relative to M1 = BF01 if π0 = 2 . ˆ The smaller the value of BF01, the stronger the evidence against M0.

Testing as a model selection problem using Bayes factor illustrated below: Jeffreys test. Jeffreys Test for Normal Mean; σ2 Unknown 2 X1, X2,..., Xn a random sample from N (µ, σ ). We want to test

M0 : µ = µ0 versusM1 : µ 6= µ0 where µ0 is some specified number. 2 Parameter σ is common in the two models corresponding to M0 and M1 and µ occurs only in M1. Take the prior g0(σ) = 1/σ for σ under M0. Under M1, take the same prior for σ and add a conditional prior for µ given σ, namely 1 µ g (µ | σ) = g ( ). 1 σ 2 σ where g2(·) is a p.d.f. Jeffreys suggested we should take g2 to be Cauchy, so 1 g (σ) = under M 0 σ 0 1 1 1 g (µ, σ) = g (µ | σ) = under M . 1 σ 1 σ σπ(1 + µ2/σ2) 1

One may now find the Bayes factor BF01 using (31). Example 8. Einstein’s theory of gravitation predicts the amount of deflection of light deflected by gravitation. Eddington’s expedition in 1919 (and other groups in 1922 and 1929) provided 4 observations: x1 = 1.98, x2 = 1.61, x3 = 1.18, x4 = 2.24 (all in seconds as measures of angular deflection). Suppose they are normally distributed around their predicted value µ. Then X1, ··· , X4 are independent and identically distributed as 2 N (µ, σ ). Einstein’s is µ = 1.75. Test M0 : µ = 1.75 2 versus M1 : µ 6= 1.75, where σ is unknown. Use the conventional priors of Jeffreys to calculate the Bayes factor. BF01 = 2.98. The calculations with the given data lend some support to Einstein’s prediction. However, the evidence in the data isn’t very strong. BIC When we compare two models M0 : θ ∈ Θ0 and M1 : θ ∈ Θ1, what does the Bayes factor

R f (x | θ)g (θ) dθ Θ0 0 m0(x) BF01 = R = f (x | θ)g (θ) dθ m1(x) Θ1 1 measure? m0(x) measures how well M0 fits the data x whereas m1(x) measures how well M1 fits the same data, so BF01 is the relative strength of the two models in the predictive sense. This can be difficult to compute for complicated models, so any good approximation is welcome. Approximate marginal density m(x) of X for large sample size n:

Z m(x) = π(θ)f (x | θ) dθ =? Laplace’s Method

n Z Z Y m(x) = π(θ)f (x | θ) dθ = π(θ) f (xi | θ) dθ i=1 n Z X Z = π(θ) exp( log f (xi | θ)) dθ = π(θ) exp(nh(θ)) dθ. i=1 1 Pn where h(θ) = n i=1 log f (xi | θ). Consider any integral of the form Z ∞ I = q(θ)enh(θ) dθ −∞ where q and h are smooth functions of θ with h having a unique maximum at θˆ. If h has a unique sharp maximum at θˆ, then most contribution to the integral I comes from the integral over a small neighborhood (θˆ − δ, θˆ + δ) of θˆ. Study the behavior of I as n → ∞. As n → ∞, we have Z θˆ+δ nh(θ) I ∼ I1 = q(θ)e dθ. θˆ−δ Laplace’s method involves Taylor series expansion of q and h about θˆ: ˆ Z θ+δ  1  I ∼ q(θˆ) + (θ − θˆ)q0(θˆ) + (θ − θˆ)2q00(θˆ) + ··· θˆ−δ 2 h n i × exp nh(θˆ) + nh0(θˆ)(θ − θˆ) + h00(θˆ)(θ − θˆ)2 + ··· 2 θˆ+δ ˆ Z  1  ∼ enh(θ)q(θˆ) 1 + (θ − θˆ)q0(θˆ)/q(θˆ) + (θ − θˆ)2q00(θˆ)/q(θˆ) θˆ−δ 2 hn i × exp h00(θˆ)(θ − θˆ)2 dθ. 2 Assume c = −h00(θˆ) > 0 and use a change of variable √ t = nc(θ − θˆ): ˆ 1 I ∼ enh(θ)q(θˆ)√ nc √ Z δ nc  2  t 0 ˆ ˆ t 00 ˆ ˆ −t2/2 × √ 1 + √ q (θ)/q(θ) + q (θ)/q(θ) e dt −δ nc nc 2nc √ " 00 # ˆ 2π q (θˆ) ∼ enh(θ) √ q(θˆ) 1 + nc 2ncq(θˆ) √ ˆ 2π = enh(θ) √ q(θˆ) 1 + O(n−1) . (32) nc

Apply (32) to m(x) = R π(θ)f (x | θ) dθ = R π(θ) exp(nh(θ)) dθ, with q = π and ignore terms that stay bounded. ˆ 1 ˆ 1 log(m(x) ≈ nh(θ) − 2 log n = log(f (x | θ)) − 2 log n. What Happens When θ is p > 1 Dimensional?

Simply replace (32) by its p dimensional counter part:

nh(θˆ) p/2 −p/2 ˆ −1/2 ˆ −1 I = e (2π) n det(∆h (θ)) q(θ)(1 + O(n )) where ∆h (θ) denotes the Hessian of −h, i.e.,

 ∂2  ∆h (θ) = − h(θ) . ∂θi ∂θj p×p

Now apply this to m(x) = R ··· R π(θ)f (x | θ) dθ = R ··· R π(θ) exp(nh(θ)) dθ, with q = π and ignore terms that stay bounded. Then ˆ p ˆ p log(m(x) ≈ nh(θ) − 2 log n = log(f (x | θ)) − 2 log n. Schwarz (1978) proposed a criterion, known as the BIC, based on (32) ignoring the terms that stay bounded as the sample size n → ∞ (and general dimension p for θ): BIC = log f (x | θˆ) − (p/2) log n This serves as an approximation to the logarithm of the integrated likelihood of the model and is free from the choice of prior.

2 log BF01 is a commonly used evidential measure to compare the support provided by the data x for M0 relative to M1. Under the above approximation we have, ! f (x | θb0) 2 log(BF01) ≈ 2 log − (p0 − p1) log n. (33) f (x | θb1) This is the approximate Bayes factor based on the Bayesian information criterion (BIC) due to Schwarz (1978). The term (p0 − p1) log n can be considered a penalty for using a more complex model. AIC ˆ Recall the likelihood ratio criterion: λ = f (x | θ0) n f (x | θˆ)

2 P(M0 is rejected | M0) = P(λn < c) ≈ P(χp1−p0 > −2 log(c)) > 0, so, from a frequentist point of view, a criterion based solely on the likelihood ratio does not converge to a sure answer under M0. Akaike (1983) suggested a penalized likelihood criterion: ! f (x | θb0) 2 log − 2(p0 − p1) (34) f (x | θb1) which is based on the Akaike information criterion (AIC), namely,

AIC = 2 log f (x | θb) − 2p for a model f (x | θ). The penalty for using a complex model is not as drastic as that in BIC. Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References Model Selection or Model Averaging?

Example 9. Velocities (km/second) of 82 galaxies in six well-separated conic sections of the Corona Borealis region. How many clusters? Consider mixture of normals: n Y f (x | θ) = f (xi | θ) i=1 n  k  Y X 2 =  pj φ(xi | µj , σj ) , i=1 j =1 where k is the number of mixture components, pj is the weight 2 given to the j th component, N (µj , σj ). Models to consider:

k X 2 Mk : X has density pj φ(xi | µj , σj ), k = 1, 2 ... j =1 i.e., Mk is a k component normal mixture. Bayesian model selection procedure computes R m(x | Mk ) = π(θk )f (x | θk ) dθk , for each k of interest and picks the one which gives the largest value. Example 9 contd. Chib (1995), JASA:

2 k σj log(m(x | Mk )) 2 2 2 σj = σ -240.464 2 2 3 σj = σ -228.620 2 3 σj unrestricted -224.138 3 component normal with unequal variances seems best. ˆ From the Bayesian point of view, a natural approach to model uncertainty is to include all models, Mk , under consideration for future decisions. ˆ i.e., Bypass the model-choice step entirely. ˆ Unsuitable for scientific inference where selection of a model is a must. ˆ Suitable for prediction purposes, since underestimation of

uncertainty resulting from choosing model Mkˆ is eliminated. We have Θ = ∪k Θk ,

f (y | θ) = fk (y | θk ) if θ ∈ Θk and

π(θ) = pk gk (θk ) if θ ∈ Θk , where pk = Pπ(Mk ) is the prior probability of Mk and gk integrates to 1 over Θk . Therefore, given the sample x = (x1,... xn ), f (x | θ)π(θ) π(θ | x) = m(x) X pk = f (x | θ )g (θ )I (θ ) m(x) k k k k Θk k k X = P(Mk | x)gk (θk | x)IΘk (θk ). k Predictive density m(y | x) given the sample x = (x1,... xn ) is what is needed. This is given by Z m(y | x)= f (y | θ)π(θ | x) dθ Θ X Z = P(Mk | x) fk (y | θk )gk (θk | x) dθk k Θk X = P(Mk | x)mk (y | x), k which is clearly obtained by averaging over all models. Minimum Description Length

Model fitting is like describing the data in a compact form. A model is better if it can provide a more compact description, or if it can compress the data more, or if it can be transmitted with fewer bits. Given a set of models to describe a data set, the best model is the one which provides the shortest description length.

In general one needs log2(n) bits to transmit n, but patterns can reduce the description length. 100 ··· 0: 1 followed by a million 0’s 1010 ··· 10: pair 10 repeated a million times If data x is known to arise from a probability density p, then the optimal code length (in an average sense) is given by − log p(x). The optimal code length of − log p(x) is valid only in the discrete case. What happens in the continuous case? Discretize x and denote it by [x] = [x]δ where δ denotes the precision. This means we consider Z [x]+δ/2 P([x] − δ/2 ≤ X ≤ [x] + δ/2) = p(u) du ≈ δp(x) [x]−δ/2 instead of p(x) itself as far as coding of x is considered when x is one-dimensional. In the r-dimensional case, replace the density p(x) by the probability of the r-dimensional cube of side δ containing x, namely p([x])δr ≈ p(x)δr , so that the optimal code length changes to − log p(x) − r log δ. MDL for Estimation or Model Fitting n Consider data x ≡ x = (x1, x2,..., xn ), and suppose

F = {f (x n | θ): θ ∈ Θ} is the collection of models of interest. Further, let π(θ) be a prior density for θ. Given a value of θ (or a model), the optimal code length for describing x n is − log f (x n | θ), but since θ is unknown, its description requires a further − log π(θ) bits on average. Therefore the optimal code length is obtained upon minimizing

DL(θ) = − log π(θ) − log f (x n | θ), (35) so that MDL amounts to seeking that model which minimizes the sum of

(i) the length, in bits, of the description of the model, and (ii) the length, in bits, of data when encoded with the help of the model. The posterior density of θ given the data x n is

f (x n | θ)π(θ) π(θ | x n ) = , (36) m(x n ) where m(y) is the marginal or predictive density. Minimizing

DL(θ) = − log π(θ) − log f (x n | θ) = − log{f (x n | θ)π(θ)} over θ is equivalent to maximizing π(θ | x n ). Thus MDL for estimation or model fitting is equivalent to finding the highest posterior density (HPD) estimate of θ. Consider the case of F having model parameters of different dimensions. Consider the continuous case and discretization. k Denote k-dimensional θ by θ = (θ1, θ2, . . . , θk ). Then DL(θk ) k k n k n = − log{π([θ ]δπ )δπ} − log{f ([x ]δf | [θ ]δπ )δf } k n k = − log π([θ ]δπ ) − k log δπ − log f ([x ]δf | [θ ]δπ ) − n log δf k n k ≈− log π(θ ) − k log δπ − logf (x | θ ) − n log δf .

Note that the term −n log δf is common across all models, so it can be ignored. However, the term −k log δπ indicating the dimension of θ in the model varies and is influential. According to √ Rissanen, δπ = 1/ n is optimal, in which case k DL(θk ) ≈ −logf (x n | θk ) − log π(θk ) + log n + constant . 2 (37) Outline 1 Statistical Inference 2 Frequentist Statistics

3 Conditioning on Data

4 The Bayesian Recipe

5 Inference for Binomial proportion

6 Inference With Normals/Gaussians

7 Bayesian Computations

8 Empirical Bayes Methods for High Dimensional Problems

9 Formal Methods for Model Selection 10 Bayesian Model Selection

11 Model Selection or Model Averaging?

12 References References

1 Tom Loredo’s site: http://www.astro.cornell.edu/staff/loredo/bayes/ 2 An Introduction to Bayesian Analysis: Theory and Methods by J.K. Ghosh, Mohan Delampady and T. Samanta, Springer, 2006 3 : The Logic of Science by E.T. Jaynes, Cambridge University Press, 2003 4 Bayesian Logical for Physical Sciences by P.C. Gregory, Cambridge University Press, 2005 5 Bayesian Reasoning in High-Energy Physics: Principles and Applications by G. D’Agostini, CERN, 1999