<<

PUBH 8442, Spring 2016 Eric Lock

Yanghao Wang

May 12, 2016

Contents

1 Prior and Posterior1 Maximum Likelihood Estimation (MLE)...... 1 Prior and Posterior...... 1 Beta-Binomial Model...... 2 Conjugate Priors...... 3 Normal-Normal Model...... 4 Prior and Posterior Predictive...... 5 Normal-Normal Predictives...... 6 Non-Informative Priors...... 8 Improper Priors...... 8 Jeffreys Priors...... 9 Elicited Priors...... 11

2 Estimation, and Decision Theory 11 Point Estimation...... 11 Loss Function...... 12 Decision Rules...... 13 Frequentist Risk...... 14 Minimax Rules...... 15 Bayes Risk...... 15 Bayes Decision Rules...... 16

3 The Bias-Variance Tradeoff 16 Unbiased rules...... 16 Bias-Variance Trade-Off...... 17

i 4 Interval Estimation 19 Credible sets...... 19 Decision Theory for Credible Sets...... 20 Normal Model with Jeffreys Priors...... 23

5 Decisions and Hypothesis Testing 25 Frequentist Hypothesis Testing...... 26 Bayesian Hypothesis Testing...... 27 Hypothesis Decision...... 28 Type I error...... 29 Bayes Factors...... 32

6 Model Comparison 32 Multiple Hypothesis/Models...... 32 Model Choice...... 33 Bayes Factors for Model Comparison...... 34 Bayesian Information Criterion...... 34 Partial Bayes Factors...... 35 Intrinsic Bayes Factors...... 36 Fractional Bayes Factors...... 37 Cross-Validated Likelihood...... 38 Bayesian Residuals...... 38 Bayesian P-Values...... 39

7 Hierarchical Model 40 Bayesian Hierarchical Models...... 40 Hierarchical Normal Model...... 44 The Normal-Gamma Model...... 46 Conjugate Normal with both µ and σ2 Unknown...... 47

8 Bayesian Linear Models 48 Bayesian Framework with Uninformative Priors...... 48 Bayesian Framework with Normal-Inverse Gamma Prior...... 50

9 Empirical Bayes Approaches 52 The Empirical Bayes Approach...... 53

10 Direct Sampling 57 Direct Monte Carlo integration...... 57 Direct Sampling for Multivariate Density...... 59

ii 11 Importance Sampling 60 Importance Sampling...... 60 Density Approximation...... 61

12 Rejection Sampling 64 Rejection Sampling...... 64

13 Metropolis-Hastings Sampling 65 Metropolis-Hastings Sampling...... 66 Choice of Proposal Density...... 67

14 70 Gibbs Sampling...... 70

15 More on MCMC 73 Combining MCMC methods...... 74

16 Deviance Information Criterion 75 BIC and AIC...... 75

17 Bayesian Model Averaging 80

18 Mixture Models 83 Finite Mixture Models...... 83 Bayesian Finite ...... 84

iii 1 Prior and Posterior

Assume a sampling model for data y = (y1, . . . , yn), specified by parameter θ (maybe unknown), is often represented as a probability density f (y | θ).

2 Example. Gaussian model specified by θ = µ, σ for independent yi’s:

( )  1 n Pn (y − µ)2 f (y | θ) = √ exp − i=1 i . σ 2π 2σ2

The probability density f (y | θ) is often called the likelihood, with the notation L (θ; y). Use this notation when making inferences about θ.

Maximum Likelihood Estimation (MLE)

Choose θ to maximize likelihood

θˆ = arg max L (θ; y) . θ

MLE approach assume θ is fixed (though unknown) and has been criticized for overfitting.

Prior and Posterior

Alternatively, θ can be a random variable. Let π (θ) be a probability density for θ or π (θ | η) be a probability density for θ where η is hyperparameter. The marginal density for y is Z p (y) = f (y | θ) π (θ) dθ, which implies a marginal density for y is a density for y averaging over θ.

• Bayes’ rule for continuous random variable

p (y, θ) f (y | θ) π (θ) p (θ | y) = = , p (y) R f (y | θ) π (θ) dθ

where π (θ) is the prior (distribution for θ) and p (θ | y) is the posterior (distribution for θ).

– π (θ) is the density for θ without observed data while p (θ | y) is the density for θ with observed data. – Given data y, R f (y | θ) π (θ) dθ is a normalizing constant (a scalar) helping the poste- rior p (θ | y) integrate to 1.

1 • Bayes’ rule for discrete random variable

– θ is continuous distributed while y is discrete:

P (y = k | θ) π (θ) p (θ | y = k) = , R P (y = k | θ) π (θ) dθ

where y is discrete and P (y | θ) is the probability mass function (PMF) for y given θ. – θ is discrete distributed while y is continuous:

f (y | θ = k) P (θ = k) P (θ = k | y) = P , k∈Θ f (y | θ = k) P (θ = k)

where Θ is the set of all possible values of θ.

Example (Gender Proneness). The Gupta’s (of CNN fame) have three daughters. What is the probability their next child will be a daughter?

MLE Solution Let θ be the probability of a daughter for Gupta’s. Binomial likelihood for first 3 births with y = # daughter is

P (y = 3 | θ) = θ3,

which implies θˆ = 1 (overfitting problem).

Bayes Solution Assume the prior distribution for θ is Uniform (0, 1) and hence π (θ) = 1 ∀θ ∈ [0, 1].

P (y = 3 | θ) π (θ) θ3 × 1 p (θ | y = 3) = = = 4θ3, R P (y = 3 | θ) π (θ) dθ 1/4

R 1 4 1 1 where P (y = 3 | θ) π (θ) dθ = 4 θ |θ=0= 4 . Estimate θ by the posterior distribution:

Z 1 ˆ 4 5 1 4 θ = E [θ | y = 3] = θp (θ | y = 3) dθ = θ |θ=0= . 0 5 5

Beta-Binomial Model

Assume θ ∼ Beta (a, b):

1 π (θ) = θa−1 (1 − θ)b−1 , B (a, b)

2 where B (·, ·) is the beta function. If Y ∼ Binomial (n, θ) where n is sample size and θ is the success probability, then

p (θ | y = k) = Beta (a + k, b + n − k) .

Proof. By the Bayes’ rule, the denominator is a constant, the posterior distribution p (θ | y = k) is of proportion to the numerator P (y = k | θ) π (θ). Thus,

θa−1 (1 − θ)b−1 p (θ | y = k) ∝ P (y = k | θ) π (θ) = Cnθk (1 − θ)n−k k B (a, b) θa+k−1 (1 − θ)n−k+b−1 ∝ = Beta (a + k, b + n − k) . B (a, b)

Note. is a distribution about the belief of probability; i.e., θ ∼ Beta (a, b) for all a θ ∈ [0, 1] and E[θ] = a+b . Thus, shape parameters a and b will determine the expectation for θ. In addition, Beta(1, 1) is an uniform distribution U [0, 1].

Example (Gender Proneness (cont.)). Based on data from many families, a more realistic prior for θ is π (θ) = Beta (39, 40), where 39 is the number of girls and 40 is the number of boys. For the Gupta family, p (θ | y = 3) = Beta (42, 40) since n = 3 (sample size in the Gupta’s) and k = 3 (observed number of girls). The expected value of Beta (a, b) is a/ (a + b):

ˆ 39 • Prior estimate is θ = 39+40 = 0.494 ˆ 42 • Posterior estimate is θ = 42+40 = 0.512

Conjugate Priors

A prior is conjugate for a given likelihood if its posterior belongs to the same distribution family. For example, a beta prior is conjugate to a binomial random variable (i.e., if y is generated by a binomial process, prior distribution π (θ) is a Beta distribution implies the posterior distribution p (θ | y) is also a Beta distributions).

• Conjugate priors facilitate computation of posteriors.

– Particularly useful when updating the posterior adaptively. E.g., given π (θ) = Beta (39, 40), after one girl, the posterior is Beta (40, 40). Then after another girl, the posterior is updated to Beta (41, 40).

• Conjugate does not give precision (no profound theoretical justification).

3 • Common conjugate families (credit: John D. Cook)

Example (Poisson-Gamma Model). Let y ∼Poisson(θ) and θ ∼Gamma(α, β): π (θ) ∼ θα−1e−βθ.

p (θ | y) ∝p (y | θ) π (θ) ∝θye−θθα−1e−βθ =θα+y−1e−(β+1)θ ≡ Gamma (α + y, β + 1) .

Normal-Normal Model

2 2 2 Assume y ∼ Normal µ, σ with σ known. If π (µ) = Normal µ0, τ , then

σ2µ + τ 2y σ2τ 2  p (µ | y) = Normal 0 , . σ2 + τ 2 σ2 + τ 2

Proof. Note y ∼ Normal µ, σ2 implies

" # 1 (y − µ)2 p (y | µ) = √ exp − . σ 2π 2σ2

2 With π (µ) = Normal µ0, τ , we have

" #  1   1  (y − µ)2 (µ − µ )2 p (µ | y) ∝ p (y | µ) π (µ) = √ √ exp − − 0 σ 2π τ 2π 2σ2 2τ 2 " # τ 2 (y − µ)2 + σ2 (µ − µ )2 ∝ exp − 0 2σ2τ 2

  2 2 2  2 2 2 2 2 2 2  2 2 σ µ0+τ y σ µ0+τ y τ y +σ µ0 τ + σ µ − 2 2 − 2 2 + 2 2  σ +τ σ +τ σ +τ  ∝ exp −   2σ2τ 2 

4 " #  σ2µ + τ 2y 2 .  σ2τ 2  ∝ exp − µ − 0 2 , (∗) σ2 + τ 2 τ 2 + σ2

2 2 2 2 where the third line follows that τ (y − µ) + σ (µ − µ0) is

2 2 2 2 2 2 2 2 2 2 2 2 2 τ (y − µ) + σ (µ − µ0) = τ + σ µ − 2τ yµ − 2σ µ0µ + τ y + σ µ0  σ2µ + τ 2y τ 2y2 + σ2µ2  = τ 2 + σ2 µ2 − 2 0 µ + 0 σ2 + τ 2 σ2 + τ 2 " #  σ2µ + τ 2y 2 σ2µ + τ 2y 2 τ 2y2 + σ2µ2 = τ 2 + σ2 µ − 0 − 0 + 0 , σ2 + τ 2 σ2 + τ 2 σ2 + τ 2

 2 2 2 2 2 2 2 2 2 σ µ0+τ y τ y +σ µ0 and the forth line follows that (given y, µ0, σ , and τ ) constants σ2+τ 2 and σ2+τ 2 are dropped for simplicity. Equation ∗ implies

σ2µ + τ 2y σ2τ 2  p (µ | y) = Normal 0 , . σ2 + τ 2 σ2 + τ 2

2 2 Assume y = (y1, . . . , yn) are i.i.d. with yi ∼ Normal µ, σ and σ known. If π (µ) = 2 Normal µ0, τ , then

σ2µ + nτ 2y¯ σ2τ 2  p (µ | y) = Normal 0 , σ2 + nτ 2 σ2 + nτ 2

P where y¯ = i yi/n.

Example (Coke Bottles). Coca-Cola bottling machines fill with known variance 0.05 oz. Each ma- chine is calibrated to fill mean capacity µ. Hence bottles are filled with Gaussian error: Normal (µ, 0.05). Historical data show machine calibrations are approximately Normal (12, 0.01). Five randomly se- lected bottles from a given machine have sample mean y¯ = 11.88. What is the posterior for the calibration of this machine? 2 Solution: Given yi ∼ Normal (µ, 0.05) and µ ∼ Normal (12, 0.01), we have σ = 0.05, 2 µ0 = 12, τ = 0.01. With n = 5 and y¯ = 11.88, apply Normal-Normal Model,

0.05 × 12 + 5 × 0.01 × 11.88 0.05 × 0.01  p (µ | y) = Normal , = Normal (11.94, 0.005) . 0.05 + 5 × 0.01 0.05 + 5 × 0.01

Prior and Posterior Predictive

Often we are not interested in making inference on θ but in our best guess for the distribution of yi.

5 • Prior Predictive is the marginal distribution of an observation given the prior: Z p (yi) = f (yi | θ) π (θ) dθ.

With the prior information about the distribution of yi, f (yi | θ), we can average over all

possible θ’s to obtain the prior predictive distribution for yi.

• Posterior Predictive is the distribution for a future observation given the data so far: Z p (yn+1 | y) = f (yn+1 | θ) p (θ | y) dθ

With data y ∈ Rn, we can replace the prior distribution for θ, π (θ), by the posterior p (θ | y). Then average over all possible θ’s to obtain the posterior predictive distribution for yn+1.

Normal-Normal Predictives

2 2 For the normal-normal model, show that the prior predictive is p (yi) = Normal µ0, τ + σ .  2 2 2 2  σ µ0+nτ y¯ σ τ 2 The posterior predictive is p (yn+1 | y) = Normal σ2+nτ 2 , σ2+nτ 2 + σ . R Proof. By definition of p (yi), we have p (yi) = f (yi | µ) π (µ) dµ

" # Z 1 (y − µ)2 (µ − µ )2 = exp − i − 0 dµ 2πστ 2σ2 2τ 2 " # Z 1 τ 2 (y − µ)2 + σ2 (µ − µ )2 = exp − i 0 dµ 2πστ 2σ2τ 2

  2 2 2  2 2 τ yi+σ µ0  2 2 2  τ + σ µ − τ σ (yi−µ0) Z 1 τ 2+σ2 2 2 = exp −  exp − τ +σ dµ 2πστ  2σ2τ 2   2σ2τ 2 

  2 2 2  τ yi+σ µ0 " 2 # Z 1 µ − τ 2+σ2 1 (y − µ )   i 0 = √ exp − 2 2 √ √ exp − dµ 2π √ στ  2 σ τ  2π τ 2 + σ2 2 (τ 2 + σ2) τ 2+σ2 τ 2+σ2

  2 2 2  " 2 # τ yi+σ µ0 1 (y − µ ) Z 1 µ − τ 2+σ2 i 0   =√ √ exp − √ exp − 2 2 dµ, 2π τ 2 + σ2 2 (τ 2 + σ2) 2π √ στ  2 σ τ  τ 2+σ2 τ 2+σ2 | {z } =1

2 2 2 2 where the third line follows that τ (yi − µ) + σ (µ − µ0)

2 2 2 2 2 2 2 2 2 2 =τ yi − τ 2yiµ + τ µ + σ µ − σ 2µµ0 + σ µ0 2 2 2 2 2  2 2 2 2 = τ + σ µ − 2 τ yi + σ µ0 µ + τ yi + σ µ0

6  τ 2y + σ2µ  = τ 2 + σ2 µ2 − 2 i 0 µ + τ 2y2 + σ2µ2 τ 2 + σ2 i 0 2  τ 2y + σ2µ 2 τ 2y + σ2µ  = τ 2 + σ2 µ − i 0 − i 0 + τ 2y2 + σ2µ2 τ 2 + σ2 τ 2 + σ2 i 0 2    τ 2y + σ2µ  τ 4¨y¨2 + 2τ 2y σ2µ +σ4µ2−τ4y2 − τ 2σ2µ2 − σ2τ 2y2−σ4µ2 = τ 2 + σ2 µ − i 0 − ¨ i i 0 0 i 0 i 0 τ 2 + σ2 τ 2 + σ2  τ 2y + σ2µ 2 τ 2σ2 (y − µ )2 = τ 2 + σ2 µ − i 0 + i 0 . τ 2 + σ2 τ 2 + σ2

Thus,

Z " 2 # 1 (yi − µ0) p (yi) = f (yi | µ) π (µ) dµ = √ √ exp − , 2π τ 2 + σ2 2 (τ 2 + σ2)

2 2 that is, yi ∼ Normal µ0, τ + σ . R By definition of p (yn+1 | y), we have p (yn+1 | y) = p (yn+1 | µ) p (µ | y) dµ, where

σ2µ + nτ 2y¯ σ2τ 2  p (y | µ) = Normal µ, σ2 and p (µ | y) = Normal 0 , . n+1 σ2 + nτ 2 σ2 + nτ 2

Besides integrating µ as we did for p (yi), there is an alternative way. 2 Let yn+1 = (yn+1 − µ) + µ. Then (yn+1 − µ) ∼ Normal 0, σ and µ ∼ p (µ | y) are independently distributed from two normal distributions. Thus

2  2    2 2  yn+1 | y ∼ Normal 0, σ + Normal µ | y, σµ|y ≡ Normal (0 + µ | y) , σ + σµ|y σ2µ + nτ 2y¯ σ2τ 2  ≡ Normal 0 , σ2 + . σ2 + nτ 2 σ2 + nτ 2

Example (Coke Bottles (cont.)). The predictive for a single bottle from a given machine is

Normal (12, 0.06) ,

2 2 where µ0 = 12 and τ + σ = 0.01 + 0.05 = 0.06. (There is no information about observations and hence a single bottle implies yi.) The predictive for the sixth bottle after the five observed (n = 5 and y¯ = 11.88) is

Normal (11.94, 0.055) ,

σ2τ 2 2 where σ2+nτ 2 = 0.005 and σ = 0.05.

7 Non-Informative Priors

A non-informative prior is intended to convey no prior belief. That is, let the data alone drive inference.

Example. θ is a unknown parameter from uniform distribution. The non-informative prior for θ is

1 • Discrete random variable: For Θ = {θ1, . . . , θK }, P (θ = θk) = K .

1 • Continuous random variable: For Θ = [a, b], π (θ) = b−a for θ ∈ [a, b].

Note (Connection between Prior and Posterior). If π (θ) = c ∀θ ∈ Θ, then p (θ | y) ∝ f (y | θ), and so the model of p (θ | y) is the likelihood used in MLE for θ.

Improper Priors

A “prior”, which is not a true probability distribution, is improper. Hence an improper prior can re- sult in an improper posterior. Improper priors should be avoided if possible. However, an improper prior can give reasonable results, especially if posterior is proper. R Example. π (θ) = 1 for all θ ∈ R is an improper prior, since π (θ) dθ = ∞. (A proper prior should be integrated to 1.) Suppose R f (y | θ) π (θ) dθ < ∞,

f (y | θ) π (θ) f (y | θ) p (θ | y) = = , R f (y | θ) π (θ) dθ R f (y | θ) dθ which is still proper.

Example (Venusian Surface). A spacecraft lands on the planet Venus, and captures a small surface sample. The radioactivity of the sample is measured as the frequency of particle emission. y = 60 particle emissions are observed in one minute. Infer θ, the average particle emissions per minute. Suppose y ∼ Poisson (θ) with a uniform prior π (θ) = c for all θ > 0 (note π (θ) is improper).

e−θθk p (y = k | θ) = for k = 0, 1, 2,... k!

Since R p (y = k | θ) dθ < ∞,

p (θ | y = k) ∝ p (y = k | θ) ∝ e−θθk ∝ Gamma (k + 1, 1) .

An alternative prior distribution for radioactivity by emission frequency is often given on log scale γ = log (θ) where θ ∼ U (0, ∞). A uniform prior for θ implies that a prior for γ (= log (θ)) is π (γ) ∝ eγ ∀γ which is not a uniform distribution any more.

8 Note (Transformation from A Known Density Function).

π (γ) = |J| π (θ (γ)) , where θ (γ) means θ is written as a function of γ (θ = eγ ), |J| is the determinant of Jacobian J = dθ/dγ = eγ , and π (θ (γ)) = 1 since π (θ) = 1 for all θ = eγ > 0.

Jeffreys Priors

For a one-parameter model f (θ | y), the Fisher information is

 d2  I (θ) = −E log f (y | θ) , y|θ dθ2 which is an average over observed data y given θ, i.e., the RHS is a function of θ rather than y. Note. The Fisher information is a way of measuring the amount of information that an observable random variable Y carries about an unknown parameter θ of a distribution f (· | θ) that models Y .

• The definition of Fisher information (for θ) is " #  d 2 Z  d 2 I (θ) = E log f (y | θ) = log f (y | θ) f (y | θ) dy. y|θ dθ dθ

• Intuition for Fisher Information: (i) The occurrence of a less frequent event (i.e., an event with small probability) brings much information about θ. (ii) If θ is true parameter, the log should achieve maximum and the derivative ∂ log f (· | θ) /∂θ should be closed to zero. So is ∂2 log f (· | θ) /∂θ2.

• Based on the fact ∂f (· | θ) /∂θ = 0 and ∂ log f (· | θ) /∂θ = 0, there are three equivalent formulas for the Fisher information.

Let θ be an unknown parameter in the distribution model f (x | θ) of random variable X. Then the corresponding likelihood function is f (θ | x), where x is a collection of observations. The Jeffreys prior for θ is proportional to the square root of Fisher information I (θ):

π (θ) ∝ [I (θ)]1/2 .

Jeffreys priors are invariant under all one-to-one (bijective) transformations: If γ = h (θ) and the prior π (θ) is Jeffreys for θ, then the induced prior for γ is Jeffreys for γ. Claim. If I (·) is the Fisher information, show

1/2 1/2 dθ [I (γ)] = [I (θ (γ))] . dγ

9 For multi-parameter θ = (θ1, . . . , θK ), the Fisher information is a matrix with i, j element

 d2  Iij (θ) = −Ey|θ log f (y | θ) . dθidθj

The Jeffreys prior for θ is

π (θ) ∝ [det I (θ)]1/2 .

Example. For a Poisson(θ) model, the Jeffreys prior for θ is

1 π (θ) ∝ √ . θ

Proof. Note that x ∼Poisson(θ) implies the mean is Ex|θ [x] = θ and the likelihood function is

θxe−θ f (x | θ) = . x!

To construct the Fisher information for θ, first take logarithm of f (x | θ):

x X log f (x | θ) = x log θ − θ − log i. i=1

Then,

d x d2 x log f (x | θ) = − 1 and log f (x | θ) = − . dθ θ dθ2 θ2

Therefore,

h x i Ex|θ [x] θ p 1 I (θ) = −Ex|θ − = = =⇒ π (θ) ∝ I (θ) = √ . θ2 θ2 θ2 θ

Example. For a Binomial(n, θ) model, the Jeffreys prior for θ is

π (θ) = Beta (0.5, 0.5) .

Proof. Note that x ∼Binomial(n, θ) implies the mean is nθ and the likelihood function is

n x n−x f (x | θ) = Cx θ (1 − θ) .

n To construct the Fisher information for θ, first take logarithm of f, log Cx +x log θ+(n − x) log (1 − θ),

10 and then differentiate w.r.t. θ:

d x n − x d2 x n − x log f (x | θ) = − and log log f (x | θ) = − − . dθ θ 1 − θ dθ2 θ2 (1 − θ)2

Thus, the Fisher information for θ is " # x n − x nθ n − nθ n n I (θ) = −Ex|θ − − = + = + . θ2 (1 − θ)2 θ2 (1 − θ2) θ 1 − θ

Hence,

1 1 1   2   2   2 n n 1 − θ + θ 1 − 1 − 1 π (θ) ∝ + ∝ = = θ 2 (1 − θ) 2 . (∗) θ 1 − θ θ (1 − θ) θ (1 − θ)

Recall the pdf of Beta(a, b) is

1 f (x | a, b) = xa−1 (1 − x)b−1 . B (a, b)

Therefore, equation ∗ implies π (θ) = Beta (0.5, 0.5).

Elicited Priors

Often desire an expert to elicit the prior for a given experiment. However, there are some problems:

• Human intuition generally approximates uncertainty poorly

• Non-statistician experts have little knowledge of probability distributions

Have expert create a probability histogram. Or specify certain desirable properties (e.g., quantiles) and choose a known distribution to match those properties. There is a prior elicitation software: http://www.tonyohagan.co.uk/shelf/.

2 Estimation, and Decision Theory

Point Estimation

Let p (θ | y) be the posterior distribution. Then possible point estimates for θ are: ˆ • Posterior mode: θ = arg maxθ p (θ | y)

– Easy to compute because only need to work with numerator, i.e., p (θ | y) ∝ π (θ) f (y | θ); – Sometimes called Generalized Maximum Likelihood Estimation; if π (θ) is flat (con- stant), it is exactly the MLE;

11 ˆ R • Posterior expectation: θ = Eθ|y [θ] = θp (θ | y) dθ

– Posterior expectation uses full posterior;

 ˆ  R θˆ 1 • Posterior median: Pr θ ≤ θ | y = −∞ p (θ | y) dθ = 2

– Posterior median is more robust to outliers in y and posterior tails.

Loss Function

How to choose which estimate is optimal for a given application? A loss function I (truth, a) gives the loss incurred for an action a given the (usually unknown) truth. If the truth is given by parameter θ, the posterior risk is Z ρ (π, a) = Eθ|yI (θ, a) = I (θ, a) π (θ | y) dθ which is an average loss over the posterior for θ. For point estimation, our action a is to assign an estimate θˆ for unknown parameter θ. Thus, θˆ   contains the information of a and I (θ, a) can be written as I θ, θˆ .    2 • Squared error loss: I θ, θˆ = θ − θˆ (commonly used). Claim. Posterior risk for squared error loss is minimized by posterior expectation:

2  ˆ Eθ|y [θ] = arg min Eθ|y θ − θ . θˆ

2 2  ˆ h ˆi Proof. Let Eθ|y [θ] = µ. Then Eθ|y θ − θ = Eθ|y θ − µ + µ − θ

 2 2  ˆ  ˆ =Eθ|y (θ − µ) + 2 (θ − µ) µ − θ + µ − θ

2 2  ˆ  ˆ =Eθ|y (θ − E[θ]) + 2 µ − θ Eθ|y (θ − µ) + µ − θ 2  ˆ ˆ =V arθ|y (θ) + µ − θ ≥ V arθ|y (θ) if θ = µ,

 ˆ ˆ where the second line follows that µ − θ is constant under Eθ|y [·] (i.e., µ = Eθ|y [θ] and θ

is a constant estimate) and the third line follows that Eθ|y (θ − µ) = Eθ|y (θ) − µ = 0.   ˆ ˆ • Absolute loss: I θ, θ = θ − θ . Claim. Posterior risk for absolute loss is minimized by the posterior median:

∗ Z θ 1 ∗ ∗ ∗ ˆ If θ satisfies Pr (θ ≤ θ | y) = p (θ | y) dθ = , then θ = arg min Eθ|y θ − θ . −∞ 2 θˆ

12       ˆ ˆ ˆ ˆ Proof. Let P = Pr θ ≤ θ | y . Then Eθ|y θ − θ = P θ − θ + (1 − P ) θ − θ =   (1 − 2P ) θ − θˆ . Thus, 1 − 2P = 0 minimize the posterior risk for absolute loss. That is, ˆ  ˆ  1 θ is the posterior median s.t. Pr θ ≤ θ | y = 2 .

Decision Rules

Set up

• Denote the space of allowable actions A (a ∈ A).

• Denote sample space (possible data observations) Y (y ∈ Y).

• A decision rule d ∈ D : Y → A is a rule for determining an action based on data.

– In the context of statistic inference, a decision rule is a rule determining an estimate based on observed data.

Formal decision-theoretic framework:

• Prior distribution: π (θ), θ ∈ Θ

• Sampling distribution: f (y | θ)

• Allowable actions: a ∈ A (e.g., an action a is to assign a number or an estimate for θ)

• Decision rules: d ∈ D : Y → A (i.e., based on data, we can do)

• Loss function: I (θ, a)

iid 2 2 Example. Recall that y = (y1, . . . , yn) where yi ∼ Normal µ, σ and π (µ) = Normal µ0, τ .

σ2µ + nτ 2y¯ σ2τ 2  p (µ | y) = Normal 0 , . σ2 + nτ 2 σ2 + nτ 2

Consider decision for point estimate of µ:

2 • Prior distribution: π (µ) =Normal µ0, τ , µ ∈ R.

iid 2 n • Sampling distribution: yi ∼ Normal µ, σ and y ∈ Y = R .

• Allowable actions: a ∈ A = R.

2 2 σ µ0+nτ y¯ • Decision rule: d (y) = σ2+nτ 2 where d ∈ D and d : Y → A is a mapping from y to a.

• Loss function: I (µ, a) = (µ − a)2 or |µ − a|.

13 Example (Coke Bottles). Coke bottles are filled with calibration Normal(12, 0.01). Given ma- chine with calibration µ, bottles filled with Normal(µ, 0.05). For n = 5 and y¯ = 11.88 oz, p (µ | y) =Normal(11.94, 0.005). Action a assigns estimate µˆ = 11.94; i.e., d (y) = a =µ ˆ. 2 Posterior risk under squared error loss, i.e., minµˆ Eµ|y (µ − µˆ) = V arµ|y (µ), is the posterior variance, 0.005. h  ˆi Note. Posterior risk for specific loss is a conditional average, i.e., Eθ|y I θ, θ , where d (y) = a = θˆ means decision rule is a mapping from observed data y to action a which assigns the value of estimate θˆ.

Frequentist Risk

The frequentist risk of a decision rule d is Z R (θ, d) =Ey|θI (θ, d (y)) = I (θ, d (y)) f (y | θ) dy, which is an average loss over y given θ. R (θ, d) does not depend on prior π (θ). A rule d is inadmissible if

∃d∗ with (i) R (θ, d∗) ≤ R (θ, d) ∀θ ∈ Θ, and (ii) R (θ, d∗) < R (θ, d) for some θ ∈ Θ.

Implications: (i) There is another rule is universally “better,” (ii) a rule that is not inadmissible is admissible, and (iii) admissible rules are NOT necessarily good. Note (Admissible Rule). A rule d is admissible if

@d0 ∈ D s.t., (i) R (θ, d0) ≤ R (θ, d) ∀θ ∈ Θ, and (ii) R (θ, d0) < R (θ, d) for some θ ∈ Θ.

(Equivalent) A rule d is admissible if

∀d0 ∈ D s.t., (i) R (θ, d) ≤ R (θ, d0) ∀θ ∈ Θ and (ii) R (θ, d) < R (θ, d0) for some θ ∈ Θ.

Example (Coke Bottles (cont.)). The (poor) rule d (y) = 5 is admissible because it is unbeatable 0 0 2 when µ = 5. That is, @ a better d (y) s.t., R (µ, d (y)) < R (µ, d (y)) = Ey|µ (µ − 5) if µ = 2 5, where R (µ, d (y)) = Ey|µ [µ − d (y)] is the frequentist risk under squared loss function for decision rule d (y). (i) Let d (y) =y ¯ be the decision rule. The frequentist risk (under squared loss function) is 2 σ2 R (µ, y¯) = Ey|µ (¯y − µ) = V ary|µ (¯y) = n = 0.05/5 = 0.01. Note that d (y) =y ¯ does not depend on µ (due to known σ2), neither does R (µ, d (y)) = R (µ, y¯).

(ii) Let d (y) = Eµ|y [µ] be the decision rule, where the posterior mean is Bµ0 + (1 − B)y ¯ σ2 0.05 1 where B = σ2+nτ 2 = 0.05+5×0.01 = 2 . Then the frequentist risk is

 h 2 2 2 2i R µ, Eµ|y [µ] =Ey|µ B (µ − µ0) + 2B (1 − B)(µ − µ0)(µ − y¯) + (1 − B) (µ − y¯)

14 2 2 2 2 =B (µ − µ0) + 2B (1 − B)(µ − µ0)Ey|µ (µ − y¯) + (1 − B) Ey|µ (µ − y¯) 2 2 2 =B (µ − µ0) + (1 − B) V ary|µ (¯y) =0.25 (µ − 12)2 + 0.25 × 0.01,

2 2 where the term in the second line follows that Ey|µ (µ − µ0) = (µ − µ0) (since µ is given and µ0 is known) and the third line follows that Ey|µ [µ − y¯] = 0 (i.e., Ey|µ [¯y] = µ).  In this case, R (µ, d (y)) = R µ, Eµ|y [µ] is a function of µ, which implies the value of  R µ, Eµ|y [µ] depends on the true µ.

Minimax Rules

A minimax rule d satisfies

sup R (θ, d) ≤ sup R (θ, d∗) , ∀d∗ ∈ D. θ∈Θ θ∈Θ

Chose d to minimize maximum frequentist risk: Prepare for “the worst case scenario.” Note. “Minimax” alone does not mean d is the best decision rule. A minimax rule d may NOT be unique, admissible, or favorable.

Bayes Risk

The Bayes risk for given decision rule is Z r (π, d) = EθEy|θ [I (θ, d (y))] = R (θ, d) π (θ) dθ

Equivalently, Z r (π, d) = EyEθ|y [I (θ, d (y))] = ρ (π, d (y)) m (y) dy, where m is the marginal distribution of y. Bayes risk is also called “preposterior risk”: The expected risk before obtaining any data. Summary (Loss Function, Posterior, Frequentist, and Bayes Risk).

• Loss function I (θ, d (y)): Function of θ and y.

• Posterior risk ρ (π, d (y)): Function of y, averaged over θ

• Frequentist risk R (θ, d): Function of θ, averaged over y

• Bayes risk r (π, d): Averaged over both y and θ.

15 Bayes Decision Rules

∗ ∗ A Bayes decision rule minimizes Bayes risk: d = arg mind∈D r (π, d). Usually d are admissible. Claim. If a Bayes rule is unique, it is admissible. That is, d∗ is unique =⇒ d∗ is admissible. Recall that p → q ≡ ¬p ∨ q ≡ ¬ (p ∧ ¬q).

Proof. (Per Contra) Suppose a Bayes rule d∗ is unique but inadmissible and then ∃d s.t., ∀θ ∈ Θ Z Z R (θ, d) ≤ R (θ, d∗) =⇒ R (θ, d) π (θ) dθ ≤ R (θ, d∗) π (θ) dθ =⇒ r (π, d) ≤ r (π, d∗) , which implies d∗ is not a unique Bayes rule (since d at least minimize R (π, d) as well).

Example (Normal-Normal Model). For the normal-normal model with I (µ, µˆ) = (µ − µˆ)2, we have shown d (y) = arg mind∈D ρ (π, d (y)) is given by the posterior mean for any y. So it is a Bayes decision rule (by the equivalent definition r (π, d) = R ρ (π, d (y)) m (y) dy). The Bayes risk is given by

σ2τ 2 r (π, d) = . σ2 + nτ 2

In the Coke bottling example, let r (π, d) = 0.005 be the expected loss before checking any bot- tles, which is the same as the expected risk after collecting bottles (posterior risk), because posterior risk does not depend on y.(r (π, d) is an average over both y and θ.) Bayes risk of d (y) =y ¯ is

Z Z σ2 σ2 Z σ2 r (π, d) = R (µ, d) π (µ) dµ = π (µ) dµ = π (µ) dµ = = 0.01. n n n

3 The Bias-Variance Tradeoff

Unbiased rules

A rule d (y) is unbiased at θ if risk under a given θ ∈ Θ is minimized

0 0 Ey|θ [I (θ , d (y))] ≥ Ey|θ [I (θ, d (y))] for all θ , θ ∈ Θ.

Let I (θ, d (y)) = (θ − d (y))2.

 2 Ey|θ [I (θ, d (y))] =Ey|θ θ − Ey|θ [d (y)] + Ey|θ [d (y)] − d (y)  2  2 =Ey|θ θ − Ey|θ [d (y)] + Ey|θ Ey|θ [d (y)] − d (y)  2  2 = θ − Ey|θ [d (y)] + Ey|θ Ey|θ [d (y)] − d (y) , | {z } V ary|θ [d(y)]

16 1 where the third line follows that Ey|θ [d (y)] is a function of θ rather than y. The form of d (y) can be related to given θ and hence V ary|θ [d (y)] is determined by both the uncertainty of y and the form of d (y). The best we can do to minimize risk under a given θ is to set a decision rule d (y) s.t., Ey|θ [d (y)] = θ.

• Unbiased d (y) implies Ey|θ [d (y)] = θ.

• The bias of d (y) is 4 = θ − Ey|θ [d (y)]. Note. The definitions for admissible and unbiased decision rule d are related to frequentist risk

R (θ, d) = Ey|θI (θ, d (y)). An admissible d minimizes R (θ, d), ∀d ∈ D, while an unbiased d minimizes R (θ, d), ∀θ ∈ Θ.

Example (Unbiased Estimator). Let X be the number of times an event occurs in an hour, X ∼Poisson(λ). Using X, estimate Pr (A) where A represents “no event in two hours.”

−2λ • d would like an unbiased estimator for Pr (A) under squared loss if Ex|θ [d (x)] = e .

• Turns out that d is unbiased only if d (x) = (−1)x and d (x) < 0 is a ridiculous estimate for some x. λx −λ −λ Note that p (x | λ) = x! e and Pr (X = 0 | λ) = e . Hence, Pr (A) = Pr (X = 0 | λ) × Pr (X = 0 | λ) = e−2λ since event happens indepen- dently.

∞ ∞ X λx X λx e−2λ = E [d (x)] = d (x) e−λ =⇒ e−λ = d (x) x|λ x! x! x=0 x=0 ∞ x ∞ x X x λ X λ due to Taylor expansion of e−λ =⇒ (−1) = d (x) x! x! x=0 x=0 =⇒ d (x) = (−1)x .

Bias-Variance Trade-Off

2 Let y1, . . . , yn be iid with mean µ and variance σ . Consider estimates for µ of the form

dB (y) = Bµ0 + (1 − B)y ¯ where 0 ≤ B ≤ 1 is the “shrinkage factor.” This estimate has

2 σ2 • Variance (1 − B) n : V ary|µ [dB (y)] =

σ2 V ar [d (y) − Bµ ] = V ar [(1 − B)y ¯] = (1 − B)2 V ar [¯y] = (1 − B)2 . y|µ B 0 y|µ y|µ n 1The expectation over y given θ implies every possible y is taken into account. Then the expectation over y is irrelevant with any certain y. However, given information θ still remains in the expectation. Hence Ey|θ [·] gives a function of θ.

17 • Bias B (µ − µ0): µ − Ey|µ [dB (y)] =

  µ − Ey|µ [Bµ0 + (1 − B)y ¯] = B (µ − µ0) + (1 − B) µ − Ey|µ [¯y] = B (µ − µ0) .

The frequentist risk for dB under squared loss is

2 2 2 2 R (µ, dB) =Ey|µ [I (µ, dB (y))] = B (µ − µ0) + (1 − B) Ey|µ [µ − y¯] 2 2 2 2 2 2 2 σ =B (µ − µ0) + (1 − B) V ary|µ [¯y] = B (µ − µ0) + (1 − B) . | {z } n Bias2 | {z } Variance

• B = 0, d (y) =y ¯ is unbiased but maximizes variance.

• B = 1, d (y) = µ0 has no variance (i.e., d (y) is a constant µ0 for all y) but maximizes bias.

• Ideal B depends on: σ2 (larger σ2 ⇒ B closer to 1), n (larger n ⇒ B closer to 0), and 2 distance of µ from µ0 which might be measured by τ (larger distance⇒ B closer to 0).

2 Claim. Given prior π for µ with mean µ0 and variance τ . The rule dB that minimizes Bayes risk R r (π, dB) = R (µ, dB) π (µ) dµ has

σ2 B = . σ2 + nτ 2

Proof. Let dB = Bµ0 + (1 − B)y ¯. The Bayes risk is

h 2i h 2 2 2 2i r (π, dB) =Eµ Ey|µ [µ − Bµ0 − (1 − B)y ¯] = Eµ B (µ − µ0) + (1 − B) Ey|µ (µ − y¯)

2 2 2 h 2i =B Eµ (µ − µ0) + (1 − B) Eµ Ey|µ (µ − y¯) σ2 =B2V ar (µ) + (1 − B)2 E V ar (¯y) = B2τ 2 + (1 − B)2 . µ y|µ n

The problem mindB r (π, dB) is equivalent to minB {r (π, dB) | dB = Bµ0 + (1 − B)y ¯}, which implies the FOC ∂r/∂B = 0; that is,

σ2 σ2 2Bτ 2 − 2 (1 − B) = 0 =⇒ nτ 2B = σ2 (1 − B) =⇒ B = . n σ2 + nτ 2

Note. Normal-normal model is a special case (the posterior mean is the Bayes decision rule).

18 4 Interval Estimation

Credible sets

A 100 × (1 − α)% credible set for θ is a subset C ⊆ Θ with Z 1 − α ≤ Pr (C | y) = p (θ | y) dθ. C

The probability that θ is in C is at least (1 − α) credible. If θ ∈ R and C = [a, b] for a, b ∈ R, C may be called a or Bayesian confidence interval.

•A highest posterior density (HPD) includes only the “most likely” θ values:

C = {θ ∈ Θ: p (θ | y) ≥ K (α)}

where K (α) is the largest constant giving Pr (C | y) ≥ 1 − α.

– The smallest possible credible set. – May not be an interval for some α (if p (θ | y) has several local modes). – Visualized as horizontal line intersecting density curve.

• An equal tail (or percentile) interval excludes equal probability in the upper and lower tail.

– C = [a, b] where a = α/2 quantile and b = (1 − α/2) quantile of p (θ | y). – Symmetric – includes “the middle” of the posterior. – Invariant under monotone transformations. h i • Alternatively, construct interval so that point estimate θˆ is central: C = θˆ − m, θˆ + m .

– The interval is centered at MLE. (E.g., The MLE λˆ = 4 is centered in the following interval.) – Note MLE λˆ = 4 is the mode of the likelihood f (λ | y), rather than the mode of this blue curve p (λ | y) which is proportional to π (λ) f (λ | y).

19 Example (Emergency Room). An emergency room receives y = 4 collapsed lung patients in one week. Model y as Poisson distribution with weekly rate λ where the improper Jeffreys Prior is

1 π (λ) ∝ I (λ)1/2 = √ ∝ Gamma (0.5, 0) . λ

Note that λ ∼ Gamma (α, β) means the PDF for λ is π (λ) = λα−1e−βλ. Since a Gamma prior is conjugate to a Poisson random variable, p (λ | y = 4) = Gamma (4.5, 1). Create 90% credible interval for λ:

Decision Theory for Credible Sets

“We do not view credible sets as having a clear decision-theoretic role, and are therefore leery of ‘optimality’ approaches to selection of a credible set.” – Jim Berger, 1985, in Statistical decision theory and Bayesian analysis.

• Action space could involve all subsets of parameter space A = {a : a ⊆ Θ}.

– For a univariate θ: A = {a = [u, v]: u, v ∈ R}.

• Possible loss function:

I (θ, d (y)) = 1{θ∈ /d(y)} + K · Volume (d (y))

1 R where {·} is an indicator function and Volume(d (y)) is the volume of d (y), i.e., θ∈d(y) dθ.

– Balances the coverage and size of a credible set, since an arbitrary constant K is the penalty for the size of a credible set. – For a univariate with d (y) = [a, b], the Volume(d (y)) = b − a.

Example (Posterior Risk). Consider the posterior risk

  ρ (π, d (y)) = Eθ|y 1{θ∈ /d(y)} + K · Volume (d (y)) = Pr (θ∈ / d (y) | y)+K·Volume (d (y)) .

The posterior risk is minimized by d (y) = {θ : p (θ | y) ≥ K}, a HPD interval.

20 Proof. By definition, the posterior risk is

  ρ (π, d (y)) =Eθ|y 1{θ∈ /d(y)} + K · Volume (d (y)) = Pr (θ∈ / d (y) | y) + K · Volume (d (y)) | {z } not a fcn of θ Z =1 − Pr (θ ∈ d (y) | y) + K · dθ θ∈d(y) Z =1 + [K − p (θ | y)] dθ, θ∈d(y) which is minimized by a HPD interval since K ≤ p (θ | y), ∀θ ∈ d (y) = {θ : p (θ | y) ≥ K}.

Example (Emergency Room (cont.)). Decision theoretic framework:

1  + • Prior distribution: π (λ) = Gamma 2 , 0 , λ ∈ R . • Sampling distribution: y ∼ Poisson (λ), y ∈ {0, 1, 2,... }

• Allowable actions: A = {a : a ⊆ R+}

• Decision rule: d (y) = {λ : p (λ | y) ≥ K} where K = 0.05.

• Loss function: I (λ, d (y)) = 1{λ/∈d(y)} + 0.05 · Volume (d (y)).

Recall y = 4 and p (λ | y = 4) = Gamma (4.5, 1). Note K = 0.05 gives a HPD interval with α = 0.10. Then d (y = 4) = [1.21, 7.65] with volume 7.65 − 1.21 = 6.44.[Code] Volume plots for HPD and symmetric quantile intervals:

• Consider the posterior predictive distribution of y2 given y1.

– y1 is observed number of collapsed lung patients for initial week

– y2 is number of collapsed lung patients the following week Note (Gamma Distribution). Suppose random variable X is distributed from a Gamma distri- bution with shape parameter α (or k = α) and rate parameter β (or scale parameter θ = 1/β). For any x ∈ (0, ∞),

α β α−1 −βx 1 k−1 − x f (x) ∼ Gamma (α, β) = x e or f (x) ∼ Gamma (k, θ) = x e θ . Γ(α) Γ(k) θk

21 Moreover, the mean is E[x] = α/β = kθ. Recall under Jeffreys prior for λ, p (λ) = √1 , λ

y − 1 −λ y −λ 1  λ 1 2 e λ 2 e p (λ | y1) = Gamma + y1, 1 = 1  and p (y2 | λ) = Poisson (λ) = . 2 Γ y1 + 2 y2!

– Posterior predictive is

∞ y −λ y − 1 −λ Z Z λ 2 e λ 1 2 e p (y2 | y1) = p (y2 | λ) p (λ | p1) dλ = 1 dλ 0 y2! Γ y1 + 2 y + 1 y Z ∞ 1    1 2   2 1 h y +y − 1 −2λi Γ y1 + y2 + 2 1 1 = λ 1 2 2 e dλ = y !Γ y + 1  y !Γ y + 1  2 2 2 1 2 0 | {z } 2 1 2 1 ∝Gamma(y1+y2− 2 ,2) ! y − 1 +1 y y + y − 1 1 1 2 1 2  1 1 = 1 2 2 = NegativeBinomial y + , . 1 2 2 1 2 2 y1 − 2

Note. (i) Binomial distribution with PDF f (x) ∼ Binomial (n, p) is a distribution of the number of successes (x) in a fixed number of independent Bernoulli trails (n). Hence the total number of trails is fixed at n. The probability mass function of Binomial distri- bution is

n! Pr (X = k) = Cnpk (1 − p)n−k = pk (1 − p)n−k . k k!(n − k)!

(ii) Negative Binomial distribution with PDF f (x) ∼ NegativeBinomial (r, p) is a distribution of the number of successes (x) until the r-th failure occurs with the success rate of every independent Bernoulli trail at p. Hence the total number of trails is not fixed. The probability mass function of Negative Binomial distribution is

h k+r−1 k r−1i k+r−1 k r Pr (X = k) = Ck p (1 − p) (1 − p) = Ck p (1 − p) | {z } | {z }r-th failure first k+r−1 trails (k + r − 1)! Γ(k + r) = pk (1 − p)r = pk (1 − p)r , k!(r − 1)! Γ(r) k!

where Γ(n) = (n − 1)! for any integer n.

• Consider creating credible set for y2 given y1.

– Action set is A = {a : a ⊆ {0, 1, 2,... }}. 1 – Loss function I (y2, d (y1)) = {y2∈/d(y1)} + K · |d (y1)|

* |·| gives cardinality (i.e., the number of values in d (y1) due to y1 is an integer) * Motivates d (y1) = {k : Pr (y2 = k | y1) ≥ K}

22 * Discrete analogue of HPD set, with level X 1 − α = Pr (y2 = k | y1)

y2∈d(y1)

– For y1 = 4, d (y1) = {1, 2, 3, 4, 5, 6, 7, 8} corresponds to K = 0.05 with level 1 − α = 0.10 + 0.14 + 0.15 + 0.14 + 0.12 + 0.09 + 0.07 + 0.05 = 0.86.[Code]

• Given y1, consider the probability of no patient in following two weeks (A)

Pr (A | y1) = Pr (y2 = 0 ∩ y3 = 0 | y1)

= Pr (y2 = 0 | y1) × Pr (y3 = 0 | y2 = 0, y1) y + 1 y + 1 y + 1 1 1 2 2 1 2 1 1 2 = = , 2 3 3

1 1  1 1  where p (y2 | y1) = Neg.Binomial y1 + 2 , 2 and p (y3 | y2 = 0, y1) = Neg.Binomial y1 + 2 , 3 . R Note that (i) p (y3 | y2 = 0, y1) = p (y3 | λ) p (λ | y2 = 0, y1) dλ, where p (λ | y2 = 0, y1) ∝ 1  p (y2 = 0, y1 | λ) p (λ) ∝ p (y2 = 0 | λ) p (y1 | λ) p (λ) ∝ Gamma y1 + 2 , 2 , and (ii) CDF collapses to PMF at either y2 = 0 or y3 = 0, since Y is non-negative.

Then, as y1 is discrete, the integral (over all possible y1’s) becomes the summation:

∞ 1 ∞ k  k+ 2 k −λ λ  X 1 λ e 1 −λX 3 1 −λ λ 1 − 2λ E [Pr (A | y )] = · = √ e = √ e e 3 = √ e 3 . y1|λ 1 3 k! k! k=0 3 k=0 3 3

Thus, the posterior predictive estimate has bias

1 − 2λ −2λ √ e 3 − e . 3

– Recall only unbiased estimate for Pr (A) is (−1)y1 .

The unbiased estimate for Pr (A) is d (y1) s.t., Ey1|λ [d (y1)] = Pr (A) = Pr (y2 = 0 | λ)× −2λ y1 Pr (y3 = 0 | λ) = e . Hence we have d (y1) = (−1) .

Normal Model with Jeffreys Priors

2 Let y1, . . . , yn be iid N µ, σ

Case I Assume σ2 is known.

• The Jeffreys prior for µ is uniform: π (µ) ∝ c.

23 2 1 Pn 2 Note that log f y | µ, σ = C − 2σ2 i=1 (yi − µ) ,

n ∂ 1 X ∂2 n log f = (y − µ) , and log f = − . ∂µ σ2 i ∂µ2 σ2 i=1

Given y and σ2, the Jeffreys prior for µ is a constant and hence µ is uniformly distributed:

s  ∂2  π (µ) ∝ pI (µ) = −E log f = pn/σ2 = c. y|µ ∂µ2

2 2  σ2  • π (µ) = π µ | σ = c gives posterior p µ | y, σ = N y,¯ n .

( ) Pn (y − µ)2 p µ | y, σ2 ∝p y | µ, σ2 π µ | σ2 ∝ exp − i=1 i 2σ2 ( ) (µ − y¯)2  σ2  ∝ exp − ∝ N y,¯ . σ2 n 2 n

• The 100 (1 − α))% HPD (and percentile/symmetric) credible interval for µ is

 σ σ  C = y¯ − z (σ/2) √ , y¯ + z (σ/2) √ , n n

where z (σ/2) is the σ/2 quantile of N (0, 1).

Case II Assume σ2 is also unknown and independent of µ.

2 2 1 • The Jeffreys prior for σ is π σ ∝ σ2 . 2 n 2 1 Pn 2 Note that log f y | µ, σ = C − 2 log σ − 2σ2 i=1 (yi − µ) ,

∂ log f n/2 Pn (y − µ)2 ∂2 log f n/2 Pn (y − µ)2 = − + i=1 i , and = − i=1 i . ∂ (σ2) (σ2) 2 (σ2)2 ∂ (σ2)2 (σ2)2 (σ2)3

Given y, the Jeffreys prior for σ2 is

v u " Pn 2 # 2 u n/2 i=1 (yi − µ) π σ ∝ t−Ey|σ2 − (σ2)2 (σ2)3 v u hPn 2 i u n/2 nEy|σ2 i=1 (yi − µ) /n = t− + (σ2)2 (σ2)3 s s n/2 nσ2 n/2 1 rn 1 = − + = = ∝ . (σ2)2 (σ2)3 (σ2)2 σ2 2 σ2

24 R ∞ 2 2  2 • Then, p (µ | y) = 0 p µ | y, σ p σ | y dσ is a shifted and scaled t-distribution: s µ − y¯ Pn (y − y¯)2 √ ∼ t where s = i=1 i . s/ n n−1 n − 1

• The HPD credible interval for µ is a standard t-interval with n − 1 degrees of freedom.

5 Decisions and Hypothesis Testing

Example (Coke Bottles (cont.)). Coke bottles are filled with calibration µ ∼ N (12, 0.01). Given machine with calibration µ, bottles filled with y ∼ N (µ, 0.05). For n = 5,

σ2µ + nτ 2y¯ σ2τ 2  1  p (µ | y) = N 0 , = N (12 +y ¯) , 0.005 . σ2 + nτ 2 σ2 + nτ 2 2

A machine with calibration µ costs the company $50, 000 (µ − 12)2 and the cost to recalibrate a machine to µ = 12 is $500. The decision theoretic framework is

• Prior distribution: π (µ) = N (12, 0.01), ∀µ ∈ R

5 iid 5 • Sampling distribution: {yi}i=1 ∼ N (µ, 0.05) and y ∈ R

• Allowable actions: A = {Re-calibrate (R) , Not re-calibrate (N)}

• Loss function:  500 if a = R l (µ, a) = 50, 000 (µ − 12)2 if a = N

Decide whether to recalibrate after sample of n = 5. That is to show d (y).

Solution: The posterior loss for a = R (re-calibrate) is ρ (π, R) = Eµ|y [I (µ, R)] = 500. The posterior loss for N (not re-calibrate) is

h 2i ρ (π, N) =Eµ|y 50, 000 µ − Eµ|yµ + Eµ|yµ − 12 n h 2i 2o =50, 000 Eµ|y µ − Eµ|yµ + Eµ|yµ − 12 " # 1 2 =50, 000 V ar (µ) + (12 +y ¯) − 12 µ|y 2 h i =50, 000 0.005 + 0.25 (¯y − 12)2 = 250 + 12, 500 (¯y − 12)2 .

ρ (π, N) < ρ (π, R) implies 250 + 12, 500 (¯y − 24)2 < 500 =⇒ y¯ ∈ (11.86, 12.14).

25 Thus, the decision rule is  N if y¯ ∈ (11.86, 12.14) d (y) = R if y¯ ∈ / (11.86, 12.14)

Frequentist Hypothesis Testing

• Two possible hypotheses involving θ: the null (H0) and alternative (Ha)

• Observe data y and assume Y and y are iid given θ

• The p-value is the probability that Y are more “extreme” than y under H0:

p-value= Pr (Y more “extreme” than y | H0).

– Y represents results that “could have” occurred under H0 – “Extreme” depends on context (Refer to “How to Correctly Interpret P Values”)

– Probability always computed under H0, never Ha

– H0 is usually more specific: e.g., H0: θ = θ0 and Ha: θ 6= θ0

– Small p-value is evidence against H0

ˆ Suppose H0: θ = 0 and estimate θ. 0 0 Let p-value= Pr (|θ − θ| > |θc − θ| | θ = 0)∀θ ∈ Θ, where θc is the critical value. ˆ 0 0 If p-value≤ 0.01 and estimate θ ∈ {θ : Pr (|θ − θ| > |θc − θ| | θ = 0) ≤ 0.01}, then we

can reject H0 at 1% significant level. Otherwise, cannot reject.

• p-value requires considering probability of other possible outcomes, not just the observed outcome. Therefore, it violates the Likelihood Principle:

– In making inferences or decisions about θ after y is observed, all relevant experimental information is contained in the likelihood function for the observed y.

The likelihood principle says that inference should only depend on the likelihood of the ob- served data (considered as a function of theta), but when we compute a p-value we also con- sider the sampling distribution at other possible data values that we did not observe.

Two experiments can give data with different p-values but equal likelihood under H0.

Example (Tipping Pennies). Claim: If you stand a penny on its side and let it fall, it will land heads more often than tails.

• Let θ be the probability the penny lands heads

26 • H0 : θ = 1/2 and Ha : θ > 1/2

Assume for both experiments, we observe 5 heads then one tail. Experiment 1: Tip the coin until the penny lands tails.

• Random variable X is the number of heads before first tail; X ∼Neg.Binomial(1, θ)

• For Neg. Binomial distr. the possible value of X goes to infinity and thus p-value is P (X ≥ 5 | H0) = P∞ 5 1 5 1  h 1 1 2 i 1 5 k=5 P (X ≥ k | H0) = C5 2 1 − 2 1 + 2 + 2 + ··· = 2 ≈ 0.031.

Experiment 2: Tip the coin 6 times

• Random variable Y is the number of times penny falls heads; Y ∼Binomial(6, θ)

• For Binomial distr. the possible value of Y ends at 6 and thus p-value is P (Y ≥ 5 | H0) = P6 6 1 5 1  6 1 6 k=5 P (Y ≥ k | H0) = C5 2 1 − 2 + C6 2 ≈ 0.110.

Note. (i) In frequentist hypothesis testing, p-value presents the probability over all possible extreme values from the sampling distribution (though not observed in both experiments) under H0. (ii) Given the number of observed heads at 5 (for both experiments) and the total number of trails at 6 (since n is an important parameter in the Binomial distribution), there are two identical likelihoods: P (X = 5 | θ) ∝ P (Y = 5 | θ) ∝ θ5 (1 − θ).

Bayesian Hypothesis Testing

• Compute P (H0 | y) directly:

P (y | H0) P (H0) P (H0 | y) = P (y | H0) P (H0) + P (y | Ha) P (Ha)

– Must compute P (y | Ha) which requires the prior for Ha

– Must specify P (H0) which is the of H0 (being true)

• P (H0 | y) + P (Ha | y) = 1

• Symmetric: Give evidence for or against H0

– p-value can only give evidence against (i.e., never “accept H0”)

Example (Tipping Pennies (cont.)). Possible Bayesian Framework:

• Let θ be the probability the penny lands heads.

• H0: θ = 1/2

• Ha: θ ∼Uniform(0.5, 1)

27 • P (H0) = 0.5 Then the prior distribution for θ is mixed, s.t.,  1/2 with prob. 1 θ ∼ 2 1 π (θ) = 2 for θ ∈ [0.5, 1] with prob. 2 Experiment 1: The number of heads before first tail X ∼Negative Binomial(1, θ)

5 1 5 1  1 6 5 1 5 • P (x = 5 | H0) = C5 2 × 1 − 2 = 2 where C5 2 is the probability of observing 1  first five heads and 1 − 2 is the probability of observing first tail.

R 1 R 1 5 5 • P (x = 5 | Ha) = 0.5 π (θ) · p (x = 5 | θ)dθ = 0.5 2 · C5 θ (1 − θ)dθ = 0.044 Therefore,

P (x = 5 | H0) P (H0) P (H0 | x = 5) = = 0.267, P (x = 5 | H0) P (H0) + P (x = 5 | Ha) P (Ha)

where P (Ha) = 1 − P (H0) = 1 − 0.5 = 0.5. Experiment 2: The number of heads in 6 trails Y ∼Binomial(6, θ)

6 1 5 1 1 1 6 • P (y = 5 | H0) = C5 2 1 − 2 = 6 2

R 1 R 1 6 5 • P (y = 5 | Ha) = 0.5 π (θ) · p (y = 5 | θ) dθ = 0.5 2 · C5 θ (1 − θ) dθ = 0.254 Therefore,

P (y = 5 | H0) P (H0) P (H0 | y = 5) = = 0.267. P (y = 5 | H0) P (H0) + P (y = 5 | Ha) P (Ha)

Note. Likelihood under H0 means the probability P (H0 | y) where y is observed data.

Summary. If P (y | θ) ∝ P (x | θ), say P (y | θ) = CP (x | θ), then P (H0 | y) = P (H0 | x):

CP (x | H0) P (H0) P (H0 | y) = = P (H0 | x) . CP (x | H0) P (H0) + CP (x | Ha) P (Ha)

Hypothesis Decision

Often there is no need to conclude or decide on a given hypothesis. Simply report strength of evidence given by p-value or posterior. If necessary to choose, can use decision-theoretic framework.

• Action space A = {H0,Ha}  H0 if P (H0 | y) > c • A natural decision rule is dc (y) = for c ∈ (0, 1). Ha otherwise

– E.g., c = 0.5 which implies choose whichever hypothesis is more probably.

28 Type I error

A type I error occurs when H0 is true, but we conclude it is not. That is, if we observe y and conclude Ha, the probability we made a type I error is P (H0 | y). The standard definition of type I error rate, for a given rule, is

P (d (y) 6= H0 | H0) (e.g., P (P (H0 | y) ≤ c | H0) if dc (y) is a binary decision rule).

Note (Type I error, p-value, and Type I error rate).

1. Type I error is NOT the p-value, P (Y more “extreme” than y | H0).

(a) Type I error is the probability of rejecting H0 when it is true. A type I error happens when (i) the sample is unusual due to chance and (ii) it produces a low p-value. (b) P-value is the probability, say p-value= 4% in a symmetric distribution with the population mean at 2 but sample mean at 3, which represents there is a 4% likelihood of obtaining a sample mean that is at least as extreme as our sample mean (i.e., possible mean of another sample from the population ≥ 3 or ≤ 1) in both tails of the distribution if the null hypothesis is true (i.e., the population mean is 2). Hence the p-value doesn’t say anything directly about what you’re observing, it only talks about odds of observing it.2

2. P-value is comparable to the type I error rate for a hypothesis test, although it is neither the

Type I error nor the error rate. A type I error rate depends on dc (·) and y, while p-value only depends on y.

3. Type I error rate for rule dc (y) depends on (i) Prevalence of real effects (where Initial Proba-

bility of true null= 1 − P (real)), (ii) Power (i.e., P (y | Ha) = 1 − β where β is the type II error or “false negative”), and (iii) p-value.3

4. The significance level of α (e.g., 5% or 1%) is often referred to as a type I error rate. Example (Tipping Pennies (cont.)). Bayesian framework: Let θ be the probability the penny lands heads. H0: θ = 1/2 and Ha: θ ∼Unif(0.5, 1). P (H0) = 0.5. Two different experiments: X ∼Neg.Binomial(1, θ) and Y ∼Binomial(6, θ)

Consider d0.3: rule choosing H0 if P (H0 | ·) > 0.3.

x P (H0 | x) y P (H0 | y) 4 0.34 4 0.51 5 0.26 5 0.26 6 0.18 6 0.05 7 0.13

2Refer to two blog articles: Three Things the P-Value Can’t Tell You about Your Hypothesis Test and Understanding Hypothesis Tests: Significance Levels (Alpha) and P values in Statistics. 3Refer to two blog articles: Not All P Values are Created Equal and On the Hazards of Significance Testing.

29 • Experiment 1: d0.3 (x) = Ha if P (H0 | x) ≤ 0.3 ⇐⇒ x ≥ 5. Thus, type I error rate is

P (d (x) 6= H0 | H0) = P (P (H0 | x) ≤ 0.3 | H0) = P (X ≥ 5 | H0) = 0.031.

• Experiment 2: d0.3 (x) = Ha if P (H0 | y) ≤ 0.3 ⇐⇒ y ≥ 5. Thus, type I error rate is

P (d (x) 6= H0 | H0) = P (P (H0 | y) ≤ 0.3 | H0) = P (Y ≥ 5 | H0) = 0.110.

Note. In this example, the type I error rate is as same as the p-vlaue, both of which are smaller than the type I error.

Example (Vaccine Trail). A tentative trail for a Zika vaccine completes

• 1 of 50 who received the vaccine were infected

• 7 of 50 who received a placebo were infected

Conduct second stage trail, with n = 500 in each group. Possible Bayesian framework: Let θ1 and θ2 be the probabilities of vaccinated and non-vaccinated infection.

• H0: θ1 = θ2 = θ ∼Beta(9, 93) and Ha: θ1 ∼Beta(2, 50), θ2 ∼Beta(8, 44) are independent.

• P (H0) = 0.25 and hence P (Ha) = 1 − P (H0) = 0.75.

Observe y1 = 36 infections in vaccinated and y2 = 52 in non-vaccinated group.

The observed probability under H0 is Z P (y1 = 36, y2 = 52 | H0) = P (y1 = 36 | θ) P (y2 = 52 | θ) Beta (9, 93) dθ

Z θ8 (1 − θ)92 = C500θ36 (1 − θ)464 C500θ52 (1 − θ)448 dθ 36 52 B (9, 93) C500C500 Z = 36 52 θ96 (1 − θ)1004 dθ B (9, 93) | {z } ∝Beta(97,1005) C500C500 = 36 52 B (97, 1005) ≈ 0.0002. B (9, 93)

The observed probability under Ha is Z P (y1 = 36, y2 = 52 | Ha) = P (y1 = 36 | θ1) Beta (2, 50) P (y2 = 52 | θ) Beta (8, 44) dθ

Z θ1 (1 − θ )49 Z θ7θ43 = C500θ36 (1 − θ )464 1 1 C500θ52 (1 − θ )448 2 2 dθ dθ 36 1 1 B (2, 50) 52 2 2 B (8, 44) 2 1 C500C500 Z Z = 36 52 θ37 (1 − θ )513 θ59 (1 − θ )491 dθ dθ B (2, 50) B (8, 44) 1 1 2 2 2 1

30 C500C500 = 36 52 B (38, 514) B (60, 492) ≈ 0.0001. B (2, 50) B (8, 44)

The of H0 is

P (y1 = 36, y2 = 52 | H0) P (H0) P (H0 | y1 = 36, y2 = 52) = ≈ 0.42. P (y1 = 36, y2 = 52 | H0) P (H0) + P (y1 = 36, y2 = 52 | Ha) P (Ha)

Thus, the probability that the vaccine has an effect is 0.58, which implies we are less sure than before (less than P (Ha) = 0.75). By the beta-binomial model,

• Under H0, P (θ1 = θ2 = θ | H0, y) = Beta (9 + y1 + y2, 93 + 1000 − y1 − y2).

• Under Ha, P (θ1 | Ha, y) = Beta (2 + y1, 50 + 500 − y1) and P (θ2 | Ha, y) = Beta (8 + y2, 44 + 500 − y2).

• The marginal distribution over H0, Ha for θ1, θ2 are

p (θi | y) = P (H0 | y) p (θ | H0, y) + P (Ha | y) p (θi | Ha, y) .

Thus, in this example, p (θ1 = θ2 = θ | H0, y) = Beta (97, 1005), p (θ1 | Ha, y) = Beta (38, 514),

and p (θ2 | Ha, y) = Beta (60, 492). The marginal distributions over H0, Ha are

p (θ1 | y) = 0.42Beta (97, 1005) + 0.58Beta (38, 514) and

p (θ2 | y) = 0.42Beta (97, 1005) + 0.58Beta (60, 492). 30 density 10 0

0.02 0.04 0.06 0.08 0.10 0.12 0.14

theta Decide whether to distribute the vaccine, after second trail.

• Loss for distributing vaccines is I (θ, distribute) = 50, 000

6 • Loss for not distributing vaccines is I (θ, not distribute) = 10 (θ2 − θ1)

• Posterior risk for distributing is ρ (θ, distribute) = Eθ|y [I (θ, distribute)] = 50, 000

• Posterior risk for not distributing is ρ (θ, not distribute) = Eθ|y [I (θ, not distribute)] =     106 E θ − E θ  = 106 0.4297 + 0.58 38 −0.4297 − 0.58 60 ≈ θ2|y 2 θ1|y 1 97+1005 38+514 97+1005 60+492 23, 100 < 50, 000.

31 Thus, choose not to distribute vaccine at this time.

Bayes Factors

• The posterior odds for H0 over Ha is

P (H0 | y) P (Ha | y)

• The for H0 over Ha is

p (y | H ) BF = 0 p (y | Ha)

• The Bayes factor is the posterior odds scaled by the prior odds

P (H ) P (H | y) P (H | y) P (H ) BF = a · 0 or 0 = 0 · BF. P (H0) P (Ha | y) P (Ha | y) P (Ha)

Heuristic interpretation of Bayes factors (c. Harold Jeffreys):

BF < 1 1 ∼ 3 3 ∼ 10 10 ∼ 30 30 ∼ 100 > 100 Strength of Negative Barely worth Substantial Strong Very Decisive Evidence mentioning strong Consider the example above, the Bayes factor for the second vaccine trail was

P (y = 36, y = 52 | H ) 1 2 0 ≈ 2, P (y1 = 36, y2 = 52 | Ha) which implies that data gives little evidence for one hypothesis over the other.

6 Model Comparison

Multiple Hypothesis/Models

• Bayesian framework does not treat H0 and Ha differently.

• Methodology may be extended to more than two conclusions.

• Instead of “hypotheses”, compare evidence for “models.”

• For data y, models M1, ... , Mm:

– Mi: y ∼ f (y | θi,Mi) with prior θi ∼ πi (θi) where Mi is the data generating process.

32 Pm – With prior probabilities P (Mi): i=1 P (Mi) = 1.

• The posterior probability of model i is

Z P (Mi) p (y | Mi) p (Mi | y) = Pm where p (y | Mi) = f (y | θi) πi (θi | Mi) dθi. j=1 P (Mj) p (y | Mj)

Model Choice

Actions A = {M1,...,Mm}

1 • Under “0-1” Loss: I (Mi, d (y)) = {d(y)6=Mi},

Choose Mi with the highest posterior probability P (Mi | y).

1 • Under “0-ci” Loss: I (Mi, d (y)) = ci {d(y)6=Mi}, where ci is like a penalty, P P – Posterior risk for choosing Mi: ρ (π, a = Mi) = Ea|y [ i I (Mi, d (y))] = j6=i cjP (Mj | y)

Choose Mi with the highest weighted posterior probability ciP (Mi | y).

Example (IQ). Human IQs have a N (100, 225) distribution. A given IQ test has normal error with variance 64. That is, f (y | µ) =Normal(µ, 64) and π (µ) =Normal(100, 225). Observe a test score y ∈ R for a student, then the posterior distribution for their true IQ is

 2 2 2 2  σ µ0+τ y σ τ p (µ | y) =Normal σ2+τ 2 , σ2+τ 2 =Normal(22.15 + 0.779y, 49.83).

• A given student belongs to

– remedial learning group if IQ< 80 (R) – standard learning group if 80 120 (A)

• Assume that a student has score y = 75, then p (µ | y) =Normal(79.80, 49.83).

• The posterior probabilities of belonging to each group are

Z 80 P (R | y = 75) = p (µ | y = 75) dµ = 0.511 −∞ Z 120 P (S | y = 75) = p (µ | y = 75) dµ = 0.489 80 Z ∞ P (A | y = 75) = p (µ | y = 75) dµ ≈ 0. 120

33 • Loss function

I (R, d (y)) = 1{d(y)6=R},I (S, d (y)) = 2 · 1{d(y)6=S}, and I (A, d (y)) = 1{d(y)6=A}.

• Given y = 75, choose group with the highest weighted posterior probability: Choose the standard group S, since 2P (S | y = 75) > P (R | y = 75) and P (A | y = 75).

• Decision rule for arbitrary y:  R if y < 70.4  d (y) = S if 70.4 ≤ y ≤ 129.6  A if y > 129.6

Under the loss functions, choose R if P (R | y) > 2P (S | y). Note that P (S | y) ≈ 1 − P (R | y). Hence, P (R | y) > 2P (S | y) implies P (R | y) > 2/3, which is equivalent to comparison between corresponding quantiles after standardizing the distribution:

Z 80 80 − 22.15 − 0.779y p (µ | y) dµ = √ > Z2/3 = 0.431 =⇒ y < 70.4. −∞ 49.83

Bayes Factors for Model Comparison

p(y|M1) • Recall the Bayes factor for model M1 over model M2 is BF = . p(y|M2)

max f(y|θ ,M ) • A likelihood ratio test is based on maximum for each model: Λ = θ1 1 1 . maxθ2 f(y|θ2,M2)

f(y|θ(1)) • Under point models M : θ = θ(1) and M : θ = θ(2): BF = Λ = . 1 2 f(y|θ(2))

Bayesian Information Criterion

• Let ki be number of parameters in model Mi and n be the sample size.

• A heuristic for assessing the fit of a model is the Bayesian Information Criterion (BIC):

BIC (Mi) = −2 log (maxθi f (y | θi,Mi)) + ki log n.

• Likelihood ratio test based on transformed ratio:

max f (y | θ ,M ) W = −2 log θ1 1 1 . maxθ2 f (y | θ2,M2)

• The difference (change from M1 to M2) in BIC can be expressed in terms of W :

34 4BIC = W − (k2 − k1) log n.

• For y = y1, y2, . . . , yn iid, as n → ∞, −2 log (BF ) ≈ 4BIC under mild assumption.

– Derivation: Bhat and Kumar. 2010. “On the derivation of the Bayesian Information Criterion.” – 4BIC may be easier to compute than the BF . – 4BIC does not depend on prior distributions but it depends on the Likelihood ratio and

the dimension of each model (i.e., W and k2 − k1). – BIC also called Schwarz information criterion (SIC) for G. Schwarz.

Partial Bayes Factors R If πi (θi) is imporper, then so is p (y | Mi) = f (y | θi,Mi) πi (θi) dθi. R Note. Let πi (θi) dθi = ∞. Then Z ZZ p (y | Mi) dy = f (y | θi,Mi) π (θi) dθidy Z Z Z = πi (θi) f (y | θi,Mi) dydθi = πi (θi) dθi = ∞. | {z } =1

Hence BF involving Mi is not well defined. There is a possible solution:

(i) Assume p (θ1 | y1) is proper for y1 = (y1, . . . , yi).

(ii) Find conditional Bayes factor for y2 = (yi+1, . . . , yn) and construct a Partial Bayes Factor:

p (y2 | y1,M1) BF (y2 | y1) = . p (y2 | y1,M2)

Example (Traffic Accidents). Estimate weekly accident rate at new traffic intersection. Each week observe y ∼Poisson(λ) accidents.

• M1: Elicited prior from city planner: π1 (λ) =Gamma(3, 2).

• M2: Compare with (improper) uniform prior π2 (λ) = 1.

• Observe data for 5 weeks: y1 = 3, y2 = 6, y3 = 2, y4 = 4, and y5 = 2.

iid • If y1, . . . , yn ∼Poisson(λ) and M: π (λ) =Gamma(α, β), marginal distr. of y under M is

P Z Z yi −nλ α λ i e β α−1 −βλ p (y | M) = p (y | λ) π (λ) dλ = Q λ e dλ i yi! Γ(α) α α β Z P β Γ(P y + α) yi+α−1 −(β+n)λ i i = λ i e dλ = P . Q Q yi+α i yi!Γ (α) | {z } i yi!Γ (α) (β + n) i P ∝Gamma( i yi+α,β+n)

35 • p (y | M2) is improper. If π2 (λ) = 1 is improper, then so is p (y | M2).

• Condition on y1:

λy1 e−λ 3−1 −2λ y1+3−1 −3λ p (λ | M1, y1) ∝ p (y1 | λ) π1 (λ) ∝ λ e ∝ λ e ∝ Gamma (y1 + 3, 3) . y1! λy1 e−λ y1 −λ p (λ | M2, y1) ∝ p (y1 | λ) π2 (λ)= × 1 ∝ λ e ∝ Gamma (y1 + 1, 1) . y1!

• Compute partial Bayes factor, conditioned on y1: Z p (y2 = 6, y3 = 2, y4 = 4, y5 = 2 | M1, y1 = 3) = p (y2, y3, y4, y5 | λ)p (λ | M1, y1) dλ | {z }| {z } Poisson(λ) Gamma(y1+3,3) Z p (y2 = 6, y3 = 2, y4 = 4, y5 = 2 | M2, y1 = 3) = p (y2, y3, y4, y5 | λ)p (λ | M2, y1) dλ | {z }| {z } Poisson(λ) Gamma(y1+1,1)

p (y−1 | M1, y1) 0.000133 BF (y2, y3, y4, y5 | y1) = = = 0.596. p (y−1 | M2, y1) 0.000224

Note. (i) The key is to compute the marginal distr. of y under different data generating model. (ii) Modest evidence that the elicited prior is not better than flat prior.

Intrinsic Bayes Factors   • Compute n partial Bayes factors: BF {yj}j6=i | yi for i = 1, . . . , n. • The average of these partial BFs is the intrinsic Bayes factor.

– Could take arithmetic or geometric average   – If BF {yj}j6=i | yi does not exist, condition on larger subsets instead

PoisMarginal= function(Yvec,alpha,beta){ n= length(Yvec) Prob=(1/ prod(factorial(Yvec)))*((beta^alpha)/gamma(alpha))* (gamma(sum(Yvec)+alpha)/(beta+n)^( sum(Yvec)+alpha)) return(Prob) }

Partial_y1_M1= PoisMarginal(c(6,2,4,2),alpha=3+3,beta=3) Partial_y1_M2= PoisMarginal(c(6,2,4,2),alpha=1+3,beta=1) BF_y1= Partial_y1_M1/Partial_y1_M2; c(Partial_y1_M1, Partial_y1_M2, BF_y1)

## [1] 0.0001340 0.0002248 0.5959680

Partial_y2_M1= PoisMarginal(c(3,2,4,2),alpha=3+6,beta=3) Partial_y2_M2= PoisMarginal(c(3,2,4,2),alpha=1+6,beta=1) BF_y2= Partial_y2_M1/Partial_y2_M2

36 Partial_y3_M1= PoisMarginal(c(3,6,4,2),alpha=3+2,beta=3) Partial_y3_M2= PoisMarginal(c(3,6,4,2),alpha=1+2,beta=1) BF_y3= Partial_y3_M1/Partial_y3_M2

Partial_y4_M1= PoisMarginal(c(3,6,2,2),alpha=3+4,beta=3) Partial_y4_M2= PoisMarginal(c(3,6,2,2),alpha=1+4,beta=1) BF_y4= Partial_y4_M1/Partial_y4_M2

Partial_y5_M1= PoisMarginal(c(3,6,2,4),alpha=3+2,beta=3) Partial_y5_M2= PoisMarginal(c(3,6,2,4),alpha=1+2,beta=1) BF_y5= Partial_y5_M1/Partial_y5_M2

BF_Intrinsic= (BF_y1+BF_y2+BF_y3+BF_y4+BF_y5)/5; BF_Intrinsic

## [1] 1.639

Fractional Bayes Factors

An alternative to intrinsic BF is the fractional Bayes factor:

p (y, b | M1) BFb = p (y, b | M2) where

R f (y | θ ,M ) π (θ ) dθ p (y,b | M ) = i i i i i for b ∈ (0, 1) . i R b f (y | θi,Mi) πi (θi) dθi

• Often choose b = 1/n if BF1/n is well-defined.

• Fractional BF satisfies likelihood principle, intrinsic BF does not.

– Fractional BF only depends on the likelihood of data f (y | θ) and prior for θ. – Intrinsic BF is an (either geometric or arithmetic) average of conditional distributions.

• Note that Z 1−b p (y,b | Mi) = f (y | θi,Mi) p (θi | y,b,M i) dθi

b where p (θi | y,b,M i) ∝ f (y | θi,Mi) πi (θi).

n • For {yj}j=1 iid given θi,

b  n  b Y f (y | θi,Mi) =  f (Yj | θi,Mi) . j=1

So b = 1/n gives the geometric mean for the likelihood of one observation.

37 Example (Traffic Accidents (cont.)). Note that p (λ | y, 1/n, M2) =Gamma(¯y + 1, 1):

P 1/n  yi −nλ  1/n λ i e y¯ −λ p (λ | y, 1/n, M2) ∝ f (y | λ, M2) π2 (λ) = Q × 1 ∝ λ e ∝ Gamma (¯y + 1, 1) . i yi!

Thus,

P 1− 1 Z Z  yi −nλ  n y¯ −λ 1−b λ i e λ e p (y, 1/n | M2) = f (y | λ, M2) p (λ | y, 1/n, M2) dλ = Q dλ i yi! Γ (¯y + 1) 1 Z 1 Γ(P y + 1) = λny¯e−nλ dλ = i i n−1 n−1 P y +1 (Q y !) n Γ (¯y + 1) | {z } (Q y !) n Γ (¯y + 1) n( i i ) i i P i i ∝Gamma( i yi+1,n)

n=5 Yvec= c(3,6,2,4,2) Fractional_M1=(1/ prod(factorial(Yvec)))^((n-1)/n)*(1/n)^(sum(Yvec)+1)* (1/gamma(mean(Yvec)+1))*gamma(sum(Yvec)+1) Fractional_M2=3^( mean(Yvec)+3)*(1/prod(factorial(Yvec)))^((n-1)/n)* (1/(n+2))^(sum(Yvec)+3)*(1/gamma(mean(Yvec)+3))*gamma(sum(Yvec)+3)

BF_Fractional= Fractional_M1/Fractional_M2; BF_Fractional

## [1] 1.285

Cross-Validated Likelihood

The conditional predictive distribution for yi is Z    f yi | y(i) = f yi | θ, y(i) p θ | y(i) dθ,

where y(i) = (y1, . . . , yi−1, yi+1, . . . , yn).  • The low f yi | y(i) indicates the model is a poor fit for yi. Qn  • The product i=1 f yi | y(i) is sometimes called the pseudo marginal likelihood.

– A higher value indicates a better overall model fit.

Bayesian Residuals

0   Use conditional predictive distribution to compute Bayesian residuals: ri = yi − E Yi | y(i) . The standardized residual is   yi − E Yi | y d0 = (i) , i q  V ar Yi | y(i)

38 which can be used to detect systematic departures from model assumptions.

Bayesian P-Values

Recall: the frequentist p-value based on observed statistic T (y): P (T (Y) ≥ T (y) | H0) where Y and y are iid given θ.

The posterior predictive p-vlaue under M0: y ∼ f (y | θ), θ ∼ π (θ) for statistic T (y | θ) is

pT =P (T (Y, θ) ≥ T (y, θ) | M0, y) Z = P (T (Y, θ) ≥ T (y, θ) | θ) p (θ | M0, y) dθ.

pT is called a Bayesian p-value. More generally, replace ‘≥’ with ‘more extreme than.’

• T can be a function of data (y) and parameters (θ). For example, the goodness-of-fit statistic

n 2 X [yi − E(Yi | θ)] T (y, θ) = . V ar (Y | θ) i=1 i

• Low pT can indicate problems with M0. That is, the observed data are not plausible under the predictive model.

• Little decision-theoretic justification.

• Not recommended as sole basis for comparing models.

Example (Parental Remorse). Interested in θ: proportion of parents who regret having children.

• Prior for θ: Beta(1, 9)

• In a small survey of n = 10 longtime parents, y = 9 say they regret the choice.

• Posterior for θ: Beta(10, 10)

What is the posterior predictive p-value? If y ∼Binomial(n, θ) and π (θ) =Beta(a, b), then

! Z n θa−1 (1 − θ)b−1 P (Y = k | y = 9) = θk (1 − θ)n−k dθ k B (a, b) ! ! n 1 Z n B (a + k, b + n − k) = θk+a−1 (1 − θ)n−k+bdθ = . k B (a, b) | {z } k B (a, b) ∝Beta(a+k,b+n−k) where B is the Beta function.

39 a= 10; b= 10; n=10; Pk_y= c() for(k in 0:n){i=k+1; Pk_y[i]= choose(n,k)*beta(a+k,b+n-k)/beta(a,b)} post_pvalue= Pk_y[10]+Pk_y[11]; post_pvalue

## [1] 0.02889

Here we show prior and posterior distribution for θ on the left and P (Y ≥ 9 | y = 9) on the right.

The pT suggests that our prior was overly strong and we should have been less optimistic about parental remorse. Note. Posterior predictive p-values have been criticized for “using the data twice:” (i) use data for reference distribution p (θ | y) and (ii) use data for observed statistic T (y, θ). The practical implication of this has been debated: Read blog article “posterior predictive p-values” and related discussion in comments. Alternatively, for training data y1 and test data y2, compute

0 pT = P (T (Y2, θ) ≥ T (y2, θ) | M0, y1).

7 Hierarchical Model

Bayesian Hierarchical Models

• Standard model: y ∼ f (y | θ), θ ∼ π1 (θ1). The joint distribution of y, θ1 is p (y, θ1) =

f (y | θ1) π1 (θ1). If π1 has hyper-parameters θ2: p (y, θ1, θ2) = f (y | θ1) π1 (θ1 | θ2). Then a Bayesian Hierarchical Model of level l is

p (y, θ1, θ2, . . . , θl) = f (y | θ1) π1 (θ1 | θ2) π2 (θ2 | θ3) ··· πl (θl | η).

• The marginal prior for θ1 is Z Z ∗ π (θ1) = ··· π1 (θ1 | θ2) ··· πl (θl | η) dθ2 ··· dθl,

∗ where π1, . . . , πl are proper, so is π .

• The marginal, posterior, and posterior predictive can be derived from marginal prior π∗ in iid the standard way for y = (y1, . . . , yn) ∼ f (· | θ1): Z ∗ p (y) = f (y | θ1) π (θ1) dθ1,

∗ f (y | θ1) π (θ1) p (θ1 | y) = R ∗ , f (y | θ1) π (θ1) dθ1 Z p (yn+1 | y) = f (yn+1 | θ1) p (θ1 | y) dθ1.

40 Example (Liver Disease). Give a collection of mice a dose of alcohol. Observe which mice develop liver disease. Each mouse comes from one of 50 genetic strains. For i = 1,..., 50, let ni be the number of mice from strain i and yi be the number of mice with liver disease from strain i.

• Assume each strain has probability θi of developing liver disease: yi ∼Binomial(ni, θi).

• Model θi’s as iid from Beta(a, b) distribution.

p (θi | yi, a, b) ∝ p (yi | θi) p (θi | a, b) =⇒ p (θi | yi, a, b) = Beta (a + yi, b + ni − yi) | {z } | {z } Beta-Binomial Model because Beta prior p(θi|a,b) is conjugate

Z Z ! a−1 b−1 ni yi ni−yi θi (1 − θi) p (yi | a, b) = p (yi | θi) p (θi | a, b) dθi = θi (1 − θi) dθi yi B (a, b) ! Z ! ni 1 yi+a−1 ni−yi+b−1 ni B (a + yi, b + ni − yi) = θi (1 − θi) dθi = and yi B (a, b) | {z } yi B (a, b) ∝Beta(a+yi,b+ni−yi) n Y p (y | a, b) = p (yi | a, b) (because every strain is independent). i=1

• How to choose a and b? One approach is to put a prior on (a, b). For computational simplicity, we’ll use a discrete uniform prior:

b = 1 b = 2 b = 3 a = 1 1/9 1/9 1/9 a = 2 1/9 1/9 1/9 a = 3 1/9 1/9 1/9

• The posterior probability of (a = a0, b = b0) is ((( p (y | a, b) P (a, b) p (y | a = a0, b = b0) P (a(=(a0(, b = b0) p (a = a0, b = b0 | y) = = ( ( p (y) P3 P3 ((( i=1 j=1 p (y | a = i, b = j)(P (a(= i, b = j) | {z } 1 because P (a,b)= 9 ∀(a,b) is a const. in numerator and denominator. p (y | a = a0, b = b0) = . P3 P3 i=1 j=1 p (y | a = i, b = j)

Thus, the posterior probabilities for a, b:

41 StrainData= read.csv('http://www.tc.umn.edu/~elock/MiceData.csv') n= StrainData$n Yi= StrainData$Yi

Yi_marg <- function(ni,yi,a,b){choose(ni,yi)*beta(a+yi,b+ni-yi)/beta(a,b)} Y_marg <- function(n,y,a,b){prod(Yi_marg(n,y,a,b))} #product of marginal for each Yi

Norm= Y_marg(n,Yi,1,1)+Y_marg(n,Yi,1,2)+Y_marg(n,Yi,1,3)+ Y_marg(n,Yi,2,1)+Y_marg(n,Yi,2,2)+Y_marg(n,Yi,2,3)+ Y_marg(n,Yi,3,1)+Y_marg(n,Yi,3,2)+Y_marg(n,Yi,3,3) y_marg_ab <- matrix(NA,3,3) for(a in 1:3){ for(b in 1:3){ y_marg_ab[a,b]<-Y_marg(n,Yi,a,b)/Norm } } print(y_marg_ab, digits=2)

## [,1] [,2] [,3] ## [1,] 3.5e-07 1.4e-01 8.6e-01 ## [2,] 7.7e-22 9.2e-10 8.1e-05 ## [3,] 3.4e-36 5.7e-20 4.3e-12

P • The posterior distribution for θi is p (θi | y) = (a,b) p (θi | yi, a, b) P (a, b | y); that is,

p (θi | y) = 0.14Beta (1 + yi, 2 + ni − yi) + 0.86Beta (1 + yi, 3 + ni − yi) .     1+yi 1+yi Then the posterior expectation for θi is E[θi | y] = 0.14 + 0.86 . Here are 3+ni 4+ni ˆ the histograms of raw proportions θi = yi/ni (on the left) and E[θi | y] (on the right).

Histogram of Yi/n Histogram of PostExps 15 12 10 8 6 5 4 Frequency Frequency 2 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

y_i/n_i Post Exp

• The posterior for the probability of liver disease in an unobserved strain, θ51, is

X p (θ51 | y) = p (θ51 | yn+1, a, b) p (a, b | y) = 0.14Beta (1, 2) + 0.86Beta (1, 3) , (a,b)

since for an unobserved strain there is no information for inference of θ51 (hence it is equiva-

lent to set nn+1 = 0 and yn+1 = 0). ∗ P3 P3 1 Note. The marginal prior (over a, b) for θi is π (θi) = l=1 m=1 9 Beta (a = l, b = m).

42 1.5 3.0 1.0 2.0 density density 0.5 1.0 0.0 0.0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

theta_i theta_51

• The prior predictive pmf p (yi = k) for ni mice from strain i is

! 3 3 X ni X X 1 B (l + yi, m + ni − yi) p (Yi = yi) = P (Yi = yi | a, b) P (a, b) = . y 9 B (l, m) (a,b) i l=1 m=1

The full marginal prior pmf P (y) is

3 3 50 ! X X 1 Y ni B (l + yi, m + ni − yi) p (y) = . 9 B (l, m) l=1 m=1 i=1 yi

Suppose a mouse’s susceptibility to liver disease does not depend on genetics (θ1 = θ2 = ··· = θ50 = θ) and yi ∼Binomial(ni, θ) and θ ∼Beta(1, 1). Denote this model as M0 compared to previous hier- archical model M1. P50 P50 Under M0, y ∼Binomial(n, θ) where n = i=1 ni and y = i=1 yi. Observe n = 356 and y = 92, then due to conjugate property of Beta-Binomial model

p (θ | y,M0) = Beta (1 + 92, 1 + 356 − 92) = Beta (93, 265).

• Marginal for M0:

50 50 ! Z Z Z n Y Y i yi ni−yi p (y | M0) = p (y | θ) π (θ | M0) dθ = p (yi | θ) · 1dθ = θ (1 − θ) dθ i=1 i=1 yi 50 ! 50 ! Z P Y ni P y (ni−yi) Y ni = θ i i (1 − θ) i dθ = B (93, 265) . i=1 yi | {z } i=1 yi P P ∝Beta( i yi+1, i(ni−yi)+1)

• Marginal for M1:

50 3 3 50 ! Y X X 1 Y ni B (l + yi, m + ni − yi) p (y | M ) = p (y | M ) = . 1 i 1 9 B (l, m) i=1 l=1 m=1 i=1 yi

43 Marg_M1=1/9 *(Y_marg(n,Yi,1,1)+Y_marg(n,Yi,1,2)+Y_marg(n,Yi,1,3)+ Y_marg(n,Yi,2,1)+Y_marg(n,Yi,2,2)+Y_marg(n,Yi,2,3)+ Y_marg(n,Yi,3,1)+Y_marg(n,Yi,3,2)+Y_marg(n,Yi,3,3)) Marg_M0= prod(choose(n,Yi))*beta(93,265) BF= Marg_M1/Marg_M0; BF

## [1] 23987863

The Bayes factor for M1 over M0 is > 10000, which is a convincing evidence that the hierar- chical model is superior.

Note (Structural Form of M0 and M1). Begin with the fixed distribution and count from the first layer of random variable to the data.

M0 (Standard Model) M1 (Hierarchical Model with Level 2) For fixed (a, b), For fixed prior distribution Π of (a, b),   (2) n1, y1 θ1 → n1, y1     (2) (1) n2, y2 (1) θ2 → n2, y2 Beta(a, b) → θ → Π → (a, b) and then Beta (a, b) → . . . .     (2) nn, yn θn → nn, yn

Hierarchical Normal Model

2 2 ∗ 2 Assume y1, . . . , yn ∼ Normal θ1, σ1 , θ1 ∼ Normal θ2, σ2 , . . . , θl ∼ Normal µ , τ . Then

∗ ∗ 2 2 2 2 π (θ1) ∼ Normal µ , σ2 + σ3 + ··· + σl + τ ,

 2 l where σi i=1 are known.

2 • Assume yij ∼Normal θi, σ for group i and observations j = 1, . . . , ni for each group, where σ2 is known.

2 2 • Assume θi’s are iid Normal µ, τ with τ known and prior π (µ) for µ.

2 • Given yi = (yi1, . . . , yini ), as well as fixed µ and σ ,

2 2 2 2 f yi | µ, σ ∝Normal y¯i | µ, σi where σi = σ /ni.

So it suffices to consider the group means y¯i.

– y¯i is a sufficient statistic for yi.

44 • With know variances (σ2 and τ 2), the full joint distribution is

m m Y 2 Y 2 p (y, θ, µ) = Normal y¯i | θi, σi · Normal θi | µ, τ · π (µ). i=1 i=1 | {z } | {z } f(y|θ;σ2) f(θ|µ;τ 2)·π(µ)

• The posterior distribution for µ is p (µ | y):

m Y p (µ | y) ∝f (y | µ) π (µ) = f (yj | µ) · π (µ) j=1   m m ( 2 ) " 2 # Y Y (¯yj − µ)  X (¯yj − µ)  ∝ f (¯y | µ) · π (µ) ∝ exp − · 1 = exp − j σ2 + τ 2 σ2 + τ 2 j=1 j=1 j  j j       2  X 1 X y¯j X y¯j  = exp µ2 − 2 µ +  σ2 + τ 2   σ2 + τ 2  σ2 + τ 2  j j j j j j 

−1  −1 2    P σ2+τ 2 y¯    P σ2+τ 2 y¯  2 j ( j ) j  j ( j ) j  µ − 2 −1 µ  µ − −1   P (σ2+τ 2)   P (σ2+τ 2)  ∝ exp j j ∝ exp j j , P 2 2−1 P 2 2−1  j σj + τ   j σj + τ     

Qm 2 2 where the second line follows f (yj | µ) ∝ j=1 f (¯yj | µ) for yj where y¯j ∼Normal µ, σj + τ .

 iid 2 2 Note. Suppose yj = yj1, . . . , yjnj where yji ∼ Normal θj, σ for all i and θj ∼ Normal µ, τ . iid 2 Then yji = θj + εji where εji ∼ Normal 0, σ for all i. Hence, we have

! 1 X 1 X 1 X y¯ = y = (θ + ε ) = θ + ε , j n ji n j ji j n ji j i j i j i 1 X 1 X E (¯y ) =E (θ ) + E(ε ) = µ + · 0 = µ, i j n ji n j i j i 2 1 2 σ 2 2 V ar (¯yj) =V ar (θj) + 2 · nj · V ar (εji) = τ + = τ + σj . nj nj

Example (IQ Scores). Human IQs have variance 225 and are to be centered at 100. Infer the calibration µ of a certain IQ test. Sample of 20 individuals, each takes the test ni times:

2  • yij ∼Normal θi, σ = 64 for all i and j

2  • θi ∼Normal µ, τ = 225 for all i

• π (µ) = 1 for all µ.

• Posterior for µ is Normal(108.7, 13.0).

45 Thus, P (µ < 100 | y) ≈ 0.01 suggests that IQ test is too generous.

The Normal-Gamma Model

2 2 • Assume y = y1, . . . , yn are iid Normal µ, σ , and σ known.

2 • If π (µ) = Normal µ0, τ , then

σ2µ + nτ 2y¯ σ2τ 2  p (µ | y) = Normal 0 , . σ2 + nτ 2 σ2 + nτ 2

• The precision is the reciprocal of the variance.

1 n 1 n Thus, the posterior precision is τ 2 + σ2 where τ 2 is the prior precision and σ2 is the data precision.

Conjugate Normal with µ Known and σ2 Unknown

2 2 • Assume y = y1, . . . , yn are iid Normal µ, σ , with µ known and σ unknown.

• Sampling model (likelihood) " #  1 n/2 Pn (y − µ)2 f y | σ2 ∝ exp − i=1 i σ2 2σ2

1 Given y and µ, the right hand side is proportional to the Gamma distribution for σ2 .

• The for σ2 is the inverse gamma distribution (IG). Note. If X ∼ Gamma(α, β), then 1/X ∼ IG (α, β), the inverse gamma distribution.

Assume prior distribution for σ2 is π σ2 = IG (α, β) where α, β > 0,

βα  1 α−1  β  π σ2 = exp − . Γ(α) σ2 σ2

Then the posterior distribution for σ2 is " #  1 n/2 Pn (y − µ)2  1 α−1  β  p σ2 | y ∝f y | σ2 π σ2 ∝ exp − i=1 i exp − σ2 2σ2 σ2 σ2 ( " #)  1 α+n/2−1 1 Pn (y − µ)2 = exp − β + i=1 i σ2 σ2 2 ! n Pn (y − µ)2 ∝Gamma α + , β + i=1 i . 2 2

46 2  n n  1 P 2 That is p σ | y = IG α + 2 , β + 2 W where W = n i (yi − µ) . Thus, the prior distribution for σ2 is conjugate.

• For σ2 ∼ IG (α, β), the prior mean and variance are

β E σ2 = for α > 1 α − 1 β2 V ar σ2 = for α > 2 (α − 1)2 (α − 2)

Solving for α, β:

2 ( 2 ) E σ2 E σ2 α = + 2 and β = E σ2 + 1 . V ar (σ2) V ar (σ2)

Conjugate Normal with both µ and σ2 Unknown

2 2 • Assume y = y1, . . . , yn are iid Normal µ, σ , with µ unknown and σ unknown.

• A conjugate prior for µ and σ2 are

π σ2 ∼IG (α, β) 2 2  π µ | σ ∼Normal µ0, σ /λ ,

where prior for µ depends on σ2, and λ reflects confidence/influence of prior (larger λ, more

confidence in prior hypermean µ0).

• Given the normal-normal model with σ2 known,

λµ + ny¯ σ2  p µ | σ2, y = N 0 , . λ + n λ + n

By the model with µ known, p σ2 | µ, y ∝ p µ | σ2, y π σ2:

 n n  p σ2 | µ, y = IG α + , β + W . 2 2

By integrating over µ, p σ2 | y ∝ R f y | µ, σ2 π µ | σ2 π σ2 dµ:

! n 1 1 X 2 p σ2 | y = IG α + − , β + (y − y¯) . 2 2 2 i i

47 8 Bayesian Linear Models

For observations y1, . . . , yn, the basic linear model is yi = x1iβ1 + ··· + xpiβp + i, where th x1i, . . . , xpi are predictors for i observation and i is the error term. In matrix form, y = Xβ +  where y,  ∈ Rn×1, X ∈ Rn×p, and β ∈ Rp×1. Assume X is fixed and errors are normal and iid with equal variance:  ∼ Normal0, σ2I. Standard frequentist estimates are

−1 1  T   βˆ = XT X XT y and σˆ2 = s2 = y − Xβˆ y − Xβˆ . n − p These estimates are unbiased and can be motivated by least-squares.

Bayesian Framework with Uninformative Priors

Under a Bayesian framework, we put a prior on β and σ2. Consider uniform prior for β and Jeffreys prior for σ2:

1 1 π β, σ2 = π (β) · π σ2 = 1 · = , σ2 σ2 where the Jeffreys prior for σ2, under y ∼ N µ, σ2 where µ = xT β and θ = β, σ2, is

" " ##  ∂  ∂ n Pn (y − µ)2 π σ2 = − E log f y | σ2 ∝ −E − log σ2 − i=1 i y|θ ∂σ2 y|θ ∂σ2 2 2σ2

h 2i " n 2 # P n 1 P (y − µ) n Ey|θ i (yi − µ) n σ2 = − E − + i=1 i = − = − y|θ 2 σ2 2σ4 2σ2 2σ4 2σ2 2σ4 n − 1 1 ∝ ∝ . 2σ2 σ2

σ2 is given and hence σ2 can be canceled as a constant.

 −1 • The posterior for β, given σ2, is p β | y, σ2 = Normal βˆ, σ2 XT X .

T −1 T Consider a projection matrix of X: PX = X X X X . Then the expectation of y in

the space of X is Ey|β,σ2 [y] = PXy.

p β | y, σ2 ∝f y | β, σ2 π β, σ2  1   1  T   ∝ exp − (y − P y)T (y − P y) = exp − y − Xβˆ y − Xβˆ 2σ2 X X 2σ2  1  T   = exp − Xβ +  − Xβˆ Xβ +  − Xβˆ 2σ2  1  T  h   i = exp − β − βˆ XT + T X βˆ − β +  2σ2

48 (By OLS assumption T X = XT  = 0 and T  = σ2)  1  T   ∝ exp − β − βˆ XT X β − βˆ 2σ2  1  T  −1−1    −1 = exp − β − βˆ σ2 XT X β − βˆ = Normal βˆ, σ2 XT X . 2

• The marginal posterior for σ2 is

Z n − p (n − p) s2  p σ2 | y ∝ f y | β, σ2 π β, σ2 dβ ∝ IG , . | {z } 2 2 MVN(y|Xβ,σ2I)

2 1 Thus, for the uninformative prior π β, σ ∝ σ2 , variance estimate is

2 (n−p)s s2 (n − p) E σ2 | y = 2 = . n−p n − p − 2 2 − 1

However, the expected precision is

1 E 1/σ2 | y = , s2

where s2 is commonly used as point estimates for error variance.

b Note. (i) If U ∼ IG (a, b), E[U] = a−1 for α > 1. (ii) If V ∼ Gamma (a, b), E[V ] = b where V = 1/U. av  (iii) If U ∼ χ2, then c · U ∼ Gamma , 2c . Thus, v 2

(n − p) s2 σ2 ∼ where U ∼ χ2 . U (n−p)

• The marginal posterior for βi is non-central t-distribution:

ˆ βi − βi q ∼ tn−p. T −1 s (X X)ii

• For a new predictor vector x(n+1), the posterior predictive for yn+1 is also a non-central t-distribution:

T ˆ yn+1 − xn+1β q ∼ tn−p. T T −1 s 1 + xn+1 (X X) xn+1

2 1 Remark. All the results above under the jointly prior π β, σ ∝ σ2 correspond to the standard

49 frequentist inference for linear regression.

• Residuals

0  – Define Bayesian residual as ri = yi−E Yi | y(i) where y(i) = (y1, . . . , yi−1, yi+1, . . . , yn).  −1 0 ˆ ˆ T T – Alternatively, ri = yi − xiβ(i) where β(i) = X(i)X(i) X(i)y(i).

ˆ Compared to the standard definition of residual, ri = yi − xiβ, the Bayesian residual correct the over-fitting of the standard residual since the Bayesian residual does not use itself as data to predict itself.

Example (Body Fat). To complete

Data= read.csv('http://www.lock5stat.com/datasets/BodyFat.csv') Y_tilde= Data$Bodyfat Y_tilde_bar= mean(Y_tilde) s_y_tilde= sd(Y_tilde) Y= Y_tilde-18.5

X_tilde= as.matrix(Data[,2:10]) X= scale(X_tilde)

Bayesian Framework with Normal-Inverse Gamma Prior

• Consider independent normal priors for the βi’s:

2 2  2 β | σ ∼ Normal 0, σ T , where Tij = τi if i = j, 0 otherwise.

The prior for βi’s is a normal distribution centering at 0, which implies the none hypothesis

H0 for βi’s is βi = 0 ∀i. Then we expect to use data to reject H0 and show βi is significant

away from 0. In other words, xi is a significant predictor for yi.

• An inverse gamma prior for σ2: σ2 ∼ IG (a, b).

• The full prior is

p 2 Y 2 2 2  π β, σ = Normal βi | 0, σ τi · IG σ | a, b . j=1

• The posterior for β, given σ2, is

2 ˜ 2  p β | y, σ = Normal β, σ Vβ

50 ˜ T −1−1 T  T −1−1 where β = X X + T X y and Vβ = X X + T .

p β | y, σ2 ∝ p y, β, σ2 = p y | β, σ2 · π β, σ2

 1 −1   1 −1  ∝ exp − (y − Xβ)T σ2 (y − Xβ) · exp − βT σ2T β 2 2  1   = exp − βT XT Xβ − 2βT XT y + βT T−1β 2σ2  1   = exp − βT XT X + T−1 β − 2βT XT y 2σ2  −1 −1 ∝ N XT X + T−1 XT y, σ2 XT X + T−1 ,

where the second line follows that IG σ2 | a, b is dropped out since it is irrelevant to β. ˜ ˜ 2 • The estimate β solves a penalized least squares criterion: β = arg minβ ky − Xβk + Pp 2 2 i=1 βi /τi . The penalized criterion is

p 2 X 2 2 T T −1 ky − Xβk + βi /τi = (y − Xβ) (y − Xβ) + β T β i=1 =yT y − 2βT XT y + βT XT Xβ + βT T−1β.

−1 The FOC w.r.t. β is −2XT y+2XT Xβ+2T−1β = 0; that is, β = XT X + T−1 XT y.

2 2  n • The marginal posterior for σ is p σ | y = IG (an, bn) where an = a + 2 and bn = h T i 1 T ˜ −1 ˜ b + 2 y y − β Vβ β . • The marginal posterior for β is a multivariate t-distribution

β − β˜ i i ∼ t t . q 2an or (2a+n) bn (Vβ) an ii

2 • For a new predictor vector xn+1, the posterior predictive for yn+1 given σ is

2  T ˜ 2 T  yn+1 | σ ∼ Normal xn+1β, σ 1 + xn+1Vβxn+1 .

  T ˜ 2 ˜ 2 Let yn+1 = xn+1β + n+1 where n+1 ∼ N 0, σ and β | y ∼ MVN β, σ Vβ . ˜ 0 0 2  0 Let β = β + n+1 where n+1 ∼ MVN 0, σ Vβ and n+1 is independent of n+1. T ˜ T 0 T ˜ Then yn+1 = xn+1β + xn+1n+1 + n+1; hence E(yn+1) = xn+1β and

h T 0  T 0 T i V ar (yn+1) =E xn+1n+1 + n+1 xn+1n+1 + n+1

51 h T 0 0 T T i =E xn+1n+1 n+1 xn+1 + n+1n+1 T 2 2 2 T  =xn+1σ Vβxn+1 + σ = σ 1 + xn+1Vβxn+1 .

Thus the full posterior predictive distribution is non-central t-distribution:

y − xT β˜ n+1 n+1 ∼ t t . q 2an or (2a+n) bn T  1 + x Vβxn+1 an n+1

9 Empirical Bayes Approaches

2 2  Example (Body Fat (cont.)). Use iid normal prior for βi’s, β ∼ Normal 0, σ τ I , and IG (a, b) prior for σ2.

• Choose a and b such that E σ2 = 10 and V ar σ2 = 100.

2 σ2 ∼ IG (a, b) E σ2 = b a > 1 V ar σ2 = b a = 3 By , a−1 for and (a−1)2(a−2) , and b = 20.

• Choose the hyper-parameter τ 2

– Use subjective intuition or put a prior on τ 2

2 2 2 2 2 – Alternatively choose τ empirically s.t., τˆ = arg maxτ 2 p y | τ . Then fix τ =τ ˆ and proceed as if it were known.

2 • For the normal-inverse gamma regression model with βi’s iid with variance τ ,

p y | σ2, τ 2 = N 0, σ2 I + τ 2 XXT  .

Let y = Xβ +  where β ∼ N 0, σ2τ 2I and  ∼ N 0, σ2I. Then E[y] = 0 and

h i h i V ar (y) = E (Xβ + )(Xβ + )T = XE ββT XT +E T  = σ2 I + τ 2 XXT  .

• The marginal for y is a multivariate t-distribution with 2a degree of freedom:

 b  p y | τ 2 = MVT 0, I + τ 2 XXT  . 2a a

Example (Body Fat (cont)). Choose τ 2 empirically. library("mvtnorm") ##Load the mvtnorm package to use the multivariate t-distribution n= length(Y) ##Plot marginal density of \tau^2 where a=3, b=20 and df=2a=6

52 marY_hypvar= function(hypvar){ dmvt(Y,delta=rep(0,n),sigma=(20/3)*(diag(n)+hypvar*X%*%t(X)),df=6) } index <- seq(0,5, length=1000) marg= sapply(index,marY_hypvar) maximizer= index[ which.max(marg)]; maximizer

## [1] 0.6206 −304 log marginal density −310 0 1 2 3 4 5

tau^2

beta_tilde= solve(t(X)% *%X+(1/0.62) *diag(9))% *% t(X)% *%Y a_n=3+50 b_n=20+(1/2) *(t(Y)%*%Y-t(beta_tilde)%*%(t(X)% *%X+(1/0.62) *diag(9))%*%beta_tilde) var_mat= as.numeric(b_n/a_n)*solve(t(X)% *%X+(1/0.62) *diag(9)) cred_lower= beta_tilde- sqrt(diag(var_mat))*qt(0.975,106) # 2*a_n=106 cred_upper= beta_tilde+ sqrt(diag(var_mat))*qt(0.975,106) table= cbind(round(beta_tilde,3),round(cred_lower,3), round(cred_upper,3)) print(table)

## [,1] [,2] [,3] ## Age 1.193 0.157 2.229 ## Weight -0.465 -4.127 3.197 ## Height -0.380 -1.602 0.842 ## Neck -0.143 -1.749 1.463 ## Chest -0.642 -2.999 1.714 ## Abdomen 8.704 6.286 11.123 ## Ankle 0.042 -1.300 1.384 ## Biceps 0.207 -1.104 1.519 ## Wrist -2.176 -3.683 -0.668

The Empirical Bayes Approach

An empirical Bayes (EB) method is one in which the prior is estimated from the data. It is useful

when borrowing strength across multiple related parameter θ = (θ1, . . . , θk). Model θi’s as iid from shared estimated prior πˆ (·).

• Parametric EB: Estiamte π from a parametric family with hyperparameters η: π (· | η).

53 • Nonparametric EB: Assume no parametric form for π.

Parametric Empirical Bayes

For the parametric EB approach we consider π (· | η) and estimate η. A natural estimate is the maximizer of the marginal density Z ηˆ = arg max p (y | η) = arg max f (y | θ) π (θ | η) dθ. η η

In previous regression example, θ = β and η = τ 2. Alternatively, could estimate η by matching moments or some other approach.

Non-parametric Empirical Bayes

A non-parametric does not assume a parametric form for π. Base inference directly on marginal distribution Z m (y) = f (y, θ) π (θ) dθ.

Example (Robbins’ Method). If Yi | θi ∼ Poisson(θi), and θi have unspecified prior. Estimate

ˆ # {y = yi + 1} #y’s equal to yi + 1 θi = (yi + 1) or (yi + 1) . # {y = yi} #y’s equal to yi

Solution: Note that

yi −θi θi e f (yi | θi) π (θi) f (Yi = yi | θi) = and f (θi | Yi = yi) = R . yi! f (yi | θi) π (θi) dθi

Therefore, Z Z f (yi | θi) π (θi) E[θi | yi] = θif (θi | yi) dθi = θi R dθi f (yi | θi) π (θi) dθi     R yi+1 −θi R yi+1 −θi θ e /yi! π (θi) dθi θ e /yi! π (θi) dθi i yi + 1 i = R = R f (yi | θi) π (θi) dθi yi + 1 f (yi | θi) π (θi) dθi   R yi+1 −θi θ e / (yi + 1)! π (θi) dθi i mπ (yi + 1) = (yi + 1) R = (yi + 1) . f (yi | θi) π (θi) dθi mπ (yi)

1 Pn 1 Let mˆ π (y) = n i=1 {yi=y}. Then

ˆ mˆ π (yi + 1) θi = (yi + 1) . mˆ π (yi)

54 Example (Liver Disease (cont.)). Recall: mice from 50 genetic strains.

• ni is the total number of mice from strain i

• yi is the number of mice with liver disease from strain i

• θi is probability a mouse from strain i develops liver disease: yi ∼ Binomial(ni, θi), where

θi’s are iid from Beta(a, b) distribution. Estimate a, b empirically.

a Solution: Consider reparametrization µ = a+b and M = a + b. The mean and variance of a beta distributed variable θ are

µ (1 − µ) E(θ) = µ and V ar (θ) = . M + 1

To compute mean and variance of Y in terms of µ and M, use,   – Law of Total Expectation: EY [Y ] = Eθ EY |θY ,     – Law of Total Variance: V arY [Y ] = Eθ V arY |θY + V arθ EY |θY .

Since Y ’s are iid from Binomial(n, θ), EY |θY = nθ and V arY |θ = nθ (1 − θ). Let θˆ = Y/n, then

   Y   1  E θˆ =E E = E · E Y = E [θ] = µ Y θ Y |θ n θ n Y |θ θ   h i h i  1   1  V ar θˆ =E V ar θˆ + V ar E θˆ = E · V ar Y + V ar · E Y Y θ Y |θ θ Y |θ θ n2 Y |θ θ n Y |θ θ (1 − θ) 1 =E + V ar [θ] = E θ − θ2 + V ar [θ] θ n θ n θ θ 1 1 h  i = E θ − E θ2 + V ar [θ] = µ − (E θ)2 + V ar θ + V ar [θ] n θ θ θ n θ θ θ 1 1 µ (1 − µ) n − 1 µ (1 − µ) = µ − µ2 − V ar θ + V ar θ = + n n θ θ n n M + 1 µ (1 − µ)  n − 1  = 1 + . n M + 1

ˆ Hence, let θi = yi/ni, then   ˆ  ˆ  µ (1 − µ) ni − 1 E θi = µ, and V ar θi = 1 + . ni M + 1

• Consider “iterated method of moments” estimator based on y1, . . . , ym: the estimator of mean

55  ˆ   E θi = µ is

m 1 X yi =µ, ˆ m n i=1 i

and by plugging in µˆ, the estimator of variance is

m m   1 X   1 X µˆ (1 − µˆ) ni − 1 s2 = V ar θˆ ≈ 1 + , m i m n ˆ i=1 i=1 i M + 1

2 ˆ where s is the sample variance of the θi’s. Thus,

µˆ (1 − µˆ) − s2 Mˆ = . s2 − µˆ(1−µˆ) Pm 1 m i=1 ni

ˆ ˆ aˆ(1−µˆ) ˆ ˆ Therefore, aˆ =µ ˆM and b = µˆ = M − µˆM. ˜  ˆ  Note. Since θi ∼ Beta(a, b), θi | yi, ni, M,ˆ µˆ ∼ Beta aˆ + yi, b + ni − yi . Thus,

ˆ ˆ   aˆ + yi µˆM + yi θ˜ =E θ | y , n , M,ˆ µˆ = = i i i i ˆ ˆ aˆ + yi + b + ni − yi M + ni   yi   ˆ =Bˆiµˆ + 1 − Bˆi = Bˆiµˆ + 1 − Bˆi θi, ni

˜ˆ   ˆ Mˆ that is, θi = Bˆiµˆ + 1 − Bˆi θi, where Bˆi = is the “shrinkage factor.” Mˆ +ni

StrainData= read.csv('http://www.tc.umn.edu/~elock/MiceData.csv') n= StrainData$n Yi= StrainData$Yi N= length(Yi) mu_hat= mean(Yi/n) s_hat_2= var(Yi/n) M_hat= (mu_hat *(1-mu_hat)-s_hat_2)/(s_hat_2-(mu_hat*(1-mu_hat)/N)*sum(1/n)) a_hat= (mu_hat *M_hat) b_hat= (a_hat *(1-mu_hat))/mu_hat c(a_hat, b_hat)

## [1] 0.8108084 2.3560193 x <- seq(0,1, length=500) post_theta= 0.14 *dbeta(x,1,2)+0.86*dbeta(x,1,3) emp_theta= dbeta(x,a_hat,b_hat) plot(x,post_theta,xlim= c(0,1), ylim= c(0,5), ylab= "density", xlab= "theta", type= 'l', lwd=2) points(x,emp_theta,lwd=1, type='l',col='red')

56 4 2 density 0

0.0 0.2 0.4 0.6 0.8 1.0

theta

10 Direct Sampling

In statistics it is common to encounter integrals that are difficult to solve analytically. For example,

1. Computing expected values or variances

2. Computing the normalizing constant for marginal distributions

3. Finding probabilities when the CDF is unknown

Monte Carlo simulation, in which observations are randomly generated from the model, are helpful to overcome this.

Direct Monte Carlo integration

iid Expected Value Assume θ ∼ p (θ) with θ1, . . . , θN ∼ p (θ) and consider the functional form g (θ). By the Law of Large Number,

N Z 1 X g (θ) p (θ) dθ = E [g (θ)] ≈ g (θ ) for large N. N j j=1

1 PN Standard Error Let γˆ = N j=1 g (θj). The variance of g (θ) under p can be approximated similarly

N 1 X 2 V ar [g (θ)] ≈ [g (θ ) − γˆ] . N − 1 j j=1

57 This represents uncertainty of g (θ) implied by p (· | y). The standard error for γˆ using V ar (ˆγ) = V ar (g (θ)) /N:

v u N u 1 X 2 seˆ (ˆγ) = t [g (θj) − γˆ] . N (N − 1) j=1

This represents simulation uncertainty in estimates for E[g (θ)]. seˆ (ˆγ) converges to 0 as N in- creases.   By Central Limit Theorem, γˆ ∼ N E[g (θ)] , seˆ (ˆγ)2 for large N.

Consider g (θ) = 1{a

# of c (θ ) ’s ∈ (a, b) Prˆ = j , N

r   where the standard error is Prˆ 1 − Prˆ /N.

Note. (i) g (θ) follows a binary distribution and Prˆ follows a binomial distribution.

(ii) A histogram of the c (θj)’s approximates the density of c (θ).

Example (Eye Exam). An eye patient is asked to identify shapes from a given distance. She is shown n shapes where n = 20. Jeffreys beta-binomial model for number of shapes she identifies correctly: y ∼ Binomial (20, θ) and θ ∼ Beta (0.5, 0.5). Assume she identifies 16 shapes correctly. Then p (θ | y) = Beta (16.5, 4.5).  θ  Consider the logit transformation g (θ) = log 1−θ . Approximate distribution of g (θ) induced by Beta posterior. For j = 1,...,N:

(i) Simulate θj from Beta(16.5, 4.5) and  θj  (ii) compute g (θj) = log . Compute g (θj)’s for N = 1000 sims. 1−θj

N= 10000 theta_js= rbeta(N,16.5,4.5) g_theta_js= log((theta_js)/(1-theta_js)) E_g_theta= mean(g_theta_js) Var_g_theta= var(g_theta_js) SE= sqrt(Var_g_theta/N) print(c(E_g_theta, Var_g_theta, SE))

## [1] 1.382663 0.308158 0.005551

Note. Mean of data follows a normal distribution by Central Limit Theorem, though data does not follow a normal.

58 Direct Sampling for Multivariate Density Qk Let θ = (θ1, . . . , θk) and p (θ1, . . . , θk) = i=1 p (θi | θi−1, . . . , θ1). For example, p (θ1, θ2, θ3) = ˜ p (θ3 | θ2, θ1) p (θ2 | θ1) p (θ1). We can draw direct sample θ from p (θ1, . . . , θk) as follows: ˜ (i) Draw θ1 from p (θ1), ˜  ˜  (ii) Draw θ2 from p θ2 | θ1 , ˜  ˜ ˜  (iii) ... Draw θk from p θk | θk−1,..., θ1 .

It is useful for simulating difficult marginal densities. For example, p (θ1) and p (θ2 | θ1) may be easy to obtain but not p (θ2).

Example (Body Fat (cont.)). Consider the model y = βX + . The iid normal prior for βi’s: 2  2 2  2 2 β ∼ Normal 0, 0.62σ I where τˆ = 0.62 = arg maxτ 2 p τ | y , and the prior for σ : σ ∼ IG (3, 20) which is based on the matching moment; i.e., E σ2 = b/ (a − 1).   2 ˜ 2 ˜ T 1 −1 T  Given the conditional posterior p β | y, σ = Normal β, σ Vβ where β = X X + 0.62 I X y T 1 −1 and Vβ = X X + 0.62 I . 2 2  100 The marginal posterior for σ is p σ | y = IG (an, bn) where an = 3 + 2 and bn = h T i 1 T ˜ T 1 −1 ˜ 2  2 2 20 + 2 y y − β X X + 0.62 I β . [Note p σ | y ∝ p y | σ π σ .] Direct sampling scheme: For samples j = 1,...,N = 10, 000, 2 2  (i) Draw σj from p σ | y 2 (ii) Draw βj from p β | y, σj .

###Load data Data= read.csv('http://www.lock5stat.com/datasets/BodyFat.csv') Y_tilde= Data$Bodyfat Y_tilde_bar= mean(Y_tilde) s_y_tilde= sd(Y_tilde) Y= Y_tilde-18.5 X_tilde= as.matrix(Data[,2:10]) X= scale(X_tilde) beta_tilde= solve(t(X)% *%X+(1/0.62) *diag(9))% *% t(X)% *%Y V_beta= solve(t(X)% *%X+(1/0.62) *diag(9)) a_n=3+50 b_n=20+(1/2) *(t(Y)%*%Y-t(beta_tilde)%*%(t(X)% *%X+(1/0.62) *diag(9))%*%beta_tilde)

##load package MASS, for simulating multivariate normal variables using mvrnorm() library("MASS")

##Draw from posterior N=10000 beta_sims= matrix(nrow=9,ncol=N) sigma2_sims= matrix(nrow=1,ncol=N) for(j in 1:N){ sigma2_sims[j]=1/ rgamma(1,a_n,b_n) beta_sims[,j]= mvrnorm(1,beta_tilde, sigma2_sims[j]*V_beta) }

CI95_B1_Sim= c(quantile(beta_sims[1,],0.025), quantile(beta_sims[1,],0.975)) ##Compare with t-interval

59 var_mat= as.numeric(b_n/a_n)*solve(t(X)% *%X+(1/0.62) *diag(9)) cred_lower_t= beta_tilde- sqrt(diag(var_mat))*qt(0.975,106) cred_upper_t= beta_tilde+ sqrt(diag(var_mat))*qt(0.975,106) CI95_B1_t= c(cred_lower_t[1], cred_upper_t[1]) cat(" 95% credible interval from sim:", CI95_B1_Sim, " \n", "95% credible interval from t-dist:", CI95_B1_t, sep="")

## 95% credible interval from sim: 0.1719 2.21 ## 95% credible interval from t-dist: 0.1575 2.229

Assume an individual has the following standardized measurements: : 1.20, Weight:0.5, Height:- 0.4; circumference of neck:2.2, chest:0.3, abdomen:0.8, ankle:0.2, bicep:0.3, and wrist:0.6.

Denote his vector of predictors as x101 and simulate from the posterior predictive for y101: 2 (i) Draw σj and βj as before 2 (ii) Draw y101,j from Normal x101βj, σj . Note x101 only affects y101 in the last step.

2  2  2 2 2 2 p y101, σ , β | y = p σ | y p β | y, σ p y101 | y, σ , β | {z }| {z }| {z } step 1 step 2 step 3

x101= c(1.2,0.5,-0.4,2.2,0.3,0.8,0.2,0.3,0.6) y101_sims= matrix(nrow=1,ncol=N) for(j in 1:N) y101_sims[j]= rnorm(1,x101%*%beta_sims[,j],sqrt(sigma2_sims[j])) Predicted.BF= y101_sims+18.5 summary(as.numeric(Predicted.BF))

## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 8.01 22.10 25.10 25.10 28.10 42.00

11 Importance Sampling

Importance Sampling

Direct sampling can be used to estimate R c (θ) p (θ) dθ by simulating draws from p (θ). What if normalizing constant for p (θ) is unknown? For example, for multivariate distribution, it is difficult to calculate its integral. Several “indirect” simulation methods, including importance sampling: Simulate θ from some other distribution g, but weigh each simulation by its relative likelihood under p. Consider the posterior p (θ | y) ∝ f (θ | y) π (θ), where f (θ | y) = f (y | θ) is the likelihood. For a given function c,

Z f (θ | y) π (θ) E[c (θ) | y] = c (θ) dθ. R f (θ | y) π (θ) dθ

60 Simulate θ1, . . . , θN iid from density g (θ). Define the weights as

w (θ) = f (θ | y) π (θ) /g (θ) .

Although f (θ | y) and π (θ) are known, their product may be not proportional to any well-known distribution or the integral (or the normalizing constant) is difficult to be calculated. Thus we con- sider an alternative well-defined distribution g (θ) and the corresponding weights w (θ):

R 1 PN Z f (θ | y) π (θ) c (θ) w (θ) g (θ) dθ c (θj) w (θj) E[c (θ) | y] = c (θ) dθ = ≈ N j=1 . R f (θ | y) π (θ) dθ R w (θ) g (θ) dθ 1 PN N j=1 w (θj)

Comments

1. g (θ) is called the importance function.

2. w (θ) weighs values according to (relative) posterior probability, and probability of being simulated from g:

(a) Higher posterior probability → Higher weight, (b) Higher probability of being simulated from g → lower weight.

3. Ideally, g closely resembles p (θ | y):

(a) An efficient way to choose g (θ) as close as possible to p (θ | y). g (θ) = p (θ | y) minimizes simulation variability (then we have equal weights). (b) If g (θ) and p (θ | y) are very different, many weights will be close to 0 so need large N for good approximations. (c) g (θ) and p (θ | y) should (at least) have common support.

Density Approximation

Let c (θ) = 1θ∈(a,b). Then P w (θj) E[c (θ) | y] = Pr (θ ∈ (a, b)) ≈ θj ∈(a,b) . PN j=1 w (θj)

Can use this to approximate the posterior density with weighted histograms: P θ ∈(a,b) w (θj) 1 Bin Height = j · . P w (θ ) b − a θj j

61 Define

1 PN j=1 c (θj) w (θj) cd(θ) = N . 1 PN N j=1 w (θj)

Denote xj = c (θj) w (θj) and yj = w (θj). We can compute the simulation variability in cd(θ) as ! x¯ 1 σˆ2 x¯2σˆ2 2¯xσˆ V ar ≈ X + y − xy , y¯ N y¯2 y¯4 y¯3

2 1 P 2 2 1 P 2 1 P where σˆx = N−1 (xj − x¯) , σˆy = N−1 (yj − y¯) , and σˆxy = N−1 (xj − x¯)(yj − y¯).

iid Example (Normal-Gamma Model). Suppose y = y1, . . . , yn ∼ Normal(0, θ) and prior θ ∼ Gamma(3, 0.5). The posterior is

( n !) 1 X p (θ | y) ∝ θ(4−n)/2 exp − θ2 + y2 , 2θ i i=1 which is not a well-known or easily integrable pdf (unlike IG prior). Consider an importance sampling g = Gammaα = cs2, β = c [Note: Eθ = α/β] (i) s2 is the sample variance of y and the expected value under g is s2. (ii) Higher c implies less variance. Importance weight function is

( n ! ) f (θ | y) π (θ) 4−n − cs2−1 1 2 X 2 w (θ) = = θ 2 ( ) exp − θ + y − (−cθ) . g (θ) 2θ i i=1

Note that constants in w (θ) can be ignored, since they will be canceled in E[c (θ) | y]. For different values of c: (i) Simulate N = 100, 000 draws from g (θ) (ii) Compute w (θ) from each draw 1 PN θ w(θ ) ˆ N j=1 j j (iii) For the estimated expected value θ = 1 PN , compute the simulation uncertainty w(θj )   N j=1 V ar θˆ using formula.

##load 20 observations y set.seed(123) y <- rnorm(n=20, mean=0, sd=sqrt(5)) n=length(y) s_2= var(y)

##Define weight function, based on data y w <- function(theta,c){ theta^((4-n)/2)*exp(-1/(2*theta) * (theta^2+sum(y^2)))/dgamma(theta,c*s_2,c) }

62 ##Grid of c values c_grid= c(1:50)/25

##Run importance sampling for each c and calculate simulation variance SimVar_theta_hat=theta_hat=c() for(i in 1:length(c_grid)){ N=100000 c=c_grid[i] thetas= rgamma(N,c*s_2,c) ws= w(thetas,c) theta_hat[i]= sum(ws*thetas)/sum(ws) ##Estimated posterior expectation ###Compute posterior variance due to simulation: x= ws *thetas SimVar_theta_hat[i]=(1/N) *(var(x)/mean(ws)^2+ mean(x)^2*var(ws)/mean(ws)^4- 2*mean(x)*cov(x,ws)/mean(ws)^3) }

##Find best c c= c_grid[ which.min(SimVar_theta_hat)]; c

## [1] 0.8

##Get importance samples for best c thetas= rgamma(N,c*s_2,c) ws= w(thetas,c) theta_hat= sum(ws*thetas)/sum(ws); theta_hat

## [1] 5.209

Pr_theta_above_5= sum(ws[thetas>5])/sum(ws); Pr_theta_above_5

## [1] 0.4878

To minimize the variance of importance sampling, choose c = 0.80. Then by c = 0.80, we have θˆ = 5.21 and we can also calculate the posterior probability, for example, Pr (θ > 5 | y) = P w(θj ) θj >5 PN = 0.489. j=1 w(θj ) 0.30 0.15 Density 0.00 1 3 5 7 9 11 13 15 17 19

theta

Note g (θ) is the red continuous density curve above.

63 12 Rejection Sampling

Rejection Sampling

We would like to sample from a (potentially unknown) density p. This is equivalent to sampling uniformly from the area under p. If p ∝ q, this is also equivalent to sampling uniformly from area under q, where p is a posterior distribution and q is the unnormalized version: q (θ) = f (y | θ) π (θ).

Algorithm Choose an enveloping function M · g (θ) that you can sample underneath; q (θ) ≤ M · g (θ), where in practice g is a density function and M is a constant. Ignore the samples above q and use the samples underneath. For j = 1,...,N,

(i) Draw θj from g (θ)

(ii) Draw U from Uniform(0, 1). Then we have a point (θj,U · M · g (θj)) under M · g (θ).

(iii) Accept θj, if UMg (θj) ≤ f (y | θj) π (θj); reject, otherwise. f(y|θj )π(θj ) The accepted θj’s correspond to draws from p (θ | y) with probability Prj = . M·g(θj )

Comments

1. Similar to importance sampling: Importance sampling uses

(a) Simulations from known density g (θ) to approximate unknown p (θ | y).

(b) Prj as weights instead of acceptance probability. (That is, importance sampling accepts all draws with weights.)

2. Rejection sampling yields direct samples from p (θ | y) not weighted samples.

3. Envelope function Mg (θ) MUST satisfy Mg (θ) ≥ f (θ | y) π (θ) ∀θ; o/w, inconsistent.

4. Ideally, M is as small as possible and Mg (θ) is close to f (θ | y) π (θ).

Example (Normal Density). Simulate samples from p (θ) = N (0, 1) using rejection sampling. −1 h  θ 2i Let g (θ) = Cauchy(0, 2) as envelope function; g (θ) = 2π 1 + 2 .

N(0,1) Cauchy(0,2)*3 0.4

Cauchy(0,2)*1 0.2 Density 0.0 −4 −2 0 2 4

Histogram of accepted samples using M = 1 and M = 3:

64 # Rejection sampler for multiplication constant M rejsamp= function(M){ while(1){ # Repeat this expriment until uMg(x) < q(x) # Draw single value from between zero and one u= runif(1) # Draw candidate value for x from the enveloping distribution x= rcauchy(1, location=0, scale=2) # Accept or reject candidate value; if rejected try again if(u*M*dcauchy(x, location=0, scale=2)< dnorm(x,0,1)) return(x) } } N=100000 sampx1= replicate(N, rejsamp(1)) sampx3= replicate(N, rejsamp(3))

Cauchy(0,2)*1 Cauchy(0,2)*3 0.4 0.4 0.2 0.2 Density Density 0.0 0.0 −4 −2 0 2 4 −4 −2 0 2 4

Question 1 What value of M gives optimal efficiency (i.e., less time)? In this context, compare with mode:

1 1 √ Mg (θ) |θ=0≥ q (θ) |θ=0 =⇒ M · ≥ √ · 1 =⇒ M ≥ 2π. 2π 2π

More general, ∀θ, f (θ) = Mg (θ) − q (θ) ≥ 0. Solve θ˜ from ∂f (θ) /∂θ = 0, where θ˜ is the     minimizer. Then find M s.t., Mg θ˜ − q θ˜ as close to zero as possible. √ Question 2 What is the acceptance probability for M = 2π?

Z R q (θ) ¨¨ q (θ) dθ 1 Acceptance prob. = ¨¨g (θ)dθ = = . M¨g (¨θ) M M

In this context, q (θ) = N (0, 1) is integrable and hence we have R q (θ) dθ = 1; otherwise, we just stop at R q (θ) dθ.

13 Metropolis-Hastings Sampling

Overview of Posterior Simulation Method

• Direct sampling: Sample from analytical posterior distribution, say, p (θ | y).

65 • Non-iterative indirect sampling: Suppose p (θ | y) ∝ f (θ | y) π (θ) and it is difficult to com- pute the integral for f (θ | y) π (θ), consider (i) Importance sampling: Choose a known density g (θ) and evaluate samples with weight w (θ) = f (θ | y) π (θ) /g (θ). (ii) Rejection sampling: Choose a known density g (θ) and construct an enveloping function

M · g (θ) s.t., M is a constant and f (θ | y) π (θ) ≤ Mg (θ) ∀θ. Accept θj if UMg (θj) ≤

f (θj | y) π (θj) where U ∼ Uniform(0, 1).

sampling: (i) Metropolis-Hastings algorithm (ii) Gibbs sampling

Markov Chain Monte Carlo (MCMC)

• “Monte Carlo” refers to any method uses random sampling to obtain results.

• A Markov chain is a sequence of random variables θ(1), θ(2),..., satisfying the Markov prop- erty: P θ(t+1) | θ(1), . . . , θ(t) = P θ(t+1) | θ(t), where current state t+1 can depend only on previous state t.

• MCMC methods “adaptively” simulate from posterior p (θ | y).

– Current draw depends on previous draw. – Draws converge to approximate dependent samples from p (θ | y).

Metropolis-Hastings Sampling

Wish to draw θ(1), θ(2),... from (potentially unnormalized) distribution h, e.g., h (θ) = f (y | θ) π (θ). Define a proposal density that depends on previous draw θ(t−1): q · | θ(t−1). New draw is taken from q · | θ(t−1) with a rejection step to encourage new draw has high density under h. The Metropolis algorithm applies to symmetric q: q θ∗ | θ(t−1) = q θ(t−1) | θ∗, while Metropolis- Hastings algorithm extends to non-symmetric q.

The Metropolis Algorithm Specify an initial value θ(0). For t = 1,...,T repeat:

1. Draw θ∗ from q · | θ(t−1), which is a symmetric proposal density.

∗ 2. Compute r = h(θ ) . h(θ(t−1)) 3. If r ≥ 1, set θ(t) = θ∗;  θ∗ with probability r If r < 1, set θ(t) = . θ(t−1) with probability 1 − r

66 Note. For computational reasons, often work with log-densities: r = exp log (h (θ∗)) − log h θ(t−1) .

The Metropolis-Hastings Algorithm Specify an initial value θ(0). For t = 1,...,T repeat:

1. Draw θ∗ from q · | θ(t−1), which is a non-symmetric proposal density.

h(θ∗)q(θ(t−1)|θ∗) 2. Compute r = . h(θ(t−1))q(θ∗|θ(t−1)) 3. If r ≥ 1, set θ(t) = θ∗;  θ∗ with probability r If r < 1, set θ(t) = . θ(t−1) with probability 1 − r

Comments Under mild conditions, as t → ∞, θ(t) converges in distribution to a draw from posterior. Read Chib and Greenberg (1995) “Understanding the Metropolis-Hastings Algorithm.”

1. The Metropolis-Hastings algorithm is identical to the Metropolis if q is symmetric.

2. In practice, a good initial value θ(0) will have high posterior density.

(a) Could initialize by posterior mode, if possible: θ(0) = θˆ. (b) Make a guess or generate θ(0) from prior.

Choice of Proposal Density

A common choice for q is a normal distribution centered at previous draw: q θ∗ | θ(t−1) = Normalθ(t−1), σ2. If θ is multivariate, replace σ2 with Σ.

Acceptance Ratio

• Higher σ2 often leads to low acceptance ratio: Proposal θ∗ may be far away from areas in which p concentrates (“big jumps”).

• Lower σ2 often leads to high acceptance ratio: Proposal θ∗ are close to θ(t−1). Many iterations needed to cover larger areas of parameter space.

We need to compromise between these two extremes. As a rule of thumb, accepting about 20% ∼ 70% of proposals is reasonable.

• Vary q to give the desired rejection rate: Some algorithms adjust q adaptively during sampling.

• Alternatively, for q =Normalθ(t−1), σ2, let σ2 be an approximation to posterior variance. 2 π −1 Recall Bayesian CLT: σ | y = V arθ|yθ ≈ (I (y)) (Asymptotic var. with prior π), where

 2  π ∂ Iij (y) = − log (f (y | θ) π (θ)) . ∂θi∂θj θ=θˆπ

67 Iterations

• Beginning iterations are dependent on initial value. Especially if initial value is far from concentration of posterior.

• Typical to ignore first M beginning iterations as burn-in.

– Burn-in can vary: M = 1, 000, M = 5, 000 or even M = 100, 000 iterations. – May adjust proposal distribution during burn-in.

• Aim for stationarity after burn-in: The probability distribution of θ(t) does not depend on t, while θ(t) does depend on θ(t−1) (correlation).

– Initial iterations not stationary because of dependence on θ(0). – Eventually iterations will be approximately stationary. – The stationary distribution is the posterior: p θ(t) ≈ p (θ | y) for t > M.

• θ(t) for t ≥ M are kept as draws from posterior.

• To validate burn-in, can run from different initialization: See if they converge to similar dis- tributions after burn-in.

Dependence In general, favor low dependence between MCMC samples.

• Low autocorrelation: cor θ(t), θ(t−1).

• Leads to better convergence toward stationary posterior.

• Leads to lower uncertainty in results from posterior draws. That is, samples can cover the area under the objective (unknown distribution) h in a small size of draws.

Example (Basketball Shooting (cont.)). Consider the shooting percentage for a basketball team over n games: y = (y1, . . . , yn). iid 1 θ−1 θ−1 Model yi ∼ Beta(θ, 2) for θ > 0: f (yi | θ) = B(θ,2) yi (1 − yi) = θ (1 + θ) yi (1 − yi). Use a Gamma(a, b) prior for θ. Then, given y,

n ! n ! n ! n Y Y 1 − yi n Y p (θ | y) ∝θn+a−1 (θ + 1) e−bθ yθ ∝ θn+a−1 (θ + 1) e−bθ yθ . i y i i=1 i=1 i i=1 | {z } | {z } constant h(θ)

P20 Observe n = 20 games with i=1 log yi = −9.89. Prior a = b = 1. By Bayesian CLT, the approximated posterior is p (θ | y) ≈ Normal(3.24, 0.33).

68 To compare with the Bayesian CLT result, we now apply Metropolis sampling to draw from p (θ | y) and use asymptotic approximation to motivate θ(0) and q: (i) Initial value θ(0) = 3.24, (ii) Proposal density p · | θ(t−1) = Normalθ(t−1), 0.33, and (iii) Unnormalized posterior h (θ). Let T = 25, 000 and M = 5, 000 iterations as burn-in. Thus, remain N = 20, 000 as draws from p (θ | y).

##Compute mode and asymptotic variance (see Asymptotic_Posterior_Approximations) n=20; logYs=-9.89 A= sum(logYs)-1;B=2 *n-1+sum(logYs); C=n theta_hat=(-B- sqrt(B^2-4*A*C))/(2*A) ##Quadratic formula var=(1/20) *(theta_hat^2*(1+theta_hat)^2)/(theta_hat^2+(theta_hat+1)^2) h <- function(theta) theta^n*(theta+1)^n*exp(-theta)*(exp(sum(logYs)))^theta

##Perform Metropolis algorithm T=25000 met.thetas= rep(0,T) prev.theta= theta_hat for(t in 1:T){ theta_star= rnorm(1,prev.theta,sqrt(var)) r=exp(log(h(theta_star))-log(h(prev.theta))) if(r >=1) met.thetas[t]= theta_star if(r<1){ U= runif(1,0,1) ###Create uniform RV to compute probabilities (P(U=r) met.thetas[t]= prev.theta } prev.theta= met.thetas[t] } ##To compute acceptance rate, count number of times met.thetas[t] != met.thetas[t-1] AcceptanceRate= sum(met.thetas[(5000+1):(T-1)]!=met.thetas[(5000+2):T])/(T-1) ##0.56 -- a bit higher than we'd like, but still OK AutoCor= cor(met.thetas[(5000+1):(T-1)],met.thetas[(5000+2):T]) #Auto correlation = 0.771

First 100 Iterations Sim Dist. of Posterior Draws

● ● ●●●● ● ●● N(3.24,0.33)

● ● 0.6 ● ● ●● ●● 4.0 ● ● ●●● ●●●●● ● ● ● ●● ● ●●● ● ●● ●●● ●●● ●●● ●●●● ●● ● ● ● ● ● ● ● ●●●● ●●● ●●●●●●●●●● ● 0.3 3.0 ●● ●●●●● ●●●● ●● Density ● ● ●●● ● ●● met.thetas[1:100] 0.0 2.0 0 20 40 60 80 2 3 4 5 6

Index PostDraws

69 14 Gibbs Sampling

(t) Motivation Let θ = (θ1, . . . , θk) with density p. Recall direct samples θ from p (θ1, . . . , θk) as follows: (t) (i) Draw θ1 from p (θ1), (t)  (t) (ii) Draw θ2 from p θ2 | θ1 , (t)  (t) (t) (iii) ... Draw θk from p θk | θk−1, . . . , θ1 .  (t) In direct sampling, full distribution p (θ1, θ2) can be difficult, while p (θ1) and p θ2 | θ1 may be easier. However, marginalizing over other parameters, as for p (θ1), can also be difficult, while  (t) (t) sampling from full conditionals, such as p θk | θk−1, . . . , θ1 , may be easier. This motivates Gibbs sampling.

Gibbs Sampling

(1) (2) (t)  (t) (t) Wish to draw θ , θ ,... from distribution p, where θ = θ1 , . . . , θk . Specify initial vector θ(0). For t = 1,...,T , (t)  (t−1) (t−1) (t−1) (i) Draw θ1 from p θ1 | θ2 , θ3 , . . . , θk , (t)  (t) (t−1) (t−1) (ii) Draw θ2 from p θ2 | θ1 , θ3 , . . . , θk , (t)  (t) (t) (t)  (iii) ... Draw θk from p θk | θ1 , θ2 , . . . , θk−1 . Each draw from a full conditional using previous iteration. As t → ∞, θ(t) converges in distri- bution to a draw from p (θ), where in practice p (θ) is a multivariate posterior density. Note. Gibbs sampling is a special case of Metropolis-Hastings, where proposal distributions are full conditionals, giving acceptance probability 1.  ∗ (t−1)  ∗ (t−1) (t−1) Consider p (θ) ∝ h (θ) and proposal density q θ1 | θ = p θ1 | θ2 , . . . , θk , where p is conditional on the latest k − 1 parameters. Under MH sampling,

 ∗ (t−1)  (t−1) ∗ (t−1)  ∗ (t−1)  (t−1) (t−1) (t−1) h θ1, θ−1 q θ1 | θ1, θ−1 p θ1, θ−1 p θ1 | θ2 , . . . , θk r = = = 1.  (t−1) (t−1)  ∗ (t−1)  (t−1)  ∗ (t−1) (t−1) h θ1 , θ−1 q θ1 | θ p θ p θ1 | θ2 , . . . , θk

Comments

1. Gibbs sampling is the standard tool for estimating large multivariate posteriors.

2. As in MH, convergence is quicker if initialization is in area of high posterior concentration.

3. As in MH, need to consider burn-in and auto-correlation but rejecting of proposals.

4. With low autocorrelation, we say the parameters “mix well.”

70 5. Gibbs sampling assumes we can obtain direct samples from full conditionals: Also “hybrid” versions, in which another method (such as MH) is used to sample from in- tractable full conditionals.

Example (IQ (cont.)). Human IQ have variance 225 and are to be centered at 100. Infer the “cal- 2 ibration” µ and variance σ of a certain IQ test. Sample of 20 individuals, each takes the test ni Pm times: n = i=1 ni. 2 2 Model: yij ∼ Normal θi, σ , where θi ∼ Normal µ, τ for i = 1, . . . , m, j = 1, . . . , ni.

Layer 1 µ ∼ π (µ) = 1

2  2 2 Layer 2 θi ∼ N µ, τ = 225 and σ ∼ π σ

2 2 2 Layer 3 N θ1, σ N θ2, σ ··· N θm, σ

Layer 4 y11, y12, . . . , y1n1 y21, y22, . . . , y2n2 ··· ym1, ym2, . . . , ymnm

2 2 1 2  Use independent Jeffrey’s priors for µ, σ : π µ, σ = σ2 . Then the full joint density p y, θ, σ , µ is

m m 2 Y 2 Y 2 π µ, σ N θi | µ, τ N yij | θi, σ . i=1 i=1

Thus, the full conditions are

 2 2 2 2  2  σ µ + niτ y¯i σ τ p θi | y, σ , µ =N 2 2 , 2 2 ∀i σ + niτ σ + niτ P 2 ! n (yij − θi) p σ2 | y, θ, µ =IG , i,j 2 2  τ 2  p µ | y, θ, σ2 =N θ,¯ m

2  Qm Qni 2 2 Qm Qni 2 2 where p σ | y, θ, µ ∝ i=1 j=1 p yij − θi | µ, σ π µ, σ = i=1 j=1 p yij − θi | σ π µ, σ ∝ P i ni  P (y −θ )2   P (y −θ )2  1  2 +1 i,j ij i n i,j ij i 2 σ2 exp − 2σ2 ∝ IG 2 , 2 , and posterior µ is irrelevant to σ .

ba a−1 −bx ba −a−1 −b/x Note. X ∼ Gamma(a, b): p (x) = Γ(a) x e ; X ∼ Inverse Gamma(a, b): p (x) = Γ(a) x e .   Use Gibbs to draw samples θ(t), σ2(t), µ(t) from p θ, σ2, µ | y. (i) Initial values: σ2 = 64, µ = 100 2  2  (ii) Begin sampling with θ: Each iteration draws from p θi | y, σ , µ ∀i, p σ | y, θ, µ , and p µ | y, θ, σ2. (iii) Run 10000 iterations, using the first 2000 as burn-in.

#####(Here is the code that was used to simulate these data): set.seed(123) n=20

71 y <- list() for(i in 1:n){ # set mu=110 instead of prior calibration mu=100, then check later mu= rnorm(1,110,sqrt(225)) n_i= sample(c(1:5),1) y[[i]]=rnorm(n_i,mu,6) # sigma^2=36 }

T=10000; BurnIn= 2000;N=T-BurnIn; m=20; n_i_s= sapply(y,length); n=sum(n_i_s) y_i_bars= sapply(y,mean); tau_2= 225 ##fixed sigma_2= 64; mu=100 ##intitialize sigma^2=64 and mu=100 thetas= c() ##no need to initialize these draws_sigma_2= rep(0,T); draws_mu= rep(0,T); draws_thetas= matrix(nrow=m,ncol=T) for(t in 1:T){ #####Draw for thetas for(i in 1:m){ post_mean= (sigma_2 *mu+n_i_s[i]*tau_2*y_i_bars[i])/(n_i_s[i]*tau_2+sigma_2) post_var= sigma_2 *tau_2/(n_i_s[i]*tau_2+sigma_2) thetas[i]= rnorm(1,post_mean,sqrt(post_var)) }

#####Draw for sigma^2 a_n=n/2 SumSquares=0 for(i in 1:m) SumSquares= SumSquares+ sum((y[[i]]-thetas[i])^2) b_n=(1/2) *SumSquares sigma_2=1/ rgamma(1,a_n,b_n)

######Draw for mu mu= rnorm(1,mean(thetas),sqrt(225/m))

###Store sampled values draws_sigma_2[t]=sigma_2 draws_mu[t]=mu draws_thetas[,t]=thetas }

Hist. of posterior mu Hist. of posterior sigma^2 0.12 0.04 0.08 Density Density 0.02 0.04 0.00 0.00 95 105 115 125 20 40 60 80 100

72 By prior, IQs should be centered at 100. Check if the test is calibrated too high:

PN 1 (i) Pr (µ > 100 | y) ≈ i=1 µ >100 = 0.996, N which is a statistic inference from the posterior distribution of µ | y rather than µ | y, θ, σ2, since draws of µ | y are over all possible θ and σ2. Also use draws to construct 95% credible interval for IQ of each individual. sum(draws_mu[(BurnIn+1):T]>100)/N

## [1] 0.9946

###Load package plotrix to make side-by-side confidence intervals by individual means= apply(draws_thetas[,(BurnIn+1):T],1,mean) upper= apply(draws_thetas[,(BurnIn+1):T],1,quantile,probs=0.975) lower= apply(draws_thetas[,(BurnIn+1):T],1,quantile,probs=0.025) head(cbind(lower, means, upper)) # list the first 6 individuals

## lower means upper ## [1,] 96.65 103.4 110.2 ## [2,] 98.59 105.6 112.5 ## [3,] 101.29 106.7 112.1 ## [4,] 118.40 124.5 130.4 ## [5,] 90.05 101.5 113.1 ## [6,] 102.65 108.0 113.4

● ● 130 ● ●

● ● ● IQ ● ● ● ● ● 110 ● ● ● ● ● ● ● ● 90

5 10 15 20

Individual

15 More on MCMC

Connections between methods

• Accepted samples under rejection sampling are direct samples from posterior.

• Importance sampling is analogous to rejection sampling, with rejection probabilities used as weights, i.e., w (θ) = f (y | θ) π (θ) /g (θ).

73 • Metropolis-Hastings sampling includes an accept/reject step similar to rejection sampling.

• Gibbs sampling is a special case of Metropolis-Hastings sampling, in which proposal density is conditional posterior.

Combining MCMC methods

• Gibbs sampling requires direct sampling from full conditionals. Otherwise, it can combine with other sampling methods.

– For example: use Gibbs sampling, with a MH step to sample from intractible condition- als

(t) • A single MH “sub-step” is sufficient for convergence: Draw θi using MH with proposal density given by θ(t−1), rather than from its full conditional.

2 Example (IQ (cont.)). Model: Scores yij ∼ Normal θi, σ for individuals i = 1, . . . , m, trails j = 1, . . . , ni. IQs θi ∼ Normal(µ, 225) for i = 1, . . . , m. 25−1 2 Use flat prior for µ, Gamma(25, 1) for σ2: π µ, σ2 ∝ σ2 e−σ . n o 2 224−n/2 2 1 Pm Pni 2 Thus, the full conditional for σ is proportional to σ exp −σ − 2σ2 i=1 j=1 (yij − θi) , which is not a well-know density. Use Gibbs sampling with MH sub-step for σ2:

(t) • For i = 1,..., 20 draw θi from

σ2(t−1)µ(t−1) + n τ 2y¯ σ2(t−1)τ 2  p θ | y, σ2, µ = Normal i i , . i 2(t−1) 2 2(t−1) 2 σ + niτ σ + niτ

• Draw σ2(t) using Metropolis step

– Draw σ2∗ from q · | σ2(t−1) = Normal σ2(t−1), 25. p(σ2∗,θ(t),µ(t−1),y) – Compute r = . p(σ2(t−1),θ(t),µ(t−1),y) – If r ≥ 1, set σ2(t) = σ2∗;  σ2∗ with prob. r If r < 1, set σ2(t) = . σ2(t−1) with prob. 1 − r

 (t−1)  • Draw µ(t) from p µ | y, θ, σ2 = Normal θ¯ , 225/m .

74 Assessing Convergence

• MCMC iterations will eventually converge to their stationary distribution (the posterior), which can be assessed by visual inspection of trace plots: Plot of draws over the iterations for a parameter.

• There should be no indication of a systematic trend, after burn-in.

 (t) (t)  • The log joint density can be used as a summary: Consider log p θ1 , . . . , θk , y for each iteration t. The trace is increasing during convergence, then appears stationary after burn-in.

Assessing Variability due to Simulation

For λ = g (θ), consider the estimate for λ based on MCMC draws

N 1 X E(ˆ λ | y) = λˆ = λ(t). N N t=1

ˆ  Consider V ar λN assuming draws are from the posterior (i.e., the MCMC has converged) but (t) (t+k) dependent. Denote ρk as the autocorrelation between λ and λ . Then the effective sample size, ESS, is defined by

∞ X ESS = N/κ (λ) , where κ (λ) = 1 + 2 ρk (λ) . k=1

ˆ The simulation variance of λN (i.e., analog to s.e. for the frequentist) may be approximated by

N   s2 1 X  2 Vˆ ar λˆ = λ , where s2 = λ(t) − λˆ . N ESS λ N − 1 N t=1

ESS can be computed by summing autocorrelations until they become negligible (say, below   k 1−ρ1 0.01). Often autocorrelation decays exponentially: ρk ≈ ρ which gives ESS ≈ N with 1 1+ρ1 total number of draws N after burn-ins.

16 Deviance Information Criterion

BIC and AIC

Define the deviance function for a model with parameter θ: D (θ) = −2 log f (y | θ). Bayesian Information Criterion (BIC) is

  D θˆ + p log n,

75 which is motivated by asymptotic approximation of Bayes factor, where θˆ is the maximum likeli- hood estimate, p is model dimension, θ = (θ1, . . . , θp), and n is sample size of y = (y1, . . . , yn). Akaike Information Criterion (AIC) is

  D θˆ + 2p, which is motivated by asymptotic approximation to Kullback-Leibler divergence.

Question What if choice of p and n is not clear? This is common in Bayesian hierarchical models. Consider the multi-level normal model 2 2 yij ∼ Normal θi, σ for i = 1, . . . , m and j = 1, . . . , ni where θi ∼ Normal µ, τ . 2  (i) If θi are all nearly identical τ → 0 , model depends only on estimation of µ (p ≈ 1). 2  (ii) If θi are estimated independently τ → ∞ , p ≈ m makes sense. The choice of sample size is similarly unclear.

Effective Number of Parameters Define the effective number of parameters by the “expected” deviance minus the “fitted” deviance:

ˆ ˆ pD = Eθ|yD (θ) − D θ , where θ = Eθ|yθ.

ˆ Higher pD implies more over-fitting with estimate θ. For a non-hierarchical model, the Bayesian

CLT implies p ≈ pD for large n.

• Denote L (θ) = log f (y | θ) and D (θ) = −2L (θ).  h  i−1 • By Bayesian CLT, p (θ | y) ≈ N θ,ˆ −L00 θˆ , where θˆ is posterior mean or MLE s.t.,   L0 θˆ = 0.

• Using 2nd order Taylor expansion about θˆ,

   T   1  T     D (θ) = D θˆ + θ − θˆ D0 θˆ + θ − θˆ D00 θˆ θ − θˆ 2   1  T h  i      T h  i   = D θˆ + θ − θˆ −2L00 θˆ θ − θˆ = D θˆ + θ − θˆ −L00 θˆ θ − θˆ , 2

 T h  i   ˆ 00 ˆ ˆ 2 where θ − θ −L θ θ − θ is a χp for θ. Thus, take expectation on both sides:

ˆ Eθ|yD (θ) ≈ D θ + p.

Note. The more parameters do not mean the more flexible the model is to fit data.

76 Deviance Information Criteria The deviance information criteria (DIC) is

DIC = Eθ|yD (θ) + pD, where Eθ|yD (θ) shows how well the model fits and pD shows penalty for numbers of parameters.

 1 PN (t) • Also estimate DIC from posterior samples: DIC = 2D −D θ where θ = N t=1 θ and ¯ 1 PN (t) D = N t=1 −2 log f y | θ .

– Includes a “goodness-of-fit” term D with a penalty for complexity pD.  – pD can be negative if D θ is relatively large: It implies Bayesian CLT does not hold and θ is a poor estimate.

• Approximates AIC for a non-hierarchical model.

• Similar asymptotic justification as AIC.

• More appropriate for hierarchical models than AIC and BIC.

• Used for model comparison; otherwise DIC values are not very informative on their own.

– Lower DIC are better.

Example (Gene Testing). 40 mice are given a given a dose of alcohol, 40 are kept as control. g Expression levels are subsequently measured for 500 genes in liver. Yij is expression level for gene g g i, mouse j, group g. Measurements are normally distributed with variance 1: Yij ∼ Normal(µi , 1). diff alc con diff 2 2 Let µi = µi − µi and µi ∼ Normal 0, τ with Jeffrey’s prior for effect variance, π τ ∝ 1 τ 2 . Consider the group differences

alc con  1  Y diff = Y − Y ∼ Normal µalc − µcon, , i i i i i 20

g  g 1 where Y i is an average of 40 mouses for gene i in group g and V ar Y i = 40 .

2 1 π τ = τ 2 N 0, τ 2 diff 1  diff 1  N µ1 , 20 ··· N µ500, 20 diff diff y1 y500

diff Full distribution for yi ’s:

500 1 Y N µdiff | 0, τ 2 N ydiff | µdiff, 1/20 . τ 2 i i i i=1

77 diff 2 • Gibbs sample conditionals for µi s and τ is

0 × 1 + τ 2ydiff 1 τ 2  p µdiff | τ 2, y = Normal 20 i , 20 , i τ 2 + 1/20 τ 2 + 1/20 n ! 500 ! n 1 X 2 1 X 2 p τ 2 | µdiff, y = IG , 0 − µdiff = IG 250, µdiff . 2 2 i 2 i i=1 i=1

• Initialize τ 2 = 1/20, run 10000 iterations with 2000 burn-in. Compute at each iteration:

500 X   1  D µdiff, τ 2 = −2 log f (y | θ) = −2 log N ydiff | µdiff, . i i i 20 i=1

diff  diff • The deviance for µˆ i , the mean vector over draws, is D µˆ i = −422.7. Thus

 diff pD = D − D µˆ i = 344.9 and DIC = D + pD = 267.1.

diff • Consider the null model µi = 0 ∀i: The effective number of parameters is

500 X   1  p = 0 and DIC = −2 log N ydiff | 0, = 1029. D i 20 i=1

• Evidence there are alcohol effects for at least some genes.

##Simulated data: set.seed(123) mus_true= c(rep(0,400),rnorm(100,0,sqrt(0.5))) y_diffs= rnorm(500,mus_true,sqrt(0.05))

### initialize T=10000 BurnIn= 2000 N=T-BurnIn draws_tau_2= rep(0,T) draws_mu_diff= matrix(nrow = T, ncol= 500) Ds= rep(0,T) tau_2=1/20

## Run gibbs sampler for(t in 1:T){ mus= rnorm(500, tau_2*y_diffs/(tau_2+0.05), sqrt(0.05*tau_2/(tau_2+0.05))) tau_2=1/ rgamma(1,250, 0.5*sum(mus^2)) draws_tau_2[t]= tau_2 draws_mu_diff[t,]= mus Ds[t]=-2 *sum(log(dnorm(y_diffs,mus,sqrt(0.05)))) }

### compute mean_mus= colMeans(draws_mu_diff[2001:T,])

78 D_mean=-2 *sum(log(dnorm(y_diffs,mean_mus,sqrt(0.05)))) p_d1= mean(Ds[2001:T])-D_mean DIC1=2 *mean(Ds[2001:T])-D_mean DIC_null=-2 *sum(log(dnorm(y_diffs,0,sqrt(0.05)))) cat("p_d1 =",p_d1,"DIC1 =",DIC1,"DIC_null =",DIC_null)

## p_d1 = 304.7747 DIC1 = 227.2288 DIC_null = 698.3259

Consider a third model, that allows “no effect” for some genes. P1 is shared probability that diff µi 6= 0 for a given gene:  diff 0 with probability 1 − P1 µi ∼ . 2 N 0, τ with probability P1

2 2 1  diff Again π τ = 1/τ and use a uniform prior for P1: P1 ∼ Beta(1, 1). Let ζi = µi 6= 0 and diff diff 1  Yi ∼ Normal ζiµi , 20 . • Gibbs sampling conditional for ζ, µdiff for each gene i:

– Draw ζi ∈ {0, 1} by

P · N ydiff | 0, τ 2 + 1  p ζ = 1 | y, τ 2,P  = 1 i 20 , i 1 diff 2 1  diff 1  P1 · N yi | 0, τ + 20 + (1 − P1) · N yi | 0, 20

diff diff diff diff  diff 2 1  where yi |ζi=1= µi + εi and hence p yi | ζi = 1 = N yi | 0, τ + 20 .  2 diff 2  diff diff diff τ yi τ /20 – Draw µi : If ζi = 0, set µi = 0. Otherwise, draw µi ∼ Normal τ 2+1/20 , τ 2+1/20 .

2 2 diff   1 P 1 P diff2 • Conditional for τ : p τ | ζ, µ , y = IG 2 i ζi, 2 i ζiµi .

diff 2 P P • Conditional for P1: p P1 | ζ, µ , y, τ = Beta (1 + i ζi, 1 + 500 − i ζi).

• pD = 179.8 and DIC = 106.57. Thus, it suggests a good compromise between Null model (DIC = 1029) and Model with an effect in every gene (DIC = 267.1).

#### Now estimate model, allowing some mu_diffs to be 0 draws_p1= rep(0,T) p1= 0.5 tau_2=1/5 for(t in 1:T){ Marg0=(1-p1) *dnorm(y_diffs,0,sqrt(0.05)) MargNot0= p1 *dnorm(y_diffs,0,sqrt(0.05+tau_2)) Probs1= MargNot0/(Marg0+MargNot0)

Indicator= rbinom(n=500,1,Probs1) mus[Indicator==0]=0 NumNot0= sum(Indicator) mus[Indicator==1]=rnorm(NumNot0, tau_2*y_diffs[Indicator==1]/(tau_2+0.05),

79 sqrt(0.05*tau_2/(tau_2+0.05)))

tau_2=1/ rgamma(1,NumNot0/2, 0.5*sum(mus[Indicator==1]^2))

p1= rbeta(1,1+NumNot0,1+500-NumNot0)

draws_p1[t]= p1 draws_tau_2[t]= tau_2 draws_mu_diff[t,]= mus Ds[t]=-2 *sum(log(dnorm(y_diffs,mus,sqrt(0.05)))) }

mean_mus= colMeans(draws_mu_diff[2001:T,]) D_mean=-2 *sum(log(dnorm(y_diffs,mean_mus,sqrt(0.05)))) p_d2= mean(Ds[2001:T])-D_mean DIC2=2 *mean(Ds[2001:T])-D_mean cat("p_d2 =",p_d2,"DIC2 =",DIC2)

## p_d2 = 168.3568 DIC2 = 97.0276

17 Bayesian Model Averaging

Example (Body Fat (cont.)). Recall: % body fat (BF%) measured for 100 adult males. Also mea- sured 9 predictor variables: Age, Weight, Height; circumference of neck, chest, abdomen, ankle, bicep, and wrist. Consider the model y = βX +  where y is population-centered BF%, X is the standardized matrix of predictor variables, and  ∼ Normal0, σ2I.

2  2 • Previously used iid normal prior for βi’s: β ∼ Normal 0, 0.62σ I where τˆ = 0.62 is estimated empirically and σ2 ∼ IG (3, 20).

• Now allow βi’s to be identically 0, with probability 1/2:  1 0 with probability 2 βi ∼ . 2 1 N 0, 0.62σ with probability 2

P9 Equivalently, incorporate model inclusion indicators ζ: yi = k=1 ζkβkxik + i where ζi’s 1  have iid Bernoulli 2 priors.

model{ # sampling model for(i in 1:100){ Y[i] ~ dnorm(mu[i], Prec) mu[i] <- coeffs[1]*X1[i]+coeffs[2]*X2[i]+coeffs[3]*X3[i]+ coeffs[4]*X4[i]+ coeffs[5]*X5[i]+coeffs[6]*X6[i]+ coeffs[7]*X7[i]+coeffs[8]*X8[i]+coeffs[9]*X9[i]} # priors Prec Beta <- (1/0.62)*Prec

80 for(j in 1:9){ beta[j] ~ dnorm(0,PrecBeta) zeta[j] ~ dbern(0.5) # change in new model coeffs[j] <- beta[j]*zeta[j]} Prec ~ dgamma(3,20) }

– ζ draw statistics Abdominal circumference (6) is non-zero for all. Wrist circumference (9) is included in 93%, Age (1) in 78%. Remaining variables are most often excluded from model.

• Assume an individual has the following standardized measurements:

– Age: 1.20, Weight:0.5, Height:-0.4; circumference of neck:2.2, chest:0.3, abdomen:0.8, ankle:0.2, bicep:0.3, and wrist:0.6.

• Let µ101 be mean %BF for above measurements, and y101 be the individual’s BMI. Compute for each MCMC iteration: For each draw, some subset of coefficients are 0. WinBUGS code:

mu101 < − 18.5+coeffs[1]*1.2+coeffs[2]*0.5-coeffs[3]*0.4- coeffs[4]* 2.2+coeffs[5]*0.3+coeffs[6]*0.8+ coeffs[7]*0.2+coeffs[8]*0.3+coeffs[9]*0.6 y101 ~ dnorm(mu101,Prec)

81 For many applications, consider multiple models M1,...,Mm and combine them rather than selecting the best model.

• Let γ be a quantity that is defined in every model. Prior probabilities p (Mi) define the com- bined prior distribution

m X p (γ) = p (γ | Mi) p (Mi) . i=1

• The combined posterior distribution for γ is

m X p (y | Mi) p (Mi) p (γ | y) = p (γ | M , y) p (M | y) where p (M | y) = . i i i Pm p (y | M )p (M ) i=1 | {z } j=1 j j posterior given model & data | {z }| {z } likelihood prior

• In previous example, consider m = 29 = 512 models.

– Each model Mi includes some subset of 9 predictor variables. −9 – Prior: p (Mi) = 2 for i = 1, . . . , m.

• If p (M1) = ··· = p (Mm), p (Mi | y) ∝ p (y | Mi) and then

p (y | M ) p (M | y) BF = i = i p (y | Mj) p (Mj | y)

is Bayes factor of model i over model j.

• Under posterior sampling, p (Mi | y) is approximated by the proportion of times model i is chosen.

BMA Comments

• Even if no parameters are shared, can use BMA for posterior predictive γ = yn+1.

• BMA appropriately accounts for uncertainty in model choice: Selecting the single best model can underestimate uncertainty, overfit.

• In practice, may consider a large set of models, but omit less plausible models in BMA.

– For example, only include those models i such that

max p (M | y) j j < C p (Mi | y)

for some C (say, C = 20).

82 18 Mixture Models

Finite Mixture Models

Let f1, . . . , fK be densities with weights q1, ··· , qk where q1 + ··· + qK = 1 and qk ≥ 0 ∀k. PK Assume y1, . . . , yn are iid with density p (yi) = k=1 qkfk (yi). Equivalently,  f1 with prob. q1 . yi ∼ . .   fK with prob. qk

It is useful for modeling complex distributions. Mixture of simpler component distributions fk.

Dirichlet Distribution

In a Bayesian framework, a Dirichlet prior is commonly used for q = (q1, . . . , qK ).

• Parameterized by concentration values α = (α1, . . . , αK ):

K 1 Y π (q) = qαk−1 B (α) k k=1

for q1 + q2 + ··· + qK = 1 and qk ≥ 0 ∀k.

• B (·) is the multivariate beta function, defined by the gamma function Γ:

QK Γ(α ) B (α) = k=1 k . PK  Γ k=1 αk

– For K = 2, the Dirichlet(α1, α2) = Beta(α1, α2).

P Note. Dirichlet distribution is a generalization of Beta distribution with restriction i qi = 1. If q ∼ Beta(α1, α2), then (q, 1 − q) ∼ Dirichlet (α1, α2). The domain of the Dirichlet is the K − 1 unit simplex

( K ) K−1 K X 4 = (q1, . . . , qK ) ∈ R : qk = 1 and qk ≥ 0 ∀k . k=1

K−1 α1 = ··· = αK = 1 implies a uniform distribution on 4 .

Dirichlet-Multinomial Model Let z1, . . . , zn be iid variables from K categories, where zi ∈

{1,...,K} with p (zi = k) = qk.

83 Pn 1 Let nk be the number of zi’s from category k, i.e., nk = i=1 {zi = k}. Thus n is a sufficient statistic for z. P • If (n1, . . . , nK ) ∼ Multinomial (n, q) with n = k nk, the Dirichlet(α) is conjugate for q:

p (q | z) = Dirichlet (α1 + n1, . . . , αK + nK ) .

n1 nK – n | q ∼ Multinomial (n, q) implies p (n | q) ∝ q1 × · · · × qK .

α1−1 αK −1 – q ∼ Dirichlet (α) implies π (q) ∝ q1 × · · · × qK .

QK αk+nk−1 – Hence p (q | z) ≡ p (q | n (z)) ∝ p (n (z) | q) π (q) ∝ k=1 qk ∝ Dirichlet (α + n).

Bayesian Finite Mixture Model

• We put a prior on (f1, . . . , fK ). Assume fk’s are from same parametric family: fk (·) =

f (· | θk) ∀k.

2 2 – For example, fk = Normal µk, σk where θk = µk, σk with independent priors

2 2 µk ∼ Normal µ0, τ and τ ∼ IG (a, b) .

• Let zi ∈ {1,...,K} indicate the component fi = f (· | θi) that generated yi, where the P 1 subscript i in θi is k k · {zi = k}. For example, if zi = k, then θi = θk.

• Gibbs ample from full conditionals for z, q, and (θ1, . . . , θK ).

– Draw zi for i = 1, . . . , n by multinomial distribution with size 1 (i.e., zi = k means kth element is 1 and 0 otherwise; sum of all elements is size 1) and prob for category k as

q f (y ) p (z = k | y, q, θ) = k k i ∝ q f (y | θ ) . i PK k k i k k0=1 qk0 fk0 (yi)

Pn 1 – Draw q by p (q | y, θ, z) = Dirichlet (α + n), where nk = i=1 {zi = k}.

– Draw θk for k = 1,...,K from p (θk | y, q, zi) ∝ π (θk) f (yk | θk), where yk =

{yi : zi = k}. This may require a Metropolis step if conjugate priors are not used.

Example (Galaxies). Measurements taken from 82 galaxies in Corona Borealis region of space. Consider velocity in km/sec for each galaxy: Galaxies with a higher velocity are farther from earth.

• Model y as a mixture of normal densities: for i = 1, . . . , n,

K  2K  iid X 2 p yi | q, µk, σk k=1 = qkNormal yi | µk, σk . k=1

84 • Use a uniform Dirichlet prior for q: q ∼ Dirichlet (1).

2 iid 9 • Use a mildly informative normal-inverse gamma prior for µk, σk : µk ∼ Normal 20000, 10 2 8 and σk ∼ IG 2, 10 .

• For K = 4 clusters, Gibbs sampling initialization is:

2 – Draw µk, σk from prior for k = 1,..., 4.

– Set qk = 1/4 for k = 1,..., 4.

2 • The full conditional for each µk, σk can be drawn from a normal-inverse gamma posterior.

• Run Gibbs sampling for T = 50000 iterations. library(gtools) ##load package 'gtools' for the dirichlet distribution library(MASS) ##Load galaxy data from package 'MASS y=galaxies n= length(y); T=50000; BurnIn= 2000;N=T-BurnIn K=4 # category of y probs= matrix(nrow = K, ncol = n) # prob. for each obs. draws_q= matrix(nrow = T, ncol = K) draws_mu= matrix(nrow = T, ncol = K) draws_sigma2= matrix(nrow = T, ncol = K) x= seq(from=5000, to=40000, length.out= 200) ###x values to compute density draws_dens= matrix(nrow = T, ncol= 200) Ds= rep(0,T) ##Prior for mu mu0= 20000 ; tau2= 10^8 ##Prior for sigma a=2; b=10^8 ##Prior for q alpha= rep(1,K)

##Initialize mu= rnorm(K,mu0, sqrt(tau2)) sigma2=1/ rgamma(K,a,b) q= rep(1/K,K) Z= rep(1,n)

##Run gibbs sampler set.seed(123) for(t in 1:T){ ###Generate component indicators Z for(k in 1:K) probs[k,]= q[k] *dnorm(y,mu[k],sqrt(sigma2[k])) for(i in 1:n) Z[i]= which(rmultinom(1,size=1,probs[,i])==1) #Multinomial Dist

###Generate qs nk= c() for(k in 1:K) nk[k]= sum(Z==k) q= rdirichlet(1,alpha+nk)

###generate mu_k, sigma2_k from normal-inverse-gamma posterior for(k in 1:K){

85 a_post=a+nk[k]/2-1/2 b_post=b+(1/2) *sum((y[Z==k]-mean(y[Z==k]))^2) sigma2[k]=1/ rgamma(1,a_post,b_post) post_mu_mean= (sigma2[k] *mu0+tau2*sum(y[Z==k]))/(nk[k]*tau2+sigma2[k]) post_mu_var= sigma2[k] *tau2/(nk[k]*tau2+sigma2[k]) mu[k]= rnorm(1,post_mu_mean,sqrt(post_mu_var)) } draws_q[t,]=q draws_mu[t,]= mu draws_sigma2[t,]= sigma2 ###Compute density over grid Dens= rep(0,200) for(k in 1:K) Dens= Dens+q[k] *dnorm(x,mu[k],sqrt(sigma2[k])) draws_dens[t,]= Dens }

Histogram of y

● ● ● ● ● ● ● ● ●●● ●● ●●●●● ●●●●●● ● ●● ●●●● ●● ●●●●● ●●●● ●●●●● ● ●● ●● ●●●● ●●●●● ●● ●● ● ●●●●●●● ●●●●● ●●●●●●●●●● ●●● ●●●● ● ●●●● ● ●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●● ●●● ●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●● ●●●● ●● ● 10 ●● ● ●● ● ● ●●●●● ●● ●● ● ● ●●●●● ●● ●●● ●●●●●● ● ●● ● ●● ●● ●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●● ● ●● ●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●● 20000 ● ●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 5 ● ●●●●●●● ●● ●● ●●●●●●●●●●●●●●●●●● ●●● ●●●●●● ●●●●●●●●●●●●●●●● ●● ●●●● ● ●●●

draws_mu ● Frequency ● ● ●●● ● ● ●● ● ●● ● ● ● ● 0 0

10000 20000 30000 49500 49700 49900

Velocity (km/sec) Index

for(t in 1:T){ # label switching Order= order(draws_mu[t,]) draws_q[t,]= draws_q[t,Order] draws_mu[t,]= draws_mu[t,Order] draws_sigma2[t,]= draws_sigma2[t,Order] }

##Mean values: MeanMu= colMeans(draws_mu) MeanQ= colMeans(draws_q) Mean_Sigma_2= colMeans(draws_sigma2) ##Density estimate MeanDens= colMeans(draws_dens) cat(MeanMu,"\n", MeanQ,"\n", Mean_Sigma_2,"\n")

## 12641 18381 22017 26931 ## 0.09154 0.281 0.4643 0.1631 ## 67452042 60855936 43571728 73814521

86 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ●●●● ● ● ● ● ●● ● ●● ● ● ● ● ●●● ● ●● ●●●●● ●● ● ● ●● ●● ●● ●● ● ● ●●● ● ●● ●● ● ● ●●●●●● ●●●● ● ● ● ●●● ●●● ●●●●●●● ● ● ● ● ● ●●●● ● ●● ●●● ●●● ●● ● ●●●●● ●● ● ● ●● ● ● ●●● ●●● ●●●●●● ● ●●● ●●● ●●● ●● ●●●● ●● ●● ●●●●●●●●●● ● ● ● ● ● ●●● ●●●●●●●●●●●●●● ●●● ●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●● ●● ● ● ● ● ●● ●● ● ●●●●●●●●●● ●●●●●●●●●● ●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ●●● ● ● ●●●● ●● ● ● ● ● ●●● 8e+07 ● ● ● ● 30000 ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●● ●● ● ● ● ● ●● ●● ● ●● ●● ●●●●● ●●●●●●●●●●●●●●●● ●●● ●● ●●●●●●●●●● ●●●●●● ●● ●●●● ●● ● ● ● ● ●● ● ● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●●● ●●● ●●● ●●● ●●● ● ● ● ● ● ● ●●● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ●●●●●● ●●● ● ●●●●●● ●●●●●●●●●● ●● ●●● ●●● ●● ●●● ●●●● ●● ●● ● ● ● ● ● ● ● ● ●● ●● ● ●● ● ●● ● ● ●●●● ● ● ● ● ●●●●●●● ●● ●● ●● ● ● ● ● ●●●● ● ● ● ● ●●● ● ● ●● ●● ● ● ●●●●● ● ● ● ●●●●● ● ●●● ●● ●● ● ● ●●● ● ●● ●●● ●●● ● ●● ● ● ● ● ● ●● ●● ●●● ●●●●●●● ●●●●● ●●● ● ● ●● ● ● ● ●●● ● ●● ●●●● ●● ●● ●●●● ● ●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●● ●● ●● ● ●● ●● ●●● ● ● ● ● ●●●●● ● ● ● ●●● ●●●● ●●●●●●●● ●● ● ●● ●●●●● ●●●●●●● ●●●●●●●●●●●●●●●● ●● ●●● ●●●●●●●●●●●●● ●●●●●●● ●●● ●● ●● ● ● ● ●●●● ● ● ● ●●●●●●●●●●● ●● ●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●● ● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●●●●●●● ●●●● ●●●●●●●●● ●● ●●● ●● ●●● ●●● ●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●● ●●●●●●●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●● ●●●●●●● ● ●●●● ●●●● ●●● ●●●●●●●●●●●●●● ●●●● draws_mu ● ● ●● ●● ● ●●● ● ●●●●● ●●●● ●●●●● ● ●● ●●●●● ● ●● ● ● ● ●● ●●●● ●● ●● ●● ● ● ●●● ●●● ● ●●● ● ●●● ●● ● ●●●●●● ●●●●●● ●●●●● ●● ●●● ● 4e+07 ●●● ●●● ● ●● ●● ● ●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●● ●●● ● ● ● ●●●●●●●●●●●●●● ●● ●●●●●● ● ●●●●●●●●●●● ●●● ●●● ●● ● ●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●● ●●●● ●● ●● ●●● ● ●●● ●● ●● ● ● ●● ● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●● ●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ● ● draws_sigma2 ●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●●●●● ● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 10000 ● ●●●●●●●●●●●●●●● ●●● ●●● ● ●●●●●● ● ● ●●● ●●●●● ●● ●●● ● ●● ● ● ●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●●●●●●● ●● ●●●●●●● ● ● ● ● ● ●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●● ●●●●●● ●●●● ●●●● ● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0 ● 0e+00 49500 49600 49700 49800 49900 50000 49500 49600 49700 49800 49900 50000

Index Index

Histogram of y

● ● ● ● ● ● ● ●● ● ● ● ●● ● ●● ● ● ● ● ● ●● ●●● ●●●●●● ● ● ● ●● ●●●●●● ● ●●●●●● ●●●●● ● ●● ●●●●●●● ●●● ●●●● ●●● ●●●●●● ●●●●● ●●●●●● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●● 0.8 ● ● ● ● ● ● ● ● ●●● ●● ●● ● ● ● ● ● ●●● 0.00020 ● ● ●●●●●●●●●●●● ●●●● ●●● ●●●●●● ●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●● ● ● ●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●● ●●●●●●●●●●●●●●●●● ●●● ●● ●●●● ●●●● ●●●●● ● ● ●●● ●●● ● ● ●● ●●●● ● ● ● ● ●● ●●● ● ● ● ●●● ● ●●●●● ●● ● ●● ●● ● ●● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●●● ● ●●● ● ●● ●●● Density 0.4 ●● draws_q ● ●●● ● ● ● ● 0.00010 ● ● ●●●● ● ●●● ● ● ●● ●● ● ●● ●● ● ●●●● ●● ● ● ● ● ●● ●●● ● ●●●● ●●●●● ●● ●● ●●●●●●● ●●● ●●● ● ●● ●●●●●●● ●●●●●● ●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●● ●●●●● ●● ●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● 0.0

49500 49600 49700 49800 49900 50000 0.00000 10000 15000 20000 25000 30000 35000

Index Velocity (km/sec)

Choosing K

• Number of components (i.e., clusters) K can be selected using standard model selection tools.

• DIC, BIC, cross-validation, etc.

• Alternatively, put a prior on K.

• Or, choose K large and use small values for α: Some components will have negligable prob- ability.

• Let K → ∞ with αk = c/K for all k gives a Dirichlet process.

87