PUBH 8442, Spring 2016 Eric Lock

PUBH 8442, Spring 2016 Eric Lock Yanghao Wang May 12, 2016 Contents 1 Prior and Posterior1 Maximum Likelihood Estimation (MLE)..........................1 Prior and Posterior.....................................1 Beta-Binomial Model....................................2 Conjugate Priors.......................................3 Normal-Normal Model...................................4 Prior and Posterior Predictive................................5 Normal-Normal Predictives.................................6 Non-Informative Priors...................................8 Improper Priors.......................................8 Jeffreys Priors........................................9 Elicited Priors........................................ 11 2 Estimation, and Decision Theory 11 Point Estimation....................................... 11 Loss Function........................................ 12 Decision Rules....................................... 13 Frequentist Risk....................................... 14 Minimax Rules....................................... 15 Bayes Risk......................................... 15 Bayes Decision Rules.................................... 16 3 The Bias-Variance Tradeoff 16 Unbiased rules........................................ 16 Bias-Variance Trade-Off.................................. 17 i 4 Interval Estimation 19 Credible sets......................................... 19 Decision Theory for Credible Sets............................. 20 Normal Model with Jeffreys Priors............................. 23 5 Decisions and Hypothesis Testing 25 Frequentist Hypothesis Testing............................... 26 Bayesian Hypothesis Testing................................ 27 Hypothesis Decision.................................... 28 Type I error......................................... 29 Bayes Factors........................................ 32 6 Model Comparison 32 Multiple Hypothesis/Models................................ 32 Model Choice........................................ 33 Bayes Factors for Model Comparison........................... 34 Bayesian Information Criterion............................... 34 Partial Bayes Factors.................................... 35 Intrinsic Bayes Factors................................... 36 Fractional Bayes Factors.................................. 37 Cross-Validated Likelihood................................. 38 Bayesian Residuals..................................... 38 Bayesian P-Values...................................... 39 7 Hierarchical Model 40 Bayesian Hierarchical Models............................... 40 Hierarchical Normal Model................................. 44 The Normal-Gamma Model................................. 46 Conjugate Normal with both µ and σ2 Unknown...................... 47 8 Bayesian Linear Models 48 Bayesian Framework with Uninformative Priors...................... 48 Bayesian Framework with Normal-Inverse Gamma Prior................. 50 9 Empirical Bayes Approaches 52 The Empirical Bayes Approach............................... 53 10 Direct Sampling 57 Direct Monte Carlo integration............................... 57 Direct Sampling for Multivariate Density......................... 59 ii 11 Importance Sampling 60 Importance Sampling.................................... 60 Density Approximation................................... 61 12 Rejection Sampling 64 Rejection Sampling..................................... 64 13 Metropolis-Hastings Sampling 65 Metropolis-Hastings Sampling............................... 66 Choice of Proposal Density................................. 67 14 Gibbs Sampling 70 Gibbs Sampling....................................... 70 15 More on MCMC 73 Combining MCMC methods................................ 74 16 Deviance Information Criterion 75 BIC and AIC........................................ 75 17 Bayesian Model Averaging 80 18 Mixture Models 83 Finite Mixture Models................................... 83 Bayesian Finite Mixture Model............................... 84 iii 1 Prior and Posterior Assume a sampling model for data y = (y1; : : : ; yn), specified by parameter θ (maybe unknown), is often represented as a probability density f (y j θ). 2 Example. Gaussian model specified by θ = µ, σ for independent yi’s: ( ) 1 n Pn (y − µ)2 f (y j θ) = p exp − i=1 i : σ 2π 2σ2 The probability density f (y j θ) is often called the likelihood, with the notation L (θ; y). Use this notation when making inferences about θ. Maximum Likelihood Estimation (MLE) Choose θ to maximize likelihood θ^ = arg max L (θ; y) : θ MLE approach assume θ is fixed (though unknown) and has been criticized for overfitting. Prior and Posterior Alternatively, θ can be a random variable. Let π (θ) be a probability density for θ or π (θ j η) be a probability density for θ where η is hyperparameter. The marginal density for y is Z p (y) = f (y j θ) π (θ) dθ; which implies a marginal density for y is a density for y averaging over θ. • Bayes’ rule for continuous random variable p (y; θ) f (y j θ) π (θ) p (θ j y) = = ; p (y) R f (y j θ) π (θ) dθ where π (θ) is the prior (distribution for θ) and p (θ j y) is the posterior (distribution for θ). – π (θ) is the density for θ without observed data while p (θ j y) is the density for θ with observed data. – Given data y, R f (y j θ) π (θ) dθ is a normalizing constant (a scalar) helping the posterior p (θ j y) integrate to 1. 1 • Bayes’ rule for discrete random variable – θ is continuous distributed while y is discrete: P (y = k j θ) π (θ) p (θ j y = k) = ; R P (y = k j θ) π (θ) dθ where y is discrete and P (y j θ) is the probability mass function (PMF) for y given θ. – θ is discrete distributed while y is continuous: f (y j θ = k) P (θ = k) P (θ = k j y) = P ; k2Θ f (y j θ = k) P (θ = k) where Θ is the set of all possible values of θ. Example (Gender Proneness). The Gupta’s (of CNN fame) have three daughters. What is the probability their next child will be a daughter? MLE Solution Let θ be the probability of a daughter for Gupta’s. Binomial likelihood for first 3 births with y = # daughter is P (y = 3 j θ) = θ3; which implies θ^ = 1 (overfitting problem). Bayes Solution Assume the prior distribution for θ is Uniform (0; 1) and hence π (θ) = 1 8θ 2 [0; 1]. P (y = 3 j θ) π (θ) θ3 × 1 p (θ j y = 3) = = = 4θ3; R P (y = 3 j θ) π (θ) dθ 1=4 R 1 4 1 1 where P (y = 3 j θ) π (θ) dθ = 4 θ jθ=0= 4 . Estimate θ by the posterior distribution: Z 1 ^ 4 5 1 4 θ = E [θ j y = 3] = θp (θ j y = 3) dθ = θ jθ=0= : 0 5 5 Beta-Binomial Model Assume θ ∼ Beta (a; b): 1 π (θ) = θa−1 (1 − θ)b−1 ; B (a; b) 2 where B (·; ·) is the beta function. If Y ∼ Binomial (n; θ) where n is sample size and θ is the success probability, then p (θ j y = k) = Beta (a + k; b + n − k) : Proof. By the Bayes’ rule, the denominator is a constant, the posterior distribution p (θ j y = k) is of proportion to the numerator P (y = k j θ) π (θ). Thus, θa−1 (1 − θ)b−1 p (θ j y = k) / P (y = k j θ) π (θ) = Cnθk (1 − θ)n−k k B (a; b) θa+k−1 (1 − θ)n−k+b−1 / = Beta (a + k; b + n − k) : B (a; b) Note. Beta distribution is a distribution about the belief of probability; i.e., θ ∼ Beta (a; b) for all a θ 2 [0; 1] and E[θ] = a+b . Thus, shape parameters a and b will determine the expectation for θ. In addition, Beta(1; 1) is an uniform distribution U [0; 1]. Example (Gender Proneness (cont.)). Based on data from many families, a more realistic prior for θ is π (θ) = Beta (39; 40), where 39 is the number of girls and 40 is the number of boys. For the Gupta family, p (θ j y = 3) = Beta (42; 40) since n = 3 (sample size in the Gupta’s) and k = 3 (observed number of girls). The expected value of Beta (a; b) is a= (a + b): ^ 39 • Prior estimate is θ = 39+40 = 0:494 ^ 42 • Posterior estimate is θ = 42+40 = 0:512 Conjugate Priors A prior is conjugate for a given likelihood if its posterior belongs to the same distribution family. For example, a beta prior is conjugate to a binomial random variable (i.e., if y is generated by a binomial process, prior distribution π (θ) is a Beta distribution implies the posterior distribution p (θ j y) is also a Beta distributions). • Conjugate priors facilitate computation of posteriors. – Particularly useful when updating the posterior adaptively. E.g., given π (θ) = Beta (39; 40), after one girl, the posterior is Beta (40; 40). Then after another girl, the posterior is updated to Beta (41; 40). • Conjugate does not give precision (no profound theoretical justification). 3 • Common conjugate families (credit: John D. Cook) Example (Poisson-Gamma Model). Let y ∼Poisson(θ) and θ ∼Gamma(α; β): π (θ) ∼ θα−1e−βθ. p (θ j y) /p (y j θ) π (θ) /θye−θθα−1e−βθ =θα+y−1e−(β+1)θ ≡ Gamma (α + y; β + 1) : Normal-Normal Model 2 2 2 Assume y ∼ Normal µ, σ with σ known. If π (µ) = Normal µ0; τ , then σ2µ + τ 2y σ2τ 2 p (µ j y) = Normal 0 ; : σ2 + τ 2 σ2 + τ 2 Proof. Note y ∼ Normal µ, σ2 implies " # 1 (y − µ)2 p (y j µ) = p exp − : σ 2π 2σ2 2 With π (µ) = Normal µ0; τ , we have " # 1 1 (y − µ)2 (µ − µ )2 p (µ j y) / p (y j µ) π (µ) = p p exp − − 0 σ 2π τ 2π 2σ2 2τ 2 " # τ 2 (y − µ)2 + σ2 (µ − µ )2 / exp − 0 2σ2τ 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 σ µ0+τ y σ µ0+τ y τ y +σ µ0 τ + σ µ − 2 2 − 2 2 + 2 2 6 σ +τ σ +τ σ +τ 7 / exp 6− 7 4 2σ2τ 2 5 4 " # σ2µ + τ 2y 2 . σ2τ 2 / exp − µ − 0 2 ; (∗) σ2 + τ 2 τ 2 + σ2 2 2 2 2 where the third line follows that τ (y − µ) + σ (µ − µ0) is 2 2 2 2 2 2 2 2 2 2 2 2 2 τ (y − µ) + σ (µ − µ0) = τ + σ µ − 2τ yµ − 2σ µ0µ + τ y + σ µ0 σ2µ + τ 2y τ 2y2 + σ2µ2 = τ 2 + σ2 µ2 − 2 0 µ + 0 σ2 + τ 2 σ2 + τ 2 " # σ2µ + τ 2y 2 σ2µ + τ 2y 2 τ 2y2 + σ2µ2 = τ 2 + σ2 µ − 0 − 0 + 0 ; σ2 + τ 2 σ2 + τ 2 σ2 + τ 2 2 2 2 2 2 2 2 2 2 σ µ0+τ y τ y +σ µ0 and the forth line follows that (given y, µ0, σ , and τ ) constants σ2+τ 2 and σ2+τ 2 are dropped for simplicity.

PUBH 8442, Spring 2016 Eric Lock

Paradoxes and Priors in Bayesian Regression

A Compendium of Conjugate Priors

A Review of Bayesian Optimization

9 Introduction to Hierarchical Models

On the Robustness to Misspecification of Α-Posteriors and Their

Prior Distributions for Variance Parameters in Hierarchical Models

The Role of Hierarchical Priors in Robust Bayesian Inference

Automatic Kernel Selection for Gaussian Processes Regression with Approximate Bayesian Computation and Sequential Monte Carlo

Hyperparameter Estimation in Bayesian MAP Estimation: Parameterizations and Consistency

Sampling Hyperparameters in Hierarchical Models

STATS 300A Theory of Statistics Stanford University, Fall 2015

Bayesian Optimization of Hyperparameters When the Marginal Likelihood Is Estimated by MCMC