Hierarchical Models & Bayesian Model Selection
Total Page:16
File Type:pdf, Size:1020Kb
Hierarchical Models & Bayesian Model Selection Geoffrey Roeder Departments of Computer Science and Statistics University of British Columbia Jan. 20, 2016 Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Contact information Please report any typos or errors to geoff[email protected] Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Outline 1 Hierarchical Bayesian Modelling Coin toss redux: point estimates for θ Hierarchical models Application to clinical study 2 Bayesian Model Selection Introduction Bayes Factors Shortcut for Marginal Likelihood in Conjugate Case Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y ∼ Bin(n; θ), with probability mass function n p(Y = y j θ) = θy (1 − θ)n−y (1) y We want to estimate the parameter θ Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 − θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 We want to estimate the parameter θ Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 − θ Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y ∼ Bin(n; θ), with probability mass function n p(Y = y j θ) = θy (1 − θ)n−y (1) y Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Coin toss: point estimates for θ Probability model Consider the experiment of tossing a coin n times. Each toss results in heads with probability θ and tails with probability 1 − θ Let Y be a random variable denoting number of observed heads in n coin tosses. Then, we can model Y ∼ Bin(n; θ), with probability mass function n p(Y = y j θ) = θy (1 − θ)n−y (1) y We want to estimate the parameter θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Let ``(θjy) := log p(yjθ), the log-likelihood. Then, θ^ML = argmax``(θjy) = argmax ylog(θ) + (n − y)log(1 − θ) (2) θ θ Since the log likelihood is a concave function of θ, @`(θjy) argmax `(θjy) , 0 = @θ ^ θ θML y n − y , 0 = − (3) θ^ML 1 − θ^ML y , θ^ = ML n Coin toss: point estimates for θ Maximum Likelihood By interpreting p(Y = yjθ) as a function of θ rather than y, we get the likelihood function for θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Since the log likelihood is a concave function of θ, @`(θjy) argmax `(θjy) , 0 = @θ ^ θ θML y n − y , 0 = − (3) θ^ML 1 − θ^ML y , θ^ = ML n Coin toss: point estimates for θ Maximum Likelihood By interpreting p(Y = yjθ) as a function of θ rather than y, we get the likelihood function for θ Let ``(θjy) := log p(yjθ), the log-likelihood. Then, θ^ML = argmax``(θjy) = argmax ylog(θ) + (n − y)log(1 − θ) (2) θ θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Coin toss: point estimates for θ Maximum Likelihood By interpreting p(Y = yjθ) as a function of θ rather than y, we get the likelihood function for θ Let ``(θjy) := log p(yjθ), the log-likelihood. Then, θ^ML = argmax``(θjy) = argmax ylog(θ) + (n − y)log(1 − θ) (2) θ θ Since the log likelihood is a concave function of θ, @`(θjy) argmax `(θjy) , 0 = @θ ^ θ θML y n − y , 0 = − (3) θ^ML 1 − θ^ML y , θ^ = ML n Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Asymptotic result that this approaches true parameter Coin toss: point estimates for θ Point estimate for θ: Maximum Likelihood What if sample size is small? Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Coin toss: point estimates for θ Point estimate for θ: Maximum Likelihood What if sample size is small? Asymptotic result that this approaches true parameter Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Recall that if p(yjθ) is in the exponential family, there exists a conjugate prior p(θ) s.t. if p(θ) 2 F, then p(yjθ)p(θ) 2 F Saw last time that binomial is in the exponential family, and θ ∼ Beta(α; β) is a conjugate prior. Γ(α + β) p(θjα; β) = θα−1(1 − θ)β−1 (5) Γ(α)Γ(β) Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes' Theorem: p(yjθ)p(θ) p(θjy) = (4) p(y) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ) Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Saw last time that binomial is in the exponential family, and θ ∼ Beta(α; β) is a conjugate prior. Γ(α + β) p(θjα; β) = θα−1(1 − θ)β−1 (5) Γ(α)Γ(β) Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes' Theorem: p(yjθ)p(θ) p(θjy) = (4) p(y) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ) Recall that if p(yjθ) is in the exponential family, there exists a conjugate prior p(θ) s.t. if p(θ) 2 F, then p(yjθ)p(θ) 2 F Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Coin toss: point estimates for θ Maximum A Posteriori Alternative analysis: reverse the conditioning with Bayes' Theorem: p(yjθ)p(θ) p(θjy) = (4) p(y) Lets us encode our prior beliefs or knowledge about θ in a prior distribution for the parameter, p(θ) Recall that if p(yjθ) is in the exponential family, there exists a conjugate prior p(θ) s.t. if p(θ) 2 F, then p(yjθ)p(θ) 2 F Saw last time that binomial is in the exponential family, and θ ∼ Beta(α; β) is a conjugate prior. Γ(α + β) p(θjα; β) = θα−1(1 − θ)β−1 (5) Γ(α)Γ(β) Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Thus, p(θjy) / p(yjθ)p(θ)p(y) so that θ^MAP = argmax p(θjy) θ = argmax p(yjθ)p(θ) θ = argmax log p(yjθ)p(θ) θ By evaluating the first partial derivative w.r.t θ and setting to 0 at θ^MAP we can derive y + α − 1 θ^ = (6) MAP n + β − 1 + α − 1 Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = R p(yjθ0)p(θ0)dθ0 is a constant Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 By evaluating the first partial derivative w.r.t θ and setting to 0 at θ^MAP we can derive y + α − 1 θ^ = (6) MAP n + β − 1 + α − 1 Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = R p(yjθ0)p(θ0)dθ0 is a constant Thus, p(θjy) / p(yjθ)p(θ)p(y) so that θ^MAP = argmax p(θjy) θ = argmax p(yjθ)p(θ) θ = argmax log p(yjθ)p(θ) θ Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Coin toss: point estimates for θ Maximum A Posteriori Moreover, for any given realization y of Y, the marginal distribution p(y) = R p(yjθ0)p(θ0)dθ0 is a constant Thus, p(θjy) / p(yjθ)p(θ)p(y) so that θ^MAP = argmax p(θjy) θ = argmax p(yjθ)p(θ) θ = argmax log p(yjθ)p(θ) θ By evaluating the first partial derivative w.r.t θ and setting to 0 at θ^MAP we can derive y + α − 1 θ^ = (6) MAP n + β − 1 + α − 1 Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Can also choose uninformative prior: Jeffreys’ prior. For 1 1 beta-binomial model, corresponds to (α; β) = ( 2 ; 2 ). Deriving an analytic form for the posterior is possible also if the prior is conjugate. We saw last week that for a single Binomial experiment with a conjugate Beta, p(θjy) ∼ Beta(α + y − 1; β + n − y − 1) Coin toss: point estimates for θ Point estimate for θ: Choosing α; β The point estimate for θ^MAP shows choices of α and β correspond to having already seen prior data. Can encode strength of prior belief using these parameters. Geoffrey Roeder Hierarchical Models & Bayesian Model Selection Jan. 20, 2016 Deriving an analytic form for the posterior is possible also if the prior is conjugate. We saw last week that for a single Binomial experiment with a conjugate Beta, p(θjy) ∼ Beta(α + y − 1; β + n − y − 1) Coin toss: point estimates for θ Point estimate for θ: Choosing α; β The point estimate for θ^MAP shows choices of α and β correspond to having already seen prior data.