Hierarchical Models – Motivation James-Stein Inference • Suppose X ∼ N(Θ, 1) – X Is Admissible (Not Dominated) for Estimating Θ with Squared Error Loss

Hierarchical models { motivation James-Stein inference ² Suppose X » N(θ; 1) { X is admissible (not dominated) for estimating θ with squared error loss ² Now Xi » N(θi; 1); i = 1; : : : ; r { X = (X1;:::;Xr) is admissible if r = 1; 2 but not r ¸ 3 { for r ¸ 3 r ¡ 2 ±i = (1 ¡ P 2 )Xi i Xi yields better estimates { known as James-Stein estimation 57 Hierarchical models { motivation James-Stein inference (cont'd) ² Bayes view: Xi » N(θi; 1) and θi » N(0; a) { posterior distn: θijXi » N 1 { posterior mean is (1 ¡ a+1 )Xi { need to estimate a; one natural approach yields James-Stein ² Summary { estimation results depend on loss function { squared-error loss do well on avg but maybe poor for one component { powerful lesson about combining related problems to get improved inferences 58 Hierarchical Models Suppose we have data Yij j = 1;:::;J i = 1; : : : ; nj such that Yij i = 1; : : : ; nj are independent given θj with distribution p(Y jθj). e.g. |scores{z } for |students{z } in |classrooms{z } It might be reasonable Y (i) (j) to expect θj's to be \similar" (but not necessarily identical). Therefore, we may perhaps try to estimate population distribution of θj's. This is achieved in a natural way if we use a prior distribution in which the θj's are viewed as a sample from a common population distribution. 59 Hierarchical Models ² Key: The observed data, yij , with units indexed by i within groups indexed by j, can be used to estimate aspects of the population distribution of the θj's even though the values of θj are not themselves observed. ² How? It is natural to model such a problem hierarchically { observable outcomes modeled conditionally on parameters θ { θ given a probabilistic speci¯cation in terms of other parameters, Á, known as hyperparameters. 60 Hierarchical Models ² Nonhierarchical models are usually inappropiate for hierarchical data. { a single θ (i.e. θj ´ θ 8j) may be inadequate to ¯t a combined data set. { separate unrelated θj are likely to \over¯t" data. { information about one θj can be obtained from others' data. ² Hierarchical model uses many parameters but population distribution induces enough structure to avoid over¯tting. 61 Setting up hierarchical models Exchangeability Recall: A set of random variables (θ1; : : : ; θk) is exchangeable if the joint distribution is invariant to permutations of the indexes (1; : : : ; k). The indexes contain no information about the values of the random variables. ² hierarchical models often use exchangeable models for the prior distribution of model parameters ² iid random variables are one example ² seemingly non-exchangeable r.v.'s may become exchangeable if we condition on all available information (e.g., regression analysis) 62 Setting up hierarchical models Exchangeable models ² Basic form of exchangeable model { θ = (θ1; : : : ; θk) are independent conditional on additional parameters Á (known as hyperparameters) Yk p(θjÁ) = p(θjjÁ) j=1 { Á referred to as hyperparameter(s) with hyperprior distn p(Á) R { implies p(θ) = p(θjÁ)p(Á)dÁ { work with joint posterior distribution, p(θ; Ájy) ² One objection to exchangeable model is that we may have other information, say (Xj). In that case may take YJ p(θ1; : : : ; θJ jX1;:::;XJ ) = p(θijÁ; Xi) i=1 63 Setting up hierarchical models ² Model is speci¯ed in nested stages { sampling distribution p(yjθ) (¯rst level of hierarchy) { prior (or population) distribution for θ: p(θjÁ) (second level of hierarchy) { prior distribution for Á (hyperprior): p(Á) { Note: more levels are possible { hyperprior at highest level is often di®use but improper priors must be checked carefully to avoid improper posterior distributions. 64 Setting up hierarchical models ² Inference { Joint distn: p(y; θ; Á) = p(yjθ; Á)p(θjÁ)p(Á) = p(yjθ)p(θjÁ)p(Á) { Posterior distribution p(θ; Ájy) / p(Á)p(θjÁ)p(yjθ) = p(θjy; Á)p(Ájy) ¤ often p(θjÁ) is conjugate for p(yjθ) ¤ if we know (or ¯x) Á: p(θjy; Á) follows from conjugacy ¤ then need inference for Á: p(Ájy) 65 Computational approaches for hierarchical models ² Marginal model Z p(yjÁ) = p(yjθ)p(θjÁ)dθ do inference only for Á (e.g. marginal maximum likelihood) ² Empirical Bayes p(θjy; Á^) / p(yjθ)p(θjÁ^) do inference for θ ² Hierarchical Bayes (a.k.a. full Bayes) p(θ; Ájy) / p(yjθ)p(θjÁ)p(Á) inference for θ and Á 66 Hierarchical models and random e®ects Animal breeding example Consider the following mixed linear model commonly used in animal breeding studies Y = X¯ + Zu + e X = design matrix for ¯xed e®ects Z = design matrix for random e®ects ¯ = ¯xed e®ects parameters u = random e®ects parameters 2 e = individual variation » N(0; σe I) 2 2 Y j¯; u; σe » N(X¯ + Zu; σe I) 2 2 ujσa » N(0; σaA) (can also think of ¯ as random with p(¯) / 1) 67 Hierarchical models and random e®ects Animal breeding example ² Marginal model (after integrating out u) 2 2 2 0 2 Y j¯; σa; σe » N(X¯; σaZAZ + σe I) ² Note: the separation of parameters into θ and Á is somewhat ambiguous here: 2 { model speci¯cation suggests Á = fσag and θ = f¯; u; σeg 2 2 { marginal model suggests Á = f¯; σa; σe g and θ = fug 68 Hierarchical models and random e®ects Animal breeding example ² Empirical Bayes (known as REML/BLUP) 2 2 We can estimate σa, σe by marginal 2 2 (restricted?) maximum likelihood (^σa,σ ê ). Then 2 2 2 2 p(u; ¯jy; σâ; σê ) / p(yj¯; u; σê )p(ujσâ) (a joint normal distn) ² Hierarchical Bayes 2 2 2 2 2 2 p(¯; σa; σe ; ¹jy) / p(yj¯; u; σe )P (ujσa)p(¯; σa; σe ) 69 Computation with hierarchical models ² Two cases { conjugate case (p(θjÁ) conjugate prior for p(yjθ)) ¤ approach described below { non-conjugate case ¤ requires more advanced computing ¤ problem-speci¯c implementations ² Computational strategy for conjugate case { write p(θ; Ájy) = p(Ájy)p(θjÁ; y) { identify conditional posterior density of θ given Á, p(θjÁ; y) (easy for conjugate models) { obtain marginal posterior distribution of Á, p(Ájy) { simulate from p(Ájy) and then p(θjÁ; y) 70 Computation with hierarchical models The marginal posterior distribution p(Ájy) ² Approaches for obtaining p(Ájy) R { integration p(Ájy) = p(θ; Ájy)dθ { algebra - for a convenient value of θ p(θ; Ájy) p(Ájy) = p(θjÁ; y) ² Sampling from p(Ájy) { easy if known distribution { grid if Á is low-dimensional { more sophisticated methods (later) 71 Beta-binomial example ² Series of toxicology studies ² Study j: nj exchangeable individuals yj develop tumors ² Model speci¯cation: { yjjθj » Bin(nj; θj); j = 1;:::;J (indep) { θj; j = 1;:::;J j ®; ¯ » Beta(®; ¯) (iid) { p(®; ¯) { to be speci¯ed later, hopefully "non-informative" ² Marginal model: { can integrate out θj; j = 1;:::;J in this case Z Z J µ ¶ Y ¡(® + ¯) ®¡1 ¯¡1 nj yj n ¡y p(yj®; ¯) = ¢ θ (1 ¡ θ ) θ (1 ¡ θ ) j j dθ ¢ dθ j j j j 1 J j=1 ¡(®)¡(¯) yj J µ ¶ Y nj ¡(® + ¯) ¡(® + yj )¡(¯ + nj ¡ yj ) = j=1 yj ¡(®)¡(¯) ¡(® + ¯ + nj ) { yj; j = 1;:::;J are ind { distn of yj is known as beta-binomial distn 72 Beta-binomial example ² Conditional distn of θ's given ®; ¯; y Q { p(θj®; ¯; y) = j Beta(® + yj; ¯ + nj ¡ yj) { independent conjugate analyses { ¯nd this by algebra or by inspection of p(θ; ®; ¯jy) { analysis is thus reduced to ¯nding (and simulating from) p(®; ¯jy) ² Marginal posterior distn of ®; ¯ YJ ¡(® + ¯) ¡(® + y )¡(¯ + n ¡ y ) p(®; ¯jy) / p(®; ¯) j j j ¡(®)¡(¯) ¡(® + ¯ + n ) j=1 j { could derive from marginal distn on previous slide { could also derive from joint posterior distn { not a known distn (on ®; ¯) but easy to evaluate 73 Beta-binomial example ² Hyperprior distn p(®; ¯) { First try: p(®; ¯) / 1 (flat, noninformative?) ¤ equivalent to p(®=(® + ¯); ® + ¯) / (® + ¯) ¤ equivalent to p(log(®=¯); log(® + ¯)) / ®¯ ¤ check to see if posterior is proper ¢ consider di®'t cases (e.g., ® ! 0; ¯ ¯xed) ¢ if ®; ¯ ! 1 with ®=(® + ¯) = c, then p(®; ¯jy) / constant (not integrable) ¢ this is an improper distn ¢ contour plot would also show this (lots of probability extending out towards in¯nity) 74 Beta-binomial example ² Hyperprior distn p(®; ¯) { Second try: p(®=(® + ¯); ® + ¯) / 1 (flat on prior mean and precision) ¤ more intuitive, these two params are plausibly independent ¤ equivalent to p(®; ¯) / 1=(® + ¯) ¤ still leads to improper posterior distn { Third try: p(log(®=¯); log(® + ¯)) / 1 (flat on natural transformation of prior mean and variance) ¤ equivalent to p(®; ¯) / 1=(®¯) ¤ still leads to improper posterior distn { Fourth try: p(®=(® + ¯); (® + ¯)¡1=2) / 1 (flat on prior s.d. and prior mean) ¤ equivalent to p(®; ¯) / (® + ¯)¡5=2 ¤ "¯nal answer" - proper posterior distn ¤ equivalent to p(log(®=¯); log(® + ¯)) / ®¯(® + ¯)¡5=2 (this will come up later) 75 Beta-binomial example ² Computing { later consider more sophisticated approaches { for now, use grid approach ¤ simulate ®; ¯ from grid approx to posterior distn ¤ then simulate θ's using conjugate beta posterior distn { convenient to use (log(®=¯); log(® + ¯)) scale because contours "look better" and we can get away with smaller grid ² Illustrate with rat tumor data (separate handout) 76 Normal-normal hierarchical model ² Data model 2 { yjjθj » N(θj; σj ); j = 1;:::;J (indep) 2 { σj 's are assumed known (can release this assumption later) { motivation: yj could be a summary statistic with (approx) normal distn from the j-th study (e.g., regression coe±cient, sample mean) ² Prior distn { need a prior distn p(θ1; : : : ; θJ ) { if exchangeable, then model θ's as iid given parameters Á { some additional comments follow 77 Normal-normal hierarchical model ² Constructing a prior distribution Can think of this data model as a one-way ANOVA model (especially if yj is a sample mean of nj obs in group j).

Hierarchical Models – Motivation James-Stein Inference • Suppose X ∼ N(Θ, 1) – X Is Admissible (Not Dominated) for Estimating Θ with Squared Error Loss

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support