Sampling Hyperparameters in Hierarchical Models
Total Page:16
File Type:pdf, Size:1020Kb
Sampling hyperparameters in hierarchical models: improving on Gibbs for high-dimensional latent fields and large data sets. Richard A. Norton1, J. Andr´es Christen2 and Colin Fox3 1Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand, [email protected], 2Centro de Investigaci´on en Matem´aticas (CIMAT), CONACYT, Guanajuato, GT, Mexico, [email protected] 3Department of Physics, University of Otago, Dunedin, New Zealand, fox@[email protected]. 20 October 2016 Abstract We consider posterior sampling in the very common Bayesian hierarchical model in which observed data depends on high-dimensional latent variables that, in turn, depend on relatively few hyperparameters. When the full conditional over the latent variables has a known form, the marginal posterior distribution over hyperparameters is accessible and can be sampled using a Markov chain Monte Carlo (MCMC) method on arXiv:1610.06632v1 [stat.CO] 21 Oct 2016 a low-dimensional parameter space. This may improve computational efficiency over standard Gibbs sampling since computation is not over the high-dimensional space of latent variables and correlations between hyperparameters and latent variables be- come irrelevant. When the marginal posterior over hyperparameters depends on a fixed-dimensional sufficient statistic, precomputation of the sufficient statistic renders the cost of the low-dimensional MCMC independent of data size. Then, when the hyperparameters are the primary variables of interest, inference may be performed in 1 big-data settings at modest cost. Moreover, since the form of the full conditional for the latent variables does not depend on the form of the hyperprior distribution, the method imposes no restriction on the hyperprior, unlike Gibbs sampling that typi- cally requires conjugate distributions. We demonstrate these efficiency gains in four computed examples. 1 Introduction Suppose we have a hierarchical model structure in which the data model depends on a high- dimensional latent structure X, which in turn depends on a low-dimensional hyperparameter vector Θ. That is, fY X Θ( x, θ) models the data Y , and the joint prior distribution for X | , ·| and Θ is defined by the hierarchical structure fX Θ(x, θ)= fX Θ(x θ)fΘ(θ). , | | This is a very common hierarchical structure present in many statistical analyses (for exam- ple Banerjee et al., 2004; Bardsley, 2012; Gilavert et al., 2015; Papaspiliopoulos et al., 2007; Wikle et al., 2001). The posterior distribution is then fY |X,Θ(y x, θ)fX|Θ(x θ)fΘ(θ) fX,Θ|Y (x, θ y)= | | . (1) | fY (y) Indeed, in most relevant circumstances the normalizing constant f (y)= fY X Θ(y x, θ)fX Θ(x θ)fΘ(θ) dxdθ Y | , | | | Z Z is intractable and we must rely on Monte Carlo methods to explore the posterior distribution. A convenient, but restrictive, common method is to define conditional conjugate prior distribution such that all of the full conditionals are of a known form and may be sampled from, and then progress by Gibbs sampling. In the following, we require that the prior of the latent structure fX Θ(x θ) has some sort of conjugancy to make the full conditional | | fX Θ Y (x θ, y) of a known form. However, the prior for the hyperparameters fΘ(θ) may | , | have arbitrary form. 2 Let us assume then that the full conditional for the parameters X fY |X,Θ(y x, θ)fX|Θ(x θ) fX|Θ,Y (x θ, y)= | | (2) | fY Θ(y θ) | | belongs to a known family of distributions. This means that, in principle, by comparing this distribution with the numerator in (1), one can identify the θ-dependence of the normalizing constant in the denominator of (2), that is ′ ′ ′ fY Θ(y θ)= fY X Θ(y x , θ)fX Θ(x θ) dx . | | | , | | | Z Hence, if the full conditional for X belongs to a known family of distributions, we should be able to determine the marginal posterior distribution over hyperparameters fY |Θ(y θ)fΘ(θ) fΘ|Y (θ y)= | , (3) | fY (y) excepting the normalizing constant fY (y). This allows us to perform MCMC on the low- dimensional marginal posterior distribution over the hyperparameters, for which automatic MCMC methods are available, and other customized MCMCs may be designed. We discuss automatic methods in Section 2.2. In many cases this marginal posterior distribution has a sufficient statistic of fixed low dimension that can be precomputed. Consequently, the cost of evaluating fΘ Y (θ y) at each | | iteration of an MCMC may grow not at all, or only slowly, with data size. This improvement may become critical over the alternative of Gibbs sampling (or any other MCMC) on the joint posterior distribution over X, Θ Y since the cost of performing MCMC on the joint posterior | distribution grows with the dimension of X which, in many analyses, grows linearly with data size. Moreover, correlations between X and Θ that can adversely effect the computational efficiency of Gibbs sampling from the joint posterior distribution are irrelevant to MCMC on the marginal posterior distribution (Rue and Held, 2005). Note that once we have (pseudo) independent samples, after burn-in and thinning, of Θ Y , namely θ(i) for i =1, 2,...,T , if inference on the latent structure is also required then | (i) (i) (i) (i) we simply simulate x from fX Θ Y ( θ , y) and (x , θ ) for i =1, 2,...,T are (pseudo) | , ·| independent samples from the joint posterior fX Θ Y ( , y). That is, the complete results of , | · ·| the original Gibbs sampling are recovered. However, performing MCMC on the marginal posterior distribution has at least three 3 potential advantages: the improved computational efficiency for ‘big data’ just mentioned, an easier MCMC since correlations in a very high-dimensional posterior are now avoided, and greater freedom of prior choice. Rue and Held (2005) proposed a similar decomposition to give the one-block algorithm. From (2) the marginal posterior over hyperparameters is evaluated using the expression ∗ ∗ fY |X,Θ(y x , θ)fX|Θ(x θ) fY |Θ(y θ)= | | . (4) | fX Y Θ(x∗ y, θ) | , | ∗ ∗ with x suitably chosen, restricted to fX Y Θ(x y, θ) > 0. This approach was proposed | , | in Rue and Held (2005) and Simpson et al. (2012) for linear Gaussian models. Fox and Norton (2016) in the same setting canceled the terms in x on the right-hand side of (4) to give an expression that does not depend on x. ∗ Indeed, as noted above, since fX Y Θ(x y, θ) has a known form, this cancellation appears | , | to be always possible and the selection of x∗ is unnecessary. This is the procedure followed in the four examples we present in Section 3. In common with the one-block algorithm, the calculation that we use in (3) does not require any special form of the hyperprior distribution fΘ(θ), as opposed to the usual Gibbs sampling that requires conjugate distributions to be employed so that all full conditionals are available. The approach we take is also related to partially collapsed Gibbs samplers (Park and Min, 2016; van Dyk and Park, 2008; Liu et al., 1994) that perform strategic integrations and then apply Gibbs sampling on a reduced set of variables. We are proposing an extreme version of partially collapsed Gibbs sampling by “collapsing” over all latent variables and then sam- pling. However, we do not then apply Gibbs sampling to the marginal posterior distribution, so we do not fit completely under the aegis of partially collapsed Gibbs samplers. Note also that we do not actually perform any integration, but use the algebraic route implied by (3). Our first example in Section 3 is the ‘pump failure data’ analyzed by Gelfand and Smith (1990), that sparked off the widespread use of Gibbs sampling. Back then, analyzing a model with more than three parameters was in most cases prohibitive, and, consequently, Bayesian statistics had a very marginal impact until that turning point. The fact that Gelfand and Smith (1990) showed a way to analyzing and sampling, in principle, from ar- bitrarily large models and provided an example with 11 parameters was revolutionary to most statisticians. However, practically, Gibbs sampling may become arbitrarily inefficient with high correlations (Belisle, 1998) and certainly in latent structures with large data sets. We now have a range of automatic, semiautomatic, and customized algorithms that we may 4 employ to deal with an arbitrary MCMC (with no conjugancy), over 10 to 15 parameters, to sample the hyperprior parameters, and therefore the need and applicability of our approach. The next section describes a measure of computational efficiency of MCMC methods, to achieve a quantitative comparison of conventional Gibbs and our approach. This is followed by four examples of sampling from the marginal posterior distribution to demonstrate its ad- vantages over sampling from the joint posterior distribution. We finish with some discussion. Overly technical details are in the appendix. 2 Computational considerations 2.1 Statistical and Computational Efficiency The statistical efficiency of (geometrically convergent) MCMC algorithms can be measured by the integrated autocorrelation time (IACT, see Geyer, 1992) for some real-valued statistic of interest, often one component of the state, or the log target density. Let h be the statistic of interest. The IACT for h is defined by ∞ (h) τint =1+2 ρk (5) Xk=1 (h) where ρk is the autocorrelation coefficient of h at lag k. We estimate τint using twice the IACT value computed using the (Matlab) function provided by Wolff (2004) since in the physics literature IACT is defined as τint/2 (see also Sokal, 1989). IACT has units of iterations, or steps, of the MCMC. We prefer the statistics definition in (5) since it may be thought of as the length of the chain that has the same variance reducing power as one independent sample.