Bernoulli 23(4B), 2017, 3711–3743 DOI: 10.3150/16-BEJ863

Accelerated Gibbs sampling of normal distributions using matrix splittings and polynomials

COLIN FOX1 and ALBERT PARKER2 1Department of Physics, University of Otago, Dunedin, New Zealand. E-mail: [email protected] 2Center for Biofilm Engineering, Department of Mathematical Sciences, Montana State University, Boze- man, MT, USA. E-mail: [email protected]

Standard Gibbs sampling applied to a multivariate normal distribution with a specified precision matrix is equivalent in fundamental ways to the Gauss–Seidel iterative solution of linear equations in the precision matrix. Specifically, the iteration operators, the conditions under which convergence occurs, and geomet- ric convergence factors (and rates) are identical. These results hold for arbitrary matrix splittings from classical iterative methods in giving easy access to mature results in that field, including existing convergence results for antithetic-variable Gibbs sampling, REGS sampling, and gener- alizations. Hence, efficient deterministic stationary relaxation schemes lead to efficient generalizations of Gibbs sampling. The technique of polynomial acceleration that significantly improves the convergence rate of an iterative solver derived from a symmetric may be applied to accelerate the equivalent generalized Gibbs sampler. Identicality of error polynomials guarantees convergence of the inhomogeneous Markov chain, while equality of convergence factors ensures that the optimal solver leads to the optimal sampler. Numerical examples are presented, including a Chebyshev accelerated SSOR Gibbs sampler ap- plied to a stylized demonstration of low-level Bayesian image reconstruction in a large 3-dimensional linear inverse problem.

Keywords: Bayesian inference; Gaussian Markov random field; Gibbs sampling; matrix splitting; multivariate normal distribution; non-stationary stochastic iteration; polynomial acceleration

1. Introduction

The Metropolis–Hastings algorithm for MCMC was introduced to main-stream statistics around 1990 (Robert and Casella [48]), though prior to that the Gibbs sampler provided a coherent approach to investigating distributions with Markov random field structure (Turcinˇ [60], Grenan- der [32], Geman and Geman [25], Gelfand and Smith [23], Besag and Green [11], Sokal [58]). The Gibbs sampler may be thought of as a particular Metropolis–Hastings algorithm that uses the conditional distributions as proposal distributions, with acceptance probability always equal to1(Geyer[26]). In statistics, the Gibbs sampler is popular because of ease of implementation (see, e.g., Roberts and Sahu [51]), when conditional distributions are available in the sense that samples may be drawn from the full conditionals. However, the Gibbs sampler is not often presented as an ef- ficient algorithm, particularly for massive models. In this work, we show that generalized and

1350-7265 © 2017 ISI/BS 3712 C. Fox and A. Parker accelerated Gibbs samplers are contenders for the fastest sampling algorithms for normal target distributions, because they are equivalent to the fastest algorithms for solving systems of linear equations. Almost all current MCMC algorithms, including Gibbs samplers, simulate a fixed transition kernel that induces a homogeneous Markov chain that converges geometrically in distribution to the desired target distribution. In this aspect, modern variants of the Metropolis–Hastings algorithm are unchanged from the Metropolis algorithm as first implemented in the 1950s. The adaptive Metropolis algorithm of Haario et al. [34] (see also Roberts and Rosenthal [50]) is an exception, though it converges to a geometrically convergent Metropolis–Hastings algorithm that bounds convergence behaviour. We focus on the application of Gibbs sampling to drawing samples from a multivariate normal distribution with a given covariance or precision matrix. Our concern is to develop generalized Gibbs samplers with optimal geometric, or better than geometric, distributional convergence by drawing on ideas in numerical computation, particularly the mature field of computational linear algebra. We apply the matrix-splitting formalism to show that fixed-scan Gibbs sampling from a multivariate normal is equivalent in fundamental ways to the stationary linear iterative solvers applied to systems of equations in the precision matrix. Stationary iterative solvers are now considered to be very slow precisely because of their geometric rate of convergence, and are no longer used for large systems. However, they remain a basic building block in the most efficient linear solvers. By establishing equivalence of error polynomials, we provide a route whereby acceleration techniques from numerical linear algebra may be applied to Gibbs sampling from normal distributions. The fastest solvers employ non- stationary iterations, hence the equivalent generalized Gibbs sampler induces an inhomogeneous Markov chain. Explicit calculation of the error polynomial guarantees convergence, while control of the error polynomial gives optimal performance. The adoption of the matrix splitting formalism gives the following practical benefits in the context of fixed-scan Gibbs sampling from normal targets: 1. a one-to-one equivalence between generalized Gibbs samplers and classical linear iterative solvers; 2. rates of convergence and error polynomials for the Markov chain induced by a generalized Gibbs sampler; 3. acceleration of the Gibbs sampler to induce an inhomogeneous Markov chain that achieves the optimal error polynomial, and hence has optimal convergence rate for expectations and in distribution; 4. numerical estimates of convergence rate of the (accelerated) Gibbs sampler in a single chain and a priori estimates of number of iterations to convergence; 5. access to preconditioning, whereby the sampling problem is transformed into an equivalent problem for which the accelerated Gibbs sampler has improved convergence rate. Some direct linear solvers have already been adapted to sampling from multivariate normal distributions, with Rue [52] demonstrating the use of solvers based on Cholesky factorization to allow computationally efficient sampling. This paper extends the connection to the iterative linear solvers. Since iterative methods are the most efficient for massive linear systems, the associated samplers will be the most efficient for very high dimensional normal targets. Polynomial accelerated Gibbs sampling 3713

1.1. Context and overview of results

The Cholesky factorization is the conventional way to produce samples from a moderately sized multivariate normal distribution (Rue [52], Rue and Held [53]), and is also the preferred method for solving moderately sized linear systems. For large linear systems, iterative solvers are the methods of choice due to their inexpensive cost per iteration, and small computer memory re- quirements. Gibbs samplers applied to normal distributions are essentially identical to stationary itera- tive methods from numerical linear algebra. This connection was exploited by Adler [1], and independently by Barone and Frigessi [8], who noted that the component-wise Gibbs sampler is a stochastic version of the Gauss–Seidel linear solver, and accelerated the Gibbs sampler by introducing a relaxation parameter to implement the stochastic version of the successive over-relaxation (SOR) of Gauss–Seidel. This pairing was further analyzed by Goodman and Sokal [30]. This equivalence is depicted in panels A and B of Figure 1. Panel B shows the contours of a normal density π(x), and a sequence of coordinate-wise conditional samples taken by the Gibbs sampler applied to π. Panel A shows the contours of the quadratic minus log(π(x)) and the Gauss–Seidel sequence of coordinate optimizations,1 or, equivalently, solves of the normal equations ∇ log π(x) = 0. Note how in Gauss–Seidel the step sizes decrease towards conver- gence, which is a tell-tale sign that convergence (in value) is geometric. In Section 4, we will show that the iteration operator is identical to that of the Gibbs sampler in panel B, and hence the Gibbs sampler also converges geometrically (in distribution). Slow convergence of these al- gorithms is usually understood in terms of the same intuition; high correlations correspond to long narrow contours, and lead to small steps in coordinate directions and many iterations being required to move appreciably along the long axis of the target function. Roberts and Sahu [51] considered forward then backward sweeps of coordinate-wise Gibbs sampling, with relaxation parameter, to give a sampler they termed the REGS sampler. This is a stochastic version of the symmetric-SOR (SSOR) iteration, which comprises forward then backward sweeps of SOR. The equality of iteration operators and error polynomials, for these pairs of fixed-scan Gibbs samplers and iterative solvers, allows existing convergence results in numerical analysis texts (for example, Axelsson [5], Golub and Van Loan [29], Nevanlinna [45], Saad [54], Young [64]) to be used to establish convergence results for the corresponding Gibbs sampler. Existing results for rates of distributional convergence by fixed-sweep Gibbs samplers (Adler [1], Barone and Frigessi [8], Liu et al. [39], Roberts and Sahu [51]) may be established this way. The methods of Gauss–Seidel, SOR, and SSOR, give stationary linear iterations that were used as linear solvers in the 1950s, and are now considered very slow. The corresponding fixed-scan Gibbs samplers are slow for precisely the same reason. The last fifty years has seen an explosion of theoretical results and algorithmic development that have made linear solvers faster and more efficient, so that for large problems, stationary methods are used as preconditioners at best, while the method of preconditioned conjugate gradients, GMRES, multigrid, or fast-multipole methods

1Gauss–Seidel optimization was rediscovered by Besag [10] as iterated conditional modes. 3714 C. Fox and A. Parker

1 T − T Figure 1. The panels in the left column show the contours of a quadratic function 2 x Ax b x in − two dimensions and the iteration paths for some common optimizers towards the minimizer μ = A 1b, or equivalently the path of iterative linear solvers of Ax = b. The right column presents the iteration paths of the samplers corresponding to each linear solver, along with the contours of the normal density {− 1 T + T } k exp 2 x Ax b x ,wherek is the normalizing constant. In all panels, the matrix A has eigenpairs {(10, [11]T ), (1, [1 −1]T )}. The Gauss–Seidel solver took 45 iterations to converge to μ (shown are the 90 coordinate steps; each iteration is a “sweep” of the two coordinate directions), the Chebyshev polynomial accelerated SSOR required just 16 iterations to converge, while CG finds the minimizer in 2 iterations. For each of the samplers, 10 iterations are shown (the 20 coordinate steps are shown for the Gibbs sampler). The correspondence between these linear solvers/optimizers and samplers is treated in the text (CG in the supplementary material [20]). Polynomial accelerated Gibbs sampling 3715 are the current state-of-the-art for solving linear systems in a finite number of steps (Saad and van der Vorst [55]). Linear iterations derived from a symmetric splitting may be sped up by polynomial accelera- tion, particularly Chebyshev acceleration that results in optimal error reduction amongst meth- ods that have a fixed non-stationary iteration structure (Fox and Parker [21], Axelsson [5]). The Chebyshev accelerated SSOR solver and corresponding Chebyshev accelerated SSOR sampler (Fox and Parker [19]) are depicted in panels C and D of Figure 1. Both the solver and sampler take steps that are more aligned with the long axis of the target, compared to the coordinate-wise algorithms, and hence achieve faster convergence. However, the step size of Chebyshev-SSOR solving still decreases towards convergence, and hence convergence for both solver and sampler is still asymptotically geometric, albeit with much improved rate. Fox and Parker [19] considered point-wise convergence of the mean and variance of a Gibbs SSOR sampler accelerated by Chebyshev polynomials. In this paper, we prove convergence in distribution for Gibbs samplers corresponding to any matrix splitting and accelerated by any polynomial that is independent of the Gibbs iterations. We then apply a polynomial accelerated sampler to solve a massive Bayesian linear inverse problem that is infeasible to solve using conventional techniques. Chebyshev acceleration requires estimates of the extreme eigenvalues of the error operator, which we obtain via a conjugate-gradient (CG) algorithm at no significant computational cost (Meurant [41]). The CG algorithm itself may be adapted to sample from normal distributions; the CG solver and corresponding sampler, depicted in panels E and F of Figure 1, were analysed by Parker and Fox [47] and is discussed in the supplementary material [20].

1.2. Structure of the paper

In Section 2, we review efficient methods for sampling from normal distributions, highlighting Gibbs sampling in various algorithmic forms. Standard results for stationary iterative solvers are presented in Section 3. Theorems in Section 4 establish equivalence of convergence and conver- gence factors for iterative solvers and Gibbs samplers. Application of polynomial acceleration methods to linear solvers and Gibbs sampling is given in Section 5, including explicit expres- sions for convergence of the first and second moments of a polynomial accelerated sampler, from which it follows that distributional convergence occurs with rate determined by the same error polynomial. Numerical verification of convergence results is presented in Section 6.

2. Sampling from multivariate normal distributions

We consider the problem of sampling from an n-dimensional normal distribution N(μ, ) de- fined by the mean n-vector μ, and the n × n symmetric and positive definite (SPD) covariance matrix . Since if z ∼ N(0, ) then z + μ ∼ N(μ, ), it often suffices to consider drawing sam- ples from normal distributions with zero mean. An exception is when μ is defined implicitly, which we discuss in Section 4.1. In Bayesian formulations of inverse problems that use a GMRF as a prior distribution, typ- ically the precision matrix A = −1 is explicitly modeled and available (Rue and Held [53], 3716 C. Fox and A. Parker

Higdon [35]), perhaps as part of a hierarchical model (Banerjee et al. [6]). Typically then the precision matrix (conditioned on hyperparameters) is large though sparse, if the neighborhoods specifying conditional independence are small. We are particularly interested in this case, and throughout the paper will focus on sampling from N(0, A−1) when A is sparse and large, or when some other property makes operating by A easy, that is, one can evaluate Ax for any vec- tor x. Standard sampling methods for moderately sized normal distributions utilize the Cholesky fac- torization (Rue [52], Rue and Held [53]) since it is fast, incurring approximately (1/3)n3 floating point operations (flops) and is backwards stable (Watkins [62]). Samples can also be drawn using the more expensive eigen-decomposition (Rue and Held [53]), that costs approximately (10/3)n3 flops, or more generally using mutually conjugate vectors (Fox [18], Parker and Fox [47]). For stationary Gaussian random fields defined on the lattice, Fourier methods can lead to efficient sampling for large problems (Gneiting et al. [28]). Algorithm 1 shows the steps for sampling from N(0, ) using Cholesky factorization, when the covariance matrix is available (Neal [42], MacKay [40], Higdon [35]).

Algorithm 1: Cholesky sampling using a covariance matrix input : Covariance matrix output: y ∼ N(0, ) Cholesky factor = CCT ; sample z ∼ N(0, I); y = Cz;

When the precision matrix A is available, a sample y ∼