MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster

Cotter, Simon and Roberts, Gareth and Stuart, Andrew and White, David

2013

MIMS EPrint: 2014.71

Manchester Institute for Mathematical Sciences School of Mathematics

The University of Manchester

Reports available from: http://eprints.maths.manchester.ac.uk/ And by contacting: The MIMS Secretary School of Mathematics The University of Manchester Manchester, M13 9PL, UK

ISSN 1749-9097 arXiv:1202.0709v3 [stat.CO] 10 Oct 2013 awc,Cvnr,C47L ntdKndm .M. e-mail: A. Professor Kingdom. is United Stuart 7AL, CV4 of Coventry, University Warwick, Department, Statistics Professor, is White D. Faster and Stuart M. Them A. Roberts, O. Make G. Cotter, L. Modifying to S. Functions: Algorithms for Old Methods MCMC 03 o.2,N.3 424–446 3, DOI: No. 28, Vol. 2013, Science Statistical oety V A,Uie Kingdom. United 7AL, CV4 Warwick, Coventry, of Assistant, University Research Department, Postdoctoral Mathematics is White D. and nvriyo acetr 1 P,Uie Kingdom United 9PL, e-mail: M13 Manchester, of University

.L otri etrr colo Mathematics, of School Lecturer, is Cotter L. S. c nttt fMteaia Statistics Mathematical of Institute 10.1214/13-STS421 [email protected] esr,wihesrsta hi pe fcnegnei ro is re convergence field of refinement. random speed mesh Gaussian their or that ensures de process which Gaussian has measure, a measure to target respect the with whenever whole applicable a methods, modifying to o MCMC approach an description describe nonparametric We function. by slo known dictated arbitrarily refinement become mesh Example typically the algorithms functions. diffusio MCMC for conditioned Standard and distribution statistics probability nonparametric Bayesian a probe to Abstract. r hw ndniyetmto,dt siiaini udm fluid in registration. image assimilation and data geophysics estimation, subsurface density in shown applicat are wide-ranging finding me tool, part reference modelling flexible essential Gaussian-based very an These is strategy. specification modelling prior interest all the meas of con and applications reference measures sparse the often reference field In non-Gaussian truncation. random r useful random Gaussian with through some density or to also has process but measure Gaussian probability a desired to app the is when describe we only that approach algorithmic The Gaussians. uin,we vlae tayfiiestof set finite any at evaluated when butions, asinrno ed aeinivreproblems. inverse Bayesian field, random Gaussian hc hweomu pe-po ierneo ple probl phrases: and applied words of Key range wide a w algori on existing algorithms speed-up of enormous new modification show many minor which via to r implemented Gaussian leads be the approach can preserve this exactly Taking which measure. time-discretizatio systems ac chosen dynamical be carefully may tic on this based functions; for proposals applicable of principle, in is, it asinpoesso admfilsaefilswoemarginal whose fields are fields random or processes Gaussian h e einpicpei ofruaeteMM ehds th so method MCMC the formulate to is principle design key The [email protected] 2013 , aypolm rsn napiain euti h need the in result applications in arising problems Many .O Roberts O. G. . CC aeinnnaaerc,algorithms, nonparametrics, Bayesian MCMC, 1 ern iesfo h rgnli aiainand pagination in detail. original typographic the from differs reprint ttsia Science article Statistical original the the of by reprint published electronic an is This N ons are points, nttt fMteaia Statistics Mathematical of Institute , 03 o.2,N.3 424–446 3, No. 28, Vol. 2013, o.Examples ion. so stochas- of ns ivdb use by hieved srsaea are asures processes. n fteover- the of h aais data the utunder bust R ial not licable hs yet thms, include s N echanics, h un- the f under w structed eference ag of range -valued ference espect distri- ems. nsity hich ure, at This . in 2 COTTER, ROBERTS, STUART AND WHITE

1. INTRODUCTION projected prior chosen to reflect available data and quantities of interest [e.g., u(x ); 1 i m say]. The use of Gaussian process (or field) priors is { i ≤ ≤ } widespread in statistical applications (geostatistics Instead it is necessary to consider the entire dis- tribution of u(x); x D . This poses major com- [48], nonparametric regression [24], Bayesian emula- { ∈ } tor modelling [35], density estimation [1] and inverse putational challenges, particularly in avoiding un- quantum theory [27] to name but a few substantial satisfactory compromises between model approxi- areas where they are commonplace). The success of mation (discretization in x typically) and compu- using Gaussian priors to model an unknown func- tational cost. tion stems largely from the model flexibility they There is a growing need in many parts of ap- afford, together with recent advances in computa- plied mathematics to blend data with sophisticated tional methodology (particularly MCMC for exact models involving nonlinear partial and/or stochastic likelihood-based methods). In this paper we describe differential equations (PDEs/SDEs). In particular, a wide class of statistical problems, and an algo- credible mathematical models must respect physi- rithmic approach to their study, which adds to the cal laws and/or Markov conditional independence growing literature concerning the use of Gaussian relationships, which are typically expressed through process priors. To be concrete, we consider a pro- differential equations. Gaussian priors arises natu- cess u(x); x D for D Rd for some d. In most rally in this context for several reasons. In partic- of the{ examples∈ we} consider⊆ here u is not directly ular: (i) they allow for straightforward enforcement observed: it is hidden (or latent) and some compli- of differentiability properties, adapted to the model cated nonlinear function of it generates the data at setting; and (ii) they allow for specification of prior our disposal. information in a manner which is well-adapted to Gaussian processes or random fields are fields the computational tools routinely used to solve the whose marginal distributions, when evaluated at any differential equations themselves. Regarding (ii), it N finite set of N points, are R -valued Gaussians. is notable that in many applications it may be com- Draws from these Gaussian probability distributions putationally convenient to adopt an inverse covari- can be computed efficiently by a variety of tech- ance (precision) operator specification, rather than niques; for expository purposes we will focus pri- specification through the covariance function; this marily on the use of Karhunen–Lo´eve expansions to allows not only specification of Markov conditional construct such draws, but the methods we propose independence relationships but also the direct use simply require the ability to draw from Gaussian of computational tools from numerical analysis [45]. measures and the user may choose an appropriate This paper will consider MCMC based computa- method for doing so. The Karhunen–Lo´eve expan- tional methods for simulating from distributions of sion exploits knowledge of the eigenfunctions and the type described above. Although our motivation eigenvalues of the covariance operator to construct is primarily to nonparametric Bayesian statistical series with random coefficients which are the desired applications with Gaussian priors, our approach can draws; it is introduced in Section 3.1. be applied to other settings, such as conditioned dif- Gaussian processes [2] can be characterized by ei- ther the covariance or inverse covariance (precision) fusion processes. Furthermore, we also study some operator. In most statistical applications, the co- generalizations of Gaussian priors which arise from variance is specified. This has the major advantage truncation of the Karhunen–Lo´eve expansion to a that the distribution can be readily marginalized to random number of terms; these can be useful to pre- suit a prescribed statistical use. For instance, in geo- vent overfitting and allow the data to automatically statistics it is often enough to consider the joint dis- determine the scales about which it is informative. tribution of the process at locations where data is Since in nonparametric Bayesian problems the un- present. However, the inverse covariance specifica- known of interest (a function) naturally lies in an tion has particular advantages in the interpretabil- infinite-dimensional space, numerical schemes for ity of parameters when there is information about evaluating posterior distributions almost always rely the local structure of u. (E.g., hence the advantages on some kind of finite-dimensional approximation of using Markov random field models in image anal- or truncation to a parameter space of dimension du, ysis.) In the context where x varies over a contin- say. The Karhunen–Lo´eve expansion provides a nat- uum (such as ours) this creates particular computa- ural and mathematically well-studied approach to tional difficulties since we can no longer work with a this problem. The larger du is, the better the ap- MCMC METHODS FOR FUNCTIONS 3 proximation to the infinite-dimensional true model serve µ or µ0 and then to employ as proposals for becomes. However, off-the-shelf MCMC methodol- Metropolis–Hastings methods specific discretizations ogy usually suffers from a curse of dimensionality of these differential equations which exactly pre- so that the numbers of iterations required for these serve the Gaussian reference measure µ when Φ 0 ≡ methods to converge diverges with du. Therefore, 0; thus, the methods do not reject in the trivial we shall aim to devise strategies which are robust to case where Φ 0. This typically leads to algorithms ≡ the value of du. Our approach will be to devise al- which are minor adjustments of well-known meth- gorithms which are well-defined mathematically for ods, with major algorithmic speed-ups. We illustrate the infinite-dimensional limit. Typically, then, finite- this idea by contrasting the standard random walk dimensional approximations of such algorithms pos- method with the pCN algorithm (studied in detail sess robust convergence properties in terms of the later in the paper) which is a slight modification of choice of du. An early specialised example of this the standard random walk, and which arises from approach within the context of diffusions is given the thinking outlined above. To this end, we define in [43]. (1.2) I(u)=Φ(u)+ 1 −1/2u 2 In practice, we shall thus demonstrate that small, 2 kC k but significant, modifications of a variety of stan- and consider the following version of the standard dard Markov chain Monte Carlo (MCMC) methods random walk method: lead to substantial algorithmic speed-up when tack- ling Bayesian estimation problems for functions de- Set k = 0 and pick u(0). • fined via density with respect to a Gaussian pro- Propose v(k) = u(k) + βξ(k),ξ(k) N(0, ). • ∼ C cess prior, when these problems are approximated Set u(k+1) = v(k) with probability a(u(k), v(k)). • (k+1) (k) on a finite-dimensional space of dimension du 1. Set u = u otherwise. Furthermore, we show that the framework adopted≫ • k k + 1. encompasses a range of interesting applications. • → The acceptance probability is defined as 1.1 Illustration of the Key Idea a(u, v)=min 1, exp(I(u) I(v)) . { − } Crucial to our algorithm construction will be a (k) detailed understanding of the dominating reference Here, and in the next algorithm, the noise ξ is Gaussian measure. Although prior specification independent of the uniform random variable used in might be Gaussian, it is likely that the posterior the accept–reject step, and this pair of random vari- distribution µ is not. However, the posterior will at ables is generated independently for each k, leading least be absolutely continuous with respect to an ap- to a Metropolis–Hastings algorithm reversible with propriate Gaussian density. Typically the dominat- respect to µ. ing Gaussian measure can be chosen to be the prior, The pCN method is the following modification of with the corresponding Radon–Nikodym derivative the standard random walk method: just being a re-expression of Bayes’ formula Set k = 0 and pick u(0). • (k) (k) (k) (k) dµ L Propose v = (1 β2)u +βξ ,ξ N(0, ). (u) (u) • (k+1) (k) − (k) ∼(k) C dµ0 ∝ Set u = v with probability a(u , v ). • p for likelihood L and Gaussian dominating measure Set u(k+1) = u(k) otherwise. • (prior in this case) µ . This framework extends in a k k + 1. 0 • → natural way to the case where the prior distribution Now we set is not Gaussian, but is absolutely continuous with respect to an appropriate Gaussian distribution. In a(u, v)=min 1, exp(Φ(u) Φ(v)) . either case we end up with { − } The pCN method differs only slightly from the dµ (1.1) (u) exp( Φ(u)) random walk method: the proposal is not a centred dµ0 ∝ − random walk, but rather of AR(1) type, and this for some real-valued potential Φ. We assume that µ0 results in a modified, slightly simpler, acceptance is a centred Gaussian measure (0, ). probability. As is clear, the new method accepts the The key algorithmic idea underlyingN C all the algo- proposed move with probability one if the potential rithms introduced in this paper is to consider (sto- Φ = 0; this is because the proposal is reversible with chastic or random) differential equations which pre- respect to the Gaussian reference measure µ0. 4 COTTER, ROBERTS, STUART AND WHITE

Fig. 1. Acceptance probabilities versus mesh-spacing, with (a) standard random walk and (b) modified random walk (pCN).

This small change leads to significant speed-ups independent of the number of mesh points du used for problems which are discretized on a grid of di- to represent the function, whilst for the old random mension du. It is then natural to compute on se- walk method it grows with du. The new method thus quences of problems in which the dimension du in- mixes more rapidly than the standard method and, creases, in order to accurately sample the limiting furthermore, the disparity in mixing rates becomes infinite-dimensional problem. The new pCN algo- greater as the mesh is refined. rithm is robust to increasing du, whilst the standard In this paper we demonstrate how methods such random walk method is not. To illustrate this idea, as pCN can be derived, providing a way of thinking we consider an example from the field of data as- about algorithmic development for Bayesian statis- similation, introduced in detail in Section 2.2 below, tics which is transferable to many different situa- and leading to the need to sample measure µ of the tions. The key transferable idea is to use propos- −2 form (1.1). In this problem du =∆x , where ∆x als arising from carefully chosen discretizations of is the mesh-spacing used in each of the two spatial stochastic dynamical systems which exactly preserve dimensions. the Gaussian reference measure. As demonstrated Figure 1(a) and (b) shows the average acceptance on the example, taking this approach leads to new probability curves, as a function of the parameter algorithms which can be implemented via minor β appearing in the proposal, computed by the stan- modification of existing algorithms, yet which show dard and the modified random walk (pCN) meth- enormous speed-up on a wide range of applied prob- ods. It is instructive to imagine running the algo- lems. rithms when tuned to obtain an average acceptance 1.2 Overview of the Paper probability of, say, 0.25. Note that for the stan- dard method, Figure 1(a), the acceptance probabil- Our setting is to consider measures on function ity curves shift to the left as the mesh is refined, spaces which possess a density with respect to a meaning that smaller proposal variances are required Gaussian random field measure, or some related non- to obtain the same acceptance probability as the Gaussian measures. This setting arises in many ap- mesh is refined. However, for the new method shown plications, including the Bayesian approach to in- in Figure 1(b), the acceptance probability curves verse problems [49] and conditioned diffusion pro- have a limit as the mesh is refined and, hence, as the cesses (SDEs) [20]. Our goals in the paper are then random field model is represented more accurately; fourfold: thus, a fixed proposal variance can be used to ob- to show that a wide range of problems may be cast tain the same acceptance probability at all levels of • in a common framework requiring samples to be mesh refinement. The practical implication of this drawn from a measure known via its density with difference in acceptance probability curves is that respect to a Gaussian random field or, related, the number of steps required by the new method is prior; MCMC METHODS FOR FUNCTIONS 5

to explain the principles underlying the deriva- potential Φ : X R. We assume that Φ can be eval- • tion of these new MCMC algorithms for functions, uated to any desired→ accuracy, by means of a nu- leading to desirable du-independent mixing prop- merical method. Mesh-refinement refers to increas- erties; ing the resolution of this numerical evaluation to to illustrate the new methods in action on some obtain a desired accuracy and is tied to the num- • nontrivial problems, all drawn from Bayesian non- ber du of basis functions or points used in a finite- parametric models where inference is made con- dimensional representation of the target function u. cerning a function; For many problems of interest Φ satisfies certain to develop some simple theoretical ideas which common properties which are detailed in Assump- • give deeper understanding of the benefits of the tions 6.1 below. These properties underlie much of new methods. the algorithmic development in this paper. Section 2 describes the common framework into A situation where (1.1) arises frequently is non- which many applications fit and shows a range of parametric density estimation (see Section 2.1), examples which are used throughout the paper. Sec- where µ0 is a random process prior for the unnor- tion 3 is concerned with the reference (prior) mea- malized log density and µ the posterior. There are also many inverse problems in differential equations sure µ0 and the assumptions that we make about it; these assumptions form an important part of the which have this form (see Sections 2.2, 2.3 and 2.4). model specification and are guided by both mod- For these inverse problems we assume that the data y Rdy is obtained by applying an operator2 to elling and implementation issues. In Section 4 we ∈ G detail the derivation of a range of MCMC meth- the unknown function u and adding a realisation of ods on function space, including generalizations of a mean zero random variable with density ρ sup- ported on Rdy , thereby determining P(y u). That is, the random walk, MALA, independence samplers, | Metropolis-within-Gibbs’ samplers and the HMC (2.1) y = (u)+ η, η ρ. method. We use a variety of problems to demon- G ∼ strate the new random walk method in action: Sec- After specifying µ0(du) = P(du), Bayes’ theorem tions 5.1, 5.2, 5.3 and 5.4 include examples aris- gives µ(dy)= P(u y) with Φ(u)= ln ρ(y (u)). ing from density estimation, two inverse problems We will work mainly| with Gaussian− random− G field arising in oceanography and groundwater flow, and priors (0, ), although we will also consider gen- the shape registration problem. Section 6 contains eralisationsN C of this setting found by random trunca- a brief analysis of these methods. We make some tion of the Karhunen–Lo´eve expansion of a Gaussian concluding remarks in Section 7. random field. This leads to non-Gaussian priors, but Throughout we denote by , the standard Eu- much of the methodology for the Gaussian case can h· ·i clidean scalar product on Rm, which induces the be usefully extended, as we will show. standard Euclidean norm . We also define , := C 2.1 Density Estimation C−1/2 ,C−1/2 for any positive-definite|·| symmetrich· ·i h · ·i −1/2 matrix C; this induces the norm C := C . Consider the problem of estimating the probabil- | · | | · | Given a positive-definite self-adjoint operator on ity density function ρ(x) of a random variable sup- C a with inner-product , , we will also ported on [ ℓ,ℓ], given dy i.i.d. observations yi. To h· ·i−1/2 −1/2 − define the new inner-product , C = , , ensure positivity and normalisation, we may write h· ·i hC · C ·i with resulting norm denoted by C or C . k · k | · | exp(u(x)) (2.2) ρ(x)= ℓ . 2. COMMON STRUCTURE −ℓ exp(u(s)) ds

We will now describe a wide-ranging set of ex- If we place a GaussianR process prior µ0 on u and amples which fit a common mathematical frame- apply Bayes’ theorem, then we obtain formula (1.1) work giving rise to a probability measure µ(du) on dy with Φ(u)= ln ρ(yi) and ρ given by (2.2). a Hilbert space X,1 when given its density with − i=1 P respect to a random field measure µ0, also on X. 2 Thus, we have the measure µ as in (1.1) for some This operator, mapping the unknown function to the mea- surement space, is sometimes termed the observation operator in the applied literature; however, we do not use that termi- 1Extension to is also possible. nology in the paper. 6 COTTER, ROBERTS, STUART AND WHITE

2.2 Data Assimilation in Fluid Mechanics tion u. The head p solves the PDE In weather forecasting and oceanography it is fre- (exp(u) p)= g, x D, − ∇ · ∇ ∈ quently of interest to determine the initial condi- (2.7) p = h, x ∂D. tion u for a PDE dynamical system modelling a ∈ fluid, given observations [3, 26]. To gain insight into Here D is a domain containing the measurement such problems, we consider a model of incompress- points xi and ∂D its boundary; in the simplest case ible fluid flow, namely, either the Stokes (γ =0) or g and h are known. The forward solution operator is (u) = p(x ). The inverse problem is to find u, Navier–Stokes equation (γ = 1), on a two-dimen- G j j sional unit torus T2. In the following v( ,t) denotes given y of the form (2.1). · the velocity field at time t, u the initial velocity field 2.4 Image Registration and p( ,t) the pressure field at time t and the fol- lowing· is an implicit nonlinear equation for the pair In many applications arising in medicine and se- curity it is of interest to calculate the distance be- (v,p): tween a curve Γobs, given only through a finite set of ∂tv ν v + γv v + p = ψ noisy observations, and a curve Γdb from a database − △ ·∇ ∇ of known outcomes. As we demonstrate below, this (x,t) T2 (0, ), (2.3) ∀ ∈ × ∞ may be recast as an inverse problem for two func- v = 0 t (0, ), tions, the first, η, representing reparameterisation of ∇ · ∀ ∈ ∞ the database curve Γdb and the second, p, represent- v(x, 0) = u(x), x T2. ∈ ing a momentum variable, normal to the curve Γdb, The aim in many applications is to determine the which initiates a dynamical evolution of the repa- initial state of the fluid velocity, the function u, from rameterized curve in an attempt to match observa- some observations relating to the velocity field v at tions of the curve Γobs. This approach to inversion later times. is described in [7] and developed in the Bayesian A simple model of the situation arising in weather context in [9]. Here we outline the methodology. forecasting is to determine v from Eulerian data of Suppose for a moment that we know the entire the form y = y N,M , where observed curve Γobs and that it is noise free. We { j,k}j,k=1 parameterize Γ by q and Γ by q , s [0, 1]. db db obs obs ∈ (2.4) yj,k (v(xj,tk), Γ). We wish to find a path q(s,t), t [0, 1], between Γdb ∼N and Γ , satisfying ∈ Thus, the inverse problem is to find u from y of the obs form (2.1) with (u)= v(x ,t ). (2.8) q(s, 0) = qdb(η(s)), q(s, 1) = qobs(s), Gj,k j k In oceanography Lagrangian data is often encoun- where η is an orientation-preserving reparameterisa- tered: data is gathered from the trajectories of par- tion. Following the methodology of [17, 31, 52], we ticles zj(t) moving in the velocity field of interest, constrain the motion of the curve q(s,t) by asking and thus satisfying the integral equation that the evolution between the two curves results t from the differential equation (2.5) zj(t)= zj,0 + v(zj(s),s) ds. ∂ 0 (2.9) q(s,t)= v(q(s,t),t). Z ∂t Data is of the form Here v(x,t) is a time-parameterized family of vector 2 (2.6) yj,k (zj(tk), Γ). fields on R chosen as follows. We define a metric on ∼N the “length” of paths as Thus, the inverse problem is to find u from y of the 1 1 form (2.1) with j,k(u)= zj(tk). (2.10) v 2 dt, G 2k kB 2.3 Groundwater Flow Z0 where B is some appropriately chosen Hilbert space. In the study of groundwater flow an important in- The dynamics (2.9) are defined by choosing an ap- verse problem is to determine the permeability k of propriate v which minimizes this metric, subject to the subsurface rock from measurements of the head the end point constraints (2.8). (water table height) p [30]. To ensure the (physically In [7] it is shown that this minimisation prob- required) positivity of k, we write k(x) = exp(u(x)) lem can be solved via a dynamical system obtained and recast the inverse problem as one for the func- from the Euler–Lagrange equation. This dynamical MCMC METHODS FOR FUNCTIONS 7 system yields q(s, 1) = G(p,η,s), where p is an ini- denoting a centred Gaussian with covariance opera- tial momentum variable normal to Γdb, and η is the tor . To be able to implement the algorithms in this reparameterisation. In the perfectly observed sce- paperC in an efficient way, it is necessary to make as- nario the optimal values of u = (p, η) solve the equa- sumptions about this Gaussian reference measure. tion G(u, s) := G(p,η,s)= qobs(s). We assume that information about µ0 can be ob- In the partially and noisily observed scenario we tained in at least one of the following three ways: are given observations 2 1. the eigenpairs (φi, λi ) of are known so that ex- yj = qobs(sj)+ ηj C act draws from µ0 can be made from truncation = G(u, sj)+ ηj of the Karhunen–Lo´eve expansion and that, fur- thermore, efficient methods exist for evaluation for j = 1,...,J; the η represent noise. Thus, we j of the resulting sum (such as the FFT); have data in the form (2.1) with (u)= G(u, s ). j j 2. exact draws from µ can be made on a mesh, for The inverse problem is to find theG distributions on 0 p and η, given a prior distribution on them, a dis- example, by building on exact sampling methods tribution on η and the data y. for Brownian motion or the stationary Ornstein– Uhlenbeck (OU) process or other simple Gaus- 2.5 Conditioned Diffusions sian process priors; 3. the precision operator = −1 is known and ef- The preceding examples all concern Bayesian non- L C parametric formulation of inverse problems in which ficient numerical methods exist for the inversion a Gaussian prior is adopted. However, the methodol- of (I + ζ ) for ζ> 0. L ogy that we employ readily extends to any situation These assumptions are not mutually exclusive and in which the target distribution is absolutely con- for many problems two or more of these will be tinuous with respect to a reference Gaussian field possible. Both precision and Karhunen–Lo´eve repre- law, as arises for certain conditioned diffusion pro- sentations link naturally to efficient computational cesses [20]. The objective in these problems is to find u(t) solving the equation tools that have been developed in numerical anal- ysis. Specifically, the precision operator is often du(t)= f(u(t)) dt + γ dB(t), defined via differential operators and theL operator (I + ζ ) can be approximated, and efficiently in- where B is a Brownian motion and where u is con- L ditioned on, for example, (i) end-point constraints verted, by finite element or finite difference methods; (bridge diffusions, arising in econometrics and chem- similarly, the Karhunen–Lo´eve expansion links nat- ical reactions); (ii) observation of a single sample urally to the use of spectral methods. The book [45] path y(t) given by describes the literature concerning methods for sam- pling from Gaussian random fields, and links with dy(t)= g(u(t)) dt + σ dW (t) efficient numerical methods for inversion of differ- for some Brownian motion W (continuous time sig- ential operators. An early theoretical exploration of nal processing); or (iii) discrete observations of the the links between numerical analysis and statistics path given by is undertaken in [14]. The particular links that we develop in this paper are not yet fully exploited in yj = h(u(tj)) + ηj. applications and we highlight the possibility of do- For all three problems use of the Girsanov formula, ing so. which allows expression of the density of the path- space measure arising with nonzero drift in terms of 3.1 The Karhunen–Lo´eve Expansion that arising with zero-drift, enables all three prob- The book [2] introduces the Karhunen–Lo´eve ex- lems to be written in the form (1.1). pansion and its properties. Let µ = (0, ) denote 0 N C 3. SPECIFICATION OF THE REFERENCE a Gaussian measure on a Hilbert space X. Recall MEASURE that the orthonormalized eigenvalue/eigenfunction pairs of form an orthonormal basis for X and solve C The class of algorithms that we describe are pri- the problem marily based on measures defined through density with respect to random field model µ = (0, ), φ = λ2φ , i = 1, 2,.... 0 N C C i i i 8 COTTER, ROBERTS, STUART AND WHITE

Furthermore, we assume that the operator is - by the formula class: ∞ ∞ du (3.5) E u(x)= αiξiφi(x), (3.1) λ2 < . i ∞ i=1 i=1 X X where α = P(d i). As in the Gaussian case, it Draws from the centred Gaussian measure µ can i u 0 can be useful, both≥ conceptually and for purposes of then be made as follows. Let ξ ∞ denote an inde- { i}i=1 implementation, to think of the unknown function u pendent sequence of normal random variables with ∞ 2 as being the infinite vector ( ξi i=1, du) rather than distribution (0, λi ) and consider the random func- { } tion N the function with these expansion coefficients. ∞ Making du a random variable has the effect of switching on (nonzero) and off (zero) coefficients in (3.2) u(x)= ξiφi(x). the expansion of the target function. This formula- Xi=1 2 tion switches the basis functions on and off in a fixed This series converges in L (Ω; X) under the trace- order. Random truncation as expressed by equation class condition (3.1). It is sometimes useful, both (3.4) is not the only variable dimension formulation. conceptually and for purposes of implementation, In dimension greater than one we will employ the to think of the unknown function u as being the sieve prior which allows every basis function to have infinite sequence ξ ∞ , rather than the function i i=1 an individual on/off switch. This prior relaxes the with these expansion{ } coefficients. 3 constraint imposed on the order in which the basis We let d denote projection onto the first d modes d P functions are switched on and off and we write φi i=1 of the Karhunen–Lo´eve basis. Thus, { } ∞ du (3.6) u(x)= χiξiφi(x), (3.3) du u(x)= ξ φ (x). P i i i=1 i=1 X X where χ ∞ 0, 1 . We define the distribution If the series (3.3) can be summed quickly on a grid, i i=1 on χ ={ χ } ∞ ∈as { follows.} Let ν denote a reference then this provides an efficient method for comput- i i=1 0 measure{ formed} from considering an i.i.d. sequence ing exact samples from truncation of µ to a finite- 0 of Bernoulli random variables with success probabil- dimensional space. When we refer to mesh-refine- ity one half. Then define the prior measure ν on χ ment then, in the context of the prior, this refers to to have density increasing the number of terms du used to represent ∞ the target function u. dν (χ) exp λ χi , 3.2 Random Truncation and Sieve Priors dν0 ∝ − ! Xi=1 Non-Gaussian priors can be constructed from the where λ R+. As for the random truncation method, Karhunen–Lo´eve expansion (3.3) by allowing d it- u it is both∈ conceptually and practically valuable to self to be a random variable supported on N; we let think of the unknown function as being the pair of p(i)= P(d = i). Much of the methodology in this u random infinite vectors ξ ∞ and χ ∞ . Hier- paper can be extended to these priors. A draw from i i=1 i i=1 archical priors, based on{ Gaussians} { but} with ran- such a prior measure can be written as dom switches in front of the coefficients, are termed ∞ “sieve priors” in [54]. In that paper posterior consis- (3.4) u(x)= I(i d )ξ φ (x), ≤ u i i tency questions for linear regression are also anal- Xi=1 ysed in this setting. where I(i E) is the indicator function. We refer to ∈ this as random truncation prior. Functions drawn 4. MCMC METHODS FOR FUNCTIONS from this prior are non-Gaussian and almost surely ∞ C . However, expectations with respect to du will The transferable idea in this section is that de- be Gaussian and can be less regular: they are given sign of MCMC methods which are defined on func- tion spaces leads, after discretization, to algorithms 3 which are robust under mesh refinement d . Note that “mode” here, denoting an element of a basis in u →∞ a Hilbert space, differs from the “mode” of a distribution. We demonstrate this idea for a number of algorithms, MCMC METHODS FOR FUNCTIONS 9 generalizing random walk and Langevin-based Me- algorithms derived below is to use discretizations tropolis–Hastings methods, the independence sam- of stochastic partial differential equations (SPDEs) pler, the Gibbs sampler and the HMC method; we which are invariant for either the reference or the anticipate that many other generalisations are pos- target measure. These SPDEs have the form, for sible. In all cases the proposal exactly preserves the = −1 the precision operator for µ , and DΦ the L C 0 Gaussian reference measure µ0 when the potential derivative of potential Φ, Φ is zero and the reader may take this key idea as a du db design principle for similar algorithms. (4.2) = ( u + γDΦ(u)) + √2 . Section 4.1 gives the framework for MCMC meth- ds −K L K ds ods on a general state space. In Section 4.2 we state Here b is a Brownian motion in X with covariance and derive the new Crank–Nicolson proposals, aris- operator the identity and = or I. Since is K C K ing from discretization of an OU process. In Sec- a positive operator, we may define the square-root tion 4.3 we generalize these proposals to the Langevin in the symmetric fashion, via diagonalization in the setting where steepest descent information is incor- Karhunen–Lo´eve basis of . We refer to it as an porated: MALA proposals. Section 4.4 is concerned SPDE because in many applicationsC is a differen- L with Independence Samplers which may be derived tial operator. The SPDE has invariant measure µ0 from particular parameter choices in the random for γ = 0 (when it is an infinite-dimensional OU pro- walk algorithm. Section 4.5 introduces the idea of cess) and µ for γ = 1 [12, 18, 22]. The target measure randomizing the choice of δ as part of the proposal µ will behave like the reference measure µ0 on high which is effective for the random walk methods. In frequency (rapidly oscillating) functions. Intuitively, Section 4.6 we introduce Gibbs samplers based on this is because the data, which is finite, is not infor- the Karhunen–Lo´eve expansion (3.2). In Section 4.7 mative about the function on small scales; mathe- we work with non-Gaussian priors specified through matically, this is manifest in the absolute continu- random truncation of the Karhunen–Lo´eve expan- ity of µ with respect to µ0 given by formula (1.1). sion as in (3.4), showing how Gibbs samplers can Thus, discretizations of equation (4.2) with either again be used in this situation. Section 4.8 briefly γ =0 or γ = 1 form sensible candidate proposal dis- describes the HMC method and its generalisation tributions. to sampling functions. The basic idea which underlies the algorithms de- 4.1 Set-Up scribed here was introduced in the specific context of conditioned diffusions with γ = 1 in [50], and then We are interested in defining MCMC methods for generalized to include the case γ = 0 in [4]; further- measures µ on a Hilbert space (X, , ), with in- h· ·i more, the paper [4], although focussed on the ap- duced norm , given by (1.1) where µ0 = (0, ). The settingk we · k adopt is that given in [51N] whereC plication to conditioned diffusions, applies to gen- Metropolis–Hastings methods are developed in a gen- eral targets of the form (1.1). The papers [4, 50] eral state space. Let q(u, ) denote the transition ker- both include numerical results illustrating applica- nel on X and η(du, dv) denote· the measure on X X bility of the method to conditioned diffusion in the found by taking u µ and then v u q(u, ). We× use case γ = 1, and the paper [10] shows application to η⊥(u, v) to denote∼ the measure found| ∼ by· reversing data assimilation with γ = 0. Finally, we mention the roles of u and v in the preceding construction that in [33] the algorithm with γ = 0 is mentioned, of η. If η⊥(u, v) is equivalent (in the sense of mea- although the derivation does not use the SPDE mo- sures) to η(u, v), then the Radon–Nikodym deriva- tivation that we develop here, and the concept of dη⊥ a nonparametric limit is not used to motivate the tive dη (u, v) is well-defined and we may define the acceptance probability construction. dη⊥ 4.2 Vanilla Local Proposals (4.1) a(u, v)=min 1, (u, v) . dη The standard random walk proposal for v u takes   | We accept the proposed move from u to v with the form this probability. The resulting Markov chain is µ- (4.3) v = u + √2δ ξ reversible. K 0 A key idea underlying the new variants on ran- for any δ [0, ), ξ0 (0,I) and = I or = . dom walk and Langevin-based Metropolis–Hastings This can be∈ seen∞ as a∼N discrete skeletonK of (4.2K) afterC 10 COTTER, ROBERTS, STUART AND WHITE ignoring the drift terms. Therefore, such a proposal set of components in the prior model; mathemati- leads to an infinite-dimensional version of the well- cally, the fact that the posterior has density with known random walk Metropolis algorithm. respect to the prior means that it “looks like” the The random walk proposal in finite-dimensional prior in the large i components of the Karhunen– problems always leads to a well-defined algorithm Lo´eve expansion.4 and rarely encounters any reducibility problems [46]. The CN algorithm violates this principle: the pro- Therefore, this method can certainly be applied for posal variance operator is proportional to (2 +δI)−2 arbitrarily fine mesh size. However, taking this ap- 2, suggesting that algorithm efficiency mightC be im- · proach does not lead to a well-defined MCMC meth- provedC still further by obtaining a proposal variance od for functions. This is because η⊥ is singular with of . In the familiar finite-dimensional case, this can respect to η so that all proposed moves are rejected beC achieved by a standard reparameterisation argu- with probability 1. (We prove this in Theorem 6.3 ment which has its origins in [23] if not before. This below.) Returning to the finite mesh case, algorithm motivates our final local proposal in this subsection. mixing time therefore increases to as d . The preconditioned CN proposal (pCN) for v u is ∞ u →∞ To define methods with convergence properties ro- obtained from (4.4) with γ =0 and = giving| the K C bust to increasing du, alternative approaches lead- proposal ing to well-defined and irreducible algorithms on (4.7) (2 + δ)v = (2 δ)u + √8δw, the Hilbert space need to be considered. We con- − sider two possibilities here, both based on Crank– where, again, w (0, ). As discussed after (4.5), Nicolson approximations [38] of the linear part of and in Section 3∼N, thereC are many different ways in the drift. In particular, we consider discretization of which the prior Gaussian may be specified. If the equation (4.2) with the form specification is via the precision and if there are L 1 numerical methods for which (I + ζ ) can be ef- v = u 2 δ (u + v) L (4.4) − KL ficiently inverted, then (4.5) is a natural proposal. δγ DΦ(u)+ √2 δξ0 If, however, sampling from is straightforward (via − K K the Karhunen–Lo´eve expansionC or directly), then it for a (spatial) white noise ξ0. is natural to use the proposal (4.7), which requires First consider the discretization (4.4) with γ = 0 only that it is possible to draw from µ0 efficiently. and = I. Rearranging shows that the resulting For δ [0, 2] the proposal (4.7) can be written as Crank–NicolsonK proposal (CN) for v u is found by ∈ | solving (4.8) v = (1 β2)1/2u + βw, − 1 1 √ 2 (4.5) (I + 2 δ )v = (I 2 δ )u + 2δξ0. where w (0, ), and β [0, 1]; in fact, β = 8δ/ L − L 2 ∼N C ∈ It is this form that the proposal is best implemented (2 + δ) . In this form we see very clearly a simple whenever the prior/reference measure µ is specified generalisation of the finite-dimensional random walk 0 given by (4.3) with = . via the precision operator and when efficient al- K C gorithms exist for inversionL of the identity plus a The numerical experiments described in Section 1.1 multiple of . However, for the purposes of analysis show that the pCN proposal significantly improves it is also usefulL to write this equation in the form upon the naive random walk method (4.3), and simi- lar positive results can be obtained for the CN meth- (4.6) (2 + δI)v = (2 δI)u + √8δ w, od. Furthermore, for both the proposals (4.5) and C C− C ⊥ where w (0, ), found by applying the operator (4.7) we show in Theorem 6.2 that η and η are 2 to equation∼N (4.5C ). equivalent (as measures) by showing that they are CA well-established principle in finite-dimensional both equivalent to the same Gaussian reference mea- sampling algorithms advises that proposal variance sure η0, whilst in Theorem 6.3 we show that the should be approximately a scalar multiple of that 4 of the target (see, e.g., [42]). The variance in the An interesting research problem would be to combine the prior, , can provide a reasonable approximation, ideas in [16], which provide an adaptive preconditioning but C are only practical in a finite number of dimensions, with the at least as far as controlling the large du limit is prior-based fixed preconditioning used here. Note that the concerned. This is because the data (or change of method introduced in [16] reduces exactly to the precondi- measure) is typically only informative about a finite tioning used here in the absence of data. MCMC METHODS FOR FUNCTIONS 11 proposal (4.3) leads to mutually singular measures The Crank–Nicolson Langevin proposal (CNL) is η⊥ and η. This theory explains the numerical obser- given by vations and motivates the importance of designing (2 + δ)v = (2 δ)u 2δ Φ(u) algorithms directly on function space. (4.12) C C− − CD The accept–reject formula for CN and pCN is very + √8δ w, simple. If, for some ρ : X X R, and some refer- C × → where, as before, w µ = (0, ). If we define ence measure η0, ∼ 0 N C 1 dη ρ(u, v)=Φ(u)+ v u, Φ(u) (u, v)= Z exp( ρ(u, v)), 2h − D i dη0 − (4.9) δ −1 dη⊥ + (u + v), Φ(u) (u, v)= Z exp( ρ(v, u)), 4hC D i dη0 − δ + Φ(u) 2, it then follows that 4kD k dη⊥ then the acceptance probability is given by (4.1) and (4.10) (u, v) = exp(ρ(u, v) ρ(v, u)). dη − (4.10). Implementation of this proposal simply re- quires inversion of (I + ζ ), as for (4.5). The CNL For both CN proposals (4.5) and (4.7) we show in L 1 method is the special case θ = 2 for the IA algorithm Theorem 6.2 below that, for appropriately defined introduced in [4]. η0, we have ρ(u, v)=Φ(u) so that the acceptance The preconditioned Crank–Nicolson Langevin pro- probability is given by posal (pCNL) is given by (4.11) a(u, v)=min 1, exp(Φ(u) Φ(v)) . (4.13) (2 + δ)v = (2 δ)u 2δ Φ(u)+ √8δw, { − } − − CD In this sense the CN and pCN proposals may be where w is again a draw from µ0. Defining seen as the natural generalisations of random walks 1 to the setting where the target measure is defined ρ(u, v)=Φ(u)+ v u, Φ(u) 2h − D i via density with respect to a Gaussian, as in (1.1). δ This point of view may be understood by noting that + u + v, Φ(u) the accept/reject formula is defined entirely through 4h D i differences in this log density, as happens in finite di- δ 2 + 1/2 Φ(u) , mensions for the standard random walk, if the den- 4kC D k sity is specified with respect to the Lebesgue mea- the acceptance probability is given by (4.1) and (4.10). sure. Similar random truncation priors are used in Implementation of this proposal requires draws from non-parametric inference for drift functions in diffu- the reference measure µ0 to be made, as for (4.7). sion processes in [53]. 1 The pCNL method is the special case θ = 2 for the 4.3 MALA Proposal Distributions PIA algorithm introduced in [4]. The CN proposals (4.5) and (4.7) contain no in- 4.4 Independence Sampler formation about the potential Φ given by (1.1); they Making the choice δ =2 in the pCN proposal (4.7) contain only information about the reference mea- gives an independence sampler. The proposal is then sure µ0. Indeed, they are derived by discretizing the simply a draw from the prior: v = w. The acceptance SDE (4.2) in the case γ =0, for which µ0 is an in- probability remains (4.11). An interesting generali- variant measure. The idea behind the Metropolis- sation of the independence sampler is to take δ = 2 adjusted Langevin (MALA) proposals (see [39, 44] in the MALA proposal (4.13), giving the proposal and the references therein) is to discretize an equa- (4.14) v = Φ(u)+ w tion which is invariant for the measure µ. Thus, to −CD construct such proposals in the function space set- with resulting acceptance probability given by (4.1) ting, we discretize the SPDE (4.2) with γ = 1. Tak- and (4.10) with ing = I and = then gives the following two K K C ρ(u, v)=Φ(u)+ v, Φ(u) + 1 1/2 Φ(u) 2. proposals. h D i 2 kC D k 12 COTTER, ROBERTS, STUART AND WHITE

4.5 Random Proposal Variance mesh refinement when implemented in practice. We have found it useful to use the partitions I = j for It is sometimes useful to randomise the proposal j j = 1,...,J 1 and I = J, J +1,... . On the{ other} variance δ in order to obtain better mixing. We J hand, standard− Gibbs and{ Metropolis-within-Gibbs} discuss this idea in the context of the pCN pro- samplers are based on partitioning via I = j , and posal (4.7). To emphasize the dependence of the j do not behave well under mesh-refinement,{ } as we proposal kernel on δ, we denote it by q(u, dv; δ). We will demonstrate. show in Section 6.1 that the measure η0(du, dv)= q(u, dv; δ)µ0(du) is well-defined and symmetric in 4.7 Metropolis-Within-Gibbs: Random u, v for every δ [0, ). If we choose δ at random Truncation and Sieve Priors from any probability∈ ∞ distribution ν on [0, ), inde- We will also use Metropolis-within-Gibbs to con- pendently from w, then the resulting proposal∞ has struct sampling algorithms which alternate between kernel updating the coefficients ξ = ξ ∞ in (3.4) or (3.6), ∞ { i}i=1 q(u, dv)= q(u, dv; δ)ν(dδ). and the integer du, for (3.4), or the infinite sequence 0 χ = χ ∞ for (3.6). In words, we alternate between Z { i}i=1 Furthermore, the measure q(u, dv)µ0(du) may be the coefficients in the expansion of a function and written as the parameters determining which parameters are ∞ active. q(u, dv; δ)µ0(du)ν(dδ) If we employ the non-Gaussian prior with draws Z0 given by (3.4), then the negative log likelihood Φ and is hence also symmetric in u, v. Hence, both can be viewed as a function of (ξ, du) and it is nat- the CN and pCN proposals (4.5) and (4.7) may be ural to consider Metropolis-within-Gibbs methods generalised to allow for δ chosen at random indepen- which are based on the conditional distributions for dently of u and w, according to some measure ν on ξ d and d ξ. Note that, under the prior, ξ and d | u u| u [0, ). The acceptance probability remains (4.11), are independent with ξ µ0,ξ := (0, ) and du as∞ for fixed δ. ∼ N C ∼ µ0,du , the latter being supported on N with p(i)= 4.6 Metropolis-Within-Gibbs: Blocking in P(du = i). For fixed du we have Karhunen–Lo´eve Coordinates dµ (4.17) (ξ du) exp( Φ(ξ, du)) Any function u X can be expanded in the dµ0,ξ | ∝ − Karhunen–Lo´eve basis∈ and hence written as with Φ(u) rewritten as a function of ξ and du via the ∞ expansion (3.4). This measure can be sampled by (4.15) u(x)= ξiφi(x). any of the preceding Metropolis–Hastings methods i=1 X designed in the case with Gaussian µ0. For fixed ξ Thus, we may view the probability measure µ given we have ∞ by (1.1) as a measure on the coefficients u = ξi i=1. dµ I { } For any index set I N we write ξ = ξi i∈I and (4.18) (du ξ) exp( Φ(ξ, du)). dµ0,d | ∝ − ξI = ξ . Both ξ⊂I and ξI are independent{ } and u − { i}i/∈I − Gaussian under the prior µ0 with diagonal covari- A natural biased random walk for du ξ arises by I I I | N ance operators −, respectively. If we let µ0 de- proposing moves from a random walk on which note the GaussianC C (0, I ), then (1.1) gives satisfies detailed balance with respect to the distri- N C bution p(i). The acceptance probability is then dµ I I I I (4.16) (ξ ξ−) exp( Φ(ξ ,ξ−)), dµI | ∝ − a(u, v)=min 1, exp(Φ(ξ, du) Φ(ξ, dv)) . 0 { − } where we now view Φ as a function on the coeffi- Variants on this are possible and, if p(i) is mono- cients in the expansion (4.15). This formula may be tonic decreasing, a simple random walk proposal on used as the basis for Metropolis-within-Gibbs sam- the integers, with local moves du dv = du 1, is plers using blocking with respect to a set of par- straightforward to implement. Of→ course, different± J titions Ij j=1,...,J with the property j=1 Ij = N. proposal stencils can give improved mixing proper- Because{ the} formula is defined for functions this ties, but we employ this particular random walk for will give rise to methods which are robustS under expository purposes. MCMC METHODS FOR FUNCTIONS 13

If, instead of (3.4), we use the non-Gaussian sieve this algorithm is the development of a new integra- prior defined by equation (3.6), the prior and pos- tor for the Hamiltonian flow underlying the method; terior measures may be viewed as measures on u = this integrator is exact in the Gaussian case Φ 0, ( ξ ∞ , χ ∞ ). These variables may be modified on function space, and for this reason behaves≡ well { i}i=1 { j}j=1 as stated above via Metropolis-within-Gibbs for sam- for nonparametric where du may be arbitrarily large pling the conditional distributions ξ χ and χ ξ. If, infinite dimensions. for example, the proposal for χ ξ is reversible| | with | respect to the prior on ξ, then the acceptance prob- 5. COMPUTATIONAL ILLUSTRATIONS ability for this move is given by This section contains numerical experiments de- a(u, v)=min 1, exp(Φ(ξu,χu) Φ(ξv,χv)) . signed to illustrate various properties of the sam- { − } pling algorithms overviewed in this paper. We em- In Section 5.3 we implement a slightly different pro- 1 ploy the examples introduced in Section 2. posal in which, with probability 2 , a nonactive mode is switched on with the remaining probability an ac- 5.1 Density Estimation N tive mode is switched off. If we define Non = i=1 χi, Section 1.1 shows an example which illustrates then the probability of moving from ξu to a state ξv P the advantage of using the function-space algorithms in which an extra mode is switched on is highlighted in this paper in comparison with stan- dard techniques; there we compared pCN with a a(u, v)=min 1, exp Φ(ξ ,χ ) Φ(ξ ,χ ) u u − v v standard random walk. The first goal of the ex-   periments in this subsection is to further illustrate N Non + − . the advantage of the function-space algorithms over N on  standard algorithms. Specifically, we compare the Similarly, the probability of moving to a situation Metropolis-within-Gibbs method from Section 4.6, in which a mode is switched off is based on the partition Ij = j and labelled MwG here, with the pCN sampler{ from} Section 4.2. The a(u, v)=min 1, exp Φ(ξ ,χ ) Φ(ξ ,χ ) u u − v v second goal is to study the effect of prior modelling   on algorithmic performance; to do this, we study a N + on . third algorithm, RTM-pCN, based on sampling the N N − on  randomly truncated Gaussian prior (3.4) using the 4.8 Hybrid Monte Carlo Methods Gibbs method from Section 4.7, with the pCN sam- pler for the coefficient update. The algorithms discussed above have been based on proposals which can be motivated through dis- 5.1.1 Target distribution We will use the true den- cretization of an SPDE which is invariant for either sity the prior measure µ0 or for the posterior µ itself. ρ ( 3, 1)I(x ( ℓ, +ℓ)) HMC methods are based on a different idea, which ∝N − ∈ − is to consider a Hamiltonian flow in a state space + (+3, 1)I(x ( ℓ, +ℓ)), N ∈ − found from introducing extra “momentum” or “ve- where ℓ = 10. [Recall that I( ) denotes the indica- locity” variables to complement the variable u in tor function of a set.] This density· corresponds ap- (1.1). If the momentum/velocity is chosen randomly proximately to a situation where there is a 50/50 from an appropriate Gaussian distribution at reg- chance of being in one of the two Gaussians. This ular intervals, then the resulting Markov chain in one-dimensional multi-modal density is sufficient to u is invariant under µ. Discretizing the flow, and expose the advantages of the function spaces sam- adding an accept/reject step, results in a method plers pCN and RTM-pCN over MwG. which remains invariant for µ [15]. These methods can break random-walk type behaviour of methods 5.1.2 Prior We will make comparisons between based on local proposal [32, 34]. It is hence of in- the three algorithms regarding their computational terest to generalise these methods to the function performance, via various graphical and numerical sampling setting dictated by (1.1) and this is under- measures. In all cases it is important that the reader taken in [5]. The key novel idea required to design appreciates that the comparison between MwG and 14 COTTER, ROBERTS, STUART AND WHITE

Fig. 2. Trace and autocorrelation plots for sampling posterior measure with true density ρ using MwG, pCN and RTM-pCN methods. pCN corresponds to sampling from the same pos- Table 1 terior, since they use the same prior, and that all Approximate integrated autocorrelation times for target ρ comparisons between RTM-pCN and other methods also quantify the effect of prior modelling as well as Algorithm IACT algorithm. Two priors are used for this experiment: the Gaus- MwG 894 sian prior given by (3.2) and the randomly trun- pCN 73.2 RTM-pCN 143 cated Gaussian given by (3.4). We apply the MwG and pCN schemes in the former case, and the RTM- pCN scheme for the latter. The prior uses the same Gaussian covariance structure for the independent ξ, 5.1.4 Numerical results In order to compare the 2 −2 performance of pCN, MwG and RTM-pCN, we show, namely, ξi (0, λi ), where λi i . Note that the eigenvalues∼N are summable, as required∝ for draws in Figure 2 and Table 1, trace plots, correlation func- from the Gaussian measure to be square integrable tions and integrated auto-correlation times (the lat- and to be continuous. The prior for the number of ter are notoriously difficult to compute accurately active terms du is an exponential distribution with [47] and displayed numbers to three significant fig- rate λ = 0.01. ures should only be treated as indicative). The auto- correlation function decays for ergodic Markov chains, 5.1.3 Numerical implementation In order to facil- itate a fair comparison, we tuned the value of δ in the and its integral determines the asymptotic variance pCN and RTM-pCN proposals to obtain an average of sample path averages. The integrated autocor- acceptance probability of around 0.234, requiring, in relation time is used, via this asymptotic variance, both cases, δ 0.27. (For RTM-pCN the average ac- to determine the number of steps required to de- ceptance probability≈ refers only to moves in ξ ∞ termine an independent sample from the MCMC { }i=1 and not in du.) We note that with the value δ = 2 we method. The figures and integrated autocorrelation obtain the independence sampler for pCN; however, times clearly show that the pCN and RTM-pCN this sampler only accepted 12 proposals out of 106 outperform MwG by an order of magnitude. This MCMC steps, indicating the importance of tuning δ reflects the fact that pCN and RTM-pCN are func- correctly. For MwG there is no tunable parameter, tion space samplers, designed to mix independently and we obtain an acceptance of around 0.99. of the mesh-size. In contrast, the MwG method is MCMC METHODS FOR FUNCTIONS 15

Table 2 space; we also work on the space of functions whose Comparison of computational timings for target ρ spatial average is zero and then A is invertible. For Time to draw an the numerics that follow, we set ν = 0.05. It is im- Algorithm Time for 106 steps (s) indepsample (s) portant to note that, in the periodic setting adopted here, A is diagonalized in the basis of divergence free MwG 262 0.234 Fourier series. Thus, fractional powers of A are eas- pCN 451 0.0331 ily calculated. The prior measure is then chosen as RTM-pCN 278 0.0398 (5.1) µ = (0, δA−α), 0 N in both the Eulerian and Lagrangian data scenar- heavily mesh-dependent, since updates are made one ios. We require α> 1 to ensure that the eigenvalues Fourier component at a time. of the covariance are summable (a necessary and Finally, we comment on the effect of the different sufficient condition for draws from the prior, and priors. The asymptotic variance for the RTM-pCN is hence the posterior, to be continuous functions, al- approximately double that of pCN. However, RTM- most surely). In the numerics that follow, the pa- pCN can have a reduced runtime, per unit error, rameters of the prior were chosen to be δ = 400 and when compared with pCN, as Table 2 shows. This α = 2. improvement of RTM-pCN over pCN is primarily caused by the reduction in the number of random 5.2.3 Numerical implementation The figures that number generations due to the adaptive size of the follow in this section are taken from what are termed basis in which the unknown density is represented. identical twin experiments in the data assimilation community: the same approximation of the model 5.2 Data Assimilation in Fluid Mechanics described above to simulate the data is also used for We now proceed to a more complex problem and evaluation of Φ in the statistical algorithm in the describe numerical results which demonstrate that calculation of the likelihood of u0 given the data, the function space samplers successfully sample non- with the same assumed covariance structure of the trivial problems arising in applications. We study observational noise as was used to simulate the data. both the Eulerian and Lagrangian data assimila- Since the domain is the two-dimensional torus, the tion problems from Section 2.2, for the Stokes flow evolution of the velocity field can be solved exactly forward model γ = 0. It has been demonstrated in for a truncated Fourier series, and in the numerics [8, 10] that the pCN can successfully sample from that follow we truncate this to 100 unknowns, as we the posterior distribution for such problems. In this have found the results to be robust to further re- subsection we will illustrate three features of such finement. In the case of the Lagrangian data, we in- methods: convergence of the algorithm from differ- tegrate the trajectories (2.5) using an Euler scheme ent starting states, convergence with different pro- with time step ∆t = 0.01. In each case we will give posal step sizes, and behaviour with random distri- the values of N (number of spatial observations, or particles) and M (number of temporal observations) butions for the proposal step size, as discussed in that were used. The observation stations (Eulerian Section 4.5. data) or initial positions of the particles (Lagrangian 5.2.1 Target distributions In this application we data) are evenly spaced on a grid. The M observa- aim to characterize the posterior distribution on the tion times are evenly spaced, with the final obser- initial condition of the two-dimensional velocity field vation time given by TM = 1 for Lagrangian obser- u0 for Stokes flow [equation (2.3) with γ = 0], given vations and TM = 0.1 for Eulerian. The true initial a set of either Eulerian (2.4) or Lagrangian (2.6) condition u is chosen randomly from the prior dis- observations. In both cases, the posterior is of the tribution. form (1.1) with Φ(u)= 1 (u) y 2 , with a non- 2 Γ 5.2.4 Convergence from different initial states We linear mapping taking ukGto the− observationk G space. consider a posterior distribution found from data We choose the observational noise covariance to be comprised of 900 Lagrangian tracers observed at 100 Γ= σ2I with σ = 10−2. evenly spaced times on [0, 1]. The data volume is 5.2.2 Prior We let A be the Stokes operator de- high and a form of posterior consistency is observed fined by writing (2.3) as dv/dt + Av = 0, v(0) = u in for low Fourier modes, meaning that the posterior is the case γ =0 and ψ = 0. Thus, A is ν times the approximately a Dirac mass at the truth. Observa- negative Laplacian, restricted to a divergence free tions were made of each of these tracers up to a final 16 COTTER, ROBERTS, STUART AND WHITE

ability is estimated over short bursts and the step size β adapted accordingly. With β too small, the algorithm accepts proposed states often, but these changes in state are too small, so the algorithm does not explore the state space efficiently. In contrast, with β too big, larger jumps are proposed, but are often rejected since the proposal often has small probability density and so are often rejected. Fig- ure 4 shows examples of both of these, as well as a more efficient choice βopt. 5.2.6 Convergence with random β Here we illus- trate the possibility of using a random proposal vari- Fig. 3. Convergence of value of one Fourier mode of the ance β, as introduced in Section 4.5 [expressed in initial condition u0 in the pCN Markov chains with different initial states, with Lagrangian data. terms of δ and (4.7) rather than β and (4.8)]. Such methods have the potential advantage of including the possibility of large and small steps in the pro- posal. In this example we use Eulerian data once again, this time with only 9 observation stations, with only one observation time at T = 0.1. Two in- stances of the sampler were run with the same data, one with a static value of β = βopt and one with β U([0.1 βopt, 1.9 βopt]). The marginal distri- butions∼ for× both Markov× chains are shown in Fig- ure 5(a), and are very close indeed, verifying that randomness in the proposal variance scale gives rise to (empirically) ergodic Markov chains. Figure 5(b) shows the distribution of the β for which the pro- Fig. 4. Convergence of one of the Fourier modes of the ini- posed state was accepted. As expected, the initial tial condition in the pCN Markov chains with different pro- uniform distribution is skewed, as proposals with posal variances, with Eulerian data. smaller jumps are more likely to be accepted. The convergence of the method with these two time T = 1. Figure 3 shows traces of the value of one choices for β were roughly comparable in this sim- particular Fourier mode5 of the true initial condi- ple experiment. However, it is of course conceivable tions. Different starting values are used for pCN and that when attempting to explore multimodal poste- all converge to the same distribution. The proposal rior distributions it may be advantageous to have a variance β was chosen in order to give an average mix of both large proposal steps, which may allow acceptance probability of approximately 25%. large leaps between different areas of high probabil- ity density, and smaller proposal steps in order to 5.2.5 Convergence with different β Here we study explore a more localised region. the effect of varying the proposal variance. Eulerian 5.3 Subsurface Geophysics data is used with 900 observations in space and 100 observation times on [0, 1]. Figure 4 shows the dif- The purpose of this section is twofold: we demon- ferent rates of convergence of the algorithm with strate another nontrivial application where function different values of β, in the same Fourier mode co- space sampling is potentially useful and we demon- efficient as used in Figure 3. The value labelled βopt strate the use of sieve priors in this context. Key here is chosen to give an acceptance rate of approx- to understanding what follows in this problem is to imately 50%. This value of β is obtained by using appreciate that, for the data volume we employ, the an adaptive burn-in, in which the acceptance prob- posterior distribution can be very diffuse and expen- sive to explore unless severe prior modelling is im- 5The real part of the coefficient of the Fourier mode with posed, meaning that the prior is heavily weighted to wave number 0 in the x-direction and wave number 1 in the solutions with only a small number of active Fourier y-direction. modes, at low wave numbers. This is because the MCMC METHODS FOR FUNCTIONS 17

Fig. 5. Eulerian data assimilation example. (a) Empirical marginal distributions estimated using the pCN with and without random β. (b) Plots of the proposal distribution for β and the distribution of values for which the pCN proposal was accepted. homogenizing property of the elliptic PDE means that a whole range of different length-scale solu- tions can explain the same data. To combat this, we choose very restrictive priors, either through the form of Gaussian covariance or through the sieve mechanism, which favour a small number of active Fourier modes. 5.3.1 Target distributions We consider equation (2.7) in the case D = [0, 1]2. Recall that the objective Fig. 6. True permeability function used to create target dis- in this problem is to recover the permeability κ = tributions in subsurface geophysics application. exp(u). The sampling algorithms discussed here are applied to the log permeability u. The “true” per- meability for which we test the algorithms is shown in Figure 6 and is given by 1 (5.2) κ(x) = exp(u1(x)) = 10 . The pressure measurement data is yj = p(xj)+ σηj with the ηj i.i.d. standard unit Gaussians, with the measurement location shown in Figure 7.

5.3.2 Prior The priors will either be Gaussian or Fig. 7. Measurement locations for subsurface experiment. a sieve prior based on a Gaussian. In both cases the Gaussian structure is defined via a Karhunen–Lo´eve For the Gaussian prior we employ MwG and pCN expansion of the form schemes, and we employ the pCN-based Gibbs sam- (0,0) u(x)= ζ0,0ϕ pler from Section 4.7 for the sieve prior; we refer (5.3) to this latter algorithm as Sieve-pCN. As in Sec- ζ ϕ(p,q) + p,q , tion 5.1, it is important that the reader appreciates (p2 + q2)α (p,q)∈Z2{0,0} that the comparison between MwG and pCN corre- X sponds to sampling from the same posterior, since (p,q) where ϕ are two-dimensional Fourier basis func- they use the same prior, but that all comparisons tions and the ζp,q are independent random variables between Sieve-pCN and other methods also quantify with distribution ζp,q (0, 1) and a R. To ensure the effect of prior modelling as well as algorithm. that the eigenvalues of∼N the prior covariance∈ operator are summable (a necessary and sufficient condition 5.3.3 Numerical implementation The forward mod- for draws from it to be continuous functions, almost el is evaluated by solving equation (2.7) on the two- surely), we require that α> 1. For target defined via dimensional domain D = [0, 1]2 using a finite differ- κ we take α = 1.001. ence method with mesh of size J J. This results in × 18 COTTER, ROBERTS, STUART AND WHITE a J 2 J 2 banded matrix with bandwidth J which may× be solved, using a banded matrix solver, in (J 4) floating point operations (see page 171 [25]). AsO drawing a sample is a (J 4) operation, the grid sizes used within these experimentsO was kept delib- erately low: for target defined via κ we take J = 64. This allowed a sample to be drawn in less than 100 ms and therefore 106 samples to be drawn in around a day. We used 1 measurement point, as shown in Figure 7. 5.3.4 Numerical results Since α = 1.001, the ei- genvalues of the prior covariance are only just summa- ble, meaning that many Fourier modes will be ac- tive in the prior. Figure 8 shows trace plots obtained through application of the MwG and pCN methods to the Gaussian prior and a pCN-based Gibbs sam- pler for the sieve prior, denoted Sieve-pCN. The pro- posal variance for pCN and Sieve-pCN was selected to ensure an average acceptance of around 0.234. Four different seeds are used. It is clear from these plots that only the MCMC chain generated by the sieve prior/algorithm combination converges in the available computational time. The other algorithms fail to converge under these test conditions. This demonstrates the importance of prior modelling as- sumptions for these under-determined inverse prob- lems with multiple solutions. 5.4 Image Registration In this subsection we consider the image registra- tion problem from Section 2.4. Our primary pur- pose is to illustrate the idea that, in the function Fig. 8. Trace plots for the subsurface geophysics application, space setting, it is possible to extend the prior mod- using 1 measurement. The MwG, pCN and Sieve-pCN algo- elling to include an unknown observational precision rithms are compared. Different colours correspond to identical MCMC simulations with different random number generator and to use conjugate Gamma priors for this param- seeds. eter. 5.4.1 Target distribution We study the setup from 5.4.2 Prior The priors on the initial momentum Section 2.4, with data generated from a noisily ob- and reparameterisation are taken as −α1 served truth u = (p, η) which corresponds to a smooth µp(p)= (0, δ1 ), closed curve. We make J noisy observations of the (5.4) N H µ (ν)= (0, δ −α2 ), curve where, as will be seen below, we consider the ν N 2H cases J = 10, 20, 50, 100, 200, 500 and 1000. The noise where α = 0.55, α = 1.55, δ = 30 and δ = 5 10−2. 1 2 1 2 · used to generate the data is an uncorrelated mean Here = (I ∆) denotes the Helmholtz operator 2 H − zero Gaussian at each location with variance σtrue = in one dimension and, hence, the chosen values of αi 0.01. We will study the case where the noise vari- ensure that the eigenvalues of the prior covariance ance σ2 is itself considered unknown, introducing operators are summable. As a consequence, draws a prior on τ = σ−2. We then use MCMC to study from the prior are continuous, almost surely. The the posterior distribution on (u, τ), and hence on prior for τ is defined as 2 (u, σ ). (5.5) µτ = Gamma(ασ, βσ), MCMC METHODS FOR FUNCTIONS 19

Fig. 9. Convergence of the lowest wave number Fourier modes in (a) the initial momentum P0, and (b) the reparameteri- sation function ν, as the number of observations is increased, using the pCN. noting that this leads to a conjugate posterior on this variable, since the observational noise is Gaus- sian. In the numerics that follow, we set ασ = βσ = 0.0001. 5.4.3 Numerical implementation In each experi- ment the data is produced using the same template shape Γdb, with parameterization given by

qdb(s) = (cos(s)+ π, sin(s)+ π), (5.6) s [0, 2π). ∈ In the following numerics, the observed shape is cho- sen by first sampling an instance of p and ν from their respective prior distributions and using the nu- merical approximation of the forward model to give Fig. 10. Convergence of the posterior distribution on the us the parameterization of the target shape. The 2 N value of the noise variance σ I, as the number of observations N observational points si i=1 are then picked by is increased, sampled using the pCN. evenly spacing them out{ over} the interval [0, 1), so that s = (i 1)/N. i − functions u and the precision parameter τ can both 5.4.4 Finding observational noise hyperparameter be recovered: a form of posterior consistency. We implement an MCMC method to sample from This is demonstrated in Figure 9, for the posterior the joint distribution of (u, τ), where (recall) τ = distribution on a low wave number Fourier coeffi- σ−2 is the inverse observational precision. When sam- cient in the expansion of the initial momentum p and pling u we employ the pCN method. In this context the reparameterisation η. Figure 10 shows the pos- it is possible to either: (i) implement a Metropolis- terior distribution on the value of the observational within-Gibbs sampler, alternating between use of variance σ2; recall that the true value is 0.01. The pCN to sample u τ and using explicit sampling from posterior distribution becomes increasingly peaked | the Gamma distribution for τ u; or (ii) marginalize close to this value as N increases. out τ and sample directly from| the marginal distri- 5.5 Conditioned Diffusions bution for u, generating samples from τ separately; we adopt the second approach. Numerical experiments which employ function We show that, by taking data sets with an increas- space samplers to study problems arising in con- ing number of observations N, the true values of the ditioned diffusions have been published in a num- 20 COTTER, ROBERTS, STUART AND WHITE ber of articles. The paper [4] introduced the idea of to satisfy these assumptions in [13], again for ap- function space samplers in this context and demon- proximate choice of X. It is shown in [9] that As- strated the advantage of the CNL method (4.12) sumptions 6.1 are satisfied for the image registration over the standard Langevin algorithm for bridge dif- problem of Section 2.4, again for appropriate choice fusions; in the notation of that paper, the IA method of X. A wide range of conditioned diffusions satisfy 1 with θ = 2 is our CNL method. Figures analogous Assumptions 6.1; see [20]. The density estimation to Figure 1(a) and (b) are shown. The article [19] problem from Section 2.1 satisfies the second item demonstrates the effectiveness of the CNL method, from Assumptions 6.1, but not the first. for smoothing problems arising in signal processing, 6.1 Why the Crank–Nicolson Choice? and figures analogous to Figure 1(a) and (b) are again shown. The paper [5] contains numerical ex- In order to explain this choice, we consider a one- periments showing comparison of the function-space parameter (θ) family of discretizations of equation HMC method from Section 4.8 with the CNL variant (4.2), which reduces to the discretization (4.4) when 1 of the MALA method from Section 4.3, for a bridge θ = 2 . This family is diffusion problem; the function-space HMC method v = u δ ((1 θ) u + θ v) δγ DΦ(u) is superior in that context, demonstrating the power (6.1) − K − L L − K of methods which break random-walk type behaviour + √2δ ξ0, K of local proposals. where ξ (0,I) is a white noise on X. Note that 0 ∼N w := √ ξ0 has covariance operator and is hence a 6. THEORETICAL ANALYSIS C C draw from µ0. Recall that if u is the current state of The numerical experiments in this paper demon- the Markov chain, then v is the proposal. For sim- strate that the function-space algorithms of Crank– plicity we consider only Crank–Nicolson proposals Nicolson type behave well on a range of nontriv- and not the MALA variants, so that γ = 0. However, ial examples. In this section we describe some the- the analysis generalises to the Langevin proposals in oretical analysis which adds weight to the choice of a straightforward fashion. Crank–Nicolson discretizations which underlie these Rearranging (6.1), we see that the proposal v sat- algorithms. We also show that the acceptance prob- isfies ability resulting from these proposals behaves as in v = (I δθ )−1 finite dimensions: in particular, that it is continuous (6.2) − KL as the scale factor δ for the proposal variance tends ((I + δ(1 θ) )u + √2δ ξ0). · − KL K to zero. And finally we summarize briefly the the- If = I, then the operator applied to u is bounded ory available in the literature which relates to the onKX for any θ (0, 1]. If = , it is bounded for function-space viewpoint that we highlight in this θ [0, 1]. The white∈ noise termK isC almost surely in X paper. We assume throughout that Φ satisfies the for∈ = I,θ (0, 1] and = ,θ [0, 1]. The Crank– following assumptions: NicolsonK proposal∈ (4.5) isK foundC by∈ letting = I and 1 K Assumptions 6.1. The function Φ : X R sat- θ = 2 . The preconditioned Crank–Nicolson proposal 1 isfies the following: → (4.7) is found by setting = and θ = . The fol- K C 2 lowing theorem explains the choice θ = 1 . 1. there exists p> 0,K> 0 such that, for all u X 2 ∈ Theorem 6.2. Let µ (X) = 1, let Φ satisfy As- 0 Φ(u; y) K(1 + u p); 0 ≤ ≤ k k sumption 6.1(2) and assume that µ and µ0 are equiv- 2. for every r> 0 there is K(r) > 0 such that, for alent as measures with the Radon–Nikodym deriva- all u, v X with max u , v < r, tive (1.1). Consider the proposal v u q(u, ) de- ∈ {k k k k} | ∼ · Φ(u) Φ(v) K(r) u v . fined by (6.2) and the resulting measure η(du, dv)= | − |≤ k − k q(u, dv)µ(du) on X X. For both = I and = These assumptions arise naturally in many Bayes- the measure η⊥ = q(×v, du)µ(dv) is equivalentK toK η ifC ian inverse problems where the data is finite di- 1 1 and only if θ = 2 . Furthermore, if θ = 2 , then mensional [49]. Both the data assimilation inverse dη⊥ problems from Section 2.2 are shown to satisfy As- (u, v) = exp(Φ(u) Φ(v)). sumptions 6.1, for appropriate choice of X in [11] dη − (Navier–Stokes) and [49] (Stokes). The groundwa- By use of the analysis of Metropolis–Hastings ter flow inverse problem from Section 2.3 is shown methods on general state spaces in [51], this theorem MCMC METHODS FOR FUNCTIONS 21 shows that the Crank–Nicolson proposal (6.2) leads dimension, and optimal acceptance probability, can to a well-defined MCMC algorithm in the function- be extended to the target measures of the form (1.1) 1 space setting, if and only if θ = 2 . Note, relatedly, which are central to this paper; see [29, 36]. These 1 that the choice θ = 2 has the desirable property that results show that the standard MCMC method must u (0, ) implies that v (0, ): thus, the prior be scaled with proposal variance (or time-step in the measure∼N isC preserved under∼N the proposal.C This mim- case of HMC) which is inversely proportional to a ics the behaviour of the SDE (4.2) for which the power of du, the discretization dimension, and that prior is an invariant measure. We have thus justi- the number of steps required grows under mesh re- fied the proposals (4.5) and (4.7) on function space. finement. The papers [5, 37] demonstrate that judi- To complete our analysis, it remains to rule out the cious modifications of these standard algorithms, as standard random walk proposal (4.3). described in this paper, lead to scaling limits without Theorem 6.3. Consider the proposal v u the need for scalings of proposal variance or time- q(u, ) defined by (4.3) and the resulting measure| ∼ step which depend on dimension. These results in- η(du,· dv)= q(u, dv)µ(du) on X X. For both = I dicate that the number of steps required is stable and = the measure η⊥ = ×q(v, du)µ(dv) isK not under mesh refinement, for these new methods, as absolutelyK C continuous with respect to η. Thus, the demonstrated numerically in this paper. The second MCMC method is not defined on function space. approach, namely, the use of spectral gaps, offers the opportunity to further substantiate these ideas: 6.2 The Acceptance Probability in [21] it is shown that the pCN method has a di- We now study the properties of the two Crank– mension independent spectral gap, whilst a standard Nicolson methods with proposals (4.5) and (4.7) in random walk which closely resembles it has spec- the limit δ 0, showing that finite-dimensional in- tral gap which shrinks with dimension. This method → tuition carries over to this function space setting. of analysis, via spectral gaps, will be useful for the We define analysis of many other MCMC algorithms arising in R(u, v)=Φ(u) Φ(v) high dimensions. − and note from (4.11) that, for both of the Crank– 7. CONCLUSIONS Nicolson proposals, a(u, v)=min 1, exp(R(u, v)) . We have demonstrated the following points: { } Theorem 6.4. Let µ be a Gaussian measure on A wide range of applications lead naturally to 0 • a Hilbert space (X, ) with µ0(X) = 1 and let µ problems defined via density with respect to a be an equivalent measurek · k on X given by the Radon– Gaussian random field reference measure, or vari- Nikodym derivative (1.1), satisfying Assump- ants on this structure. tions 6.1(1) and 6.1(2). Then both the pCN and CN Designing MCMC methods on function space, and algorithms with fixed δ are defined on X and, fur- • then discretizing the nonparametric problem, pro- thermore, the acceptance probability satisfies duces better insight into algorithm design than lim Eηa(u, v) = 1. discretizing the nonparametric problem and then δ→0 applying standard MCMC methods. 6.3 Scaling Limits and Spectral Gaps The transferable idea underlying all the meth- • There are two basic theories which have been de- ods is that, in the purely Gaussian case when veloped to explain the advantage of using the al- only the reference measure is sampled, the re- gorithms introduced here which are based on the sulting MCMC method should accept with prob- function-space viewpoints. The first is to prove scal- ability one; such methods may be identified by ing limits of the algorithms, and the second is to es- time-discretization of certain stochastic dynami- tablish spectral gaps. The use of scaling limits was cal systems which preserve the Gaussian reference pioneered for local-proposal Metropolis algorithms measure. in the papers [40–42], and recently extended to the Using this methodology, we have highlighted new • hybrid Monte Carlo method [6]. All of this work con- random walk, Langevin and Hybrid Monte Carlo cerned i.i.d. target distributions, but recently it has Metropolis-type methods, appropriate for prob- been shown that the basic conclusions of the theory, lems where the posterior distribution has density relating to optimal scaling of proposal variance with with respect to a Gaussian prior, all of which can 22 COTTER, ROBERTS, STUART AND WHITE

be implemented by means of small modifications [6] Beskos, A., Pillai, N. S., Roberts, G. O., Sanz- of existing codes. Serna, J. M. and Stuart, A. M. (2013). Optimal We have applied these MCMC methods to a range tuning of hybrid Monte Carlo. Bernoulli. To appear. • of problems, demonstrating their efficacy in com- Available at http://arxiv.org/abs/1001.4460. [7] Cotter, C. J. (2008). The variational particle-mesh parison with standard methods, and shown their method for matching curves. J. Phys. A 41 344003, flexibility with respect to the incorporation of stan- 18. MR2456340 dard ideas from MCMC technology such as Gibbs [8] Cotter, S. L. (2010). Applications of MCMC methods sampling and estimation of noise precision through on function spaces. Ph.D. thesis, Univ. Warwick. conjugate Gamma priors. [9] Cotter, C. J., Cotter, S. L. and Vialard, F. X. We have pointed to the emerging body of theo- (2013). Bayesian data assimilation in shape regis- • retical literature which substantiates the desirable tration. Inverse Problems 29 045011. [10] Cotter, S. L., Dashti, M. and Stuart, A. M. (2012). properties of the algorithms we have highlighted Variational data assimilation using targetted ran- here. dom walks. Internat. J. Numer. Methods Fluids 68 The ubiquity of Gaussian priors means that the 403–421. MR2880204 [11] Cotter, S. L., Dashti, M., Robinson, J. C. and Stu- technology that is described in this article is of im- art, A. M. (2009). Bayesian inverse problems for mediate applicability to a wide range of applica- functions and applications to fluid mechanics. In- tions. The generality of the philosophy that under- verse Problems 25 115008, 43. MR2558668 lies our approach also suggests the possibility of nu- [12] Da Prato, G. and Zabczyk, J. (1992). Stochastic merous further developments. In particular, many Equations in Infinite Dimensions. Encyclopedia of existing algorithms can be modified to the function Mathematics and Its Applications 44. Cambridge space setting that is shown to be so desirable here, Univ. Press, Cambridge. MR1207136 [13] Dashti, M., Harris, S. and Stuart, A. (2012). Besov when Gaussian priors underlie the desired target; priors for Bayesian inverse problems. Inverse Probl. and many similar ideas can be expected to emerge Imaging 6 183–200. MR2942737 for the study of problems with non-Gaussian priors, [14] Diaconis, P. (1988). Bayesian numerical analysis. In such as arise in wavelet based nonparametric esti- Statistical Decision Theory and Related Topics, mation. IV, Vol. 1 (West Lafayette, Ind., 1986) 163–175. Springer, New York. MR0927099 ACKNOWLEDGEMENTS [15] Duane, S., Kennedy, A. D., Pendleton, B. and Roweth, D. (1987). Hybrid Monte Carlo. Phys. S. L. Cotter is supported by EPSRC, ERC (FP7/ Lett. B 195 216–222. 2007-2013 and Grant 239870) and St. Cross College. [16] Girolami, M. and Calderhead, B. (2011). Riemann G. O. Roberts is supported by EPSRC (especially manifold Langevin and Hamiltonian Monte Carlo the CRiSM grant). A. M. Stuart is grateful to EP- methods (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 73 123–214. MR2814492 SRC, ERC and ONR for the financial support of [17] Glaunes, J., Trouve,´ A. and Younes, L. (2004). Dif- research which underpins this article. D. White is feomorphic matching of distributions: A new ap- supported by ERC. proach for unlabelled point-sets and sub-manifolds matching. In Computer Vision and Pattern Recog- nition, 2004. CVPR 2004. Proceedings of the 2004 REFERENCES IEEE Computer Society Conference on 2 712–718. [1] Adams, R. P., Murray, I. and Mackay, D. J. C. IEEE. (2009). The Gaussian process density sampler. In [18] Hairer, M., Stuart, A. M. and Voss, J. (2007). Anal- Advances in Neural Information Processing Systems ysis of SPDEs arising in path sampling. II. The 21. nonlinear case. Ann. Appl. Probab. 17 1657–1706. [2] Adler, R. J. (2010). The Geometry of Random Fields. MR2358638 SIAM, Philadeliphia, PA. [19] Hairer, M., Stuart, A. and Voß, J. (2009). Sam- [3] Bennett, A. F. (2002). Inverse Modeling of the Ocean pling conditioned diffusions. In Trends in Stochas- and Atmosphere. Cambridge Univ. Press, Cam- tic Analysis. London Mathematical Society Lecture bridge. MR1920432 Note Series 353 159–185. Cambridge Univ. Press, [4] Beskos, A., Roberts, G., Stuart, A. and Voss, J. Cambridge. MR2562154 (2008). MCMC methods for diffusion bridges. [20] Hairer, M., Stuart, A. and Voss, J. (2011). Signal Stoch. Dyn. 8 319–350. MR2444507 processing problems on function space: Bayesian [5] Beskos, A., Pinski, F. J., Sanz-Serna, J. M. and formulation, stochastic PDEs and effective MCMC Stuart, A. M. (2011). Hybrid Monte Carlo on methods. In The Oxford Handbook of Nonlinear Fil- Hilbert spaces. . Appl. 121 2201– tering (D. Crisan and B. Rozovsky, eds.) 833– 2230. MR2822774 873. Oxford Univ. Press, Oxford. MR2884617 MCMC METHODS FOR FUNCTIONS 23

[21] Hairer, M., Stuart, A. M. and Vollmer, S. (2013). [38] Richtmyer, D. and Morton, K. W. (1967). Difference Spectral gaps for a Metropolis–Hastings algo- Methods for Initial Value Problems. Wiley, New rithm in infinite dimensions. Available at http:// York. arxiv.org/abs/1112.1392. [39] Robert, C. P. and Casella, G. (1999). Monte [22] Hairer, M., Stuart, A. M., Voss, J. and Wiberg, P. Carlo Statistical Methods. Springer, New York. (2005). Analysis of SPDEs arising in path sampling. MR1707311 I. The Gaussian case. Commun. Math. Sci. 3 587– [40] Roberts, G. O., Gelman, A. and Gilks, W. R. (1997). 603. MR2188686 Weak convergence and optimal scaling of random [23] Hills, S. E. and Smith, A. F. M. (1992). Parameteriza- walk Metropolis algorithms. Ann. Appl. Probab. 7 tion issues in Bayesian inference. In Bayesian Statis- 110–120. MR1428751 tics, 4 (Pe˜n´ıScola, 1991) 227–246. Oxford Univ. [41] Roberts, G. O. and Rosenthal, J. S. (1998). Optimal Press, New York. MR1380279 scaling of discrete approximations to Langevin dif- [24] Hjort, N. L., Holmes, C., Muller,¨ P. and fusions. J. R. Stat. Soc. Ser. B Stat. Methodol. 60 Walker, S. G., eds. (2010). Bayesian Nonparamet- 255–268. MR1625691 rics. Cambridge Series in Statistical and Probabilis- [42] Roberts, G. O. and Rosenthal, J. S. (2001). Optimal tic Mathematics 28. Cambridge Univ. Press, Cam- scaling for various Metropolis–Hastings algorithms. bridge. MR2722987 Statist. Sci. 16 351–367. MR1888450 [25] Iserles, A. (2004). A First Course in the Numeri- [43] Roberts, G. O. and Stramer, O. (2001). On inference cal Analysis of Differential Equations. Cambridge for partially observed nonlinear diffusion models us- Univ. Press, Cambridge. ing the Metropolis–Hastings algorithm. Biometrika [26] Kalnay, E. (2003). Atmospheric Modeling, Data Assim- 88 603–621. MR1859397 ilation and Predictability. Cambridge Univ. Press, [44] Roberts, G. O. and Tweedie, R. L. (1996). Expo- Cambridge. nential convergence of Langevin distributions and [27] Lemm, J. C. (2003). Bayesian Field Theory. Johns Hop- their discrete approximations. Bernoulli 2 341–363. kins Univ. Press, Baltimore, MD. MR1987925 MR1440273 [28] Liu, J. S. (2001). Monte Carlo Strategies in Scientific [45] Rue, H. and Held, L. (2005). Gaussian Markov Ran- Computing. Springer, New York. MR1842342 dom Fields: Theory and Applications. Monographs [29] Mattingly, J. C., Pillai, N. S. and Stuart, A. M. on Statistics and Applied Probability 104. Chapman (2012). Diffusion limits of the random walk & Hall/CRC, Boca Raton, FL. MR2130347 Metropolis algorithm in high dimensions. Ann. [46] Smith, A. F. M. and Roberts, G. O. (1993). Bayesian Appl. Probab. 22 881–930. MR2977981 computation via the Gibbs sampler and related [30] McLaughlin, D. and Townley, L. R. (1996). A re- Markov chain Monte Carlo methods. J. R. Stat. assessment of the groundwater inverse problem. Soc. Ser. B Stat. Methodol. 55 3–23. MR1210421 Water Res. Res. 32 1131–1161. [47] Sokal, A. D. (1989). Monte Carlo methods in statis- [31] Miller, M. T. and Younes, L. (2001). Group actions, tical mechanics: Foundations and new algorithms, homeomorphisms, and matching: A general frame- Univ. Lausanne, Bˆatiment des sciences de physique work. Int. J. Comput. Vis. 41 61–84. Troisi`eme Cycle de la physique en Suisse romande. [32] Neal, R. M. (1996). Bayesian Learning for Neural Net- [48] Stein, M. L. (1999). Interpolation of Spatial Data: works. Springer, New York. Some Theory for Kriging. Springer, New York. [33] Neal, R. M. (1998). Regression and classifica- MR1697409 tion using Gaussian process priors. Available at [49] Stuart, A. M. (2010). Inverse problems: A Bayesian http://www.cs.toronto.edu/~radford/valencia. perspective. Acta Numer. 19 451–559. MR2652785 abstract.html. [50] Stuart, A. M., Voss, J. and Wiberg, P. (2004). Fast [34] Neal, R. M. (2011). MCMC using Hamiltonian dynam- communication conditional path sampling of SDEs ics. In Handbook of Markov Chain Monte Carlo 113– and the Langevin MCMC method. Commun. Math. 162. CRC Press, Boca Raton, FL. MR2858447 Sci. 2 685–697. MR2119934 [35] O’Hagan, A., Kennedy, M. C. and Oakley, J. E. [51] Tierney, L. (1998). A note on Metropolis–Hastings ker- (1999). Uncertainty analysis and other inference nels for general state spaces. Ann. Appl. Probab. 8 tools for complex computer codes. In Bayesian 1–9. MR1620401 Statistics, 6 (Alcoceber, 1998) 503–524. Oxford [52] Vaillant, M. and Glaunes, J. (2005). Surface match- Univ. Press, New York. MR1724872 ing via currents. In Information Processing in Med- [36] Pillai, N. S., Stuart, A. M. and Thiery,´ A. H. ical Imaging 381–392. Springer, Berlin. (2012). Optimal scaling and diffusion limits for the [53] van der Meulen, F., Schauer, M. and van Zan- Langevin algorithm in high dimensions. Ann. Appl. ten, H. (2013). Reversible jump MCMC for non- Probab. 22 2320–2356. MR3024970 parametric drift estimation for diffusion processes. [37] Pillai, N. S., Stuart, A. M. and Thiery, A. H. Comput. Statist. Data Anal. To appear. (2012). On the random walk Metropolis algorithm [54] Zhao, L. H. (2000). Bayesian aspects of some non- for Gaussian random field priors and gradient flow. parametric problems. Ann. Statist. 28 532–552. Available at http://arxiv.org/abs/1108.1494. MR1790008