<<

Stat Comput DOI 10.1007/s11222-016-9671-0

Statistical analysis of differential : introducing probability measures on numerical solutions

Patrick R. Conrad1 · Mark Girolami1,2 · Simo Särkkä3 · Andrew Stuart4 · Konstantinos Zygalakis5

Received: 5 December 2015 / Accepted: 5 May 2016 © The Author(s) 2016. This article is published with open access at Springerlink.com

Abstract In this paper, we present a formal quantification tainty in a statistical model and its subsequent analysis. of uncertainty induced by numerical solutions of ordinary We show that a wide variety of existing solvers can be and partial differential models. Numerical solu- randomised, inducing a probability over the solu- tions of differential equations contain inherent uncertainties tions of such differential equations. These measures exhibit due to the finite-dimensional approximation of an unknown contraction to a Dirac measure around the true unknown solu- and implicitly defined . When statistically analysing tion, where the rates of convergence are consistent with the models based on differential equations describing physical, underlying deterministic numerical method. Furthermore, or other naturally occurring, phenomena, it can be impor- we employ the method of modified equations to demonstrate tant to explicitly account for the uncertainty introduced by enhanced rates of convergence to perturbations the numerical method. Doing so enables objective deter- of the original deterministic problem. Ordinary differen- mination of this source of uncertainty, relative to other tial equations and elliptic partial differential equations are uncertainties, such as those caused by data contaminated used to illustrate the approach to quantify uncertainty in with noise or model error induced by missing physical or both the statistical analysis of the forward and inverse inadequate descriptors. As ever larger scale mathematical problems. models are being used in the sciences, often sacrificing com- plete resolution of the differential equation on the grids Keywords · Probabilistic numerics · used, formally accounting for the uncertainty in the numer- Inverse problems · Uncertainty quantification ical method is becoming increasingly more important. This paper provides the formal means to incorporate this uncer- Mathematics Subject Classification 62F15 · 65N75 · 65L20

Electronic supplementary material The online version of this article (doi:10.1007/s11222-016-9671-0) contains supplementary material, which is available to authorized users. 1 Introduction B Patrick R. Conrad 1.1 Motivation [email protected]

1 Department of , University of Warwick, Coventry, The numerical analysis literature has developed a large range UK of efficient for solving ordinary and partial dif- 2 Present Address: Alan Turing Institute, London, UK ferential equations, which are typically designed to solve 3 Department of Electrical and , Aalto a single problem as efficiently as possible (Hairer et al. University, Espoo, Finland 1993; Eriksson 1996). When classical numerical methods 4 Department of Mathematics, University of Warwick, are placed within statistical analysis, however, we argue that Coventry, UK significant difficulties can arise as a result of errors in the 5 School of Mathematics, University of Edinburgh, Edinburgh, computed approximate solutions. While the distributions of Scotland interest commonly do converge asymptotically as the solver 123 Stat Comput mesh becomes dense [e.g. in statistical inverse problems the solver becomes completely inaccurate by the end of the (Dashti and Stuart 2016)], we argue that at a finite resolu- depicted interval given the step-size selected, the solver pro- tion, the statistical analyses may be vastly overconfident as vides no obvious characterisation of its error at late times. a result of these unmodelled errors. Compare this with a sample of randomised solutions based The purpose of this paper is to address these issues by on the same integrator and the same step-size; it is obvi- the construction and rigorous analysis of novel probabilistic ous that early-time solutions are accurate and that they integration methods for both ordinary and partial differential diverge at late times, reflecting instability of the solver. equations. The approach in both cases is similar: we iden- Every curve drawn has the same theoretical accuracy as tify the key discretisation assumptions and introduce a local the original classical method, but the randomised integra- random field, in particular a Gaussian field, to reflect our tor provides a detailed and practical approach for revealing uncertainty in those assumptions. The probabilistic solver the sensitivity of the solution to numerical errors. The may then be sampled repeatedly to interrogate the uncer- method used requires only a straightforward modification tainty in the solution. For a wide variety of commonly used of the standard Runge–Kutta integrator and is explained in numerical methods, our construction is straightforward to Sect. 2.3. apply and provably preserves the order of convergence of the We summarise the contributions of this work as follows: original method. Furthermore, we demonstrate the value of these prob- – Construct randomised solvers of ODEs and PDEs using abilistic solvers in statistical inference settings. Analytic natural modification of popular, existing solvers. and numerical examples show that using a classical non- – Prove the convergence of the randomised methods and probabilistic solver with inadequate discretisation when study their behaviour by showing a close link between performing inference can lead to inappropriate and mislead- randomised ODE solvers and stochastic differential ing posterior concentration in a Bayesian setting. In contrast, equations (SDEs). the probabilistic solver reveals the structure of uncertainty – Demonstrate that these randomised solvers can be used in the solution, naturally limiting posterior concentration as to perform statistical analyses that appropriately consider appropriate. solver uncertainty. As a motivating example, consider the solution of the Lorenz’63 system. Since the problem is chaotic, any typi- cal fixed-step numerical methods will become increasingly 1.2 Review of existing work inaccurate for long integration times. Figure 1 depicts a deterministic solution for this problem, computed with a The statistical analysis of models based on ordinary and fixed-step, fourth-order, Runge–Kutta integrator. Although partial differential equations is growing in importance and

Fig. 1 A comparison of 20 solutions to the Lorenz’63 15 10 system using deterministic (red) 5 and randomised (blue) x 0 integrators based on a −5 −10 fourth-order Runge–Kutta −15 integrator −20 30 20 10

y 0 −10 −20 −30 50 40 30 z 20 10 0 0 5 10 15 20 t

123 Stat Comput a number of recent papers in the statistics literature have late insights from our study of the restricted problem to the sought to address certain aspects specific to such mod- more general case. Existing strategies for discretisation error els, e.g. parameter estimation (Liang and Wu 2008; Xue include empirically fitted Gaussian models for PDE errors et al. 2010; Xun et al. 2013; Brunel et al. 2014) and sur- (Kaipio and Somersalo 2007) and randomly perturbed ODEs rogate construction (Chakraborty et al. 2013). However, (Arnold et al. 2013); the latter partially coincides with our the statistical implications of the reliance on a numeri- construction, but our motivation and analysis are distinct. cal approximation to the actual solution of the differential Recent work (Capistrán et al. 2013) uses Bayes factors to equation have not been addressed in the statistics litera- analyse the impact of discretisation error on posterior approx- ture to date and this is the open problem comprehensively imation quality. Probabilistic models have also been used to addressed in this paper. Earlier work in the literature includ- study error propagation due to rounding error; see Hairer ing randomisation in the approximate integration of ordi- et al. (2008). nary differential equations (ODEs) includes (Coulibaly and Lécot 1999; Stengle 1995). Our strategy fits within the 1.3 Organisation emerging field known as Probabilistic Numerics (Hennig et al. 2015), a perspective on computational methods pio- The remainder of the paper has the following structure: neered by Diaconis (1988), and subsequently (Skilling 1992). Sect. 2 introduces and formally analyses the proposed proba- This framework recasts solving differential equations as a bilistic solvers for ODEs. Section 3 explores the characteris- statistical inference problem, yielding a probability mea- tics of random solvers employed in the statistical analysis of sure over functions that satisfy the constraints imposed by both forward and inverse problems. Then, we turn to elliptic the specific differential equation. This measure formally PDEs in Sect. 4, where several key steps of the construction quantifies the uncertainty in candidate solution(s) of the of probabilistic solvers and their analysis have intuitive ana- differential equation, allowing its use in uncertainty quan- logues in the ODE context. Finally, an illustrative example tification (Sullivan 2016) or Bayesian inverse problems of an elliptic PDE inference problem is presented in Sect.5.1 (Dashti and Stuart 2016). A recent Probabilistic Numerics methodology for ODEs (Chkrebtii et al. 2013) [explored in parallel in Hennig and 2 Probability measures via probabilistic time Hauberg (2014)] has two important shortcomings. First, it integrators is impractical, only supporting first-order accurate schemes with a rapidly growing computational cost caused by the Consider the following ordinary differential equation (ODE): growing difference stencil [although Schober et al. (2014) extends to Runge–Kutta methods]. Secondly, this method du = f (u), u(0) = u , (1) does not clearly articulate the relationship between their dt 0 probabilistic structure and the problem being solved. These methods construct a Gaussian process whose mean coin- where u(·) is a taking values in Rn.2 cides with an existing deterministic integrator. While they We let Φt denote the flow map for Eq. (1), so that u(t) = claim that the posterior variance is useful, by the con- Φt u(0) . The conditions ensuring that this solution exists jugacy inherent in linear Gaussian models, it is actually will be formalised in Assumption 2, below. just an aprioriestimate of the of Deterministic numerical methods for the integration of the integrator, independent of the actual forcing or ini- this equation on time interval [0, T ] will produce an approx- tial condition of the problem being solved. These works { = }K imation to the equation on a mesh of points tk kh k=0, also describe a procedure for randomising the construc- with Kh = T , (for simplicity we assume a fixed mesh). Let tion of the mean process, which bears similarity to our uk = u(tk) denote the exact solution of (1) on the mesh and approach, but it is not formally studied. In contrast, we Uk ≈ uk denote the approximation computed using finite formally link each draw from our measure to the analytic evaluations of f . Typically, these methods output a single solution. { }K discrete solution Uk k=0, often augmented with some type Our motivation for enhancing inference problems with of error indicator, but do not statistically quantify the uncer- models of discretisation error is similar to the more gen- tainty remaining in the path. eral concept of model error, as developed by Kennedy and O’Hagan (2001). Although more general types of model 1 error, including uncertainty in the underlying , are Supplementary materials and code are available online: http://www2. warwick.ac.uk/pints. important in many applications, our focus on errors aris- 2 To simplify our discussion we assume that the ODE is autonomous, ing from the discretisation of differential equations leads to that is, f (u) is independent of time. Analogous theory can be developed more specialised methods. Future work may be able to trans- for time-dependent forcing. 123 Stat Comput

n Let Xa,b denote the Banach space C([a, b]; R ).The The solutions on the mesh satisfy [ , ] exact solution of (1) on the time interval 0 T maybeviewed  tk+1   as a Dirac measure δu on X0,T at the element u that solves the uk+1 = uk + f u(s) ds, (5) ODE. We will construct a probability measure μh on X , , 0 T tk that is straightforward to sample from both on and off the mesh, for which h quantifies the size of the discretisation and may be interpolated between mesh points by means of step employed, and whose distribution reflects the uncer- the expression tainty resulting from the solution of the ODE. Convergence  of the numerical method is then related to the contraction of t   h u(t) = uk + f u(s) ds, t ∈[tk, tk+1). (6) μ to δu. tk We briefly summarise the construction of the numerical n n method. Let Ψh : R → R denote a classical determin- We may then write h istic one-step numerical integrator over time-step , a class  including all Runge–Kutta methods and Taylor methods for t ( ) = + ( ) , ∈[ , ), ODE (Hairer et al. 1993). Our numer- u t uk g s ds t tk tk+1 (7) tk ical methods will have the property that, on the mesh, they   take the form where g(s) = f u(s) is an unknown function of time. In the algorithmic setting, we have approximate knowledge about Uk+1 = Ψh(Uk) + ξk(h), (2) g(s) through an underlying numerical method. A variety of traditional numerical algorithms may be derived based on ( ) where ξk(h) are suitably scaled, i.i.d. Gaussian random vari- approximation of g s by various simple deterministic func- ables. That is, the random solution iteratively takes the tions gh(s). The simplest such numerical method arises from standard step, Ψh, followed by perturbation with a random invoking the Euler approximation that draw, ξk(h), modelling uncertainty that accumulates between { }K h( ) = ( ), ∈[ , ). mesh points. The discrete path Uk k=0 is straightforward to g s f Uk s tk tk+1 (8) sample and in general is not a Gaussian process. Furthermore, the discrete can be extended into a continuous time In particular, if we take t = tk+1 and apply this method induc- approximation of the ODE, which we define as a draw from tively the corresponding numerical scheme arising from μh the measure . making such an approximation to g(s) in (7)isUk+1 = The remainder of this section develops these solvers in Uk + hf(Uk). Now consider the more general one-step detail and proves strong convergence of the random solu- numerical method Uk+1 = Ψh(Uk). This may be derived h tions to the exact solution, implying that μ → δu in an by approximating g(s) in (7)by appropriate sense. Finally, we establish a close relationship   between our random solver and a stochastic differential equa- h d g (s) = Ψτ (Uk) , s ∈[tk, tk+1). (9) tion (SDE) with small mesh-dependent noise. Intuitively, dτ τ=s−tk adding Gaussian noise to an ODE suggests a link to SDEs. Additionally, note that the mesh-restricted version of our We note that all consistent (in the sense of numerical analysis) , given by (2), has the same structure as a first-order one-step methods will satisfy Ito–Taylor expansion of the SDE   d Ψτ (u) = f (u). τ τ= du = f (u)dt + σdW, (3) d 0 The approach based on the approximation (9) leads to a for some choice of σ. We make this link precise by perform- deterministic numerical method which is defined as a con- ing a backwards error analysis, which connects the behaviour tinuous function of time. Specifically, we have U(s) = of our solver to an associated SDE. Ψ ( ), ∈[ , ). s−tk Uk s tk tk+1 Consider again the Euler approx- imation, for which Ψτ (U) = U + τ f (U), and whose 2.1 Probabilistic time integrators: general formulation continuous time interpolant is then given by U(s) = Uk + (s − tk) f (Uk), s ∈[tk, tk+1). Note that this produces The form of Eq. (1)is a continuous function, namely an element of X0,T , when extended to s ∈[0, T ]. The preceding development of a  t   numerical integrator does not acknowledge the uncertainty ( ) = + ( ) . u t u0 f u s ds (4) that arises from lack of knowledge about g(s) in the interval 0 123 Stat Comput s ∈[tk, tk+1). We propose to approximate g stochastically in order to represent this uncertainty, taking   h d g (s) = Ψτ (Uk) + χk(s − tk), s ∈[tk, tk+1) dτ τ=s−tk where the {χk} form an i.i.d. of Gaussian random h 3 functions defined on [0, h] with χk ∼ N(0, C ). We will choose Ch to shrink to zero with h at a prescribed rate (see Assumption 1), and also to ensure that χ ∈ X , k 0 h (a) almost surely. The functions {χk} represent our uncertainty about the function g. The corresponding numerical scheme arising from such an approximation is given by

Uk+1 = Ψh(Uk) + ξk(h), (10) where the i.i.d. sequence of functions {ξk} lies in X0,h and is given by  t ξk(t) = χk(τ)dτ. (11) 0 (b) Note that the numerical solution is now naturally defined between grid points, via the expression Fig. 2 An illustration of deterministic Euler steps and randomised vari- ations. The random integrator in (b) outputs the path in red; we overlay the standard Euler step constructed at each step, before it is perturbed ( ) = Ψ ( ) + ξ ( − ), ∈[ , ). U s s−tk Uk k s tk s tk tk+1 (12) (blue)

When it is necessary to evaluate a solution at multiple points in an interval, s ∈ (tk, tk+1], the perturbations ξk(s − tk) While we argue that the choice of modelling local uncer- must be drawn jointly, which is facilitated by their Gaussian tainty in the flow map as a Gaussian process is natural structure. Although most users will only need the formulation and analytically favourable, it is not unique. It is possi- on mesh points, we must consider off-mesh behaviour to ble to construct examples where the Gaussian assumption rigorously analyse higher order methods, as is also required is invalid; for example, when a highly inadequate time- for the deterministic variants of these methods. step is used, a systemic bias may be introduced. However, In the case of the , for example, we have in regimes where the underlying deterministic method per- forms well, the centred Gaussian assumption is a reasonable

Uk+1 = Uk + hf(Uk) + ξk(h) (13) prior. and, between grid points, 2.2 Strong convergence result

U(s) = Uk + (s − tk) f (Uk) + ξk(s − tk), s ∈[tk, tk+1). To prove the strong convergence of our probabilistic numeri- (14) cal solver, we first need two assumptions quantifying proper- ties of the random noise and of the underlying deterministic This method is illustrated in Fig. 2. Observe that Eq. (13) integrator, respectively. In what follows we use ·, · and |·| n has the same form as an Euler–Maryama method for an to denote the Euclidean inner product and norm on R .We Rn×n |·| Eh associated SDE (3) where σ depends on the step-size h.In denote the Frobenius norm on by F, and denotes {χ } particular, in the simple one-dimensional case, σ would be expectation with respect to the i.i.d. sequence k . given by Ch/h. Section 2.4 develops a more sophisticated  ξ ( ):= t χ ( ) χ ∼ ( , h). that extends to higher order methods and off the Assumption 1 Let k t 0 k s dswith k N 0 C > , ≥ ∈[ , ] mesh. Then there exists K 0 p 1 such that, for all t 0 h , Eh|ξ ( )ξ ( )T |2 ≤ 2p+1; Eh|ξ ( )|2 ≤ k t k t F Kt in particular k t + 3 h 2p 1. We use χk ∼ N(0, C ) to denote a zero-mean Gaussian process Kt Furthermore, we assume the existence of matrix h h T 2p+1 defined on [0, h] with a covariance kernel cov(χk (t), χk (s))  C (t, s). Q, independent of h, such that E [ξk(h)ξk(h) ]=Qh . 123 Stat Comput

Here, and in the sequel, K is a constant independent of h, where but possibly changing from line to line. Note that the covari-   ance kernel Ch is constrained, but not uniquely defined. We 1 k1(u) = f (u), k2(u, h) = f u + hk1(u) will assume the form of the constant matrix is Q = σ I , and 2     we discuss one possible strategy for choosing σ in Sect.3.1. 1 k3(u, h) = f u + hk2(u) , k4(u, h) = f u + hk3(u) . Section 2.4 uses a weak convergence analysis to argue that 2 once Q is selected, the exact choice of Ch has little practical impact. The method has local truncation error in the form of Assump- tion 2 with q = 4. It may be used as the basis of a probabilistic Assumption 2 The function f and a sufficient number of its numerical method (12), and hence (10) at the grid points. Rn are bounded uniformly in in order to ensure Thus, provided that we choose to perturb this integrator with that f is globally Lipschitz and that the numerical flow map a random process χ satisfying Assumption 1 with p ≥ 4,5 Ψ + k h has uniform local truncation error of order q 1: Theorem 2.2 shows that the error between the probabilistic integrator based on the classical Runge–Kutta method is, in |Ψ ( ) − Φ ( )|≤ q+1. sup t u t u Kt the mean square sense, of the same order of accuracy as the u∈Rn deterministic classical Runge–Kutta integrator. Remark 2.1 We assume globally Lipschitz f , and bounded derivatives, in order to highlight the key probabilistic ideas, 2.4 Backward error analysis whilst simplifying the numerical analysis. Future work will address the non-trivial issue of extending of analyses to Backwards error analyses are useful tool for numerical analy- weaken these assumptions. In this paper, we provide numer- sis; the idea is to characterise the method by identifying a ical results indicating that a weakening of the assumptions is modified equation (dependent upon h) which is solved by indeed possible. the numerical method either exactly, or at least to a higher Theorem 2.2 Under Assumptions 1, 2 it follows that there degree of accuracy than the numerical method solves the is K > 0 such that original equation. For our random ODE solvers, we will show that the modified equation is a stochastic differential equation h 2 2min{p,q} Q sup E |uk − Uk| ≤ Kh . (SDE) in which only the matrix from Assumption 1 enters; 0≤kh≤T the details of the random processes used in our construction do not enter the modified equation. This universality prop- Furthermore, erty underpins the methodology we introduce as it shows that { , } many different choices of random processes all lead to the sup Eh|u(t) − U(t)|≤Khmin p q . 0≤t≤T same effective behaviour of the numerical method. We introduce the operators L and Lh defined so that, for This theorem implies that every probabilistic solution is a all φ ∈ C∞(Rn, R), good approximation of the exact solution in both a discrete         and continuous sense. Choosing p ≥ q is natural if we want L Lh φ Φ (u) = eh φ (u), Eφ U |U = u = eh φ (u). to preserve the strong order of accuracy of the underlying h 1 0 deterministic integrator; we proceed with the choice p = q, (15) introducing the maximum amount of noise consistent with Lh this constraint. Thus L := f ·∇ and eh is the kernel for the Markov chain generated by the probabilistic integrator (2). In fact we 2.3 Examples of probabilistic time integrators never need to work with Lh itself in what follows, only with h ehL , so that questions involving the logarithm do The canonical illustration of a probabilistic time integrator is not need to be discussed. the probabilistic Euler method already described.4 Another We now introduce a modified ODE and a modified SDE useful example is the classical Runge–Kutta method which which will be needed in the analysis that follows. The mod- defines a one-step numerical integrator as follows: ified ODE is

h   ˆ Ψ ( ) = + ( ) + ( , ) + ( , ) + ( , ) , du h h u u k1 u 2k2 u h 2k3 u h k4 u h = f (uˆ) (16) 6 dt

4 An additional example of a probabilistic integrator, based on a Ornstein–Uhlenbeck process, is available in the supplementary materi- 5 Implementing Eq. 10 is trivial, since it simply adds an appropriately als. scaled Gaussian random number after each classical Runge–Kutta step. 123 Stat Comput whilst the modified SDE has the form Finally we note that (20)impliesthat  h L h L h du˜ = f (u˜)dt + h2p Q dW. (17) eh φ(u) − eh φ(u)   L h 1 2p+1 :∇∇ = h 2 h Q − φ( ) The precise choice of f h is detailed below. Letting E denote e e I u   expectation with respect to W, we introduce the operators Lh L h 1 + + = eh h2p 1 Q :∇∇φ(u) + O(h4p 2) and L h so that, for all φ ∈ C∞(Rn, R), 2       L h 1 2p+1 φ uˆ(h)|ˆu(0) = u = eh φ (u), (18) = I + O(h) h Q :∇∇φ(u)     2 hL h Eφ u˜(h)|˜u(0) = 0 = e φ (u). (19) + O(h4p+2) .

Thus, Thus we have 1 L h := f h ·∇, L h = f h ·∇+ h2p Q :∇∇, (20) L h L h 1 + + 2 eh φ(u) − eh φ(u) = h2p 1 Q :∇∇φ(u) + O(h2p 2). × 2 where : denotes the inner product on Rn n which induces the (26) Frobenius norm, that is, A:B = trace(AT B). The fact that the deterministic numerical integrator has Now using (23), (25), and (26) we obtain uniform local truncation error of order q + 1 (Assumption 2) ∞ L h Lh + + + implies that, since φ ∈ C , eh φ(u) − eh φ(u) = O(h2p 2) + O(hq 2 l ). (27)

hL q+1 e φ(u) − φ(Ψh(u)) = O(h ). (21) Balancing these terms, in what follows we make the choice l = 2p − q.Ifl < 0 we adopt the convention that the drift The theory of modified equations for classical one-step f h is simply f. With this choice of q we obtain numerical integration schemes for ODEs (Hairer et al. 1993) h L h Lh + establishes that it is possible to find f in the form eh φ(u) − eh φ(u) = O(h2p 2). (28) q+l This demonstrates that the error between the Markov ker- f h := f + hi f , (22) i nel of one-step of the SDE (17) and the Markov kernel of the i=q numerical method (2)isoforderO(h2p+2). Some straight- such that forward stability considerations show that the weak error over an O(1) time interval is O(h2p+1). We make assumptions hL h q+2+l e φ(u) − φ(Ψh(u)) = O(h ). (23) giving this stability and then state a theorem comparing the weak error with respect to the modified Eq. (17), and the We work with this choice of f h in what follows. original Eq. (1). Now for our stochastic numerical method we have ∞ Assumption 3 The function f is in C and all its deriv- atives are uniformly bounded on Rn. Furthermore, f is φ(U + ) = φ(Ψ (U )) + ξ (h) ·∇φ(Ψ (U )) h k 1 h k k h k such that the operators ehL and ehL satisfy, for all ψ ∈ 1 T 3 ∞ n + ξk(h)ξ (h) :∇∇φ(Ψh(Uk)) + O(|ξk(h)| ). C (R , R) and some L > 0, 2 k | hLψ( )|≤( + ) |ψ( )|, Furthermore, the last term has mean of size O(|ξ (h)|4). sup e u 1 Lh sup u  k  u∈Rn u∈Rn From Assumption 1 we know that Eh ξ (h)ξ T (h) = k k | hLh ψ( )|≤( + ) |ψ( )|. Qh2p+1. Thus sup e u 1 Lh sup u   u∈Rn u∈Rn Lh eh φ(u) − φ Ψ (u) h =   Remark 2.3 If p q in what follows (our recommended 1 2p+1 4p+2 = h Q :∇∇φ Ψh(u) + O(h ). (24) choice) then the weak order of the method coincides with 2 the strong order; however, measured relative to the modified From this it follows that equation, the weak order is then one plus twice the strong   order. In this case, the second part of Theorem 2.2 gives hLh e φ(u) − φ Ψh(u) us the first weak order result in Theorem 2.4. Additionally, Assumption 3 is stronger than we need, but allows us to 1 + + = h2p 1 Q :∇∇φ(u) + O(h2p 2). (25) 2 highlight probabilistic ideas whilst keeping overly technical 123 Stat Comput aspects of the numerical analysis to a minimum. More sophis- This particular example does not satisfy the stringent ticated, but structurally similar, analysis would be required Assumptions 2 and 3 and the numerical results shown demon- for weaker assumptions on f . Similar considerations apply strate that, as indicated in Remarks 2.1 and 2.3, our theory to the assumptions on φ. will extend to weaker assumptions on f , something we will address in future work. Theorem 2.4 Consider the numerical method (10) and assume that Assumptions 1 and 3 are satisfied. Then, for ∞ φ ∈ C function with all derivatives bounded uniformly on 3.1 Calibrating forward uncertainty propagation Rn, we have that   Consider Eq. (29) with fixed initial conditions V (0) = h min{2p,q} |φ(u(T )) − E φ(Uk) |≤Kh , kh = T, −1, R(0) = 1, and parameter values (.2,.2, 3). Figure 3 shows draws of the V species from the measure and associated with the probabilistic Euler solver with p = q =     1, for various values of the step-size and fixed σ = 0.1. |E φ(˜( )) − Eh φ( ) |≤ 2p+1, = , u T Uk Kh kh T The random draws exhibit non-Gaussian structure at large step-size and clearly contract towards the true solution. ˜ where u and u solve (1) and (17), respectively. Although the rate of contraction is governed by the σ Example 2.5 Consider the probabilistic integrator derived underlying deterministic method, the scale parameter, , 6 from the Euler method in dimension n = 1. We thus have completely controls the apparent uncertainty in the solver. σ q = 1, and we hence set p = 1. The results in Hairer et al. This tuning problem exists in general, since is problem (2006) allow us to calculate f h with l = 1. The preceding dependent and cannot obviously be computed analytically. σ theory then leads to strong order of convergence 1, measured Therefore, we propose to calibrate to replicate the relative to the true ODE (1), and weak order 3 relative to the amount of error suggested by classical error indicators. In the SDE following discussion, we often explicitly denote the depen- σ dence on h and with superscripts, hence the probabilistic 2  h,σ h h solver is U and the corresponding deterministic solver is duˆ = f (uˆ) − f (uˆ) f (uˆ) + f (uˆ) f 2(uˆ) U h,0. Define the deterministic error as e(t) = u(t)−U h,0(t). 2 12  √ Then we assume there is some computable error indicator + ( f (uˆ))2 f (uˆ) t + Ch W. 4 d d E(t) ≈ e(t), defining Ek = E(tk). The simplest error indi- cators might compare differing step-sizes, E(t) = U h,0(t)− , These results allow us to constrain the behaviour of the U 2h 0(t), or differing order methods, as in a Runge–Kutta 4– randomised method using limited information about the 5 scheme. covariance structure, Ch. The randomised solution converges We proceed by constructing a probability distribution weakly, at a high rate, to a solution that only depends on Q. π(σ) that is maximised when the desired matching occurs. Hence, we conclude that the practical behaviour of the solu- We estimate this scale matching by comparing: (i) a Gaussian h μ˜ h,σ = tion is only dependent upon Q, and otherwise, C may be approximation of our random solver at each step k, k N (E( h,σ ), V( h,σ )); any convenient kernel. With these results now available, the Uk Uk and (ii) the natural Gaussian mea- following section provides an empirical study of our proba- h,0 sure from the deterministic solver, Uk , and the available bilistic integrators. νσ = N ( h,0,( )2). error indicator, Ek, k Uk Ek We construct π(σ) by penalising the distance between these two normal  ,σ distributions at every step: π(σ) ∝ exp −d(μ˜ h ,νσ ) . 3 Statistical inference and numerics k k k We find that the Bhattacharyya distance (closely related to the This section explores applications of the randomised ODE Hellinger metric) works well (Kailath 1967), since it diverges solvers developed in Sect.2 to forward and inverse prob- quickly if either the mean or variance differs. The density can lems. Throughout this section, we use the FitzHugh–Nagumo be easily estimated using Monte Carlo. If the ODE state is a model to illustrate ideas (Ramsay et al. 2007). This is a two- vector, we take the product of the univariate Bhattacharyya state non-linear oscillator, with states (V, R) and parameters distances. Note that this calibration depends on the initial (a, b, c), governed by the equations conditions and any parameters of the ODE. dV V 3 dR 1 = c V − + R , =− (V − a + bR) . 6 Recall that throughout we assume that, within the context of Assump- dt 3 dt c tion 1, Q = σ I . More generally it is possible to calibrate an arbitrary (29) positive semi-definite Q. 123 Stat Comput

Fig. 3 Thetruetrajectoryof h =0.2 h =0.1 h =0.05 the V species of the 4 FitzHugh–Nagumo model (red) 2 and one hundred realisations 0 from a probabilistic Euler ODE R −2 solver with various step-sizes and noise scale σ = .1(blue) −4 h =0.02 h =0.01 h =0.005 4 2 0 R −2 −4 h =0.002 h =0.001 h =0.5 × 10−3 4 2 0 R −2 −4 0 5 10 15 20 0 5 10 15 20 0 5 10 15 20 t t t

Fig. 4 A comparison of the error indicator for the V species of the FitzHugh–Nagumo model (blue) and the observed variation in the calibrated probabilistic solver. The red curves depict 50 samples of the magnitude of the difference between a standard Euler solver for several step-sizes and the equivalent randomised variant, using σ ∗, maximising π(σ)

Returning to the FitzHugh–Nagumo model, sampling 3.2 Bayesian posterior inference problems from π(σ) yields strongly peaked, uni-modal posteriors, hence we proceed using σ ∗ = arg maxπ(σ). We exam- Given the calibrated probabilistic ODE solvers described ine the quality of the scale matching by plotting the above, let us consider how to incorporate them into infer- magnitudes of the random variation against the error indi- ence problems. cator in Fig. 4, observing good agreement of the mar- Assume we are interested in inferring parameters of the ginal variances. Note that our measure still reveals non- ODE given noisy observations of the state. Specifically, we Gaussian structure and correlations in time not revealed wish to infer parameters θ ∈ Rd for the differential equation by the deterministic analysis. As described, this procedure u˙ = f (u,θ), with fixed initial conditions u(t = 0) = u0 (a requires fixed inputs to the ODE, but it is straightfor- straightforward modification may include inference on initial m ward to marginalise out a prior distribution over input conditions). Assume we are provided with data d ∈ R , d j = parameters. u(τ j ) + η j at some collection of times τ j , corrupted by i.i.d.

123 Stat Comput

η ∼ N ( ,Γ) Q(θ) du − noise, j 0 .Ifwehaveprior , the posterior = h 1 log(1 + λh)u, we wish to explore is, P(θ | d) ∝ Q(θ)L(d, u(θ)), where dt density L compactly summarises this likelihood model. hence error increases with time. However, the assumed obser- The standard computational strategy is to simply replace vational model does not allow for this, as the observation the unavailable trajectory u with a numerical approxi- variance is γ 2 at all times. mation, inducing approximate posterior Ph,0(θ | d) ∝ In contrast, our proposed probabilistic Euler solver pre- Q(θ)L(d, U h,0(θ)). Informally, this approximation will be dicts accurate when the error in the numerical solver is small com- k−1 pared to Γ and often converges formally to P(θ | d) as h → 0 / − − U = (1 + hλ)ku + σ h3 2 ξ (1 + λh)k j 1, (Dashti and Stuart 2016). However, highly correlated errors k 0 j j=0 at finite h can have substantial impact. In this work, we are concerned about the undue optimism where we have made the natural choice p = q, where σ is in the predicted variance, that is, when the posterior concen- the problem-dependent scaling of the noise and the ξk are trates around an arbitrary parameter value even though the i.i.d. N (0, 1). For a single observation, ηk and every ξk are deterministic solver is inaccurate and is merely able to repro- independent, so we may rearrange the equation to consider duce the data by coincidence. The conventional concern is the perturbation as part of the observation operator. Hence, a that any error in the solver will be transferred into posterior single observation at k has effective variance bias. Practitioners commonly alleviate both concerns by tun- k−1 ing the solver to be nearly perfect, however, we note that this ( − − ) γ 2 := γ 2 + σ 2h3 (1 + λh)2 k j 1 may be computationally prohibitive in many contemporary h = statistical applications. j 0 (1 + λh)2k − 1 We can construct a different posterior that includes the = γ 2 + σ 2h3 . uncertainty in the solver by taking an expectation over ran- (1 + λh)2 − 1 dom solutions to the ODE  Thus, late-time observations are modelled as being increas- ,σ ,σ ingly inaccurate. Ph (θ | d) ∝ Q(θ) L(d, U h (θ, ξ))dξ, (30) Consider inferring u0, given a single observation dk at 2 time k. If a Gaussian prior N (m0,ζ ) is specified for u0, ,σ 0 where U h (θ, ξ) is a draw from the randomised solver given then the posterior is N (m,ζ2), where parameters θ and random draw ξ. Intuitively, this construc- ( + hλ)2k tion favours parameters that exhibit agreement with the entire ζ −2 = 1 + ζ −2, γ 2 0 family of uncertain trajectories. The typical effect of this h k expectation is to increase the posterior uncertainty on θ,pre- (1 + hλ) d − ζ −2m = k + ζ 2m . venting the inappropriate posterior collapse we are concerned γ 2 0 0 h about. Indeed, if the integrator cannot resolve the underlying p+1/2σ h,σ (θ, ξ) dynamics, h will be large. Then U is inde- The observation precision is scaled by (1 + hλ)2k because θ Ph,σ (θ | ) ≈ pendent of , hence the prior is recovered, d late-time data contain increasing information. Assume that Q(θ). λkh † † the data are dk = e u + γη , for some given true ini- h,0 h,σ 0 Notice that as h → 0, both the measures P and P † tial condition u and noise realisation η†. Consider now typically collapse to the analytic posterior, P, hence both 0 the asymptotic regime, where h is fixed and k →∞.For methods are correct. We do not expect the bias of Ph,σ to the standard Euler method, where γh = γ , we see that be improved, since all of the averaged trajectories are of the   ζ 2 → 0, whilst m  (1 + hλ)−1ehλ ku†. Thus the infer- same quality as the deterministic solver in Ph,0.Wenow 0 ence scheme becomes increasingly certain of the wrong construct an analytic inference problem demonstrating these answer: the variance tends to zero and the mean tends to behaviours. infinity. Example 3.1 Consider inferring the , u0 ∈ In contrast, with a randomised integrator, the fixed h, large R, of the scalar linear differential equation, u˙ = λu, with k asymptotics are λ>0. We apply a numerical method to produce the approx- 1 imation Uk ≈ u(kh). We observe the state at some times ζ 2  , 2 −2 −2 −2 t = kh, with additive noise ηk ∼ N (0,γ ): dk = Uk + ηk. ζ + λ(2 + λh)σ h 0   If we use a deterministic Euler solver, the model predicts ( + λ)−1 hλ k † k 1 h e u0 Uk = (1 + hλ) u0. These model predictions coincide with m  . + ζ −2σ 2 2λ−1( + λ )−1 the slightly perturbed problem 1 0 h 2 h 123 Stat Comput

Fig. 5 The posterior marginals θ1 of the FitzHugh–Nagumo h=.1 inference problem using h=.05 deterministic integrators with h=.02 various step-sizes h=.01 h=.005 θ 0.4 2

0.2

0.0

−0.2 θ3 3.0

2.8

2.6 1 0 2 4 0 8 2 6 2 ...... 0 2 0 0 0 0 3 2 0 −

Thus, the mean blows up at a modified rate, but the variance concentration with improving solver quality. The posteriors remains positive. are not perfectly nested, possible evidence that our choice of scale parameter is imperfect, or that the assumption of locally We take an empirical Bayes approach to choosing σ, that Gaussian error deteriorates for large step-sizes. Note that the σ ∗ = π(σ) is, using a constant, fixed value arg max , chosen bias of θ3 is essentially unchanged with the randomised inte- before the data are observed. Joint inference of the parameters grator, but the posterior for θ2 broadens and is correlated and the noise scale suffer from well-known MCMC mixing to θ3, hence introduces a bias in the posterior mode; with- issues in Bayesian hierarchic models. To handle the unknown out randomisation, only the inappropriate certainty about θ3 θ parameter , we can marginalise it out using the prior distrib- allowed the marginal for θ2 to exhibit little bias. ution, or in simple problems, it may be reasonable to choose a fixed representative value. We now return to the FitzHugh–Nagumo model; given fixed initial conditions, we attempt to recover parameters 4 Probabilistic solvers for partial differential θ = (a, b, c) from observations of both species at times equations τ = 1, 2,...,40. The priors are log-normal, centred on the true value with unit variance, and with observational noise We now turn to present a framework for probabilistic solu- Γ = 0.001. The data are generated from a high-quality solu- tions to partial differential equations, working within the tion, and we perform inference using Euler integrators with finite element setting. Our discussion closely resembles the various step-sizes, h ∈{0.005, 0.01, 0.02, 0.05, 0.1}, span- ODE case, except that now we randomly perturb the finite ning a range of accurate and inaccurate integrators. element basis functions. We first perform the inferences with naive use of deter- ministic Euler integrators. We simulate from each posterior 4.1 Probabilistic finite element method for variational using delayed rejection MCMC (Haario et al. 2006), shown problems in Fig. 5. Observe the undesirable concentration of every posterior, even those with poor solvers; the posteriors are Let V be a Hilbert space of real-valued functions defined on almost mutually singular, hence clearly the posterior widths a bounded polygonal domain D ⊂ Rd . Consider a weak for- are meaningless. mulation of a linear PDE specified via a symmetric bilinear Secondly, we repeat the experiment using our probabilis- form a : V ×V −→ R, and a linear form r : V −→ R to give tic Euler integrators, with results shown in Fig. 6.Weuse the problem of finding u ∈ V : a(u,v) = r(v), ∀v ∈ V. a noisy pseudomarginal MCMC method, whose fast mix- This problem can be approximated by specifying a finite- ing is helpful for these initial experiments (Medina-Aguayo dimensional subspace Vh ⊂ V and seeking a solution in et al. 2015). These posteriors are significantly improved, Vh instead. This leads to a finite-dimensional problem to be exhibiting greater mutual agreement and obvious increasing solved for the approximation U: 123 Stat Comput

Fig. 6 The posterior marginals θ1 of the FitzHugh–Nagumo h=.1 inference problem using h=.05 probabilistic integrators with h=.02 various step-sizes h=.01 h=.005 θ 0.4 2

0.2

0.0

−0.2 θ3 3.0

2.8

2.6 2 0 2 4 1 2 6 8 0 ...... 0 0 0 0 0 0 2 2 3 −

U ∈ Vh : a(U,v)= r(v), ∀v ∈ Vh. (31) and the underlying deterministic approximation. The bilin- ear form a(·, ·) is assumed to induce an inner product, and ·2 = (·, ·); This is known as the . then norm via a a furthermore, we assume We will work in the setting of finite element methods, that this norm is equivalent to the norm on V. Through- Vh = {φ }J φ Eh assuming that span j j=1, where j is locally sup- out, denotes expectation with respect to the random basis { }J . functions. ported on a grid of points x j j=1 The parameter h is introduced to measure the diameter of the finite elements. We will also assume that Assumption 4 The collection of random basis functions {φr }J j j=1 are independent, zero-mean, Gaussian random φr ( ) = φ j (xk) = δ jk. (32) fields, each of which satisfies j xk 0 and shares the same support as the corresponding systematic basis function h φs. Any element U ∈ V can then be written as j For all j, the number of basis functions with index k which share the support of the basis functions with index J j is bounded independently of J, the total number of basis ( ) = φ ( ) U x U j j x (33) functions. Furthermore, the basis functions are scaled so that J j=1 Ehφr 2 ≤ 2p. j=1 j a Ch from which it follows that U(x ) = U . The Galerkin k k Assumption 5 The true solution u of problem (4.1)isin method then gives AU = r,forU = (U ,...,U )T , 1 J L∞(D). Furthermore, the standard deterministic interpolant A = a(φ ,φ ), and r = r(φ ).  jk j k k k of the true solution, defined by vs := J u(x )φs, satis- In order to account for uncertainty introduced by the j=1 j j fies u − vs ≤ Chq . numerical method, we will assume that each basis function a φ φs j can be split into the sum of a systematic part j and ran- φr φ φs Theorem 4.1 Under Assumptions 4 and 5 it follows that the dom part j , where both j and j satisfy the nodal property φr ( ) = approximation U, given by (31), satisfies (32), hence j xk 0. Furthermore, we assume that each φr φs j shares the same compact support as the corresponding j , preserving the sparsity structure of the underlying determin- { , } Ehu − U2 ≤ Ch2min p q . istic method. a

4.2 Strong convergence result As for ODEs, the solver accuracy is limited by either the amount of noise injected or the convergence rate of the As in the ODE case, we begin our convergence analy- underlying deterministic method, making p = q the natural sis with assumptions constraining the random perturbations choice. 123 Stat Comput

4.3 Poisson solver in two dimensions 5 PDE inference and numerics

Consider a Poisson equation with Dirichlet boundary condi- We now perform numerical experiments using probabilistic tions in dimension d = 2, namely solvers for elliptic PDEs. Specifically, we perform infer- ence in a 1D elliptic PDE, ∇·(κ(x)∇u(x)) = 4x for ∈[ , ] ( ) = , ( ) = −u = f, x ∈ D, x 0 1 , given boundary conditions u 0 0 u 1 2. We represent log κ as piecewise constant over ten equal- u = 0, x ∈ ∂ D. sized intervals; the first, on x ∈[0,.1) is fixed to be one to avoid non-identifiability issues, and the other nine are given V = 1( ) 2( ) We set H0 D and H to be the space L D with inner apriorθi = log κi ∼ N (0, 1). Observations of the field u product ·, · and resulting norm |·|2 =·, ·. The weak are provided at x = (0.1, 0.2,...0.9), with i.i.d. Gaussian formulation of the problem has the form (4.1) with error, N (0, 10−5); the simulated observations were gener-  ated using a fine grid and quadratic finite elements, then perturbed with error from this distribution. a(u,v)= ∇u(x)∇v(x)dx, r(v) =f,v. D Again we investigate the posterior produced at vari- ous grid sizes, using both deterministic and randomised Now consider piecewise linear finite elements satisfying the solvers. The randomised basis functions are draws from a assumptions of Sect. 4.2 in Johnson (2012) and take these Brownian bridge conditioned to be zero at the nodal points, {φs}J . implemented in practice with a truncated Karhunen–Loève to comprise the set j j=1 Then h measures the width of the triangulation of the finite element mesh. Assuming that expansion. The covariance operator may be viewed as a frac- f ∈ H it follows that u ∈ H 2(D) and that tional Laplacian, as discussed in Lindgren et al. (2011). The scaling σ is again determined by maximising the distribution s described in Sect.3.1, where the error indicator compares lin- u − v a ≤ Chu 2 . (34) H ear to quadratic basis functions, and we marginalise out the prior over the κ values. = {φr }J i Thus q 1. We choose random basis members j j=1 so The posteriors are depicted in Figs. 7 and 8. As in the ODE = that Assumption 4 hold with p 1. Theorem 4.1 then shows examples, the deterministic solvers lead to incompatible pos- = − Eh 2 ≤ 2. that, for e u U, e a Ch We note that that in teriors for varying grid sizes. In contrast, the randomised the deterministic case, we expect an improved rate of conver- solvers suggest increasing confidence as the grid is refined, gence in the function space H. Such a result can be shown as desired. The coarsest grid size uses an obviously inade- to hold in our setting, following the usual arguments for the quate ten elements, but this is only apparent in the randomised Aubin–Nitsche trick Johnson (2012), which is available in posterior. the supplementary materials.

Fig. 7 The marginal posterior θ1 distributions for the first four coefficients in 1D elliptic h=1/10 using a classic h=1/20 deterministic solver with various h=1/40 grid sizes θ2 h=1/60 2 h=1/80 0 −2 −4 θ3 −6 4 2

0

θ4 −2 2

0

−2 1 2 2 1 0 3 4 0 2 0 2 4 0 2 2 6 2 − − − − − − − −

123 Stat Comput

Fig. 8 The marginal posterior θ1 distributions for the first four coefficients in 1D elliptic inverse h=1/10 problem using a randomised h=1/20 solver with various grid sizes h=1/40

θ2 h=1/60 2 h=1/80 0 −2 −4 θ3 −46

2

0

θ4 −2

0

−2 1 2 2 0 0 2 0 2 4 0 2 2 6 4 2 1 3 − − − − − − − −

6 Conclusions converged, our results show standard techniques can be dis- astrously wrong when the solver is not converged. As the We have presented a computational methodology, backed by measure of convergence is not a standard numerical analy- rigorous analysis, which enables quantification of the uncer- sis one, but a statistical one, we have argued that it can be tainty arising from the finite-dimensional approximation of surprisingly difficult to determine in advance which regime solutions of differential equations. These methods play a nat- a particular problem resides in. Therefore, our practical rec- ural role in statistical inference problems as they allow for the ommendation is that the lower cost of the standard approach uncertainty from discretisation to be incorporated alongside makes it preferable when it is certain that the numerical other sources of uncertainty such as observational noise. We method is strongly converged with respect to the statistical provide theoretical analyses of the probabilistic integrators measure of interest. Otherwise, the randomised method we which form the backbone of our methodology. Furthermore propose provides a robust and consistent approach to address we demonstrate empirically that they induce more coher- the error introduced into the statistical task by numerical ent inference in a number of illustrative examples. There solver error. In difficult problem domains, such as numerical are a variety of areas in the sciences and engineering which weather prediction, the focus has typically been on reducing have the potential to draw on the methodology introduced the numerical error in each solver run; techniques such as including climatology, computational , and sys- these may allow a difference balance between numerical and tems . statistical computing effort in the future. Our key strategy is to make assumptions about the local The prevailing approach to model error described in behaviour of solver error, which we have assumed to be Kennedy and O’Hagan (2001) is based on a non-intrusive Gaussian, and to draw samples from the global distribution of methodology where the effect of model discrepancy is uncertainty over solutions that results. Section 2.4 describes allowed for in observation space. Our intrusive randomisa- a universality result, simplifying task of choosing covariance tion of deterministic methods for differential equations can be kernels in practice, within the family of Gaussian processes. viewed as a highly specialised discrepancy model, designed However, assumptions of Gaussian error, even locally, may using our intimate knowledge of the structure and proper- not be appropriate in some cases, or may neglect important ties of numerical methods. In this vein, we intend to extend domain knowledge. Our framework can be extended in future this work to other types of model error, where modifying the work to consider alternate priors on the error, for example, internal structure of the models can produce computationally multiplicative or non-negative errors. and analytically tractable measures of uncertainty which per- Our study highlights difficult decisions practitioners face, form better than non-intrusive methods. Our future work will regarding how to expend computational resources. While continue to study the computational challenges and opportu- standard techniques perform well when the solver is highly nities presented by these techniques.

123 Stat Comput

Acknowledgments The authors gratefully acknowledge support from h, we get EPSRC Grant CRiSM EP/D002060/1, EPSRC Established Career   Research Fellowship EP/J016934/2, EPSRC Programme Grant EQUIP Eh| |2 ≤ + O( ) Eh| |2 + O( 2q+1) + O( 2p+1). EP/K034154/1, and Academy of Finland Research Fellowship 266940. ek+1 1 h ek h h Konstantinos Zygalakis was partially supported by a grant from the Simons Foundation. Part of this work was done during the author’s stay Application of the Gronwall inequality gives the desired at the Newton Institute for the program “Stochastic Dynamical Systems result. in Biology: Numerical Methods and Applications.;” Now we turn to continuous time. We note that, for s ∈ Open Access This article is distributed under the terms of the Creative [tk, tk+1), Commons Attribution 4.0 International License (http://creativecomm ons.org/licenses/by/4.0/), which permits unrestricted use, distribution, ( ) = Ψ − ( ) + ξ ( − ), and reproduction in any medium, provided you give appropriate credit U s s tk Uk k s tk to the original author(s) and the source, provide a link to the Creative ( ) = Φ ( ). u s s−tk uk Commons license, and indicate if changes were made.

Let Ft denote the σ- of events generated by the {ξk} up to time t. Subtracting we obtain, using Assumptions 1 and A Numerical analysis details and proofs Φ 2 and the fact that s−tk has Lipschitz constant of the form 1 + O(h), Proof (Theorem 2.2) We first derive the convergence result    on the grid, and then in continuous time. From (10)wehave Eh | ( ) − ( )|F U s u s tk ≤|Φ − ( ) − Φ − ( )| Uk+1 = Ψh(Uk) + ξk(h) (35) s tk Uk s tk uk +|Ψ − (U ) − Φ − (U )| s tk k s tk k whilst we know that + Eh |ξ ( − )|F k s tk tk ≤ ( + )| |+O( q+1) + Eh|ξ ( − )| uk+1 = Φh(uk). (36) 1 Lh ek h k s tk + ≤ (1 + Lh)|e |+O(hq 1)  = Ψ ( ) − Φ ( ) k Define the truncation error k h Uk h Uk and note   1 that + Eh|ξ (s − t )|2 2 k k   q+1 p+ 1 ≤ (1 + Lh)|ek|+O(h ) + O h 2 . Uk+1 = Φh(Uk) + k + ξk(h). (37)

Subtracting Eq. (37) from (36) and defining ek = uk − Uk, Now taking expectations we obtain we get   1 + Eh|U(s) − u(s)|≤(1 + Lh) Eh|e |2 2 + O(hq 1)   k ek+1 = Φh(uk) − Φh(uk − ek) − k − ξk(h). p+ 1 + O h 2 . Taking the Euclidean norm and expectations give, using Assumption 1 and the independence of the ξk, Using the on-grid error bound gives the desired result, after   noting that the constants appearing are uniform in 0 ≤ kh ≤ 2 h 2 h  2p+1 .  E |ek+1| = E Φh(uk)−Φh(uk − ek)−k + O(h ), T Proof (Theorem 2.4) We prove the second bound first. Let O( 2p+1) :    where the constant in the h term is uniform in k w = E φ(˜( ))|˜( ) = = Eh φ( )| = ). q+1 k u tk u 0 u and Wk Uk U0 u 0 ≤ kh ≤ T. Assumption 2 implies that k = O(h ), Then let δk = sup ∈Rn |Wk −wk|. It follows from the Markov again uniformly in k : 0 ≤ kh ≤ T. Noting that Φ is u h property that globally Lipschitz with constant bounded by 1 + Lh under

Assumption 2, we then obtain hLh hL h Wk+1 − wk+1 = e Wk − e wk h 2 2 h 2 h h E |e + | ≤ (1 + Lh) E |e | hL hL k 1  k  = e Wk − e wk  1   1  h −  h h + E  h 2 Φh(uk) − Φh(uk − ek) , h 2 k  hL hL + e wk − e wk). + O(h2q+2) + O(h2p+1). Using (28) and Assumption 3 we obtain Using Cauchy–Schwarz on the inner product, and the fact 2p+2 that Φh is Lipschitz with constant bounded independently of δk+1 ≤ (1 + Lh)δk + O(h ). 123 Stat Comput

Iterating and employing the Gronwall inequality gives the Ornstein–Uhlenbeck integrator second error bound. Now we turn to the first error bound, comparing with the An additional example of a randomised integrator is an solution u of the original Eq. (1). From (25) and then (21)we integrated Ornstein–Uhlenbeck process, derived as follows. see that Define, on the interval s ∈[tk, tk+1), the pair of equations

hLh 2p+1 e φ(u) − φ(Ψh(u)) = O(h ), dU = V dt, U(t ) = U , (41a) k√ k hL hLh min{2p+1,q+1} e φ(u) − e φ(u) = O(h ). dV =−ΛV dt + 2Σ dW, V (tk) = f (Uk). (41b)

This gives the first weak error estimate, after using the sta- Here W is a standard Brownian motion and Λ and Σ are L bility estimate on eh from Assumption 3.  invertible matrices, possibly depending on h. The approxi- mating function gh(s) is thus defined by V (s), an Ornstein– Proof (Theorem 4.1) Recall the Galerkin orthogonality Uhlenbeck process. property which follows from subtracting the approximate Integrating (41b) we obtain variational principle from the true variational principle: it states that, for e = u − U,   V (s) = exp −Λ(s − tk) f (Uk) + χk(s − tk), (42) a(e,v)= 0, ∀v ∈ Vh. (38) where s ∈[tk, tk+1) and the {χk} form an i.i.d. sequence of Gaussian random functions defined on [0, h] with From this it follows that  √ s   χ ( ) = Σ Λ(τ − ) (τ). e ≤u − v , ∀v ∈ Vh. (39) k s 2 exp s dW a a 0

v ∈ Vh To see this note that, for any , the orthogonality prop- Note that the h-dependence of Ch comes through the time erty (38)gives interval on which χk is defined, and through Λ and Σ. Integrating (41a), using (42), we obtain a(e, e) = a(e, e + U − v) = a(e, u − v). (40)    −1 U(s) = Uk + Λ I − exp −Λ(s − tk) f (Uk) 2 Thus, by Cauchy–Schwarz, e ≤eau − va, ∀v ∈ a + ξk(s − tk), (43) Vh implying (39). We now set, for v ∈ V,

where s ∈[t , t + ], and, for t ∈[0, h], J k k 1 v( ) = ( )φ ( ) x u x j j x  t j=1 ξk(t) = χk(τ)dτ. (44) J J 0 = ( )φs( ) + ( )φr ( ) u x j j x u x j j x j=1 j=1 The numerical method (43) may be written in the form (12), =: vs(x) + vr(x). and hence (10) at the grid points, with the definition

   By the mean-zero and independence properties of the random −1 Ψh(u) = u + Λ I − exp −Λh f (u). basis functions we deduce that

This integrator is first-order accurate and satisfies Assump- Ehu − v2 = Eha(u − v,u − v) a tion 2 with p = 1. Choosing to scale Σ with h so that q ≥ 1in = Eh ( − vs, − vs) + Eh (vr,vr) a u u a Assumption 1 leads to convergence of the numerical method J with order 1. = − vs2 + ( )2Ehφr 2. u a u x j j a Had we carried out the above analysis in the case Λ = 0 j=1 we would have obtained the probabilistic Euler method (14), and hence (13) at grid points, used as our canonical example The result follows from Assumptions 4 and 5.  in the earlier developments. 123 Stat Comput

Convergence rate of L2 convergence References

When considering the Poisson problem in two dimensions, Arnold, A., Calvetti, D., Somersalo, E.: Linear multistep methods, par- as discussed in Sect.4, we expect an improved rate of con- ticle filtering and sequential Monte Carlo. Inverse Probl. 29(8), 085,007 (2013) vergence in the function space H. We now show that such a Brunel, N.J.B., Clairon, Q., dAlch Buc, F.: Parametric estimation result also holds in our random setting. of ordinary differential equations with orthogonality conditions. 109 Note that, under Assumption 4, if we introduce γ jk that J. Am. Stat. Assoc. (505), 173–185 (2014). doi:10.1080/ is 1 when two basis functions have overlapping support, and 01621459.2013.841583 γ Capistrán, M., Christen, J.A., Donnet, S.: Bayesian analysis of 0 otherwise, then jk is symmetric and there is constant C, ODE’s: solver optimal accuracy and Bayes factors (2013). J γ ≤ . arXiv:1311.2281 independent of j and J, such that k=1 jk C Now let ϕ solve the equation a(ϕ, v) =e,v, ∀v ∈ V. Then Chakraborty, A., Mallick, B.K., Mcclarren, R.G., Kuranz, C.C., Bing- ϕ ≤ | |. ϕs ϕr ham, D., Grosskopf, M.J., Rutter, E.M., Stripling, H.F., Drake, H 2 C e We define and in analogy with the R.P.: Spline-based emulators for radiative shock experiments with s r definitions of v and v . Following the usual arguments for measurement error. J. Am. Stat. Assoc. 108(502), 411–428 (2013). application of the Aubin–Nitsche trick Johnson (2012), we doi:10.1080/01621459.2013.770688 have |e|2 = a(e,ϕ)= a(e,ϕ− ϕs − ϕr). Thus Chkrebtii, O.A., Campbell, D.A., Girolami, M.A., Calderhead, B.: Bayesian uncertainty quantification for differential equations (2013). arXiv:1306.2365 2 s r |e| ≤eaϕ − ϕ − ϕ a Coulibaly, I., Lécot, C.: A quasi-randomized Runge–Kutta method. √   1 Math. Comput. Am. Math. Soc. 68(226), 651–659 (1999) ≤   ϕ − ϕs2 +ϕr2 2 . Dashti, M., Stuart, A.: The Bayesian approach to inverse problems. In: 2 e a a a (45) Ghanem, R., Higdon, D., Owhadi, H. (eds.) Handbook of Uncer-   tainty Quantification. Springer, New York (2016) ϕr( ) = J ϕ( )φr ( ) =ϕ J Diaconis, P.: Bayesian numerical analysis. Stat. Relat. We note that x j=1 x j j x H 2 j=1 φr ( ) = := Top. IV 1, 163–175 (1988) a j j x where, by Sobolev embedding (d 2 here), a j Eriksson, K.: Computational Differential Equations, vol. 1. Cambridge ϕ( )/ϕ | |≤ . x j H 2 satisfies max1≤ j≤J a j C Note, however, University Press, Cambridge (1996). https://books.google.co.uk/ that the a j are random and correlated with all of the random books?id=gbK2cUxVhDQC basis functions. Using this, together with (34), in (45), we Haario, H., Laine, M., Mira, A., Saksman, E.: DRAM: efficient adaptive MCMC. Stat. Comput. 16(4), 339–354 (2006) obtain Hairer, E., Nørsett, S., Wanner, G.: Solving Ordinary Differential Equations I: Nonstiff Problems. Solving Ordinary Differential   J   1 Equations, Springer, New York (1993). https://books.google.co. 2 2 | |2 ≤   2 +  φr ( ) ϕ . uk/books?id=F93u7VcSRyYC e C e a h a j j x a H 2 j=1 Hairer, E., Lubich, C., Wanner, G.: Geometric Numerical Integration: Structure-Preserving Algorithms for Ordinary Differential Equa- tions, vol. 31. Springer, New York (2006) We see that Hairer, E., McLachlan, R.I., Razakarivony, A.: Achieving Brouwer’s law with implicit Runge–Kutta methods. BIT Numer. Math. 48(2),  J J  1 231–243 (2008) | |≤   2 + (φr ,φr ) 2 . Hennig, P., Hauberg, S.: Probabilistic solutions to differential equations e C e a h a j ak a j k and their application to Riemannian statistics. In: Proceedings of j=1 k=1 the 17th International Conference on Artificial Intelligence and Statistics (AISTATS), vol. 33 (2014) From this and the of γ jk, we obtain Hennig, P., Osborne, M.A., Girolami, M.: Probabilistic numerics and uncertainty in computations. Proceedings of the Royal Society A (2015) (in press)  J J   1 | |≤   2 + γ φr 2 +φr 2 2 Johnson, C.: Numerical Solution of Partial Differential Equations by e C e a h jk j a k a the . Dover Books on Mathematics , j=1 k=1 Dover Publications, New York (2012). Incorporated, https://books.  J  1 google.co.uk/books?id=PYXjyoqy5qMC ≤ Ce h2 + 2C φr 2 2 . Kailath, T.: The and Bhattacharyya distance measures in sig- a j a nal selection. IEEE Trans. Commun. Technol. 15(1), 52–60 (1967). = j 1 doi:10.1109/TCOM.1967.1089532 Kaipio, J., Somersalo, E.: Statistical inverse problems: , Taking expectations, using that p = q = 1, we find, model reduction and inverse crimes. J. Comput. Appl. Math.    1 198(2), 493–504 (2007) Eh| |≤ Eh 2 2 ≤ 2 using Assumption 4, that e Ch e a Ch Kennedy, M.C., O’Hagan, A.: Bayesian calibration of models. as desired. Thus we recover the extra order of convergence J. R. Stat. Soc. Ser. B (Stat. Methodol.) 63(3), 425–464 (2001) Liang, H., Wu, H.: Parameter estimation for differential equation models over the rate 1 in the ·a norm (although the improved rate is in L1(Ω; H) whilst the lower rate of convergence is in using a framework of measurement error in regression models. J. Am. Stat. Assoc. 103(484), 1570–1583 (2008). doi:10.1198/ 2 L (Ω; V).). 016214508000000797

123 Stat Comput

Lindgren, F., Rue, H., Lindström, J.: An explicit link between Gaussian Stengle, G.: Error analysis of a randomized numerical method. Numer. fields and Gaussian Markov random fields: the stochastic par- Math. 70(1), 119–128 (1995) tial differential equation approach. J. R. Stat. Soc. Ser. B (Stat. Sullivan, T.: Uncertainty Quantification. Springer, New York (2016) Methodol.) 73(4), 423–498 (2011) Xue, H., Miao, H., Wu, H.: Sieve estimation of constant and time- Medina-Aguayo, F.J., Lee, A., Roberts, G.O.: Stability of Noisy varying coefficients in nonlinear ordinary differential equation Metropolis-Hastings (2015). arxiv:1503.07066 models by considering both numerical error and measurement Ramsay, J.O., Hooker, G., Campbell, D., Cao, J.: Parameter estimation error. Ann. Stat. 38(4), 2351–2387 (2010) for differential equations: a generalized smoothing approach. J. R. Xun, X., Cao, J., Mallick, B., Maity, A., Carroll, R.J.: Parameter esti- Stat. Soc. Ser. B (Stat. Methodol.) 69(5), 741–796 (2007) mation of partial differential equation models. J. Am. Stat. Assoc. Schober, M., Duvenaud, D.K., Hennig, P.: Probabilistic ODE solvers 108(503), 1009–1020 (2013) with Runge–Kutta means. In: Advances in Neural Information Processing Systems, pp. 739–747 (2014) Skilling, J.: Bayesian solution of ordinary differential equations. In: Maximum and Bayesian Methods, pp 23–37. Springer, New York (1992)

123