<<

Neural Stochastic Differential : Deep Latent Gaussian Models in the Diffusion Limit

Belinda Tzen∗ Maxim Raginsky†

Abstract In deep latent Gaussian models, the latent is generated by a time-inhomogeneous , where at each time step we pass the current state through a parametric nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. This work considers the diffusion limit of such models, where the number of layers tends to infinity, while the step size and the noise variance tend to zero. The limiting latent object is an Itô diffusion process that solves a stochastic differential (SDE) whose drift and diffusion coefficient are implemented by neural nets. We develop a variational inference framework for these neural SDEs via stochastic automatic differentiation in Wiener space, where the variational approximations to the posterior are obtained by Girsanov (mean-shift) transformation of the standard and the computation of is based on the theory of stochastic flows. This permits the use of black-box SDE solvers and automatic differentiation for end-to-end inference. Experimental results with synthetic data are provided.

1 Introduction

Ordinary differential equations (ODEs) and other types of continuous-time flows have always served as convenient abstractions for various deterministic iterative models and . Recently, however, several authors have started exploring the intriguing possibility of using ODEs for constructing and training very deep neural nets by considering the limiting case of composing a large number of infinitesimal nonlinear transformations (Haber and Ruthotto, 2017; Chen et al., 2018b; Li et al., 2018). In particular, Chen et al.(2018b) have introduced the framework of neural ODEs, in which the overall nonlinear transformation is represented by an ODE, and a black-box ODE solver is used as a computational primitive during end-to-end training. These ideas naturally carry over to the domain of probabilistic modeling. Indeed, since deep

arXiv:1905.09883v2 [cs.LG] 28 Oct 2019 probabilistic generative models can be viewed as time-inhomogeneous Markov chains, we can consider the limit of infinitely many layers as a continuous-time Markov process. Since the marginal distributions of such a process evolve through a deterministic continuous-time flow, one can use ODE techniques in this context as well (Tabak and Vanden-Eijnden, 2010; Chen et al., 2018a). However, an alternative possibility is to focus on the stochastic evolution of the sample paths of the limiting process, rather than on the deterministic evolution in the space of measures. This perspective is particularly useful when the generation of sample paths of the underlying process is more tractable than the computation of process distributions.

∗University of Illinois, e-mail: [email protected]. †University of Illinois, e-mail: [email protected].

1 In this paper, we develop these ideas in the context of Deep Latent Gaussian Models (DLGMs), a flexible family of generative models introduced by Rezende et al.(2014). In these models, the latent variable is generated by a time-inhomogeneous Markov chain, where at each time step we pass the current state through a deterministic nonlinear map, such as a feedforward neural net, and add a small independent Gaussian perturbation. The observed variable is then drawn conditionally on the state of the chain after a large but finite number of steps. The iterative structure of DLGMs, together with the use of differentiable layer-to-layer transformations, is the basis of stochastic backpropagation (Kingma and Welling, 2014; Ranganath et al., 2014; Rezende et al., 2014), an efficient and scalable procedure for performing variational inference with approximate posteriors of the mean-field type. A key feature here is that all the randomness in the latent space is generated by sampling a large but finite number of independent standard Gaussian random vectors, and all other transformations are obtained by suitable differentiable reparametrizations. If one considers the limiting regime of DLGMs, where the number of layers tends to infinity while the step size and the noise variance in layer-to-layer transformations both tend to zero, the resulting latent random object is a diffusion process of the Itô type (Bichteler, 2002; Protter, 2005), whose drift and diffusion coefficients are implemented by neural nets. We will refer to these models as neural SDEs, in analogy to the deterministic neural ODEs of Chen et al.(2018b). Generative models of this type have been considered in earlier work, first by Movellan et al.(2002) as a noisy continuous-time counterpart of recurrent neural nets, and, more recently, by Archambeau et al.(2007), Hashimoto et al.(2016), Ha et al.(2018), and Ryder et al.(2018). On the theoretical side, Tzen and Raginsky (2019) have investigated the expressive power of diffusion-based generative models and showed that they can be used to obtain approximate samples from any distribution whose Radon–Nikodym w.r.t. the standard Gaussian measure can be efficiently represented by a neural net. In this paper, we leverage this expressive power and develop a framework for variational inference in neural SDEs: • We show that all the latent randomness can be generated by sampling from the standard multidimensional Wiener process, in analogy to the use of independent standard Gaussian random vectors in DLGMs. Thus, the natural latent space for neural SDEs is the Wiener space of continuous vector-valued functions on [0, 1] equipped with the Wiener measure (the probability law of the standard Wiener process).

• We derive a variational bound on the marginal log-likelihood for the observed variable using the Gibbs on the path space (Boué and Dupuis, 1998). Moreover, by Girsanov’s theorem, any variational approximation to the posterior is related to the primitive Wiener process by a mean shift. Thus, the natural neural SDE counterpart of a mean-field approximate posterior is obtained by adding an observation-dependent neural net drift to the standard Wiener process.

• Finally, we show how variational inference can be carried out via automatic differentiation (AD) in Wiener space. One of the salient features of the neural ODE framework of Chen et al. (2018b) is that one can backpropagate gradients efficiently through any black-box ODE solver using the so-called method of adjoints (see, e.g., Kokotović and Heller(1967) and references therein). While there exists a counterpart of the method of adjoints for SDEs (Yong and Zhou, 1999, Chap. 3), it cannot be used to backpropagate gradients through a black-box SDE solver, as we explain in Section 5.1. Instead, one has to either derive custom backpropagation rules for each specific solver or use the less time-efficient forward-mode AD with a black-box SDE solver.

2 For the latter, we use the theory of stochastic flows (Kunita, 1984) in order to differentiate the solutions of Itô SDEs with respect to parameters of the drift and the diffusion coefficient. These pathwise are also solutions of Itô SDEs, whose drift and diffusion coefficient can be obtained from those of the original SDE using the ordinary of multivariable . Thus, the overall process can be implemented using AD and a black-box SDE solver.

1.1 Related work Extending the neural ODE framework of Chen et al.(2018b) to the setting of SDEs is a rather natural step that has been taken by several authors. In particular, the use of SDEs to enhance the expressive power of continuous-time neural nets was proposed in a concurrent work of Peluchetti and Favaro(2019). Neural SDEs driven by stochastic processes with jumps were introduced by Jia and Benson(2019) as a generative framework for hybrid dynamical systems with both continuous and discrete behavior. Hegde et al.(2019) considered generative models built from SDEs whose drift and diffusion coefficients are samples from a . Liu et al.(2019) and Wang et al. (2019) have proposed using SDEs (and suitable discretizations) as a noise injection mechanism to stabilize neural nets against adversarial or stochastic input perturbations.

2 Background: variational inference in Deep Latent Gaussian Mod- els

In Deep Latent Gaussian Models (DLGMs) (Rezende et al., 2014), the latent variables X0,...,Xk and the observed variable Y are generated recursively:

X0 = Z0 (1a)

Xi = Xi−1 + bi(Xi−1) + σiZi, i = 1, . . . , k (1b)

Y ∼ p(·|Xk), (1c)

i.i.d. d d d where Z0,...,Zk ∼ N(0,Id) are i.i.d. standard Gaussian vectors in R , b1, . . . , bk : R → R are d×d some parametric nonlinear transformations, σ1, . . . , σk ∈ R is a sequence of matrices, and p(·|·) is the observation likelihood. Letting θ denote the parameters of b1, . . . , bk and the matrices σ1, . . . , σk, we can capture the underlying generative process by the joint probability density of Y and the ‘primitive’ random variables Z0,...,Zk:

pθ(y, z0, . . . , zk) = p(y|fθ(z0, . . . , zk)φd(z0) . . . φd(zk), (2)

d d d where fθ : R ×...×R → R is the overall deterministic transformation (Z0,...,Zk) 7→ Xk specified −d/2 1 2 d by (1a)-(1b), and φd(z) = (2π) exp(− 2 kzk ) is the standard Gaussian density in R . The main object of inference is the marginal likelihood pθ(y), obtained by integrating out the latent variables z0, . . . , zk in (2). This integration is typically intractable, so instead one works with the so-called evidence lower bound. Typically, this bound is derived using Jensen’s inequality (see, e.g., Blei et al.(2017)); for our purposes, though, it will be convenient to derive it from the well-known Gibbs variational principle (Dupuis and Ellis, 1997): For any Borel probability measure d k+1 µ on Ω := (R ) and any measurable real-valued F : Ω → R,

−F (Z0,...,Zk)  − log Eµ[e ] = inf D(νkµ) + Eν[F (Z0,...,Zk)] , (3) ν∈P(Ω)

3 where D(·k·) is the Kullback–Leibler and the infimum is over all Borel probability measures ν on Ω. If we let µ be the marginal distribution of Z0,...,Zk in (2) and apply (3) to the function Fθ(z0, . . . , zk) := − log p(y|fθ(z0, . . . , zk)), we obtain the well-known variational formula Z − log pθ(y) = − log p(y|fθ(z0, . . . , zk))φd(z0) . . . φd(zk) dz0 ... dzk n Z o = inf D(νkµ) − log p(y|fθ(z0, . . . , zk))ν(dz0,..., dzk) . (4) ν∈P(Ω) Ω

The infimum in (4) is attained by the posterior density pθ(z0, . . . , zk|y) whose computation is also gen- erally intractable, so one typically picks a suitable family νβ(dz0,..., dzk|y) = qβ(z0, . . . , zk|y) dz0 ... dzk of approximate posteriors to obtain the variational upper bound

− log pθ(y) ≤ inf Fθ,β(y), (5) β where Z Fθ,β(y) := D(νβ(·|y)kµ) − log p(y|fθ(z0, . . . , zk))νβ(dz0,..., dzk|y) Ω is the variational free energy. The choice of qβ(·|y) is driven by considerations of computational tractability vs. representational richness. A widely used family of approximate posteriors is given by the mean-field approximation: qβ(·|y) is the product of k + 1 Gaussian densities whose means ˜ ˜ b0(y),..., bk(y) and covariance matrices C0(y),...,Ck(y) are also chosen from some parametric class of nonlinearities, and β is then the collection of all the parameters of these transformations. The resulting inference procedure, known as stochastic backpropagation (Kingma and Welling, 2014; Rezende et al., 2014), involves estimating the gradients of the variational free energy Fβ,θ(y) with respect to θ and β. This procedure hinges on two key steps:

˜ 1/2 i.i.d. Reparametrization: The joint law of bi(y) + Ci (y)Zi, i ∈ {0, . . . , k} with Zi ∼ N(0,Id) is equal to the mean-field posterior νβ(·|y). Thus, we can express the variational free energy as

k X ˜ h ˜ 1/2 ˜ 1/2 i Fθ,β(y) = D(N(bi(y),C0(y))kN(0,Id)) + E Fθ b0(y) + C0(y) Z0,..., bk(y) + Ck(y) Z0 . i=0 (6)

Backpropagation with Monte Carlo: Since the expectation in (6) is w.r.t. to a collection of i.i.d. standard Gaussian vectors, it follows that the gradients can be computed by interchanging differentiation and expectation and using reverse-mode automatic differentiation or backpropagation (Baydin et al., 2018). Unbiased estimates of ∇•Fθ,β(y), • ∈ {θ, β}, can then be obtained by Monte Carlo sampling.

3 Neural Stochastic Differential Equations as DLGMs in the diffu- sion limit

In this work, we consider the continuous-time limit of (1), in analogy to the neural ODE framework of Chen et al.(2018b) (which corresponds to the deterministic case σi ≡ 0). In this limit, the latent

4 object becomes a d-dimensional diffusion process X = {Xt}t∈[0,1] given by the solution of the Itô stochastic differential equation (SDE)

dXt = b(Xt, t) dt + σ(Xt, t) dWt, t ∈ [0, 1] (7) where W is the standard d-dimensional Wiener process or Brownian motion (see, e.g., Bichteler (2002) or Protter(2005) for background on diffusion processes and SDEs). The observed variable Y is generated conditionally on X1: Y ∼ p(·|X1). We focus on the case when both the drift d d d d×d b : R × [0, 1] → R and the diffusion coefficient σ : R × [0, 1] → R are implemented by feedforward neural nets, and will thus use the term neural SDE to refer to (7). The special case of (7) with X0 = 0 and σ ≡ Id was studied by Tzen and Raginsky(2019), who showed that generative models of this kind are sufficiently expressive. In particular, they showed that one can use a neural net drift b(·, ·) and σ = In in (7) to obtain approximate samples from any target distribution for X1 whose Radon–Nikodym derivative f w.r.t. the standard Gaussian measure d on R can be represented efficiently using neural nets. Moreover, only a polynomial overhead is incurred in constructing b compared to the neural net representing f. The main idea of the construction of Tzen and Raginsky(2019) can be informally described as d follows. Consider a target density of the form q(x) = f(x)φd(x) for a sufficiently smooth f : R → R+ and the Itô SDE ∂ dX = log Q f(X ) dt + dW ,X = 0; t ∈ [0, 1] (8) t ∂x 1−t t t 0 √ where Qtf(x) := EZ∼φd [f(x + tZ)]. Then X1 ∼ q, i.e., one can use (8) to obtain an exact sample from q, and this construction is information-theoretically optimal (see, e.g., Dai Pra(1991), Lehec (2013), Eldan and Lee(2018), or Tzen and Raginsky(2019)). The drift term in (8) is known as the Föllmer drift (Föllmer, 1985). Replacing Q1−tf(x) by a Monte Carlo estimate and f(·) by a suitable neural net approximation fb(·; θ), we can approximate the Föllmer drift by functions of the form   √ N √ PN ∂ ∂  1 X  n=1 ∂x fb(x + 1 − tzn; θ) b(x, t; θ) = log fb(x + 1 − tzn; θ) = √ . ∂x N PN  n=1  n=1 fb(x + 1 − tzn; θ)

This has the following implications (see Tzen and Raginsky(2019) for a detailed ):

• the complexity of representing the Föllmer drift by a neural net is comparable to the complexity of representing the Radon–Nikodym derivative f = q by a neural net; φd • the neural net approximation to the Föllmer drift takes both the space variable x and the time variable t as inputs, and its weight parameters θ do not explicitly depend on time.

In some cases, this can be confirmed by direct computation. As an example, consider a stochastic deep linear neural net (Hardt and Ma, 2017) in the diffusion limit:

dXt = AtXt dt + Ct dWt,X0 = x0; t ∈ [0, 1]. (9)

In this representation, the net is parametrized by the matrix-valued paths {At}t∈[0,1] and {Ct}t∈[0,1], and optimizing over the model parameters is difficult even in the deterministic case. On the other hand, the process X in (9) is Gaussian (in fact, all Gaussian diffusion processes are of this form),

5 and the probability law of X1 can be computed in closed form (Fleming and Rishel, 1975, Chap. V, Sec. 9): X1 ∼ N(m, Σ) with Z 1 > > m = Φ0,1x0 and Σ = Φt,1CtCt Φt,1 dt, 0 where Φs,t (for s < t) is the fundamental matrix that solves the ODE d Φ = A Φ , Φ = I ; t > s dt s,t t s,t s d The Föllmer drift provides a more parsimonious representation that does not involve time-varying network parameters. Indeed, since q is the d-dimensional Gaussian density with mean m and covariance matrix Σ, we have  1  f(x) = c exp − (x − m)>Σ−1(x − m) + x>x , 2 where c is a normalization constant. If det Σ 6= 0, a straightforward but tedious computation yields n1 o Q f(x) = c exp tv>Σ v − (x − m)>Σ−1(x − m) + x>x , t t 2 t −1 −1 where ct > 0 is a constant that does not depend on x, v = (Σ − Id)x − Σ m, and Σt = −1 −1 ((1 − t)Id + tΣ ) . Consequently, the Föllmer drift is given by ∂ log Q f(x) = [(1 − t)(Σ−1 − I )Σ (Σ−1 − I ) − (Σ−1 − I )]x ∂x 1−t d 1−t d d −1 −1 −1 − [(1 − t)(Σ − Id)Σ1−tΣ − Σ ]m, which is an affine function of x with time-invariant parameters Σ−1 and m.

4 Variational inference with neural SDEs

Our objective here is to develop a variational inference framework for neural SDEs that would leverage their expressiveness and the availability of adaptive black-box solvers for SDEs (Ilie et al., 2015). We start by showing that all the building blocks of DLGMs described in Section2 have their natural counterparts in the context of neural SDEs.

Brownian motion as the latent object: In the case of DLGMs, it was expedient to push all d k+1 the randomness in the latent space Ω = (R ) into the i.i.d. standard Gaussians Z0,...,Zk. An analogous procedure can be carried out for neural SDEs as well, except now the latent space is d d W = C([0, 1]; R ), the space of continuous paths w : [0, 1] → R , and the primitive random object is the Wiener process W = {Wt}t∈[0,1]. The continuous-time analogue of the independence structure of the Zi’s is the independent Gaussian increment property of W : for any 0 ≤ s < t ≤ 1, the increment Wt − Ws ∼ N(0, (t − s)Id) is independent of {Wr : 0 ≤ r ≤ s}. Moreover, there is a unique probability law µ on W (the Wiener measure), such that, under u, W0 = 0 almost surely and, for any 0 < t1 < t2 < . . . < tm ≤ 1, Wti − Wti−1 with t0 = 0 are independent centered Gaussian random vectors with covariance matrices (ti − ti−1)Id.

6 Let us explicitly parametrize the drift and the diffusion in (7) as b(x, t; θ) and σ(x, t; θ), respectively. If, for each θ, b and σ in (7) are Lipschitz-continuous in x uniformly in t ∈ [0, 1], then there exists a mapping fθ : W → W, such that X = fθ(W ), and this mapping is progressively measurable (Bichteler, 2002, Sec. 5.2): If we denote by [fθ(W )]t the value of fθ(W ) at t, then

Z t Z t [fθ(W )]t = b([fθ(W )]s, s; θ) ds + σ([fθ(W )]s, s; θ) dWs, 0 0 that is, for each t, the path {[fθ(W )]s}s∈[0,t] depends only on {Ws}s∈[0,t]. With these ingredients in place, we have the following path-space analogue of (2):

P θ(dy, dw) = p(y|[fθ(w)]1)µ(dw) dy. (10) R As before, the quantity of interest is the marginal density p (y) := p(y|[f (w)]1)µ(dw). θ W θ

The variational representation and Girsanov reparametrization: The Gibbs variational principle also holds for probability measures and measurable real-valued functions on the path space W (Boué and Dupuis, 1998), so we obtain the following variational formula:  Z  − log pθ(y) = inf D(νkµ) − log p(y|[fθ(w)]1)ν(dw) . (11) ν∈P(W) W This formula looks forbidding, as it involves integration with respect to probability measures ν on path space W. However, a significant simplification comes about from the fundamental result known as Girsanov’s theorem (Bichteler, 2002, Prop. 3.9.13): any probability measure ν on W which is absolutely continuous w.r.t. the Wiener measure µ corresponds to the addition of a drift term to the basic Wiener process W . That is, there is a one-to-one correspondence between ν ∈ P(W) with d D(νkµ) < ∞ and R -valued random processes u = {ut}t∈[0,1], such that each ut is measurable w.r.t. h 1 R 1 2 i {Ws : 0 ≤ s ≤ t} and Eµ 2 0 kutk dt < ∞. Specifically, if we consider the Itô process

Z t Zt = us ds + Wt, t ∈ [0, 1], (12) 0 then Z = {Zt}t∈[0,1] ∼ ν with

" Z 1 # 1 2 D(νkµ) = Eµ kutk dt . 2 0

Conversely, any such ν can be realized in this fashion. This leads to the Girsanov reparametrization of the variational formula (11):

( Z 1  Z • ) 1 2 − log pθ(y) = inf Eµ kutk dt + Fθ W + us ds , u 2 0 0

R • R t where W + 0 us ds is shorthand for the process {Wt+ 0 us ds}t∈[0,1], and Fθ(w) := − log p(y|[fθ(w)]1).

7 Mean-field approximation: The next order of business is to develop a path-space analogue of the mean-field approximation. This is rather simple: we consider deterministic drifts of the form ut = ˜b(y, t; β), t ∈ [0, 1], where ˜b(t, y; β) is a neural net with weight parameters β, such that R 1 ˜ 2 0 kb(y, t; β)k dt < ∞. The resulting process (12) is Gaussian and has independent increments with

Z t ! ˜ 0 0 Zt − Zs ∼ N b(y, s ; β) ds , (t − s)Id , s and we have the mean-field variational bound

( Z 1 "  Z • #) 1 ˜ 2 ˜ − log pθ(y) ≤ inf kb(y, t; β)k dt + E Fθ W + b(y, s; β) ds . β 2 0 0

One key difference from the DLGM set-up is worth mentioning: here, the only degree of freedom we need is an additive drift that affects the mean, whereas in the DLGM case we optimize over both the mean and the covariance matrix in Eq. (6).

5 Automatic differentiation in Wiener space

We are now faced with the problem of computing the gradients of the variational free energy

Z 1 "  Z • # 1 ˜ 2 ˜ Fθ,β(y) := kb(y, t; β)k dt + Eµ Fθ W + b(y, s; β) ds (13) 2 0 0 with respect to θ and β. The gradients of the first (KL-divegence) term on the right-hand side (13) can be computed straightforwardly using automatic differentiation, so we turn to the second term. To that end, let us define, for each θ and β, the Itô process Xθ,β by

Z t Z t Z t θ,β θ,β ˜ θ,β Xt := X0 + b(Xs , s; θ) ds + b(y, s; β) ds + σ(Xs , s; θ) dWs, t ∈ [0, 1] (14) 0 0 0 which is simply the result of adding the deterministic drift term ˜b(y, t; β) dt to the neural SDE (7). θ,β Then the second term on the right-hand side of (13) is equal to −E[log p(y|X1 )], and we need a procedure for computing the gradients of this term w.r.t. θ and β. Provided we have a way of differentiating the Itô process (14) w.r.t. θ and β, the desired gradients are given by " # ∂ ∂ ∂Xθ,β E[log p(y|Xθ,β)] = E log p(y|Xθ,β) 1 , ∂β 1 ∂x 1 ∂β " # ∂ ∂ ∂Xθ,β E[log p(y|Xθ,β)] = E log p(y|Xθ,β) 1 . ∂θ 1 ∂x 1 ∂θ

(Here, we are assuming that the function x 7→ log p(y|x) is sufficiently well-behaved to permit interchange of differentiation and integration.) In the remainder of this section, we first compare the problem of computation in neural SDEs to its deterministic counterpart in neural ODEs and then describe two possible approaches.

8 5.1 Gradient computation in neural SDEs Consider the following problem: We have a d-dimensional Itô process

α α α dXt = b(Xt , t; α) dt + σ(Xt , t; α) dWt, (15)

d where α is a p-dimensional parameter, and a function f : R → R be given. We wish to compute α the gradient of the expectation J(α) := E[f(X1 )] w.r.t. α. In the context of variational inference for neural SDEs, we have α = (θ, β) and f(·) = − log p(y|·). This problem is known as sensitivity analysis (Gobet and Munos, 2005). We are allowed to use automatic differentiation and black-box SDE or ODE solvers as building blocks. In the deterministic case, i.e., when σ(·) ≡ 0, Eq. (15) is an instance of a neural ODE (Chen ∂J(α) et al., 2018b), and the computation of ∂α can be carried out efficiently using any black-box ODE solver. The key idea, based on the so-called adjoint sensitivity method (see, e.g., Kokotović and Heller(1967) and references therein), is to augment the original ODE that runs forward in time with a certain second ODE that runs backward in time. This allows one to efficiently backpropagate gradients through any black-box ODE solver. Unfortunately, there is no straightforward way to port this construction to SDEs. In very broad strokes, this can be explained as follows: While the analogue of the method of adjoints is available for SDEs (see, e.g., Yong and Zhou(1999, Chap. 3)), the adjoint equation is an SDE that has to be solved backward in time with a terminal condition that depends on the entire Wiener path W = {Wt}t∈[0,1], but the solution at each time t ∈ [0, 1] must still be measurable only w.r.t. the “past” {Ws}s∈[0,t]. The augmented system consisting of the original forward SDE and the adjoint backward SDE is an instance of a forward-backward SDE, or FBSDE for short (Yong and Zhou, 1999, Chap. 7). To the best of our knowledge, there are no efficient black-box schemes for solving FBSDEs with computation requirements comparable to standard SDE solvers; this stems from the fact that any procedure for solving FBSDEs must rely on a routine for solving a certain class of semilinear parabolic PDEs (Milstein and Tretyakov, 2006), which will incur considerable computational costs in high-dimensional settings. (There are, however, promising first steps in this direction by Han et al.(2018); Han and Long(2019) based on deep neural nets.) This unfortunate complication means that we have to forgo the use of adjoint-based methods for SDEs and instead develop gradient computation procedures by other means. We describe two possible approaches in the remainder of this section. As stated earlier, we assume that the following building blocks are available:

• automatic differentiation (or AD, Griewank and Walther(2008); Baydin et al.(2018)): Given a straight-line program (directed acyclic computation graph) for computing the value of a m n m differentiable function f : R → R at any point v ∈ R , AD is a systematic procedure that generates a program for computing the Jacobian of f at v. The time complexity of the resulting program for evaluating the Jacobian scales as c · mT(f) using forward-mode AD or c · nT(f) using reverse-mode AD, where T(f) is the time complexity of the program for computing f and c is a small absolute constant. In particular, one should use reverse-mode AD when n  m; however, the space complexity of the program generated by reverse-mode AD may be rather high.

m • a black-box SDE solver: For a given initial condition z ∈ R , initial and final times 0 ≤ t0 < m m m m×d t1 ≤ 1, drift f : R × [0, 1] → R , and diffusion coefficient matrix g : R × [0, 1] → R , we

9 will denote by SDE.Solve(z, t0, t1, f, g) the output of any black-box procedure that computes or approximates the solution at time t1 of the Itô SDE dZt = f(Zt, t) dt + g(Zt, t) dWt, t ∈ [t0, t1],

with Zt0 = z. We assume, moreover, that one can pass straight-line programs for computing f and g as arguments to SDE.Solve.

5.2 Solve-then-differentiate: the Euler backprop The most straightforward approach is to derive custom backpropagation equations for a specific SDE solver, e.g., the Euler method. We first generate an Euler approximation of the diffusion process (14) θ,β and then estimate the gradients of −E[log p(y|X1 )] w.r.t. θ and β by backpropagation through the computation graph of the Euler recursion:

i.i.d. • the forward pass — given a time mesh 0 = t0 < t1 < . . . < tN = 1, we sample Z1,...,ZN ∼ γd and the desired initialization Xθ,β, and generate the updates bt0   Xθ,β = Xθ,β + h b(Xθ,β, t ; θ) + ˜b(y, t ; β) bti+1 bti i+1 bti i i + ph σ(Xθ,β, t ; θ)Z i+1 bti i i+1

for 0 ≤ i < N, where hi+1 := ti+1 − ti; • the backward pass — compute the gradients of − log p(y|Xθ,β) w.r.t. θ and β using reverse-mode btN AD.

The overall time complexity is roughly ON · (T(b) + T(˜b) + T(σ)), on the order of the time complexity of the forward pass — exactly as one would expect for reverse-mode AD. This approach, which approximates the continuous-time neural SDE by a discrete DLGM of the form (1), has previously been used in mathematical finance (Giles and Glasserman, 2006) and, more recently, in the context of variational inference for SDEs (Ryder et al., 2018). The forward and the backward passes have the same structure as in the original work of Rezende et al.(2014), and Monte-Carlo estimates of the gradient can be obtained by averaging over multiple independent realizations.

5.3 Differentiate-then-solve: the pathwise differentiation method In some settings, it may be desirable to avoid explicit discretization of the neural SDE and work with a black-box SDE solver instead. This approach amounts to differentiating through the forward pass of the solver. The computation of the pathwise derivatives of Xθ,β with respect to θ or β can be accomplished by solving another SDE, as a consequence of the theory of stochastic flows (Kunita, 1984).

Theorem 5.1. Assume the following:

1. The drift b(x, t; θ) and the diffusion matrix σ(x, t; θ) are Lipschitz-continuous with Lipschitz- continuous Jacobians in x and θ, uniformly in t ∈ [0, 1].

2. The drift ˜b(y, t; β) is Lipschitz-continuous with Lipschitz-continuous Jacobian in β, uniformly in t ∈ [0, 1].

10 Then the pathwise derivatives of X = Xθ,β in θ and β are given by the following Itô processes:

Z t ˜ ! d Z t ∂Xt ∂bs ∂Xs ∂bs X ∂σs,` ∂Xs = + ds + dW ` (16) ∂βi ∂x ∂βi ∂βi ∂x ∂βi s 0 `=1 0 and

Z t ! d Z t ! ∂Xt ∂bs ∂bs ∂Xs X ∂σs,` ∂σs,` ∂Xs = + ds + + dW ` (17) ∂θj ∂θj ∂x ∂θj ∂θj ∂x ∂θj s 0 `=1 0

θ,β θ,β where, e.g., bs is shorthand for b(Xs , s; θ), σs,` denotes the `th column of σ(Xs , s; θ), and W 1,...,W d are the independent scalar coordinates of the d-dimensional Wiener process W . Proof. We will use the results of Kunita(1984) on differentiability of the solutions of Itô SDEs w.r.t. initial conditions. Consider an m-dimensional Itô process of the form

Z t m Z t X i Zt(ξ) = ξ + f0(Zs(ξ), s) ds + fi(Zs(ξ), s) dWs, t ∈ [0, 1] (18) 0 i=1 0 where W 1,...,W m are independent standard scalar Brownian motions, with the following assump- tions:

m 1. The initial condition Z0(ξ) = ξ ∈ R . m m 2. The vector fields fj : R × [0, 1] → R , j ∈ {0, . . . , m}, are globally Lipschitz-continuous and have Lipschitz-continuous gradients, uniformlly in t ∈ [0, 1]. Then (Kunita, 1984, Chap. 2, Thm. 3.1) the pathwise derivatives of Z w.r.t. the initial condition ξ exist and are given by the Itô processes

Z t m Z t ∂Zt(ξ) ∂ ∂Zs(ξ) X ∂ ∂Zs(ξ) = e + f (Z (ξ), s) ds + σ (Z (ξ), s) dW j, (19) ∂ξi i ∂z 0 s ∂ξi ∂z j s ∂ξi s 0 j=1 0

m k n where e1, . . . , em are the standard basis vectors in R . Now suppose that β ∈ R , θ ∈ R , let > > > > m m := d + k + n, and define the vector fields f0(z, t) for z = (x , β , θ ) ∈ R and t ∈ [0, 1] by ! ! b(x, t; θ) + ˜b(y, t; β) σ (x, t; θ) · 1 f (z, t) := and f (z, t) := j j∈{1,...,d} , (20) 0 0 j 0 where σj(x, t; θ) is the jth column of σ(x, t; θ). These vector fields satisfy the above Lipschitz continuity condition by hypothesis. Consider the Itô process (18) with the initial condition ξ = (0>, β>, θ>)>. Then evidently

 θ,β Xt   Zt(ξ) =  β  , t ∈ [0, 1] θ and Eqs. (16) and (17) follow from (19), (20), and the chain rule of .

11 From the above, it follows that we can compute the gradients of the variational free energy (13) using any black-box SDE solver. To that end, we first use AD to generate the programs for computing the Jacobians of log p(y|·), b, ˜b, and σ, which, together with b, ˜b, and σ, can then be β,θ ∂Xβ,θ supplied to the SDE solver. We can then obtain both X and the pathwise derivatives ∂β and ∂Xβ,θ ˜ ∂θ by a call to SDE.Solve with f and g consisting of b(·; θ), σ(·; θ), b(·; β) and their Jacobians w.r.t. x, β, and θ. The time complexity is determined by the internal workings of SDE.Solve and by the time complexity of computing the Jacobians that enter the drift and the diffusion coefficients in (16) and n k (17). Suppose that θ takes values in R and β takes values in R . Then: • for (16), we need the Jacobian of b(x, t; θ) w.r.t. x, the Jacobian of ˜b(y; β) w.r.t. β, and the Jacobian of σ(x, t; θ) w.r.t. x. If we use AD to generate the programs for computing the Jacobians, the total time complexity per iteration of the SDE solver will be   O k · d · T(b) + min(d, k) · T(˜b) + d · T(σ) ,

using forward-mode or reverse-mode AD as needed.

• for (17), we need the Jacobian of b(x, t; θ) and σ(x, t; θ) w.r.t. θ and x. The total per-iteration time complexity of generating the Jacobians using AD will be   O n(min(d, n) + d) · T(b) + T(σ) .

Typically, the dimension n of the latent parameter θ will be on the order of d2 for a fully connected neural net.

The solve-then-differentiate approach of Section 5.2 will generally scale better to high-dimensional problems than the black-box pathwise approach. On the other hand, the pathwise approach is more flexible since it works with a generic SDE solver, so the overall time complexity may be reduced by using an adaptive SDE solver (Ilie et al., 2015). In addition, since the pathwise approach amounts to differentiating through the forward operation of the solver, it may incur smaller storage overhead than the Euler backprop method. As before, the gradients of the free energy w.r.t. β and θ can be estimated using Monte Carlo methods, by averaging multiple independent runs of the SDE solver.

6 Experimental results

We evaluated the performance of the forward-differentiation pathwise method of Sec. 5.3 using gradient descent on synthetic data. The code for all experiments was written in Julia using the DiffEqFlux library (Rackauckas et al., 2019) and executed on a CPU. The synthetic data were generated via a numerical SDE solution of the diffusion process

dXt = sigmoid(AXt) dt + dWt, t ∈ [0, 1] (21) where sigmoid(·) is the coordinatewise sigmoid softplus activation function. The observations d Y1,...,Yn ∈ R were generated by sampling n independent copies of this process and then addding small independent Gaussian perturbations to each copy of X1. That is, we use (21) parametrized by

12 d×d A ∈ R as the neural SDE model, and take p(y|x) to be Gaussian with mean x and covariance matrix Id. The auxiliary parameters were chosen from the class of constant drift terms independent of the process Xt, corresponding to mean-field approximations of the form

A,β A,β  dXt = sigmoid(AXt ) + β dt + dWt.

d×d The ground-truth parameter A ∈ R was randomly chosen with elements drawn i.i.d. from N(0, 1); all other parameters were likewise randomly initialized. For appropriate choices of parameters, performing vanilla gradient descent with constant step size leads to a decrease of the free energy. We examined the effect on performance of different discretization step sizes, as well as different sample sizes. The results are shown in Figures1 and2 10×10 for a 100-dimensional ground-truth parameter A ∈ R and the variational approximation drift 10 β ∈ R . Interestingly, the improvements from discretization meshes finer than h = 1/32 are incremental, suggesting that, at least in a simple case such as this, models with “infinitely many layers” may not offer significant practical advantage over models with “finitely many" layers, such as DLGMs, where the number of layers approaches the dimensionality of the problem. On the other hand, while the method exhibits some robustness when data are not abundant (n ≈ d), doing not much worse than when they are (n  d), it does increasingly worse—reflected in log-likelihood√ attained by optimized parameters—and ultimately fails in a low-data regime, as n approaches d.

7 Conclusion and future directions

We have presented an analysis of neural SDEs, which can be viewed as a continuous-time limit of the DLGMs of Rezende et al.(2014) or as a stochastic version of the neural ODEs of Chen et al. (2018b). In particular, this is a best-of-both-worlds perspective that enables us to both reason about the compositional expressive power of DLGMs and the inference process in the space of measures. In addition, it allows us to draw on a variety of standard scientific computing methods developed for continuous-time stochastic processes. We have shown that optimization for neural SDEs is far from a straightforward analogue of the ODE case, with the issue of time-adaptedness of paths complicating the use of the adjoint sensitivity method for gradient computation. The discretize-then-differentiate approach was shown to exactly recover the stochastic backpropagation method of Rezende et al.(2014) when using a simple Euler discretization, with the form of the mean-field variational approximation closely preserved; and the differentiate-then-discretize pathwise approach was demonstrated to be a single-pass method that leverages a black-box SDE solver, with little storage overhead but high computational complexity and thus limited scalability due to the use of forward-mode differentiation. One interesting direction to examine going forward is more sophisticated discretization schemes that readily enable the use of numerical tools developed for continuous-time processes (e.g., ODE solvers), where the continuous-time process is viewed as a deterministic function of random increments, perhaps arbitrarily small as the discretization becomes increasingly fine. Another promising direction is to investigate the connection between neural SDEs and probabilistic ODE solvers that return a posterior estimate rather than a deterministic approximate solution (Conrad et al., 2017; Schober et al., 2019).

13 Acknowledgments The authors would like to thank Matus Telgarsky for many enlightening discussions, and Chris Rackauckas, Markus Heinonen, and Mauricio Álvarez for their comments and constructive suggestions on the first version of this work. This work was supported in part by the NSF CAREER award CCF- 1254041, in part by the Center for Science of Information (CSoI), an NSF Science and Technology Center, under grant agreement CCF-0939370, in part by the Center for Advanced Electronics through Machine Learning (CAEML) I/UCRC award no. CNS-16-24811, and in part by the Office of Naval Research under grant no. N00014-12-1-0998.

References

Cédric Archambeau, Manfred Opper, Yuan Shen, Dan Cornford, and John S. Shawe-Taylor. Varia- tional inference for diffusion processes. In Neural Information Processing Systems, 2007.

Atılım Güneş Baydin, Barak A. Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. Automatic differentiation in machine learning: a survey. Journal of Machine Learning Research, 18:1–43, 2018.

Klaus Bichteler. Stochastic Integration with Jumps. Cambridge University Press, 2002.

David M. Blei, Alp Kucukelbir, and John D. McAuliffe. Variational inference: a review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.

Michelle Boué and Paul Dupuis. A variational representation for certain functionals of Brownian motion. Annals of Probability, 26(4):1641–1659, 1998.

Changyou Chen, Chunyuan Li, Liqun Chen, Wenlin Wang, Yunchen Pu, and Lawrence Carin. Continuous-time flows for efficient inference and density estimation. In International Conference on Machine Learning, 2018a.

Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David K. Duvenaud. Neural ordinary differential equations. In Neural Information Processing Systems, 2018b.

Patrick R. Conrad, Mark Girolami, Simo Särkkä, Andrew Stuart, and Konstantinos Zygalakis. Statistical analysis of differential equations: introducing probability measures on numerical solutions. and Computing, 27:1065–1082, 2017.

Paolo Dai Pra. A stochastic control approach to reciprocal diffusion processes. and Optimization, 23:313–329, 1991.

Paul Dupuis and Richard S. Ellis. A Weak Convergence Approach to the Theory of Large Deviations. Wiley, 1997.

Ronen Eldan and James R. Lee. Regularization under diffusion and anticoncentration of the information content. Duke Mathematical Journal, 167(5):969–993, 2018.

Wendell H. Fleming and Raymond W. Rishel. Deterministic and Stochastic . Springer, 1975.

14 Hans Föllmer. An entropy approach to time reversal of diffusion processes. In Stochastic Differential Systems (Marseille-Luminy, 1984), volume 69 of Lecture Notes in Control and Information Sciences. Springer, 1985.

Michael Giles and Paul Glasserman. Smoking adjoints: fast Monte Carlo Greeks. Risk, pages 88–92, January 2006.

Emmanuel Gobet and Rémi Munos. Sensitivity analysis using Itô- and martingales, and application to stochastic optimal control. SIAM Journal on Control and Optimization, 43(5): 1676–1713, 2005.

Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Differentiation. SIAM, 2nd edition, 2008.

Jung-Su Ha, Young-Jin Park, Hyeok-Joo Chae, Soon-Seo Park, and Han-Lim Choi. Adaptive path- autoencoder: representation learning and planning for dynamical systems. In Neural Information Processing Systems, 2018.

Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems, 34(1):014004, 2017.

Jiequn Han and Jihao Long. Convergence of the deep BSDE method for coupled FBSDEs, 2019. URL http://arxiv.org/abs/1811.01165.

Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional partial differential equations using deep learning. Proceedings of the National Academy of Sciences (U.S.), 115(34):8505–8510, 2018.

Moritz Hardt and Tengyu Ma. Identity matters in deep learning. In International Conference on Learning Representations, 2017.

Tatsunori Hashimoto, David Gifford, and Tommi Jaakkola. Learning population-level diffusions with generative RNNs. In Proceedings of the 33rd International Conference on Machine Learning, pages 2417–2426, 2016.

Pashupati Hegde, Markus Heinonen, Harri Lähdesmäki, and Samuel Kaski. Deep learning with differential Gaussian process flows. In Conference on Artificial Intelligence and Statistics, pages 1812–1821, 2019.

Silvana Ilie, Kenneth R. Jackson, and Wayne H. Enright. Adaptive time-stepping for the strong numerical solution of stochastic differential equations. Numerical Algorithms, 68(4):791–812, 2015.

Junteng Jia and Austin Benson. Neural jump stochastic differential equations. In Neural Information Processing Systems, 2019.

Diederik P. Kingma and Max Welling. Auto-encoding Variational Bayes. In International Conference on Learning Representations, 2014.

Petar Kokotović and James Heller. Direct and adjoint sensitivity equations for parameter optimization. IEEE Transactions on Automatic Control, 12(5):609–610, October 1967.

15 Hiroshi Kunita. Stochastic differential equations and stochastic flows of diffeomorphisms. In P. L. Hennequin, editor, Ecole d’Été de Probabilités de Saint-Flour XII, volume 1097 of Lecture Notes in Mathematics. Springer-Verlag, 1984. Joseph Lehec. Representation formula for the entropy and functional inequalities. Annales de l’Institut Henri Poincaré – Probabilités et Statistiques, 49(3):885–899, 2013. Qianxiao Li, Long Chen, Cheng Tai, and Weinan E. Maximum principle based algorithms for deep learning. Journal of Machine Learning Research, 18:1–29, 2018. Xuanqing Liu, Tesi Xiao, Si Si, Qin Cao, Sanjiv Kumar, and Cho-Jui Hsieh. Neural SDE: stabilizing neural ODE networks with stochastic noise, 2019. URL http://arxiv.org/abs/1906.02335. G. N. Milstein and M. V. Tretyakov. Numerical algorithms for forward-backward stochastic differential equations. SIAM Journal on Scientific Computing, 28(2):561–582, 2006. Javier R. Movellan, Paul Mineiro, and Ruth J. Williams. A Monte Carlo EM approach for partially observable diffusion processes: theory and applications to neural networks. Neural Computation, 14:1507–1544, 2002. Stefano Peluchetti and Stefano Favaro. Neural stochastic differential equations, 2019. URL http: //arxiv.org/abs/1905.11065. Philip E. Protter. Stochastic Integration and Differential Equations. Springer, 2nd edition, 2005. Chris Rackauckas, Mike Innes, Yingbo Ma, Jesse Bettencourt, Lyndon White, and Vaibhav Dixit. DiffEqFlux.jl – A Julia library for neural differential equations, 2019. URL http://arxiv.org/ abs/1902.02376. Rajesh Ranganath, Sean Gerrish, and David M. Blei. Black box variational inference. In Conference on Artificial Intelligence and Statistics, 2014. Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In Proceedings of the 2014 International Conference on Machine Learning, pages 1278–1286, 2014. Tom Ryder, Andrew Golightly, A. Steven McGough, and Dennis Prangle. Black-box variational inference for stochastic differential equations. In Proceedings of the 35th International Conference on Machine Learning, pages 4423–4432, 2018. Michael Schober, Simo Särkkä, and Philipp Hennig. A probabilistic model for the numerical solution of initial value problems. Statistics and Computing, 29:99–122, 2019. Esteban G. Tabak and Eric Vanden-Eijnden. Density estimation by dual ascent of the log-likelihood. Communications in Mathematical Sciences, 8(1):217–233, 2010. Belinda Tzen and Maxim Raginsky. Theoretical guarantees for sampling and inference in generative models with latent diffusions. In Conference on Learning Theory, 2019. Bao Wang, Binjie Yuan, Zuoqiang Shi, and Stanley J. Osher. ResNets ensemble via the Feynman–Kac formalism to improve natural and robust accuracies. In Neural Information Processing Systems, 2019.

16 Jiongmin Yong and Xun Yu Zhou. Stochastic Controls: Hamiltonian Systems and HJB Equations. Springer, 1999.

17 Figure 1: Log free energy for different discretization mesh sizes (sample size n = 1000).

18 1 Figure 2: Log free energy for different sample sizes (discretization mesh size h = 32 ).