1

Learning Stochastic Optimal Policies via Gradient Descent

Stefano Massaroli1,?, Michael Poli2,?, Stefano Peluchetti3 Jinkyoo Park2, Atsushi Yamashita1 and Hajime Asama1

Abstract—We systematically develop a learning-based treat- Here, we explore a different direction, motivated by the ment of stochastic optimal control (SOC), relying on direct opti- affinity between neural network training and optimal control mization of parametric control policies. We propose a derivation which involve the reliance on carefully crafted objective of adjoint sensitivity results for stochastic differential equations through direct application of variational . Then, given functions encoding task–dependent desiderata. an objective function for a predetermined task specifying the In essence, the raison d’etreˆ of synthesizing optimal desiderata for the controller, we optimize their parameters via stochastic controllers through gradient–based techniques is to iterative gradient descent methods. In doing so, we extend enrich the class of objectives for which an optimal controller the range of applicability of classical SOC techniques, often can be found, as well scale the tractability to high-dimensional requiring strict assumptions on the functional form of system and control. We verify the performance of the proposed approach regimes. This is in line with the empirical results of modern on a continuous–time, finite horizon portfolio optimization with machine learning research where large deep neural networks proportional transaction costs. are often optimized on high–dimensional non–convex prob- lems with outstanding performance [16], [17]. Gradient–based methods are also being explored in classic control settings to I.INTRODUCTION obtain a rigorous characterization of optimal control problems In this work we consider the following class of controlled in the linear–quadratic regime [18], [19]. stochastic dynamical systems: Notation: Let (Ω, ,P ) be a probability space. If a property F x˙ t = f(t, xt, ut) + g(t, xt, ut)ξ(t) (1) (event) A holds with P (A) = 1, we say that such nx nu ∈ F with state xt X R and control policy ut U R . property holds almost surely. A family of X–valued random nξ ∈ ⊂ ∈ ⊂ ξ(t) R is a stationary δ-correlated Gaussian noise, i.e. variables defined on a compact time domain T R xt t∈T ∈ ⊂ { } t > 0 E[ξ(t)] = 0 and s, t such that 0 < s < t it holds is called stochastic process and is measurable if xt(A) is ∀ ∀ E[ξ(s)ξ(t)] = δ(s t). The RHS of (1) comprises a drift term measurable with respect to the σ-algebra (T) being (T) − B ×F R t B R s f : R X U Rnx and a diffusion term g : R X U the Borel-algebra of T. As a convention, we use = × × → × × → s − t Rnx×nξ . This paper develops a novel, systematic approach to if s < t and we denote with δ the Krocnecker delta function. learning optimal control policies for systems in the form (1), with respect to smooth scalar objective functions. II.STOCHASTIC DIFFERENTIAL EQUATIONS The link between stochastic optimal control (SOC) and Although (1)“looks like a differential equation, it is really a learning has been explored in the discrete–time case [1] with meaningless string of symbols” [20]. This relation should be policy iteration and value function approximation methods [2], hence treated only as a pre–equation and cannot be studied [3] seeing widespread utilization in reinforcement learning in this form. Such ill–posedness arises from the fact that, [4]. Adaptive stochastic control has obtained explicit solutions being ξ(t) a δ–autocorrelated process, the noise fluctuates an through strict assumptions on the class of systems and ob- infinite number of times with infinite variance1 in any time jectives [5], preventing its applicability to the general case. interval. Therefore, a rigorous treatment of the model requires Forward–backward SDEs (FBSDEs) have been also proposed a different calculus to consistently interpret the integration of arXiv:2106.03780v1 [cs.LG] 7 Jun 2021 to solve classic SOC problems [6], even employing neural the RHS of (1). The resulting well-defined version of (1) is approximators for value function dynamics [7], [8]. A further known as a Stochastic Differential Equation (SDE) [21]. connection between SOC and machine learning has also been discussed within the continuous–depth paradigm of neural networks [9], [10], [11], [12]. [13] e.g. showed that fully A. Ito–Stratonovichˆ Dilemma connected residual networks converge, in the infinite depth According to Van Kampen [20], ξ(t) might be thought as and width limit, to diffusion processes. a random sequence of δ functions causing, at each time t, a A different class of techniques for SOC involves the ana- sudden jump of xt. The controversies on the interpretation of lytic [14] or approximate solution [15] of Hamilton–Jacobi– (1) arises over the fact that it does not specify which value Bellman (HJB) optimality conditions. These approaches either of x should be used to compute g when the δ functions are restrict the class of objectives functions to preserve analytic applied. There exist two main interpretations of the issue, tractability, or develop specific approximate methods which namely Ito’sˆ [22] and Stratonovich’s [23]. Itoˆ prescribes that generally become intractable in high–dimensions. g should be computed with the value of x before the jump while Stratonovich uses the mean value of x before and 1 The University of Tokyo, [email protected] 2 KAIST, [email protected] 1From a control theoretical perspective, if ξ(t) ∈ R is, for instance, a white 3 R ∞ Cogent Labs, [email protected] noise signal, its energy −∞ |F (ξ)(ω)|dω would not be finite (F (·) denotes ? equal contribution authors the Fourier transform) 2 after the jump. This choice leads to two different (yet, both a filtration with respect to which B is –adapted. Further, t Ft admissible and equivalent) types of integration. Formally, we let f, g to be bounded in X, infinitely differentiable in consider a compact time horizon T = [0,T ], T > 0 and x, continuously differentiable in t, uniformly continuous in u let B ∈ be the standard n -dimensional and we assumme the controller u ∈ to be a –adapted { t}t T ξ { t}t T Ft defined on a filtered probability space (Ω, ,P ; t t∈ ) process. Given an initial condition x0 X assumed to be a F {F } T ∈ which is such that B0 = 0, B is almost surely continuous in t, 0 measurable random variable, we suppose that there exists t F nowhere differentiable, has independent Gaussian increments, a X-valued continuous t-adapted semi–martingale xt t∈ F { } T namely s, t T s < t Bt Bs (0, t s) such that ∀ ∈ ⇒ − ∼ N − Z T Z T and, for all t T, we have ξ(t)dt = dBt. Moreover, let ∈ xT = x0 + f(t, xt, ut)dt + g(t, xt, ut) dBt, (2) φt := g(t, xt, ut). Itoˆ and Stratonovich calculi are then 0 0 ◦ R T PK almost surely. Path–wise existence and uniqueness of solu- defined as 0 φtdBt = lim|D|→0 k=1 φtk−1 (Btk Btk−1 ) R T 1 PK − tions, i.e. if, for all t , two solutions x , x0 such that that and φt dBt = lim| |→0 φt φt (Bt T t t 0 D 2 k=1 k k−1 k 0 ∈ 0 ◦ − − x0 = x0 satisfy t xt = x almost surely, is guaranteed Btk−1 ), respectively, where D is a given partition of T, T t under our class assumptions∀ ∈ on f, g and the process u 3. D := tk : 0 = t0 < t1 < < tK = T , D = t t∈T { ··· } | | { 1},∞ max (t t −1) and the limit is intended in the mean– If, as assumed here, f, g are functions of class in k k − k C square sense, (see [24]) if exists. Note that the symbol “ ” (xt, t) uniformly continuous in u with bounded derivatives ◦ w.r.t x and t, and u ∈ belongs to some admissible control in dBt is used only to indicate that the integral is interpreted { t}t T in◦ the Stratonovich sense and does not stand for function set A of T U functions, then, given a realization of the Wiener process,→ there exists a ∞ mapping Φ called stochastic composition. In the Stratonovich convention, we may use the C 2 flow from X A to the space of absolutely continuous functions standard rules of calculus while this is not the case for Ito’s.ˆ × [s, t] X such that This is because Stratonovich calculus corresponds to the limit → case of a smooth process with a small finite auto–correlation xt = Φs(xs, us s≤t;s,t∈ )(t) s t; s, t T; xs X (3) { } T ≤ ∈ ∈ approaching Bt [25]. Therefore, there are two different inter- almost surely. For the sake of compactness we denote the RHS pretations of (1): dxt = f(t, xt, ut)dt + g(t, xt, ut)dBt and of (3) with Φs,t(xs). It is worth to be noticed that the collection dxt = f(t, xt, ut)dt + g(t, xt, ut) dBt. Despite their name, Φ ≤ ; ∈ satisfies the flow property (see [24, Sec. 3.1]) ◦ { s,t}s t s,t T SDEs are formally defined as integral equations due to the s, t, v T : s < v < t Φv,t(Φs,v(xs)) = Φs,t(xs) ∀ ∈ ⇒ non-differentiability of the Brownian paths Bt. and that it is also a diffeomorphism [24, Theorem 3.7.1], i.e. From a control systems perspective, on one hand −1 there exists an inverse flow Ψs,t := Φs,t which satisfies the Stratonovich interpretation seems more appealing for noisy backward SDE physical systems as it captures the limiting case of multiplica- Z t

Ψs,t(xs) = xs f(v, Ψs,v(xs), uv)dv tive noise processes with a very small finite auto–correlation, b − s

often a reasonable assumption in real applications. However, if Z t we need to treat diffusion terms as intrisic components of the b g(v, Ψs,v(xs), uv) dBv.

− s ◦ system’s dynamics or impose martingale assumptions on the b being B ∈ the realization of the backward Wiener process solutions of the SDE, the Itoˆ convention should be adopted. { t}t T defined as for all . The diffeomorphism In particular, this is always the case of financial applications, Bt := Bt BT t T property of thus− yields: ∈ . where stochastic optimal control is widely applied (see e.g. Φs,t Ψs,t(Φs,t(xs)) = xs reverse [26]). Therefore, it is possible to the solutions of Within the scope of this paper, we will mainly adopt the Stratonovich SDEs in a similar fashion to ordinary differential Stratonovich convention for consistency of notation. Nonethe- equations (ODEs) by storing/generating identical realizations less, all results presented in this manuscript can be equiv- of the noise process. From a practical point of view, under mild alently derived for the Itoˆ case. Indeed an Itoˆ SDE can be assumptions on the chosen numerical SDE scheme, approxi- ˆ ˆ ˆ transformed into a Stratonovich one (and viceversa) by the mated solutions Φs,t(xs) satisfy xs X Ψs,t(Φs,t(xs)) x in probability as the discretization∀ ∈ time-step  0 [28→, equivalence relation between the two calculi: for all t T s → (i) ∈ Theorem 3.3]. R t R t 1 R t Pnξ ∂φs (i) φ dB = φ dB + φs ds. 0 s ◦ s 0 s s 2 0 i=1 ∂x B. Stratonovich SDE III.DIRECT STOCHASTIC OPTIMAL CONTROL We will formally introduce Stratonovich SDEs following We consider the optimal control problem for systems of the treatment of Kunita [27], [24]. For 0 s t T we class (1) interpreted in the Stratonovich sense. In particular, ≤ ≤ ≤ denote by the smallest σ-algebra containing all null sets we aim at determining a control process ut t∈T within an Fs,t { } of with respect to which, for all v, w T such that s admissible set of functions A to minimize or maximize in F ∈ ≤ v w t, B B is measurable. is called expectation some scalar criterion Ju of the form w v s,t s≤t;s,t∈T Z a two–sided≤ ≤ filtration− generated by the{F Wiener} process and we Ju(x0) = γ(t, Φ0,t(x0), ut)dt + Γ(Φ0,T (x0), uT ), (4) set write = 0 for compactness. Then, ∈ is itself Ft F ,t {Ft}t T T 2see e.g. Ito’sˆ Lemma in Stratonovich form [24, Theorem 2.4.1], where 3Note that less strict sufficient conditions only required uniform Lipsichitz 1 Pnξ (i) (i) the chain rule of classic Riemann–Stieltjes calculus is shown to be preserved continuity of f(t, xt, ut)+ 2 i=0 g (t, xt, ut)[∂g (t, xt, ut)/∂x] and under the sign of Stratonovich . g(t, xt, ut) w.r.t. x and uniform continuity w.r.t. t and u. 3 measuring the performance of the system. Within the scope Let u : θ, t u be a neural network with parameters θ,t 7→ θ,t of the paper, we characterize A as set of all control processes θ Rnθ which we use as functional approximators for can- ∈ ∗ ur t∈T which are t–adapted, have values in U and such that didate optimal controllers, i.e. uθ,t ut t T. Further, we { R} 2 F ≈ ∀ ∈ E[ ut 2]dt . To emphasize the (implicit) dependence denote fθ(t, xt) := f(t, xt, uθ,t) and gθ(t, xt) := g(t, xt, uθ,t) T k k ≤ ∞ of Ju(x0) on the realization of the Wiener process, we often and the optimization criterion Z explicitly write Ju(x0, Bt t∈T). { } Jθ(x0) = γθ(t, Φs,t(x0))dt + Γθ(T, Φs,T (x0)) (5) T A. Stochastic Gradient Descent with γθ( , ) = γ( , , uθ,t) and Γθ( , ) = Γ( , , uθ,T ). Then, the optimization· · problem· · turns into· finding· · via· gradient de- In this work we propose to directly seek the optimal scent the optimal parameters θ, by iterating controller via mini-batch stochastic gradient descent (SGD) N (or ascent) optimization [29], [30]. The algorithm works k+1 k ηk X d i θ = θ J k (x , B ). (6) θ 0 t t∈T as follows: given a scalar criterion J (x0, B ∈ ) de- − N dθ { } α { t}t T i=1 pendent on the variable α and the Brownian path, it at- If strong costraints over the set of admissible controllers A ∗ tempts at computing α = arg minα E[Jα(x0, Bt t∈T)] or are imposed, the approximation problem can be rewritten onto ∗ { } α = arg maxα E[Jα(x0, Bt t∈ )] starting from an initial a complete orthogonal basis of and u is parameterized by a { } T A θ guess α0 by updating a candidate solution αk recursively truncation of the eigenfunction expansion, rather than a neural ηk PN d i as αk+1 = αk Jαk (x0, Bt ), where N network. ± N i=1 dα { }t∈T is a predetermined number of independent and identically Remark 1 (Heuristics): As common practice in machine distributed samples B i of the Wiener process, η is a { t}t∈T k learning, the proposed approach relies heavily on the following positive scalar learning rate and the sign of the update depends empirical assumptions on the minimization or maximization nature of the problem. i. The numerical estimate of the mean gradients will be If ηk is suitably chosen and Jα is convex, αk converges in accurate enough to track the direction of the true dJθ/dθ. expectation to the minimizer of Jα as k . Although global convergence is no longer guaranteed in→ the ∞ non–convex Here, estimation errors of the gradient will be introduced case, SGD–based techniques have been employed across ap- by the numerical discretization of the SDE and the plication areas due to their scaling and unique convergence finiteness of independent path simulations. characteristics. These methods have further been refined over ii. The local optima controller reached by gradient de- good enough time to show remarkable results even in non–convex settings scent/ascent will be to satisfy the perfor- [31], [32]. mance requirements of the control problem. In order to perform the above gradient descent we therefore B. Finite Dim. Optimization via Neural Approximators need to compute the gradient (i.e. the sensitivity) of Ju with respect to the parameters θ in a computationally efficient Consider the case in which the criterion Ju has to be manner. In the following we detail different approaches to minimized. Locally optimal control processes u = u ∈ { t}t T ∈ differentiate through SDE solutions. A could be obtained, in principle, by iterating the SGD algorithm for the criterion Ju in the set A of admissible control processes. Since A is at least in L2(T U) we could IV. COST GRADIENTSAND ADJOINT DYNAMICS preserve local convergence of SGD in the function-space→ [33]. An idealized function–space SGD algorithm would compute The most straightforward approach for computing path–wise iterates of the form gradients (i.e. independently for each realization of Bt t∈T) N 5 { } k+1 k ηk X δ i is by directional differentiation i.e. u = u Ju(xs0, Bt t∈ ) Z   − N δu { } T dJθ ∂γθ dxt ∂γθ ∂Γθ dxT ∂Γθ i=1 = + dt + + k dθ ∂x dθ ∂θ ∂x dθ ∂θ where u is the solution at the kth SGD iteration, δ/δ( ) T t T · where the quantities dxt/dθ and dxT /dθ can be obtained with is the variational or functional derivative and Ju satisfies the following result. δE[Ju(xs, Bt t∈ )]/δu = E[δJu(xs, Bt t∈ )/δu]. { } T { } T At each training step of the controller, this approach would Proposition 1 (Path–wise Forward Sensitivity): Let St = dx /dθ. Then, S is a -adapted process satisfying thus perform N independent path simulations, compute the t t {Ft}   nξ " (i) (i) # criterion Ju and apply the gradient descent update. While local ∂fθ ∂fθ X ∂gθ ∂gθ (i) dSt = St + dt + St + dBt convergence to optimal solutions is still ensured under mild ∂xt ∂θ ∂xt ∂θ ◦ regularity and smoothness assumptions, obtaining derivatives i=1 Proof: The proof is an extension of the forward sensitivity in function space turns out to be computationally intractable. analysis for ODEs (see [34, Sec. 3.4]) to the stochastic case. Any infinite-dimensional algorithm needs to be discretized in Given the SDE of interest order to be practically implementable. We therefore seek to Z Z reduce the problem to finite dimension by approximating k xt = x0 + fθ(t, xt)dt + gθ(t, xt) dBt ut ◦ with a neural network4. T T

4 5 Here by neural network we mean any learnable parametric function uθ Differentiability under the integral sign follows by our smoothness as- with some approximation power for specific classes of functions. sumptions 4 differentiating under the integral sign w.r.t. θ gives all functions unless special cases. We have Z   dxt ∂fθ dxt ∂fθ dJθ ∂Γθ ∂Γθ dxT > dxT > dx 0 = + dt = + λT + λ0 dθ T ∂xt dθ ∂θ dθ ∂θ ∂xT dθ − dθ dθ n Z    ξ Z " (i) (i) # ∂γθ ∂γθ dxt dxt X ∂g dxt ∂g (i) > θ θ + + dt + dλt + + dBt ∂θ ∂x dθ dθ ∂x dθ ∂θ ◦ T t i=1 T t Z   (7) > ∂fθ ∂fθ dxt and result follows setting St = dxt/dθ. That differentiating + λ + dt t ∂θ ∂x dθ under the integral sign is allowed follows by our assumptions T t nξ Z " (i) (i) # on fθ and gθ and by an application of [24, Lemma 3.7.1] to X > ∂gθ ∂gθ dxt (i) + λt + dBt the augmented SDE (xt, θ). ∂θ ∂x dθ ◦ i=1 T t The main issue with the forward sensitivity approach is its which, by reorganizing the terms, leads to Z   curse of dimensionality with respect to the number of param- dJθ ∂Γθ ∂γθ > ∂fθ = + + λt dt eters in θ and state dimensions nx as it requires us to solve dθ ∂θ T ∂θ ∂θ nξ (i) an nx nθ matrix–valued SDE for the whole time horizon T. Z   × X > ∂gθ (i) ∂Γθ > dxT At each integration step the full Jacobians ∂f /∂θ, ∂g /∂θ + λt dBt + λT θ θ ∂θ ◦ ∂x − dθ are required, causing memory and computational overheads i=1 T T Z "   nξ i # in software implementations. Such issue can be overcome by ∂fθ ∂γθ X ∂g dxt + dλ> + λ> + dt + λ> θ dBi introducing adjoint backward gradients. t t ∂x ∂x t ∂θ ◦ t dθ

T t t i=1 Theorem 1 (Path–wise Backward Adjoint Gradients): Con- b > Finally, if the process λt satisfies the backward SDE dλt = 1 h i (i) (i) sider the cost function (5) and let λ (T X) be a > ∂fθ ∂γθ Pnξ > ∂gθ > ∂Γθ ∈ C → λt ∂x + ∂x dt i=1 λt ∂x dBt , λT = ∂x . Lagrange multiplier. Then, dJθ/dθ is given by − t t − t ◦ T The criterion function gradient is then simply obtained by Z   nξ Z (i) dJθ ∂Γθ > ∂fθ ∂γθ X > ∂gθ (i) Z   nξ Z (i) = + λ + dt+ λ dB dJθ ∂Γθ ∂fθ ∂γθ X ∂g ( ) dθ ∂θ t ∂θ ∂θ t ∂θ ◦ t = + λ> + dt+ λ> θ dB i T i=1 T dθ ∂θ t ∂θ ∂θ t ∂θ ◦ t

T i=1 T where the Lagrange multiplier λt satisfies the following back- ward Stratonovich SDE: b nξ (i) Note that if the integral term of the objective function   ∂g (i) > > ∂fθ ∂γθ X > θ is defined point–wise at a finite number of time instants dλt = λt + dt λt dBt − ∂x ∂x − ∂x ◦ R T PK θ,b t t i=1 t t , i.e. γ (t , Φ (x ))δ(t t )dt, then the RHS k s k=1 θ k s,tk s − k > ∂Γθ of the adjoint state SDE becomes piece–wise continuous in λT = ∂xT (tk, tk+1], yielding the hybrid stochastic dynamics nξ (i) ∂fθ X ∂gθ (i) Lemma 1 (Line integration by parts): Let be a compact dλt = λtdt λt dBt t (tk, tk+1] T −∂xt − ∂xt ◦ ∈ subset of R and r : T Rn, v : T Rn such that t i=1 1→ → 1 ∀ ∈ ∂γθ T, t rt := r(t) and t vt := v(t) . We have − R 7→ ∈ C 7→ R ∈ C λt = λt + t = tk rt, dvt = rT , vT r0, v0 drt, vt . ∂xt Th i h i − h i − Th i − where λt indicates the value of λt after a discrete jump, or, Proof: The proof follows immediately from integration − formally, λt = lims%t− λs. by parts using drt =r ˙tdt, dvt =v ˙tdt. Proof: (Theorem1) Let be the Lagrangian of the V. EXPERIMENTAL EVALUATION L 1 optimization problem and the process λt (T X) an We evaluate the proposed approach on a classical problem in ∈ C → t-adapted Lagrange multiplier. We consider of the form financial mathematics, continuous–time portfolio optimization F Z L [26]. We consider the challenging finite–horizon, transaction = J (x0) λ , dx f (t, x )dt g (t, x ) dB L θ − h t t − θ t − θ t ◦ ti cost case [35]. ZT = Γθ(xT ) + γθ(t, xt)dt A. Optimal Portfolio Allocation Z T Consider a two–asset market with proportional transaction λt, dxt fθ(t, xt)dt gθ(t, xt) dBt costs, all securities are perfectly divisible. Suppose that in such − h − − ◦ i T a market all assets are traded continuously and that one of the Since dx f (t, x )dt g (t, x ) dB = 0 by construction, t − θ t − θ t ◦ t them is riskless, i.e. it evolves according to the ODE the integral term is always null, d /dθ = dJ (x0)/dθ and L θ the Lagrange multiplier process can be freely assigned. Thus, dVt = rtVtdt, Vs = v R Z ∈ dJθ d ∂Γθ ∂Γθ dxT d where the interest rate rt is any t-adapted non–negative = L = + + γθ(t, xt)dt F dθ dθ ∂θ ∂x dθ dθ scalar–valued process. The other asset is a stock St R whose T T ∈ d Z dynamics satisfy the Itoˆ SDEs λt, dxt fθ(t, xt)dt gθ(t, xt) dBt dSt = µtSt + σtStdBt,Ss = s − dθ T h − − ◦ i R R T ∈ Note that, by Lemma1, we have λt, dxt = λt, xt 0 with expected rate of return µ : T R and instantaneous R T h i h i | − → dλt, xt . For compactness, we will omit the argument of volatility σ : T R+ t-adapted processes. T h i → F 5

Now, let us consider an agent who invests in such market and is interested in rebalancing a two–asset portfolio through Optimal Portfolio Allocation buy and sell actions. We obtain Portfolio Expected Return Portfolio Volatility dV = r V dt dπi + (1 α)dπd ν = 0 ν = 0.25 ν = 0.5 ν = 1 t t t − t − t i d (8) dSt = µtSt + σtStdBt + dπt dπt i d − where πt, πt are nondecreasing, t-adapted processes indicat- ing the cumulative amount of salesF and purchases of the risky 0 0 t t

asset, and α > 0 is the proportional transaction cost. Here we V V i i d d i d let dπt = utdt, dπt = ut dt and ut = (ut, ut ). An optimal strategy ut for such a portfolio requires specification of a utility function, encoding risk aversion and final objective of 5 5 the investor. Explicit results for the finite horizon, transaction − − 0 5 0 5 cost case exist only on specific classes of utilities, such as St St isoelastic [35]. As a further complication, the horizon length itself is known to greatly affect the optimal strategy. As an illustrative example, we consider here a portfolio optimality 0 1 2 0 5 10 criteria where high levels of stock value are penalized, perhaps Fig. 1. Realization of (St,Vt) trajectories under learned strategies for four investors with different degrees of risk aversion, encoded in ν, with a portfolio due to hedging considerations related to other already existing initialized at (1, 0) at time 0. Different path realizations generated by the portfolios in possession of the hypothetical trader: strategies are shown, along with their mean. [Left] Portfolio evolution against 2 Z T rtVt + µtSt [Right] Portfolio evolution against St σ. Each strategy suits the 2 corresponding risk aversion level. Ju = (rtVt + µtSt νSt σt)dt + rT VT + µT ST 0 − to be maximized in expectation for some constant ν > 0. The followed by a movement towards a more conservative portfolio same methodology can be directly extended to cover generic i d with a balanced amount of St and Vt. The policies uθ, uθ objectives, including e.g different risk measures, complex are visualized in Figure2, confirming the highly non–linear stock and interest rate dynamics, and more elaborate transac- nature of the learned controller. It should be noted that for a tion cost accounting rules. We then obtain a parameterization strategy to be allowed, the portfolio has to remain above the ui , ud of the policies ui, ud as follows. In particular we de- θ,t θ,t t t solvency line Vt + (1 α)St 0 [35], indicated in the plot. 2 2 − ≥ fine feed–forward neural networks Nθ : R R ;(St,Vt) We empirically observe automatic satisfaction of this specific i d → 7→ (uθ,t, uθ,t) taking as input the instantaneous values of the constraint for all but the investors with lowest risk aversion. assets, u = N (V ,S ) = ` ϕ ` −1 ϕ `0(V ,S ), θ,t θ t t L ◦ ◦ L ◦ · · · ◦ ◦ t t For the case ν = 0, we introduce a simple logarithmic barrier where `k are linear affine finite operators, `kx = Akx + bk, function as a soft constraint [38]. nk+1×nk nk+1 Ak R , bk R and ϕ : R R is any nonlinear activation∈ function thought∈ to be acting→ element–wise. The vector of parameters θ is thus the flattened collection of all VI.RELATED WORK weights Ak and biases bk. Computing path–wise sensitivities through stochastic dif- B. Numerical Results ferential equations has been extensively explored in the litera- In particular, we choose three–layer feed–forward neural ture. In particular, forward sensitivities with respect to initial networks with thirty–two neurons per layer, capped with a conditions of solutions analogous to Proposition1 were first i d softplus activation ϕ(x) = log(1 + exp(x)) to ensure πθ, πθ introduced in [27] and extended for Itoˆ case to parameters be nondecreasing processes. sensitivities in [39]. These approaches rely on integrating The optimization procedure for strategy parameters θ has forward matrix–valued SDEs whose drift and diffusion require been carried out through 100 iterations of the Adam [36] full Jacobians of fθ and gθ at each integration step. Thus, such gradient–based optimizer with step size 0.03. The asset dy- methods poorly scales in computational cost with large number namics and cost parameters are set as α = 0.05, r = 0.04, of parameters and high–dimensional–state regimes. This issue µ = 0.23, σ = 0.18. We produce 50 path realizations during is averted by using backward adjoint sensitivity [40] where a each training iteration and average the gradients. The asset vector–valued SDE is integrated backward and only requires dynamics are interpreted in the Itoˆ sense and have been vector–Jacobian products to be evaluated. In this direction, solved forward with the Ito–Milsteinˆ method [37], which has [28] proposed to solve the system’s backward SDE alongside been converted to Stratonovich–Milstein for the backward the adjoint SDE to recover the value of the state xt and thus adjoint SDE. Figure1 shows different realization of portfolio improve the algorithm memory footprint. Further, the approach trajectories in the (St,Vt) state–space for four different values is derived as a special case of [24, Theorem 2.4.1] and only of ν, [0, 0.25, 0.5, 1]. The trajectories agree with plausible considers criteria Jθ depending on the final state xT of the strategies for each investment style; risk–seeking investors system. (ν = 0) sell Vt to maximize the amount of stock owned at Our stochastic adjoint process extends the results of [28] final time T , and hence potential portfolio growth. On the to integral criteria potentially exhibiting explicit parameter other hand, moderately risk–averse investors (ν = 0.25) learn dependence. Further, we proposed a novel proving strategy hybrid strategies where an initial phase of stock purchasing is based on classic variational analysis. 6

Optimal Portfolio Allocation Learned Allocation Policies ν = 0 ν = 0.25 i ν = 0.5 ν = 1 uθ 1.3 0.54 1.5 0.7 t t t t V V V V 1.2 0.5 0.5 0.46 St St d St St uθ 0.7 1.5 1 2 t t t t 0.8 V V V V 1 0.4 0.5 0.6

St St St St i d Fig. 2. Learned buy–sell strategies uθ, uθ over the state–space (St,Vt). The strategies are highly non–linear, and empirically agree with investment strategies i d at different risk aversion degrees. In example, the risk–seeking investor ν = 0 balances uθ, uθ to purchase the maximum amount of stock possible while remaining above the solvency line Vt + (1 − α)St ≥ 0.

REFERENCES [21] B. Øksendal, “Stochastic differential equations,” in Stochastic differen- tial equations. Springer, 2003, pp. 65–84. [1] D. P. Bertsekas and S. Shreve, Stochastic optimal control: the discrete- [22] K. Ito,ˆ “Differential equations determining a markoff process,” J. Pan- time case, 2004. Japan Math. Coll., vol. 1077, pp. 1352–1400, 1942. [2] H. Yu and D. P. Bertsekas, “Q-learning and policy iteration algorithms [23] R. Stratonovich, “A new representation for stochastic integrals and for stochastic shortest path problems,” Annals of Operations Research, equations,” SIAM Journal on Control, vol. 4, no. 2, pp. 362–371, 1966. vol. 208, no. 1, pp. 95–132, 2013. [24] H. Kunita, Stochastic Flows and Jump-Diffusions. Springer. [3] Y. Wu and T. Shen, “Policy iteration algorithm for optimal control [25] E. Wong and M. Zakai, “On the convergence of ordinary integrals to of stochastic logical dynamical systems,” IEEE transactions on neural stochastic integrals,” The Annals of Mathematical Statistics, vol. 36, networks and learning systems, vol. 29, no. 5, pp. 2031–2036, 2017. no. 5, pp. 1560–1564, 1965. [4] S. Levine, “Reinforcement learning and control as probabilistic infer- [26] R. C. Merton, “Lifetime portfolio selection under uncertainty: The ence: Tutorial and review,” arXiv preprint arXiv:1805.00909, 2018. continuous-time case,” The review of Economics and Statistics, pp. 247– [5] E. Tse and M. Athans, “Adaptive stochastic control for a class of linear 257, 1969. systems,” IEEE Transactions on Automatic Control, vol. 17, no. 1, pp. [27] H. Kunita, Stochastic flows and stochastic differential equations. Cam- 38–52, 1972. bridge university press, 1997, vol. 24. [6] S. Peng and Z. Wu, “Fully coupled forward-backward stochastic differ- [28] X. Li, T.-K. L. Wong, R. T. Chen, and D. Duvenaud, “Scalable gradients ential equations and applications to optimal control,” SIAM Journal on for stochastic differential equations,” arXiv preprint arXiv:2001.01328, Control and Optimization, vol. 37, no. 3, pp. 825–843, 1999. 2020. [7] M. A. Pereira and Z. Wang, “Learning deep stochastic optimal control [29] H. Robbins and S. Monro, “A stochastic approximation method,” The policies using forward-backward sdes,” in Robotics: science and systems, annals of mathematical statistics, pp. 400–407, 1951. 2019. [30] S. Ruder, “An overview of gradient descent optimization algorithms,” [8] Z. Wang, K. Lee, M. A. Pereira, I. Exarchos, and E. A. Theodorou, arXiv preprint arXiv:1609.04747, 2016. “Deep forward-backward sdes for min-max control,” in 2019 IEEE 58th [31] C. De Sa, C. Re, and K. Olukotun, “Global convergence of stochastic Conference on Decision and Control (CDC). IEEE, 2019, pp. 6807– gradient descent for some non-convex matrix problems,” in International 6814. Conference on Machine Learning. PMLR, 2015, pp. 2332–2341. [9] E. Weinan, “A proposal on machine learning via dynamical systems,” [32] Z. Allen-Zhu and E. Hazan, “Variance reduction for faster non-convex Communications in Mathematics and Statistics, vol. 5, no. 1, pp. 1–11, optimization,” in International conference on machine learning. PMLR, 2017. 2016, pp. 699–707. [10] R. T. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural [33] G. Smyrlis and V. Zisis, “Local convergence of the steepest descent ordinary differential equations,” arXiv preprint arXiv:1806.07366, 2018. method in hilbert spaces,” Journal of mathematical analysis and appli- [11] S. Massaroli, M. Poli, J. Park, A. Yamashita, and H. Asama, “Dissecting cations, vol. 300, no. 2, pp. 436–453, 2004. neural odes,” arXiv preprint arXiv:2002.08071, 2020. [34] H. K. Khalil, Nonlinear systems, vol. 3. [12] S. Massaroli, M. Poli, M. Bin, J. Park, A. Yamashita, and H. Asama, [35] H. Liu and M. Loewenstein, “Optimal portfolio selection with transac- “Stable neural flows,” arXiv preprint arXiv:2003.08063, 2020. tion costs and finite horizons,” The Review of Financial Studies, vol. 15, [13] S. Peluchetti and S. Favaro, “Infinitely deep neural networks as diffusion no. 3, pp. 805–835, 2002. processes,” in International Conference on Artificial Intelligence and [36] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Statistics. PMLR, 2020, pp. 1126–1136. arXiv preprint arXiv:1412.6980, 2014. [14] L. Arnold, “Stochastic differential equations,” New York, 1974. [37] G. Mil’shtejn, “Approximate integration of stochastic differential equa- [15] H. J. K. Kushner, H. J. Kushner, P. G. Dupuis, and P. Dupuis, Numerical tions,” Theory of Probability & Its Applications, vol. 19, no. 3, pp. methods for stochastic control problems in continuous time. Springer 557–562, 1975. Science & Business Media, 2001, vol. 24. [38] J. Nocedal and S. Wright, Numerical optimization. Springer Science [16] J. Schmidhuber, “Deep learning in neural networks: An overview,” & Business Media, 2006. Neural networks, vol. 61, pp. 85–117, 2015. [39] E. Gobet and R. Munos, “Sensitivity analysis using ito–malliavinˆ [17] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, Deep learning. calculus and martingales, and application to stochastic optimal control,” MIT press Cambridge, 2016, vol. 1, no. 2. SIAM Journal on control and optimization, vol. 43, no. 5, pp. 1676– [18] L. Furieri, Y. Zheng, and M. Kamgarpour, “Learning the globally 1713, 2005. optimal distributed lq regulator,” in Learning for Dynamics and Control. [40] M. Giles and P. Glasserman, “Smoking adjoints: fast evaluation of greeks PMLR, 2020, pp. 287–297. in monte carlo calculations,” 2005. [19] J. Bu, A. Mesbahi, M. Fazel, and M. Mesbahi, “Lqr through the lens of first order methods: Discrete-time case,” arXiv preprint arXiv:1907.08921, 2019. [20] N. G. Van Kampen, “Itoˆ versus stratonovich,” Journal of Statistical Physics, vol. 24, no. 1, pp. 175–187, 1981.