Learning Stochastic Optimal Policies Via Gradient Descent

1 Learning Stochastic Optimal Policies via Gradient Descent Stefano Massaroli1;?, Michael Poli2;?, Stefano Peluchetti3 Jinkyoo Park2, Atsushi Yamashita1 and Hajime Asama1 Abstract—We systematically develop a learning-based treat- Here, we explore a different direction, motivated by the ment of stochastic optimal control (SOC), relying on direct opti- affinity between neural network training and optimal control mization of parametric control policies. We propose a derivation which involve the reliance on carefully crafted objective of adjoint sensitivity results for stochastic differential equations through direct application of variational calculus. Then, given functions encoding task–dependent desiderata. an objective function for a predetermined task specifying the In essence, the raison d’etreˆ of synthesizing optimal desiderata for the controller, we optimize their parameters via stochastic controllers through gradient–based techniques is to iterative gradient descent methods. In doing so, we extend enrich the class of objectives for which an optimal controller the range of applicability of classical SOC techniques, often can be found, as well scale the tractability to high-dimensional requiring strict assumptions on the functional form of system and control. We verify the performance of the proposed approach regimes. This is in line with the empirical results of modern on a continuous–time, finite horizon portfolio optimization with machine learning research where large deep neural networks proportional transaction costs. are often optimized on high–dimensional non–convex problems with outstanding performance [16], [17]. Gradient–based methods are also being explored in classic control settings to I. INTRODUCTION obtain a rigorous characterization of optimal control problems In this work we consider the following class of controlled in the linear–quadratic regime [18], [19]. stochastic dynamical systems: Notation: Let (Ω; ;P ) be a probability space. If a property F x_ t = f(t; xt; ut) + g(t; xt; ut)ξ(t) (1) (event) A holds with P (A) = 1, we say that such nx nu 2 F with state xt X R and control policy ut U R . property holds almost surely. A family of X–valued random nξ 2 ⊂ 2 ⊂ ξ(t) R is a stationary δ-correlated Gaussian noise, i.e. variables defined on a compact time domain T R xt t2T 2 ⊂ f g t > 0 E[ξ(t)] = 0 and s; t such that 0 < s < t it holds is called stochastic process and is measurable if xt(A) is 8 8 E[ξ(s)ξ(t)] = δ(s t). The RHS of (1) comprises a drift term measurable with respect to the σ-algebra (T) being (T) − B ×F R t B R s f : R X U Rnx and a diffusion term g : R X U the Borel-algebra of T. As a convention, we use = × × ! × × ! s − t Rnx×nξ . This paper develops a novel, systematic approach to if s < t and we denote with δ the Krocnecker delta function. learning optimal control policies for systems in the form (1), with respect to smooth scalar objective functions. II. STOCHASTIC DIFFERENTIAL EQUATIONS The link between stochastic optimal control (SOC) and Although (1)“looks like a differential equation, it is really a learning has been explored in the discrete–time case [1] with meaningless string of symbols” [20]. This relation should be policy iteration and value function approximation methods [2], hence treated only as a pre–equation and cannot be studied [3] seeing widespread utilization in reinforcement learning in this form. Such ill–posedness arises from the fact that, [4]. Adaptive stochastic control has obtained explicit solutions being ξ(t) a δ–autocorrelated process, the noise fluctuates an through strict assumptions on the class of systems and ob- infinite number of times with infinite variance1 in any time jectives [5], preventing its applicability to the general case. interval. Therefore, a rigorous treatment of the model requires Forward–backward SDEs (FBSDEs) have been also proposed a different calculus to consistently interpret the integration of arXiv:2106.03780v1 [cs.LG] 7 Jun 2021 to solve classic SOC problems [6], even employing neural the RHS of (1). The resulting well-defined version of (1) is approximators for value function dynamics [7], [8]. A further known as a Stochastic Differential Equation (SDE) [21]. connection between SOC and machine learning has also been discussed within the continuous–depth paradigm of neural networks [9], [10], [11], [12]. [13] e.g. showed that fully A. Ito–Stratonovichˆ Dilemma connected residual networks converge, in the infinite depth According to Van Kampen [20], ξ(t) might be thought as and width limit, to diffusion processes. a random sequence of δ functions causing, at each time t, a A different class of techniques for SOC involves the ana- sudden jump of xt. The controversies on the interpretation of lytic [14] or approximate solution [15] of Hamilton–Jacobi– (1) arises over the fact that it does not specify which value Bellman (HJB) optimality conditions. These approaches either of x should be used to compute g when the δ functions are restrict the class of objectives functions to preserve analytic applied. There exist two main interpretations of the issue, tractability, or develop specific approximate methods which namely Ito’sˆ [22] and Stratonovich’s [23]. Itoˆ prescribes that generally become intractable in high–dimensions. g should be computed with the value of x before the jump while Stratonovich uses the mean value of x before and 1 The University of Tokyo, [email protected] 2 KAIST, [email protected] 1From a control theoretical perspective, if ξ(t) 2 R is, for instance, a white 3 R 1 Cogent Labs, [email protected] noise signal, its energy −∞ jF (ξ)(!)jd! would not be finite (F (·) denotes ? equal contribution authors the Fourier transform) 2 after the jump. This choice leads to two different (yet, both a filtration with respect to which B is –adapted. Further, t Ft admissible and equivalent) types of integration. Formally, we let f; g to be bounded in X, infinitely differentiable in consider a compact time horizon T = [0;T ]; T > 0 and x, continuously differentiable in t, uniformly continuous in u let B 2 be the standard n -dimensional Wiener process and we assumme the controller u 2 to be a –adapted f tgt T ξ f tgt T Ft defined on a filtered probability space (Ω; ;P ; t t2 ) process. Given an initial condition x0 X assumed to be a F fF g T 2 which is such that B0 = 0, B is almost surely continuous in t, 0 measurable random variable, we suppose that there exists t F nowhere differentiable, has independent Gaussian increments, a X-valued continuous t-adapted semi–martingale xt t2 F f g T namely s; t T s < t Bt Bs (0; t s) such that 8 2 ) − ∼ N − Z T Z T and, for all t T, we have ξ(t)dt = dBt. Moreover, let 2 xT = x0 + f(t; xt; ut)dt + g(t; xt; ut) dBt; (2) φt := g(t; xt; ut). Itoˆ and Stratonovich integral calculi are then 0 0 ◦ R T PK almost surely. Path–wise existence and uniqueness of solu- defined as 0 φtdBt = limjDj!0 k=1 φtk−1 (Btk Btk−1 ) R T 1 PK − tions, i.e. if, for all t , two solutions x ; x0 such that that and φt dBt = limj j!0 φt φt (Bt T t t 0 D 2 k=1 k k−1 k 0 2 0 ◦ − − x0 = x0 satisfy t xt = x almost surely, is guaranteed Btk−1 ), respectively, where D is a given partition of T, T t under our class assumptions8 2 on f; g and the process u 3. D := tk : 0 = t0 < t1 < < tK = T , D = t t2T f ··· g j j f 1g;1 max (t t −1) and the limit is intended in the mean– If, as assumed here, f, g are functions of class in k k − k C square sense, (see [24]) if exists. Note that the symbol “ ” (xt; t) uniformly continuous in u with bounded derivatives ◦ w.r.t x and t, and u 2 belongs to some admissible control in dBt is used only to indicate that the integral is interpreted f tgt T in◦ the Stratonovich sense and does not stand for function set A of T U functions, then, given a realization of the Wiener process,! there exists a 1 mapping Φ called stochastic composition. In the Stratonovich convention, we may use the C 2 flow from X A to the space of absolutely continuous functions standard rules of calculus while this is not the case for Ito’s.ˆ × [s; t] X such that This is because Stratonovich calculus corresponds to the limit ! case of a smooth process with a small finite auto–correlation xt = Φs(xs; us s≤t;s;t2 )(t) s t; s; t T; xs X (3) f g T ≤ 2 2 approaching Bt [25]. Therefore, there are two different inter- almost surely. For the sake of compactness we denote the RHS pretations of (1): dxt = f(t; xt; ut)dt + g(t; xt; ut)dBt and of (3) with Φs;t(xs). It is worth to be noticed that the collection dxt = f(t; xt; ut)dt + g(t; xt; ut) dBt. Despite their name, Φ ≤ ; 2 satisfies the flow property (see [24, Sec. 3.1]) ◦ f s;tgs t s;t T SDEs are formally defined as integral equations due to the s; t; v T : s < v < t Φv;t(Φs;v(xs)) = Φs;t(xs) 8 2 ) non-differentiability of the Brownian paths Bt. and that it is also a diffeomorphism [24, Theorem 3.7.1], i.e.

Load more