A Convex Optimization Approach to in Continuous State and Action Spaces

Insoon Yang∗

Abstract In this paper, a convex optimization-based method is proposed for numerically solving dy- namic programs in continuous state and action spaces. The key idea is to approximate the output of the Bellman operator at a particular state by the optimal value of a convex program. The approximate Bellman operator has a computational advantage because it involves a convex optimization problem in the case of control-affine systems and convex costs. Using this feature, we propose a simple dynamic programming algorithm to evaluate the approximate value function at pre-specified grid points by solving convex optimization problems in each iteration. We show that the proposed method approximates the optimal value function with a uniform convergence property in the case of convex optimal value functions. We also propose an interpolation-free design method for a control policy, of which performance converges uniformly to the optimum as the grid resolution becomes finer. When a nonlinear control-affine system is considered, the convex optimization approach provides an approximate policy with a provable suboptimality bound. For general cases, the proposed convex formulation of dynamic programming operators can be modified as a nonconvex bi-level program, in which the inner problem is a linear program, without losing uniform convergence properties.

Key words. Dynamic programming, Convex optimization, ,

1 Introduction

Dynamic programming (DP) has been one of the most important methods for solving and ana- lyzing sequential decision-making problems in optimal control, dynamic games, and ; among others. By using DP, we can decompose a complicated sequential decision-making problem into multiple tractable subproblems, of which optimal solutions are used to construct an optimal policy of the original problem. Numerical methods for DP are the most well studied for discrete-time Markov decision processes (MDPs) with discrete state and action spaces (e.g., [1]) arXiv:1810.03847v5 [math.OC] 5 Sep 2020 and continuous-time deterministic and stochastic optimal control problems in continuous state spaces (e.g., [2]). This paper focuses on the discrete-time case with continuous state and action spaces in the finite-horizon setting. Unlike infinite-dimensional linear programming (LP) meth- ods (e.g., [3–5]), which require a finite-dimensional approximation of the LP problems, we develop a finite-dimensional convex optimization-based method, that uses a discretization of the state space, while not discretizing the action space. Moreover, we assume that a system model and a cost func- tion are explicitly known unlike the literature on reinforcement learning (RL) (e.g., [6–8] and the ref- erences therein). Note that our focus is not to resolve the scalability issue in DP. As opposed to the

∗Department of Electrical and Computer Engineering, Automation and Systems Research Institute, Seoul National University ([email protected]). This work was supported in part by the Creative-Pioneering Researchers Program through SNU, the National Research Foundation of Korea funded by the MSIT(2020R1C1C1009766), and Samsung Electronics.

1 2

RL algorithms that seek approximate solutions to possibly high-dimensional problems (e.g., [9–12]), our method is useful, when a provable convergence guarantee is needed for problems with relatively low-dimensional state spaces.1 Several discretization methods have been developed for discrete-time DP problems in continuous (Borel) state and action spaces. These methods can be assigned to two categories. The first category discretizes both state and action spaces. Bertsekas [13] proposed two discretization methods and proved their convergence under a set of assumptions, including Lipschitz type continuity conditions. Langen [14] studied the weak convergence of an approximation procedure, although no explicit error bound was provided. However, the discretization method, proposed by Whitt [15] and Hinderer [16], is shown to be convergent and to have error bounds. Unfortunately, these error bounds are sensitive to the choice of partitions, and additional compactness and continuity assumptions are needed to reduce the sensitivity. Chow and Tsitsiklis [17] developed a multi-grid algorithm, which could be more efficient than its single-grid counterpart in achieving a desired level of accuracy. The discretization procedure, proposed by Dufour and Prieto-Rumeau [18], can handle unbounded cost functions and locally compact state spaces. However, it still requires the Lipschitz continuity of some components of dynamic programs. A measure concentration result for the Wasserstein metric has also been used to measure the accuracy of the approximation method in [19] for Markov control problems in the average cost setting. Unlike the aforementioned approaches, the finite-state and finite-action approximation method for MDPs with σ-compact state spaces, proposed by Saldi et al., does not rely on any Lipschitz-type continuity conditions [20]. The second category of discretization methods uses a computational grid only for the state space, i.e., this class of methods does not discretize the action space. These approaches have a computational advantage over the methods in the first category, particularly when the action space dimension is large. The state space discretization procedures, proposed by Hern´andez-Lerma[21], are shown to have a convergence guarantee with an error bound under Lipschitz continuity condi- tions on elements of control models. However, they are subject to the issue of local optimality in solving the nonconvex optimization problem over the action space involved in the Bellman equation. Johnson et al. [22] suggested spline interpolation methods, which are computationally efficient in high-dimensional state and action spaces. Unfortunately, these spline-based approaches do not have a convergence guarantee or an explicit suboptimality bound. Furthermore, the Bellman equation approximated by these methods involves nonconvex optimization problems. The method proposed in this paper is classified into the second category of discretization pro- cedures. Specifically, our approach only discretizes the state space, and thus it can handle high- dimensional action spaces. The key idea is to use an auxiliary optimization variable that assigns the contribution of each grid point when evaluating the value function at a particular state. By doing so, we can avoid an explicit interpolation of the optimal value function and control policies evaluated at the pre-specified grid points in the state space, unlike most existing methods. The contributions of this work are threefold. First, we propose an approximate version of the Bellman operator and show that the corresponding approximate value function converges uniformly to the optimal value function v? when v? is convex. The proposed approximate Bellman operator has a computational advantage because it involves a convex optimization problem in the case of linear systems and convex costs. Thus, we can construct a control policy, of which performance converges uniformly to the optimum, by solving M convex programs in each iteration of the DP algorithm, where M is the number of grid points required for the desired accuracy. Second, we show that the proposed convex optimization approach provides an approximate policy with a provable subopti- mality bound in the case of control-affine systems. This error bound is useful when gauging the

1However, our method is suitable for problems with high-dimensional action spaces. 3 performance of the approximate policy relative to an optimal policy. Third, we propose a modified version of our approximation method for general cases by localizing the effect of the auxiliary vari- able. The modified Bellman operator involves a nonconvex bi-level optimization problem wherein the inner problem is a linear program. We show that both the approximate value function and the cost-to-go function of a policy obtained by this method converge uniformly to the optimal value function if a globally optimal solution to the nonconvex bilevel problem can be evaluated; related error bounds are also characterized. From our convex formulation and analysis, we observe that convexity in DP deserves attention as it can play a critical role in designing a tractable numerical method that provides a convergent solution. The remainder of the paper is organized as follows. In Section 2, we introduce the setup for dynamic programming problems and our approximation method. The convex optimization-based method is proposed in Section 3. We show the uniform convergence properties of this method when the optimal value function is convex. In Section 4, we characterize an error bound for cases with control-affine systems and present a modified bi-level version of our approximation method for general cases. Section 5 contains numerical experiments that demonstrate the convergence property of our method.

2 The Setup

2.1 Notation

Let Bb( ) denote the set of bounded measurable functions on , equipped with the sup norm X X v ∞ := supx∈X v(x) < + . Given a measurable function w : R, let Bw( ) denote the k k | | ∞ X → X set of measurable functions v on such that v w := supx∈X ( v(x) /w(x)) < + . Let L( ) X k k | | ∞ X denote the set of lower semicontinuous functions on . Finally, we let Lb( ) := L( ) Bb( ) and X X X ∩ X Lw( ) := L( ) Bw( ). X X ∩ X 2.2 Dynamic Programming Problems Consider a discrete-time Markov control system of the form

xt+1 = f(xt, ut, ξt),

n m where xt R is the system state, and ut (xt) R is the control input. The ∈ X ⊆ ∈l U ⊆ U ⊆ stochastic disturbance process ξt t≥0, ξt Ξ R , is i.i.d. and defined on a standard filtered { } ∈ ⊆ probability space (Ω, , t t≥0, P). The sets , (xt), and Ξ are assumed to be Borel sets, and F {F } X U U the function f : Ξ is assumed to be measurable. X × U × → X A history up to stage t is defined by ht := (x0, u0, . . . , xt−1, ut−1, xt). Let Ht be the set of histories up to stage t and πt be a stochastic kernel from Ht to . The set of admissible control U policies is chosen as

Π := π = (π0, . . . , πK−1): πt( (xt) ht) = 1 ht Ht . { U | ∀ ∈ } Our goal is to solve the following finite-horizon stochastic optimal control problem:  K−1  π X inf E r(xt, ut) + q(xK ) x0 = x , (2.1) π∈Π | t=0 where r : R is a measurable stage-wise cost function, q : R is a measurable terminal X × U → π X → cost function, and E represents the expectation taken with respect to the probability measure 4 induced by a policy π. Under the following standard assumption for the measurable selection condition, there exists a deterministic Markov policy, which is optimal (e.g., [3, Condition 3.3.3, Theorem 3.3.5]) and [23, Assumptions 8.5.1–8.5.3, Lemma 8.5.5]). In other words, we can find an optimal policy π?, which is deterministic and Markov, i.e.,

? DM π Π := π = (π0, . . . , πK−1): πt : is measurable, ∈ { X → U πt(x) = u (x) x . ∈ U ∀ ∈ X } Assumption 1 (Semicontinuous model). Let

:= (x, u): x , u (x) K { ∈ X ∈ U } be the set of admissible state-action pairs.

1. The control set (x) is compact for each x , and the multifunction x (x) is upper semicontinuous;U ∈ X 7→ U

2. The real-valued functions r and q are lower semicontinuous on and , respectively. K X 3. There exist nonnegative constants r¯, q¯ and β, with β 1, and a continuous weight function ≥ w : [1, [ such that r(x, u) rw¯ (x), X → ∞  | | ≤ q(x) qw¯ (x), and E w(f(x, u, ξ)) βw(x) for all (x, u) ; | | ≤ ≤ ∈ K 4. The function (x, u) f(x, u, ξ) is continuous on for each ξ Ξ. 7→ K ∈ [1] [N] [s] l 5. The support Ξ is finite, i.e., Ξ := ξˆ ,..., ξˆ for some ξˆ ’s in R , s = 1,...,N. { }   Note that by 3), 4) and 5),(x, u) E w(f(x, u, ξ)) is continuous on . For any v Bw( ), 7→ K ∈ X let   ( v)(x) := inf r(x, u) + E v(f(x, u, ξ)) x . (2.2) T u∈U(x) ∈ X

Under Assumption 1, the dynamic programming (DP) operator maps Lw( ) into itself by [23, opt T X Lemma 8.5.5]. Let vt : R, t = 0,...,K, be defined by the following Bellman equation: X → vopt := vopt , t = 0,...,K 1 t T t+1 − opt opt with v q. Under Assumption 1, vt Lw( ), and K ≡ ∈ X  K−1  opt π X vt (x) = inf E r(xk, uk) + q(xK ) xt = x , π∈Π | k=t

opt i.e., vt is the optimal value function of (2.1). Given a measurable function ϕ : , let X → U ϕ ( v)(x) := r(x, ϕ(x)) + E[v(f(x, ϕ(x), ξ))]. T Under Assumption 1, there exists a measurable function πopt : such that πopt(x) (x), t X → U t ∈ U and opt opt opt v (x) = πt v (x) t T t+1 for all x and all t = 0,...,K 1. Then, the deterministic Markov policy πopt := (πopt, . . . , πopt ) ∈ X − 0 K−1 ∈ ΠDM is an optimal solution to (2.1) by the dynamic programming principle [3, Theorem 3.2.1]. In 5

[7] x C3 x[1] x[2] x[3]

C1 C2

x[4] x[5] x[6] C12 x[20]

Figure 1: A two-dimensional example of the proposed discretization method. The dark gray box is 0 = S2 S12 [1] [20] Z i=1 i, and the light gray box is 1 = i=1 i. The nodes x ,..., x are selected as the vertices of the smallC square boxes such that x[1],...,Z x[6] C . In this example, M = 6, and M = 20. ∈ Z0 0 1 practice, Assumption 1-5) can be relaxed using sampling-based methods to a probability distribu- tion with a continuous support. Specifically, the proposed convex optimization approach can be used in conjunction with the sampling-based (or ‘empirical’) method in [24–26]. The convergence of the sampling-based method has been shown under restrictive conditions such as compact state spaces or finite action spaces. The validity of such a combination has been numerically tested in Section 5.1. Its extension using stochastic subgradient methods can be found in [27].

2.3 State Space Discretization opt Our method aims at approximating the optimal value function vt at pre-specified nodes in a K convex compact set t for all t. We select a sequence t of convex compact sets such Z ⊆ X {Z }t=0 that 0 1 K , Z ⊆ Z ⊆ · · · ⊆ Z ⊆ X and  f(x, u, ξ): x t, u (x), ξ Ξ t+1. (2.3) ∈ Z ∈ U ∈ ⊆ Z The existence of such a sequence is guaranteed under Assumption 1 because f is continuous in (x, u), (x) is compact, and Ξ is finite. Note that K = in general, and the state space U Z 6 X X does not have to be compact. Several examples satisfying these two conditions can be found in Section 5. This choice of computational domains is also used in [13]. With such a sequence, we can opt opt evaluate v on t by using only information about v on t+1. t Z t+1 Z We choose NC convex compact subsets 1,..., N of K satisfying the following properties: C C C Z SNC 1. i = K ; i=1 C Z o o o 2. = for i = j, where denotes the interior of i; Ci ∩ Cj ∅ 6 Ci C

NC,t NC 3. For each t, there exists a subsequence i of i such that {C j }j=1 {C }i=1 SNC,t i = t. j=1 C j Z SNC,t We relabel i, if necessary, so that i = t for all t. Note that C i=1 C Z NC,K = NC, and NC,0 NC,1 NC,K . ≤ ≤ · · · ≤ 6

[1] [M ] Let Mt denote the number of nodes (or grid points) in t. The nodes x ,..., x K K are Z ∈ Z chosen such that each i is the convex hull of a subset of the nodes. A concrete way to construct C t’s and i’s using a rectilinear grid is described in Appendix A. For example, in Fig. 1, 1 is the Z C [1] [2] [4] [5] NC C convex hull of x , x , x , x . The maximum of the diameter of the sets i is given by { } {C }i=1 δ := max max x[i] x[j] . [i] [j] k=1,...,NC x ,x ∈Ck k − k

[i] For example, in Fig. 1, δ is equal to the length of i’s diagonal. We also relabel x , if necessary, [1] [Mt] C so that x ,..., x t for all t. Thus, ∈ Z

M0 M1 MK . ≤ ≤ · · · ≤ 2.4 Value Functions on Restricted Spaces ? Let vt : t R be the optimal value function restricted on t for each t. In other words, Z → Z ? opt v (x) = v (x) x t. t t ∀ ∈ Z

Given any v Bb( t+1), let ∈ Z   (Ttv)(x) := inf r(x, u) + E v(f(x, u, ξ)) u∈U(x) N (2.4) X [s] = inf r(x, u) + psv(f(x, u, ξˆ )) u∈U(x) s=1 for all x t, where ∈ Z [s] ps := Prob(ξ = ξˆ ) [0, 1], ∈ PN and thus s=1 ps = 1 by definition. Under Assumption 1, the restricted optimal value functions satisfy the following Bellman equation:

? ? v (x) = (Ttv )(x) x t. t t+1 ∀ ∈ Z ? ? For any x t, vt (x) can be computed by using vt+1, given our choice of t’s in Section 2.3. ∈ Z opt Z This computation does not require information about vt+1 outside t+1. Our goal is to develop a Z ? convex optimization-based approach to approximating the restricted optimal value function vt in a convergent manner.

3 Convex Value Functions

3.1 Approximation of the Bellman Operator When computing the optimal value functions via the DP algorithm

? ? vt := Ttvt+1,

? a closed-form expression of vt is unavailable in general. Unfortunately, it is challenging to solve ? the optimization problem in the definition (2.4) of Tt without a closed-form expression of vt . To ? resolve this issue, we propose an approximation method that uses the value of vt only at the nodes [1] [M ] x ,..., x t+1 in t+1. Z 7

Given any v Bb( t+1), let ∈ Z

N Mt+1 X X [i] (Tˆtv)(x) := inf r(x, u) + psγs,iv(x ) u,γ s=1 i=1 Mt+1 (3.1) [s] X [i] s.t. f(x, u, ξˆ ) = γs,ix s ∀ ∈ S i=1

u (x), γs ∆ s ∈ U ∈ ∀ ∈ S for each x t, where ∆ is the (Mt+1 1)-dimensional probability simplex, i.e., ∆ := γ Mt+1 PM∈t+1 Z − { ∈ R : γi = 1, γi 0 i = 1,...,Mt+1 , and := 1,...,N . Under Assumption 1, i=1 ≥ ∀ } S { } the objective function of (3.1) is lower semicontinuous in (u, γ) (x) ∆N , and the feasible ∈ U × set is compact. Therefore, by [3, Proposition D.5], (3.1) admits an optimal solution, and Tˆt maps 0 0 Bb( t+1) to Bb( t). Moreover, Tˆt is monotone, i.e., for any v, v Bb( t+1) such that v v , Z Z 0 ∈ Z ≤ we have Tˆtv Tˆtv . The constrained-optimization problem in (3.1) for a fixed x t can be ≤ ∈ Z numerically solved to obtain a globally optimal solution using several existing convex optimization algorithms (e.g., [28–30]) if the problem is convex, i.e., u r(x, u) is a convex function and 7→ u f(x, u, ξˆ[s]) is an affine function.2 7→ Note that for each sample index s the term v(f(x, u, ξˆ[s])) in the original Bellman operator is PMt+1 [i] approximated by i=1 γs,iv(x ), where the auxiliary weight variable γs,i can be interpreted as the degree to which f(x, u, ξˆ[s]) is represented by x[i]. This idea of representing the next state as a convex combination of grid points or nodes has also been adopted in the analysis and design of semi-Lagrangian schemes for Hamilton–Jacobi–Bellman partial differential equations arising in continuous-time optimal control problems [31]. Although this paper focuses on discrete-time DP problems, it would be an interesting future research to extend our method to the continuous-time setting, possibly using advanced techniques developed in the recent literature on semi-Lagrangian methods (e.g., [32, 33]). When v is convex, we immediately observe that Ttv is upper bounded by Tˆtv on t because Z  Mt+1  Mt+1 [s] X [i] X [i] v(f(x, u, ξˆ )) = v γs,ix γs,iv(x ). ≤ i=1 i=1

Further showing that Ttv is lower bounded by Tˆtv Cδ for some positive constant C, we prove that − Tˆtv converges uniformly to Ttv on t as the maximum distance δ between neighboring nodes tends Z to zero. By definition, Tˆt depends on the discretization parameter δ as well as N. For notational simplicity, however, we suppress the dependence on δ and N.

3.2 Error Bound and Convergence In this section, we assume the convexity of v and show the uniform convergence property of our method. This assumption will be relaxed in Section 4.

2 More precisely, the set U(x) needs to be represented by convex inequalities, i.e., there exist functions ak : m X × R → R and bk : X → R such that

m U(x) := {u ∈ R : ak(x, u) ≤ bk(x), k = 1,...,Nineq}, where u 7→ ak(x, u) is a convex function for each fixed x ∈ X and each k. 8

Proposition 1. Suppose that Assumption 1 holds, and that v Lb( t+1) is convex. Then, we have ∈ Z (Tˆtv)(x) Lvδ (Ttv)(x) (Tˆtv)(x) x t, − ≤ ≤ ∀ ∈ Z kv(x)−v(x0)k where 0 0 , and therefore ˆ converges uniformly to Lv := maxj=1,...,NC,t+1 supx,x ∈Cj :x6=x kx−x0k Ttv Ttv on t as δ 0. Z → Note that Lv < because v is continuous on the compact set t+1 under the assumption of ∞ Z convexity and lower semicontinuity.

Proof. Fix an arbitrary x t. We first show that (Ttv)(x) (Tˆtv)(x). Under Assumption 1, the ∈ Z ≤ objective function of (3.1) is lower semicontinuous, and the feasible set is compact. Let (uˆ, γˆ) ∈ (x) ∆N be an optimal solution to (3.1), i.e., it satisfies U ×

N Mt+1 X X [i] (Tˆtv)(x) = r(x, uˆ) + psγˆs,iv(x ) s=1 i=1

Mt+1 [s] X [i] f(x, uˆ, ξˆ ) = γˆs,ix t+1 s . ∈ Z ∀ ∈ S i=1

The convexity of v on t+1 implies that Z  Mt+1  Mt+1 X [i] X [i] v γˆs,ix γˆs,iv(x ) s . ≤ ∀ ∈ S i=1 i=1

Therefore, by the definition of (uˆ, γˆ), we have

N  Mt+1  X X [i] (Tˆtv)(x) r(x, uˆ) + psv γˆs,ix ≥ s=1 i=1 N X [s] = r(x, uˆ) + psv(f(x, uˆ, ξˆ )) s=1 N X [s] inf r(x, u) + psv(f(x, u, ξˆ )) ≥ u∈U(x) s=1

= (Ttv)(x).

We now show that (Tˆtv)(x) Lvδ (Ttv)(x). Under Assumption 1, the objective function − ≤ of (2.4) is lower semicontinuous [23, Lemma 8.5.5], and the feasible set is compact. Thus, by [3, Proposition D.5], it admits an optimal solution, i.e., there exists u? (x) such that ∈ U N ? X ? [s] (Ttv)(x) = r(x, u ) + psv(f(x, u , ξˆ )). s=1 Let ? [s] ys := f(x, u , ξˆ ) s . ∀ ∈ S Then, ys t+1 because x t. Thus, there exists a unique j 1,...,NC,t+1 such that ∈ Z ∈ Z ∈ { } ys j. Let (ys) denote the set of grid points on the cell j that contains ys. For each s, choose ∈ C N C 9

γ? ∆ such that f(x, u?, ξˆ[s]) = P γ? x[i] and P γ? = 1. Note that γ? = 0 for all s ∈ i∈N (ys) s,i i∈N (ys) s,i s,i i / (ys). We then have ∈ N N ? X ? [s] (Ttv)(x) = r(x, u ) + psv(f(x, u , ξˆ )) s=1 (3.2) N   ? X X ? [i] = r(x, u ) + psv γs,ix . s=1 i∈N (ys)

By the definition of Lv, we have   X ? [i] [i0] [i0] X ? [i] v γ x v(x ) Lv x γ x s,i ≥ − − s,i i∈N (ys) i∈N (ys)

0 for all i (ys). This implies that ∈ N     X ? [i] [i0] [i0] X ? [i] v γ x max v(x ) Lv x γ x s,i 0 s,i ≥ i ∈N (ys) − − i∈N (ys) i∈N (ys)   [i0] X ? [i0] [i] max v(x ) Lv γ x x 0 s,i ≥ i ∈N (ys) − k − k i∈N (ys)   [i0] X ? max v(x ) Lv γ δ 0 s,i (3.3) ≥ i ∈N (ys) − i∈N (ys) X ? [i0] γs,i0 (v(x ) Lvδ) ≥ 0 − i ∈N (ys)

Mt+1 X ? [i0] = γ 0 v(x ) Lvδ, s,i − i0=1

P ? 0 0 where the third inequality holds by the definition of δ, and the last equality holds because i ∈N (ys) γs,i = 1. Combining (3.2) and (3.3), we obtain that

N Mt+1 ? X X ? [i] (Ttv)(x) r(x, u ) + psγ v(x ) Lvδ ≥ s,i − s=1 i=1

(Tˆtv)(x) Lvδ, ≥ − where the second inequality holds because (u?, γ?) is a feasible solution to (3.1). Therefore, the result follows.

Letv ˆt : t R be defined by Z →

vˆt(x) := (Tˆtvˆt+1)(x) x t ∀ ∈ Z for t = 0,...,K 1, withv ˆK q on K . We now show thatv ˆt converges uniformly to the optimal −? ≡ Z ? value function vt on t as δ tends to zero when all vt ’s are convex. A sufficient condition for the ? Z convexity of vt ’s is provided in Section 3.3. 10

Theorem 1 (Uniform Convergence and Error Bound I). Suppose that Assumption 1 holds, and ? that the optimal value function v is convex on t for all t = 0,...,K. Then, we have t Z K ? X 0 vˆt(x) v (x) Lkδ x t t = 0,...,K, (3.4) ≤ − t ≤ ∀ ∈ Z ∀ k=t+1

? ? 0 kvk(x)−vk(x )k ? where 0 0 , and therefore converges uniformly to Lk := supj=1,...,NC,k supx,x ∈Cj :x6=x kx−x0k vˆt vt on t as δ 0. Z → ? Proof. We use mathematical induction to prove (3.4). For t = K,v ˆK = v q on K , and thus K ≡ Z (3.4) holds. Suppose that the induction hypothesis holds for some t. By applying the operator Tˆt−1 on all sides of (3.4), we have for all x t−1 ∈ Z K X ? (Tˆt−1vˆt)(x) Lkδ (Tˆt−1v )(x) (Tˆt−1vˆt)(x) − ≤ t ≤ k=t+1 by the monotonicity of Tˆt−1. On the other hand, by Proposition 1,

? ? ? (Tˆt−1v )(x) Ltδ (Tt−1v )(x) (Tˆt−1v )(x) x t−1 t − ≤ t ≤ t ∀ ∈ Z ? because vt Lb( t) [23, Lemma 8.5.5] and it is convex. Therefore, we conclude that ∈ Z K X ? (Tˆt−1vˆt)(x) Lkδ (Tt−1v )(x) (Tˆt−1vˆt)(x) x t−1. − ≤ t ≤ ∀ ∈ Z k=t

ˆ ? ? Note thatv ˆt−1 = Tt−1vt and vt−1 = Tt−1vt on t−1 by definition. This implies thatv ˆt−1(x) PK ? Z − Lkδ v (x) vˆt−1(x) for all x t−1 as desired. k=t ≤ t−1 ≤ ∈ Z [1] [M ] The approximate value functionv ˆt+1 evaluated at the nodes x ,..., x t+1 for t = 0,...,K can be used to construct a deterministic stationary policy,π ˆ := (ˆπ0,..., πˆK−1), by setting

 N Mt+1  X X [i] πˆt(x) arg min min r(x, u) + psγs,ivˆt+1(x ) ∈ γ∈∆N u∈U(x) s=1 i=1

Mt+1 [s] X [i] s.t. f(x, u, ξˆ ) = γs,ix s ∀ ∈ S i=1

πˆ for all x t. Let vt : t R be defined by ∈ Z Z → πˆ πˆt πˆ vt (x) := (T vt+1)(x) N X πˆ [s] := r(x, πˆt(x)) + psv (f(x, πˆt(x), ξˆ )) x t t+1 ∀ ∈ Z s=1

πˆ πˆt with vK q on K . It is straightforward to check that T is monotone. To show that the cost- ≡ πˆZ ? to-go function vt ofπ ˆ converges uniformly to the optimal value function vt , we first observe the following: 11

Lemma 1. Suppose that Assumption 1 holds, and that vˆt is convex on t for all t = 0,...,K. Then, we have Z πˆ v (x) vˆt(x) x t t ≤ ∀ ∈ Z for all t = 0,...,K.

πˆ Proof. We use mathematical induction. For t = K, vK =v ˆK q on K , and thus the induction πˆ ≡ Z πˆt hypothesis holds. Suppose that v vˆt+1 on t+1 for some t + 1. By the monotonicity of T , t+1 ≤ Z we have πˆt πˆ πˆt (T v )(x) (T vˆt+1)(x) x t. (3.5) t+1 ≤ ∀ ∈ Z N Fix an arbitrary x t. By the definition of Tˆt andπ ˆ, under Assumption 1, there existsγ ˆ ∆ ∈ Z ∈ such that N Mt+1 X X [i] (Tˆtvˆt+1)(x) = r(x, πˆt(x)) + psγˆs,ivˆt+1(x ) (3.6) s=1 i=1 and Mt+1 [s] X [i] f(x, πˆt(x), ξˆ ) = γˆs,ix s . ∀ ∈ S i=1

By the convexity ofv ˆt+1, we have

Mt+1  Mt+1  X [i] X [i] γˆs,ivˆt+1(x ) vˆt+1 γˆs,ix ≥ i=1 i=1 (3.7) [s] =v ˆt+1(f(x, πˆt(x), ξˆ )) for each s . By combining the inequalities (3.6) and (3.7), we obtain that ∈ S N X [s] (Tˆtvˆt+1)(x) r(x, πˆt(x)) + psvˆt+1(f(x, πˆt(x), ξˆ )) ≥ s=1 πˆt = (T vˆt+1)(x).

Recall (3.5), we conclude that for all x t, ∈ Z πˆ πˆt πˆ πˆt v (x) = (T v )(x) (T vˆt+1)(x) (Tˆtvˆt+1)(x) =v ˆt(x). t t+1 ≤ ≤ This completes mathematical induction, and the result follows.

By using this lemma and Theorem 1, we obtain the following uniform convergence result:

Theorem 2 (Uniform Convergence and Error Bound II). Suppose that Assumption 1 holds, and ? that v and vˆt are convex on t for all t. Then, we have t Z K πˆ ? X 0 v (x) v (x) Lkδ x t t = 0,...,K, ≤ t − t ≤ ∀ ∈ Z ∀ k=t+1

? ? 0 kvk(x)−vk(x )k πˆ ? where 0 0 , and therefore converges uniformly to Lk := supj=1,...,NC,k supx,x ∈Cj :x6=x kx−x0k vt vt on t as δ 0. Z → 12

Algorithm 1: Interpolation-Free DP algorithm

1 Initialize Initializev ˆK q on K ; ≡ Z 2 for t = K 1 : 1 : 0 do −[i] − [i] 3 Setv ˆt(x ) := (Tˆtvˆt+1)(x ) by solving the problem in (3.1) with v vˆt+1 for each ≡ i = 1,...,Mt; [i] 4 Setπ ˆt(x ) as an optimal u of the problem in (3.1) with v vˆt+1 for each i = 1,...,Mt; ≡ 5 end

Proof. Fix an arbitrary t 0,...,K . By Theorem 1 and Lemma 1, we first observe that ∈ { } K πˆ ? ? X v (x) v (x) vˆt(x) v (x) Lkδ x t. t − t ≤ − t ≤ ∀ ∈ Z k=t+1 In addition, we have v? vπˆ since v? is the minimal cost-to-go function under Assumption 1. t ≤ t t Therefore, the result follows.

3.3 Interpolation-Free DP Algorithm ? The error bounds and convergence results established in the previous subsection are valid if vt and ? vˆt are convex. We now introduce a sufficient condition for the convexity vt andv ˆt. Assumption 2 (Linear-Convex Control). 1. The function (x, u) f(x, u, ξ) 7→ is affine on := (x, u): x , u (x) for each ξ Ξ. In addition, the stage-wise cost K { ∈ X ∈ U } ∈ function r is convex on , and the terminal cost function q is convex on . K X 2. If u (x ) for k = 1, 2, then (k) ∈ U (k) λu + (1 λ)u (λx + (1 λ)x ) (1) − (2) ∈ U (1) − (2) for any x , x and any λ (0, 1). (1) (2) ∈ X ∈ Lemma 2. Suppose that Assumptions 1 and 2 hold, and that v : t+1 R is convex. Then, Z → Ttv, Tˆtv : t R are convex. Z → Its proof can be found in Appendix B. An immediate observation obtained from Lemma 2 is ? ? the convexity of v andv ˆt for all t = 0,...,K because v andv ˆK are convex on K . t K Z ? Proposition 2. Under Assumptions 1 and 2, the functions vt and vˆt are convex for all t = 0,...,K. By Proposition 2, the error bounds and convergence results in Theorems 1 and 2 are valid under Assumptions 1–2. Note, however, that Assumption 2 is merely a sufficient condition for convexity. Algorithm 1, which is based on the proposed method, can be used to evaluate the approximate [i] value functionv ˆt and to construct the corresponding policyπ ˆt at all grid point x ’s in t. Given [i] Z vˆt(x ) for all i = 1,...,Mt, the value ofv ˆt at the other points in t, the approximate value function Z can be computed using (3.1) with v :=v ˆt+1. We also observe that the optimization problem in (3.1) used to evaluate (Tˆtv)(x) is convex regardless of the convexity of v under Assumption 2. Thus, in each iteration of the proposed DP algorithm, it suffices to solve Mt convex optimization problems, each of which is for x := x[i]. 13

Proposition 3. Under Assumption 2, the optimization problem in (3.1) for any fixed x t is a ∈ Z convex optimization problem if v Bb( t+1). ∈ Z Proof. Fix an arbitrary state x t. The objective function is convex in u and linear in γ (even ∈ Z when v is nonconvex). Furthermore, the equality constraints are linear in (u, γ). Assumption 2 implies that (x) is convex. Also, the probability simplex ∆ is convex. Therefore, this optimization U problem for any fixed x is convex.

Remark 1 (Interpolation-free property). Note that we can evaluate vˆt at an arbitrary x t ∈ Z by using the definition (3.1) of Tˆt without any explicit interpolation that may introduce additional numerical errors. This feature is also useful when the output of πˆt needs to be specified at a [i] Mt particular state that is different from the grid points x i=1. Unlike many existing discretization- based methods, our approach does not require a separate{ interpolation} stage in constructing both the optimal value function and control policies.

The computational complexity of Algorithm 1 increases exponentially with the state space dimension. This complexity issue, called the , is inherited from dynamic programming. Several approximate dynamic programming (ADP) and RL methods have been proposed to alleviate the dimensionality issue using approximation in value spaces (e.g., [9]), policy spaces (e.g., [10]) or both (e.g., [11, 12]). In ADP and RL methods, it is important to find a reasonable balance between scalable implementation and adequate performance. Our focus is to pursue the latter with a uniform convergence guarantee. Nevertheless, an important future research is to improve the scalability of the proposed convex optimization approach by replacing a grid with a function approximator.

4 Nonconvex Value Functions

The convexity of the optimal value function plays a critical role in obtaining the convergence results in the previous section. To relax the convexity condition (e.g., Assumption 2), we first show that the proposed approximation method is useful when constructing a suboptimal policy with a provable error bound in the case of nonlinear control-affine systems with convex cost functions. For further general cases, we propose a modified approximation method based on a nonconvex bi-level optimization problem, where the inner problem is a linear program.

4.1 Control-Affine Systems Consider a control-affine system of the form

xt+1 = f(xt, ut, ξt) := g(xt, ξt) + h(xt, ξt)ut. (4.1)

More precisely, we assume the following:

n Assumption 3. The function f : Ξ can be expressed as (4.1), where g : Ξ R n×m X × U × → X X × → and h : Ξ R are (possibly nonlinear) measurable functions such that g( , ξ) and h( , ξ) are continuousX × → for each · · ξ Ξ. In addition, u r(x, u) is convex on (x) for each x . ∈ 7→ U ∈ X Note that the condition on r imposed by this assumption is weaker than Assumption 2 which requires the joint convexity of r. In this setting, each iteration of the DP algorithm in Section 3.3 still involves Mt convex optimization problems. 14

Proposition 4. Under Assumption 3, the optimization problem in (3.1) is a convex program.

Due to the nonconvexity of the optimal value function,v ˆt obtained by value iteration in the previous section is no longer guaranteed to converge to the optimal value function as δ tends to zero. However, we are still able to characterize an error bound for the approximate policyπ ˆ as follows:

? Proposition 5 (Error Bound). Suppose that Assumptions 1 and 3 hold, and that vt is locally Lipschitz continuous on t for each t. Then, for all t = 0,...,K, we have Z K πˆ ? πˆ X 0 v (x) v (x) v (x) vˆt(x) + Lkδ x t, ≤ t − t ≤ t − ∀ ∈ Z k=t+1

? ? 0 kvk(x)−vk(x )k where 0 0 . Lk := supj=1,...,NC,k supx,x ∈Cj :x6=x kx−x0k Proof. Fix an arbitrary t 0,...,K . By the optimality of v?, we have ∈ { } t ? πˆ v (x) v (x) x t. t ≤ t ∀ ∈ Z Note that in the second part of the proof of Proposition 1, the convexity of v is unused. It only requires the continuity of v on t. Therefore, the second inequality of (3.4) in Theorem 1 holds ? Z when vt is continuous, i.e.,

K ? X vˆt(x) v (x) Lkδ x t. − t ≤ ∀ ∈ Z k=t+1

This implies that K πˆ ? πˆ X v (x) v (x) v (x) vˆt(x) + Lkδ x t, t − t ≤ t − ∀ ∈ Z k=t+1 and the result follows.

This proposition implies that the performance ofπ ˆ converges to the optimum as δ tends to zero πˆ if its cost-to-go function vt converges to the approximate value functionv ˆt for all t. Otherwise, it is π possible thatv ˆt converges to some function which is different from the optimal value function. Or it may oscillate within the error bound. However, this a posteriori error bound is useful when we need to design a controller with a provable performance guarantee, for example, in safety-critical systems where the objective is to maximize the probability of safety (e.g., [34]), as this problem is subject to nonconvexity issues [35].

4.2 General Case In the case of general nonlinear systems with nonlinear stage-wise cost functions, we modify the operator Tˆt as follows. For any v Bb( t+1), let ∈ Z ˜   (Ttv)(x) := inf r(x, u) + (Htv)(x, u) x t, (4.2) u∈U(x) ∀ ∈ Z 15 where

N X X [i] (Htv)(x, u) := inf psγs,iv(x ) γ s=1 i∈N (ys) X [i] s.t. ys = γs,ix s (4.3) ∀ ∈ S i∈N (ys)

γs ∆ s ∈ ∀ ∈ S γs,i = 0 i / (ys) s ∀ ∈ N ∀ ∈ S [s] for each (x, u) t such that u (x), and ys = f(x, u, ξˆ ) for all s . Here (ys) ∈ Z × U ∈ U ∈ S N denotes the set of grid points on the cell j that contains ys. The inner optimization problem (4.3) C for Htv is a linear program for any fixed (x, u). However, the outer problem (4.2) is nonconvex [s] in general. The nonconvexity originates from the local representation of ys = f(x, u, ξˆ ) by only using the grid points in the cell that contains ys. However, such a local representation reduces the problem size and allows us to show that T˜tv converges uniformly to Ttv on t as δ tends to zero if Z v is locally Lipschitz continuous on t+1. Z Proposition 6. Suppose that Assumptions 1 holds, and that v Bb( t+1) is locally Lipschitz ∈ Z continuous on t+1. Then, we have Z

(T˜tv)(x) (Ttv)(x) Lvδ x t, | − | ≤ ∀ ∈ Z kv(x)−v(x0)k where 0 0 , and therefore ˜ converges uniformly to Lv := maxj=1,...,NC,t+1 supx,x ∈Cj :x6=x kx−x0k Ttv Ttv on t as δ 0. Z → Proof. Fix an arbitrary x t. Under Assumption 1, both (4.2) and (4.3) have optimal solutions. ∈ Z Let u˜ be an optimal solution of (4.2), and letγ ˜ be a corresponding optimal solution of (4.3) given [s] u := u˜. We also lety ˜s := f(x, u˜, ξˆ ) for each s. Then, we have

X [i] y˜s = γ˜s,ix .

i∈N (˜ys)

By the continuity of v on the compact set t+1, we have Z [i] [i] v(˜ys) v(x ) Lv y˜s x Lvδ i (˜ys). | − | ≤ k − k ≤ ∀ ∈ N P Since i∈N (˜ys) γ˜s,i = 1, we have

X [i] X [i] v(˜ys) γ˜s,iv(x ) γ˜s,iv(˜ys) γ˜s,iv(x ) − ≤ | − | i∈N (ys) i∈N (˜ys) X γ˜s,iLvδ = Lvδ ≤ i∈N (˜ys) 16 for each s. By using this bound, we obtain that

N X X [i] (T˜tv)(x) = r(x, u˜) + psγ˜s,iv(x )

s=1 i∈N (˜ys) N X r(x, u˜) + psv(˜ys) Lvδ ≥ − s=1 N X [s] = r(x, u˜) + psv(f(x, u˜, ξˆ )) Lvδ − s=1

(Ttv)(x) Lvδ, ≥ − where the last inequality holds because u˜ is a feasible solution to the minimization problem in the definition (2.4) of Tt. The other inequality, (T˜tv)(x) Lvδ (Ttv)(x), can be shown by using the second part of the 3 − ≤ proof of Proposition 1. Since x was arbitrarily chosen in t, the result follows. Z Letv ˜t : t R, t = 0,...,K, be defined by Z →

v˜t(x) = (T˜tv˜t+1)(x) x t t = 0,...,K 1 ∀ ∈ Z ∀ − withv ˜K q on K . By using Proposition 6 and the inductive argument in the proof of Theorem 1, ≡ Z we can show thatv ˜t converges uniformly to the optimal value function on t as δ tends to zero. Z ? Theorem 3. Suppose that Assumption 1 holds, and that v is locally Lipschitz continuous on t t Z for each t = 0,...,K. Then, we have

K ? X v (x) v˜t(x) Lkδ x t t = 0,...,K, (4.4) | t − | ≤ ∀ ∈ Z ∀ k=t+1

? ? 0 kvk(x)−vk(x )k ? where 0 0 , and therefore converges uniformly to Lk := supj=1,...,NC,k supx,x ∈Cj :x6=x kx−x0k v˜t vt on t as δ 0. Z → As before, we construct a deterministic Markov policyπ ˜ := (˜π0,..., π˜K−1) by settingπ ˜t(x) to be an optimal solution of (4.2) with v :=v ˜t+1, i.e.,   π˜t(x) arg min r(x, u) + (Htv˜t+1)(x, u) x t. ∈ u∈U(x) ∀ ∈ Z

π˜ Then, the cost-to-go function vt : t R under this policy can be evaluated by solving the Z → following Bellman equation:

vπ˜ := T π˜t vπ˜ t = 0,...,K 1, t t+1 ∀ − π˜ π˜ where vK q on K . This cost vt incurred by the approximate policyπ ˜ converges uniformly to ≡ Z ? the optimal value function vt as the grid resolution becomes finer. 3Note that the convexity of v is unused in the second part of the proof of Proposition 1. Thus, it is valid in the nonconvex case. 17

? Corollary 1. Suppose that Assumption 1 holds, and that vt : t R is locally Lipschitz continuous Z → for all t = 0,...,K. Then, we have

K π˜ ? X 0 v (x) v (x) 2Lkδ x t t = 0,...,K, ≤ t − t ≤ ∀ ∈ Z ∀ k=t+1

? ? 0 kvk(x)−vk(x )k π˜ ? where 0 0 , and therefore converges uniformly to Lk := supj=1,...,NC,k supx,x ∈Cj :x6=x kx−x0k vt vt on t as δ 0. Z → ? ? π˜ Proof. By the optimality of vt under Assumption 1, we have vt vt on t for all t. We now show π˜ ? PK ≤ Z π˜ ? that v v + 2Lkδ on t by using mathematical induction. For t = K, v = v q on t ≤ t k=t+1 Z K K ≡ K , and thus the induction hypothesis holds. Suppose that the induction hypothesis is valid for Z some t + 1, i.e., K π˜ ? X 0 v (x) v (x) 2Lkδ x t+1. ≤ t+1 − t+1 ≤ ∀ ∈ Z k=t+2 π˜ Then, by the monotonicity of Tt , we have

K π˜t π˜ π˜t ? X (T v )(x) (T v )(x) + 2Lkδ x t. (4.5) t+1 ≤ t+1 ∀ ∈ Z k=t+2

N Fix an arbitrary state x t. Under Assumption 1, by the definition of T˜t, there existsγ ˜ ∆ ∈ Z ∈ such that N ˜ ? X X ? [i] (Ttvt+1)(x) = r(x, π˜t(x)) + psγ˜s,ivt+1(x ), (4.6) s=1 i∈N (˜ys) [s] wherey ˜s := f(x, π˜t(x), ξˆ ), and

[s] X [i] f(x, π˜t(x), ξˆ ) = γ˜s,ix .

i∈N (˜ys)

? Since v is continuous on t+1, we have t+1 Z ? ? [i] [i] v (˜ys) v (x ) Lt+1 y˜s x Lt+1δ i (˜ys). | t+1 − t+1 | ≤ k − k ≤ ∀ ∈ N P Note that i∈N (˜ys) γ˜s,i = 1. Thus, the following inequalities hold:

? X ? [i] X ? ? [i] v (˜ys) γ˜s,iv (x ) γ˜s,i v (˜ys) v (x ) t+1 − t+1 ≤ | t+1 − t+1 | i∈N (˜ys) i∈N (˜ys) (4.7)

Lt+1δ. ≤ By (4.6) and (4.7), we have

N ? X ? (T˜tv )(x) r(x, π˜t(x)) + psv (˜ys) Lt+1δ t+1 ≥ t+1 − s=1 N (4.8) X ? [s] = r(x, π˜t(x)) + psv (f(x, π˜t(x), ξˆ )) Lt+1δ t+1 − s=1 π˜t ? (T v )(x) Lt+1δ. ≥ t+1 − 18

On the other hand, by Proposition 6, we have

? ? (T˜tv )(x) (Ttv )(x) + Lt+1δ. (4.9) t+1 ≤ t+1

Since x was arbitrarily chosen in t, by combining inequalities (4.5), (4.8) and (4.9), we conclude Z that K K π˜ π˜t π˜ ? X ? X v = T v Ttv + 2Lkδ = v + 2Lkδ t t+1 ≤ t+1 t k=t+1 k=t+1 on t, which implies that the induction hypothesis is valid for t as desired. Z From the computational perspective, it is challenging to obtain a globally optimal solution of (4.2) due to nonconvexity. Thus, over-approximating T˜tv is inevitable in practice, and the quality of an approximate policyπ ˜ from an over-approximation ofv ˜t’s depends on the quality of locally optimal solutions to (4.2) evaluated in the process of value iteration. As in Section 4.1, one may be able to characterize a suboptimality bound for the approximate policy, for example, by using a convex relaxation of (4.2). In fact, the optimization problem (3.1) can be interpreted as a convex relaxation of (4.2) in the case of control-affine systems. Another classical and possibly practical method for solving the minimization problem in (4.2) is to discretize the action space. After discretization, we evaluate the optimal value of the program (4.3) for each action, and then compare the optimal values to approximately solve (4.2). This approach is used in solving the nonconvex problem in Section 5.2. On the other hand, the inner optimization problem (4.3) can be efficiently solved using existing linear programming algorithms such as interior point and simplex methods (e.g., [36, 37]). The linear optimization problem has an equivalent form with the reduced optimization variableγ ˜ :=

(γi)i∈Ny instead of using the full vector γ. Using this reduced optimization problem significantly gains computational efficiency since the size of the reduced optimization variable is independent n of grid resolution. For example, in the case of regular Cartesian grid with size Nx , the number of optimization variables decreases from N n N to 2n N. x × × 5 Numerical Experiments

5.1 Linear-Convex Control We first demonstrate the performance of the proposed method through a linear-convex stochastic control problem. Consider the linear stochastic system

xt+1 = Axt + But + Cξt,     2 0.85 0.1 1 where xt R and ξt R. We set A = , and C = . To demonstrate that our ∈ ∈ 0.1 0.85 1 method can handle high-dimensional action spaces, the control variable ut is chosen to be 1000 dimensional. The matrix B is an 2 by 1000 matrix of all entries being sampled independently from the uniform distribution over [0, 1] and then normalized so that the 1-norm of each row is 1.4 The stage-wise cost function is given by

2 r(xt, ut) = xt 1 + ut , k k k k2 4The matrix B used in our experiments can be downloaded from the following link: http://coregroup.snu.ac. kr/DB/B1000.mat. 19

10-1 1-norm -norm

1st order 10-2 2nd order Error

10-3

20 40 60 80 100 120 140160 N -1 x

2 2 2 Figure 2: Empirical convergence rates ofv ˆ(δx),0 for the linear convex problem with grid size 21 , 41 , 81 2 and 161 . Here, the label Nx denotes the number of grid points on one axis. The errors vˆ(Nx),0 v¯0 are measured by the 1( )– and ( )–norm over = [ 1, 1]2. k − k ♦ ∞  Z0 −

Table 1: Computation time (in seconds) and errors for the linear L1-control problem with different 2 grid sizes (evaluated over 0 = [ 1, 1] ). Z − # of grid points 212 412 812 1612 Time (sec) 18.22 139.93 1877.25 43736.92 `1-error 0.0339 0.0090 0.0020 0.0004 `∞-error 0.0527 0.0157 0.0036 0.0009 which is convex but non-quadratic. The set of admissible control actions is chosen as

1000 ut := [ 0.15, 0.15] . ∈ U − The support elements of ξt are sampled according to the uniform distribution over [ 0.1, 0.1]. The 2 − computational domains are chosen as follows: t := [ 1 0.2t, 1 + 0.2t] for t = 0,..., 5 so that 2 2 Z − − 0 = [ 1, 1] and 5 = [ 2, 2] . This choice of computational domains satisfies that Z − Z −  2 Ax + Bu + Cξ : x t, u [ 0.15, 0.15] , ξ [ 0.1, 0.1] t+1, ∈ Z ∈ − ∈ − ⊆ Z and 0 1 5. Each t is discretized as a two-dimensional regular Cartesian grid using Z ⊆ Z ⊆ · · · ⊆ Z Z the approach in Appendix A with grid points ( 1 0.2t + δx(i 1), 1 0.2t + δx(j 1)) : i, j = { − − − − − − 1,..., (2 + 0.4t)/δx + 1 with equal grid spacing δx. The terminal cost is set to be q 0. The } ≡ numerical experiments were conducted on a PC with 4.2 GHz Intel Core i7 and 64GB RAM. The optimization problems were solved using the interior point method in CPLEX. Let Nx denote the number of grid points on one axis of 5. Thus, the total number of grid points 2 Z is N . To demonstrate the convergence of the proposed method as δ = √2δx 0 or equivalently as x → Nx , we fix the number of sample data as N = 10, and computev ˆ for Nx = 21, 41, 81, 161 → ∞ (Nx),0 or equivalently for δx = 0.2, 0.1, 0.05, 0.025, wherev ˆ(Nx),t denotes the approximate value function 2 at t with the number of grid points being Nx . Settingv ¯0 :=v ˆ(Nx=321),0, the errors vˆ(Nx),0 v¯0 2 k − k over 0 = [ 1, 1] , measured by the 1– and –norm, are shown in Fig. 2. The CPU time for Z − ∞ computation is also reported in Table 1.5 In this example, the empirical convergence rate of our 5 The CPU time increases superlinearly with the number of grid points. This is because the size of the optimization 20

0.09

0.08

0.07

0.06

0.05

Error 0.04

0.03

0.02

0.01 1-norm -norm 0 1 2 3 4 5 6 7 8 9 10 # of iterations

Figure 3: Errors for the linear convex problem with respect to the number of DP iterations. The errors are measured by the 1( )– and ( )–norm over = [ 1, 1]2. ♦ ∞  Z0 −

Table 2: Computation time (in seconds) and errors for the linear L1-control problem with different 2 sample sizes (evaluated over 0 = [ 1, 1] ). Z − Sample size 20 40 80 160 Time (sec) 35.81 72.53 227.79 857.14 `1-error 0.0529 0.0359 0.0179 0.0110 `∞-error 0.1808 0.1146 0.0410 0.0241

6 method is beyond the second order. Furthermore, the relative error (ˆv(Nx),0 v¯0)/v¯0 1 over 2 k − k 0 = [ 1, 1] is 0.02% in the case of Nx = 161. Z − 2 Fig. 3 shows the errors in 1– and –norm, evaluated over 0 = [ 1.1] depending on the ∞ Z − number of DP iterations or equivalently the length of time horizon when δx = 0.2. The errors grow sublinearly with respect to the number of DP iterations. This result is consistent with the error bound in Theorem 1, assuming that the constant Lk does not change much over time. To test the empirical convergence of the proposed method as the sample size increases, we also computev ˆ(N),0 for N = 20, 40, 80, 160 with δx fixed as 0.2, wherev ˆ(N),0 denotes the approximate 2 value function at t = 0 with sample size N. The errors vˆ v˜0 over 0 = [ 1, 1] with k (N),0 − k Z − N = 20, 40, 80, 160 are shown in Fig. 4, wherev ˜0 :=v ˆ(N=320),0. The CPU time is reported in Table 2. In this example, the empirical convergence rate is below the first order. The relative error 2 (ˆv v˜0)/v˜0 1 over 0 = [ 1, 1] is 0.72% in the case of N = 160. k (N),0 − k Z − 5.2 Control-Affine We consider the following nonlinear, control-affine system:

xt+1 = xt + [wt(1 xt)xt xtut]∆t, − − problem (3.1) also increases with the grid size. Note that the problem size is invariant when using the bi-level method in Section 4.2. Thus, in that case the CPU time scales linearly as shown in Table 3. 6The observation of the second-order empirical convergence rate is consistent with our theoretical result since Theorem 3.1 only suggests that the suboptimality gap decreases with the first-order rate. Thus, the actual convergence rate can be higher than the convergence rate for the suboptimality gap. 21

1-norm -norm

10-1

1st order Error

10-2 20 40 60 80 100 120 140 # of sample data

Figure 4: Empirical convergence rates ofv ˆ(N),0 for the linear convex problem with sample size N = 20, 40, 80, 160. The errors vˆ v˜ are measured by the 1( )– and ( )–norm over = [ 1, 1]2. k (N),0 − 0k ♦ ∞  Z0 −

1 1 1 10 12 12 0.8 0.8 0.8 10 10 8

0.6 8 0.6 8 0.6 6

x 6 x 6 x 0.4 0.4 0.4 4 4 4 0.2 0.2 0.2 2 2 2

0 0 5 10 15 0 5 10 15 0 5 10 15 t t t (a) (b) (c)

AAAB6nicbVA9TwJBEJ3DL8Qv1NJmIzHBhtxhoXYkNpYYBUngQvaWPdiwt3fZnTMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxIpDLrut1NYW9/Y3Cpul3Z29/YPyodHbROnmvEWi2WsOwE1XArFWyhQ8k6iOY0CyR+D8c3Mf3zi2ohYPeAk4X5Eh0qEglG00n2VnvfLFbfmzkFWiZeTCuRo9stfvUHM0ogrZJIa0/XcBP2MahRM8mmplxqeUDamQ961VNGIGz+bnzolZ1YZkDDWthSSufp7IqORMZMosJ0RxZFZ9mbif143xfDKz4RKUuSKLRaFqSQYk9nfZCA0ZygnllCmhb2VsBHVlKFNp2RD8JZfXiXtes27qNXv6pXGdR5HEU7gFKrgwSU04Baa0AIGQ3iGV3hzpPPivDsfi9aCk88cwx84nz+GpI1D AAAB6nicbVA9TwJBEJ3DL8Qv1NJmIzHBhtxhoXYkNpYYBUngQvaWPdiwt3fZnTMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxIpDLrut1NYW9/Y3Cpul3Z29/YPyodHbROnmvEWi2WsOwE1XArFWyhQ8k6iOY0CyR+D8c3Mf3zi2ohYPeAk4X5Eh0qEglG00n01OO+XK27NnYOsEi8nFcjR7Je/eoOYpRFXyCQ1puu5CfoZ1SiY5NNSLzU8oWxMh7xrqaIRN342P3VKzqwyIGGsbSkkc/X3REYjYyZRYDsjiiOz7M3E/7xuiuGVnwmVpMgVWywKU0kwJrO/yUBozlBOLKFMC3srYSOqKUObTsmG4C2/vEra9Zp3Uavf1SuN6zyOIpzAKVTBg0towC00oQUMhvAMr/DmSOfFeXc+Fq0FJ585hj9wPn8AiCmNRA== AAAB6nicbVA9TwJBEJ3DL8Qv1NJmIzHBhtxhoXYkNpYYBUngQvaWPdiwt3fZnTMhF36CjYXG2PqL7Pw3LnCFgi+Z5OW9mczMCxIpDLrut1NYW9/Y3Cpul3Z29/YPyodHbROnmvEWi2WsOwE1XArFWyhQ8k6iOY0CyR+D8c3Mf3zi2ohYPeAk4X5Eh0qEglG00n2VnffLFbfmzkFWiZeTCuRo9stfvUHM0ogrZJIa0/XcBP2MahRM8mmplxqeUDamQ961VNGIGz+bnzolZ1YZkDDWthSSufp7IqORMZMosJ0RxZFZ9mbif143xfDKz4RKUuSKLRaFqSQYk9nfZCA0ZygnllCmhb2VsBHVlKFNp2RD8JZfXiXtes27qNXv6pXGdR5HEU7gFKrgwSU04Baa0AIGQ3iGV3hzpPPivDsfi9aCk88cwx84nz+Jro1F

πˆ ? Figure 5: (a) The value function vt (x) of the approximate policyπ ˆ, (b) the optimal value function vt (x), πˆ ? vt (x)−vt (x) and (c) the relative error v? x 100% for the epidemic control problem. t ( ) ×

which models the outbreak of an infectious disease [38]. Here, the system state xt [0, 1] represents ∈ the ratio of infected people in a given population, the control input ut [0, umax] represents a public ∈ treatment action, and wt > 0 denotes infectivity. The support of wt was sampled from the uniform distribution over [1, 2]. The stage-wise cost function is chosen as the following convex, non-quadratic function: 2 r(xt, ut) = xt + λut , −1 −1 where λ > 0 is the intervention cost. We choose umax = 1, ∆t = 10 , K = 20, λ = 10 , N = 10, and q 0. In this case, the system state remains in the domain [0, 1] for all time. Thus, we set ≡ t [0, 1] for all t = 0,..., 20. The domain t is discretized with a uniform grid with grid spacing Z ≡ Z δx = 0.01. Thus, the number of grid points is 101. In this setting, the CPU time for running our method is 12.57 seconds. To examine the suboptimality of our method in this problem with a control-affine system, we πˆ ? 7 compare the value function vt of our approximate policy and the optimal value function vt . As 7To compute the optimal value function, we used the method in Section 4.2 discretizing the action space with 22

πˆ ? shown in Fig. 5 (a) and (b), vt and vt look very similar to each other. Fig. 5 (c) shows the relative error πˆ ? vt (x) vt (x) ?− 100%. vt (x) × The relative error is less than 11% for all x [0, 1] and t = 0,..., 20. The `1-norm and the `2-norm ∈ of the relative error are 0.45% and 1.44%, respectively. This result confirms the of our method even in nonconvex problems with control-affine systems.

5.3 Fully Nonlinear System We consider the following discrete-time Dubins car model [39], which is fully nonlinear: 1 xt+1 = xt + vt cos θt, yt+1 = yt + vt sin θt, θt+1 = θt + vt tan st, L 3 2 where xt := (xt, yt, θt) R is the system state, and ut := (vt, st) R is the control action. ∈ 2 ∈ Specifically, (xt, yt) represents the position of the vehicle in R , and θt denotes the heading angle. Moreover, vt and st denote a velocity input and a steering input, respectively. The third state equation specifies the turning rate. The Dubins vehicle model is commonly used in robotics and controls as a model for describing the planar motion of wheeled vehicles. We set vt 0, 0.1 , ∈ { − } st 1, 0, 1 , and L = 0.5. The control objective is to steer the vehicle to the y-axis with heading ∈ {− } angle π. Accordingly, the terminal cost function is set to be 2 q(x) = (x2, x2) (0, π) , k − k 3 where x = (x1, x2, x3) R with no running costs. With K = 20, the computational domains are ∈ chosen as follows:

t := [ 0.5 0.1t, 0.5 + 0.1t] [ 0.5 0.1t, 0.5 + 0.1t] [0, 2π] Z − − × − − × for t = 0, 1,..., 20. It is straightforward to check that this choice satisfies the condition (2.3). Each t is discretized as a three-dimensional Cartesian grid using the approach in Appendix A with grid Z points  ( 0.5 0.1t + δx(i 1), 0.5 0.1t + δx(j 1), 0 + δθ(k 1)) : − − − − − − − i, j = 1,..., (1 + 0.2t)/δx + 1, k = 1,..., 2π/δθ + 1 .

In our numerical experiment, we set δx = 0.1 and δθ = π/20, and thus 20 is discretized with a Z grid of size 51 51 41. The CPU time for running our method in Section 4.2 is 4630.93 seconds. × × Fig. 6 shows the resulting vehicle trajectories starting from (x0, y0) = (0, 0.5) with four different − initial heading angles θ0 = 0, π/2, π, 3π/2. In all four cases, the vehicle moves to the y-axis with heading π as desired. As in Section 5.1, let Nx denote the number of grid points on one axis of 5. Thus, the total 3 Z number of grid points is N . To demonstrate the convergence of our method as δx, δθ 0 or x → equivalently as Nx , we evaluate the errors vˆ(Nx),0 v¯0 in the 1– and –norm over 0, → ∞ k − k ∞ Z3 wherev ˆ(Nx),t denotes the approximate value function at t with the number of grid points being Nx , andv ¯0 :=v ˆ(Nx=161),0. Table 3 shows the errors and the CPU time with different grid sizes and K = 5. This result indicates that the empirical convergence rate in this example is approximately the second order. Another important observation is that the computation time scales linearly with the number of grid points. This is because the size of the linear optimization problem (4.3) is invariant with the grid size as discussed in Section 4.2. 1001 equally spacing grid points. 23

0.2

0

-0.2

y -0.4

-0.6

0 -0.8 /2

3 /2 -1 -1.2 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 x

Figure 6: Controlled trajectories for the Dubins vehicle model with different initial heading angles. Each arrow represents the vehicles’s heading at a particular instance.

Table 3: Computation time (in seconds) and errors for the Dubins vehicle problem with different grid sizes (evaluated over 0). Z # of grid points 213 413 813 Time (sec) 135.31 975.54 7574.77 `1-error 0.0082 0.0027 0.0008 `∞-error 0.0122 0.0038 0.0009

6 Conclusions

We have proposed and analyzed a convex optimization-based method designed to solve dynamic programs in continuous state and action spaces. This interpolation-free approach provides a con- trol policy of which performance converges uniformly to the optimum as the computational grid becomes finer when the optimal and approximate value functions are convex. In the case of control- affine systems with convex cost functions, the proposed bound on the gap between the cost-to-go function of the approximate policy and the optimal value function is useful to gauge the degree of suboptimality. In general nonlinear cases, a simple modification to a bi-level optimization formu- lation, of which inner problem is a linear program, maintains the uniform convergence properties if an optimal solution to this bi-level program can be computed.

A State Space Discretization Using a Rectilinear Grid

In this appendix, we provide a concrete way to discretize the state space using a rectilinear grid. The construction below satisfies all the conditions in Section 2.3.

1. Choose a convex compact set 0 := [x , x0,1] [x , x0,2] [x , x0,n], and discretize Z 0,1 × 0,2 × · · · × 0,n it using an n-dimensional rectilinear grid. Set t 0. ← 24

2. Compute (or over-approximate) the forward reachable set8  Rt := f(x, u, ξ): x t, u (x), ξ Ξ . ∈ Z ∈ U ∈

3. Choose a convex compact set t+1 := [x , xt+1,1] [x , xt+1,2] [x , xt+1,n] Z t+1,1 × t+1,2 × · · · × t+1,n such that Rt t+1. ⊆ Z 4. Expand the rectilinear grid to fit t+1. Z 5. Stop if t + 1 = K; otherwise, set t t + 1 and go to Step 2. ← SNC,t We can then choose i as each grid cell. We label i so that i = t for all t. A two- C C i=1 C Z dimensional example is shown in Fig. 1. This construction approach was used in Section 5.1 and 5.3.

B Proof of Lemma 2

Proof. Suppose that v : t+1 R is convex. Fix two arbitrary states x(k) t, k = 1, 2. Z → ∈ Z We first show that Ttv is convex on t. Under Assumption 1, for k = 1, 2, there exists u Z (k) ∈ (x(k)) such that U N X [s] (Ttv)(x(k)) = r(x(k), u(k)) + psv(f(x(k), u(k), ξˆ )) s=1 [s] and f(x , u , ξˆ ) t+1 for all s . Fix an arbitrary λ (0, 1), and let x := λx + (1 (k) (k) ∈ Z ∈ S ∈ (λ) (1) − λ)x t and u := λu + (1 λ)u . By Assumption 2, we have u (x ). Thus, the (2) ∈ Z (λ) (1) − (2) (λ) ∈ U (λ) following inequality holds:

N X [s] (Ttv)(x ) r(x , u ) + psv(f(x , u , ξˆ )). (λ) ≤ (λ) (λ) (λ) (λ) s=1 Since (x, u) r(x, u) and (x, u) v(f(x, u, ξ)) are convex for each ξ Ξ, we have 7→ 7→ ∈ (Ttv)(x ) λr(x , u ) + (1 λ)r(x , u ) (λ) ≤ (1) (1) − (2) (2) N X [s] [s] + ps[λv(f(x , u , ξˆ )) + (1 λ)v(f(x , u , ξˆ ))]. (1) (1) − (2) (2) s=1 Therefore, we obtain that

(Ttv)(x ) λ(Ttv)(x ) + (1 λ)(Ttv)(x ). (λ) ≤ (1) − (2) which implies that Ttv is convex on t. Z Next we show that Tˆtv is convex on t. Under Assumption 1, for k = 1, 2, there exists an Z optimal solution (uˆ , γˆ ) (x ) ∆N to (3.1) with x := x , i.e., (k) (k) ∈ U (k) × (k)

N Mt+1 X X [i] (Tˆtv)(x(k)) = r(x(k), uˆ(k)) + psγˆ(k),s,iv(x ) s=1 i=1

8The forward reachable set can be over-approximated in an analytical way, particularly when a loose approximation is allowed. For a high quality of approximation, one may use advanced computational techniques with semidefinite approximation [40] and ellipsoidal approximation [41], among others. 25

[s] PMt+1 [i] and f(x , uˆ , ξˆ ) = γˆ x t+1 for all s . Given any fixed λ ]0, 1[, let (k) (k) i=1 (k),s,i ∈ Z ∈ S ∈ (uˆ , γˆ ) := λ(uˆ , γˆ ) + (1 λ)(uˆ , γˆ ). Since (x, u) f(x, u, ξˆ[s]) is affine on , (λ) (λ) (1) (1) − (2) (2) 7→ K

Mt+1 [s] X [i] f(x(λ), uˆ(λ), ξˆ ) = γˆ(λ),s,ix . i=1

This implies that (uˆ(λ), γˆ(λ)) is a feasible solution to the minimization problem in the definition of (Tˆtv)(x(λ)). Therefore,

N Mt+1 X X [i] (Tˆtv)(x ) r(x , uˆ ) + psγˆ v(x ) (λ) ≤ (λ) (λ) (λ),s,i s=1 i=1

N Mt+1 X X [i] λr(x , uˆ ) + (1 λ)r(x , uˆ ) + ps[λγˆ + (1 λ)ˆγ ]v(x ) ≤ (1) (1) − (2) (2) (1),s,i − (2),s,i s=1 i=1

= λ(Tˆtv)(x ) + (1 λ)(Tˆtv)(x ), (1) − (2) where the second inequality holds by the convexity of r. This implies that Tˆtv is convex on t. Z References

[1] Puterman, M. L.: Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Hoboken, NJ (2014)

[2] Kushner, H., Dupuis, P. G.: Numerical Methods for Stochastic Control Problems in Continuous Time, Springer Science & Business Media, New York (2013)

[3] Hern´andez-Lerma,O., Lasserre, J. B.: Discrete-Time Markov Control Processes: Basic Opti- mality Criteria, Springer Science & Business Media, New York (2012)

[4] Savorgnan, C., Lasserre, J. B., Diehl, M.: Discrete-time stochastic optimal control via occu- pation measures and moment relaxations. In Proceedings of Joint 48th IEEE Conference on Decision and Control and 28th Chinese Control Conference (2009)

[5] Dufour, F., Prieto-Rumeau, T.: Finite linear programming approximations of constrained dis- counted Markov decision processes. SIAM Journal on Control and Optimization 51(2), 1298-1324 (2013)

[6] Sutton, R. S., Barto, A. G.: Reinforcement Learning: An Introduction, 2nd Edition, MIT Press, Cambridge, MA (2018)

[7] Bertsekas, D. P.: Reinforcement Learning and Optimal Control, Athena Scientific, Belmont, MA (2019)

[8] Szepesvari, C.: Algorithms for Reinforcement Learning, Morgan and Claypool Publishers, San Rafael, CA, USA (2010)

[9] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- land, A. K., Ostrovski, G., and Petersen, S.: Human-level control through deep reinforcement learning. Nature, 518 (7540), 529-533 (2015) 26

[10] Schulman, J., Levine, S., Moritz, P., Jordan, M., Abbeel, P.: Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 1889-1897 (2015)

[11] Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., Wierstra, D.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)

[12] Haarnoja, T., Zhou, A., Abbeel, P., Levine, S.: Soft actor-critic: Off-policy maximum en- tropy deep reinforcement learning with a stochastic actor. In Proceedings of the International Conference on Machine Learning, 1861-1870 (2018)

[13] Bertsekas, D. P.: Convergence of discretization procedures in dynamic programming. IEEE Transactions on Automatic Control 20(3), 415-419 (1975)

[14] Langen, H.-J.: Convergence of dynamic programming models. Mathematics of Operations Research 6(4), 493-512 (1981)

[15] Whitt W.: Approximations of dynamic programs, I. Mathematics of Operations Research 3(3), 231-243 (1978)

[16] Hinderer K.: On approximate solutions of finite-stage dynamic programs. In: Puterman, M. L. (ed.): Dynamic Programming and Its Applications, pp. 289-317. Academic Press (1978)

[17] Chow, C.-S., Tsitsiklis, J. N.: An optimal one-way multigrid algorithm for discrete-time stochastic control. IEEE Transactions on Automatic Control 36(8), 898-914 (1991)

[18] Dufour, F. and Prieto-Rumeau, T.: Approximation of Markov decision processes with general state space. Journal of Mathematical Analysis and Applications 388, 1254-1267 (2012)

[19] Dufour, F. and Prieto-Rumeau, T.: Approximation of average cost Markov decision processes using empirical distributions and concentration inequalities. Stochastics: An International Jour- nal of Probability and Stochastic Processes 87(2), 273-307 (2015)

[20] Saldi, N., Y¨uksel,S., Linder, T.: On the asymptotic optimality of finite approximations to Markov decision processes with Borel spaces. Mathematics of Operations Research 42(4), 945-978 (2017)

[21] Hern´andez-Lerma,O.: Discretization procedures for adaptive Markov control processes. Jour- nal of Mathematical Analysis and Applications 137, 485-514 (1989)

[22] Johnson, S. A., Stedinger, J. R., Shoemaker, C. A., Li, Y., Tejada-Guibert, J. A.: Numerical solution of continuous-state dynamic programs using linear and spline interpolation. Operations Research 41(3), 484-500 (1993)

[23] Hern´andez-Lerma,O., Lasserre, J. B.: Further Topics on Discrete-Time Markov Control Pro- cesses, Springer Science & Business Media, New York, USA (2012)

[24] Rust, J.: Using randomization to break the curse of dimensionality. Econometrica 65(3), 487- 516 (1997)

[25] Munos, R., Szepesv´ari,C.: Finite-time bounds for fitted value iteration. Journal of Machine Learning Research 1, 815-857 (2008) 27

[26] Haskell, W. B., Jain, R., Sharma, H., Yu, P.: A universal empirical dynamic programming algorithm for continuous state MDPs. IEEE Transactions on Automatic Control 65(1), 115-129 (2020)

[27] Jang, S., Yang, I.: Stochastic subgradient methods for dynamic programming in continuous state and action spaces. In Proceedings of the 58th IEEE Conference on Decision and Control, 7287-7293 (2019)

[28] Nesterov, Y.: Lectures on Convex Optimization, 2nd ed., Springer, Switzerland (2018)

[29] Boyd, S., Vandenberghe, L.: Convex Optimization, Cambridge University Press, Cambridge, UK (2004)

[30] Bertsekas, D. P.: Convex Optimization Algorithms, Athena Scientific, Bellmont, MA, USA (2015)

[31] Falcone, M., Ferretti, R.: Semi-Lagrangian Approximation Schemes for Linear and Hamilton- Jacobi Equations, SIAM, Philadelphia, PA, USA (2013)

[32] Alla, A., Falcone, M., Saluzzi, L.: An efficient DP algorithm on a tree-structure for finite horizon optimal control problems. SIAM Journal on Scientific Computing, 41(4), A2384-A2406 (2019)

[33] Picarelli, A., Reisinger, C.: Probabilistic error analysis for some approximation schemes to optimal control problems. Systems & Control Letters, 137, 104619 (2020)

[34] Abate, A., Prandini, M., Lygeros, J., Sastry, S.: Probabilistic reachability and safety for controlled discrete time stochastic hybrid systems. Automatica 44, 2724-2734 (2008)

[35] Yang, I.: A dynamic game approach to distributionally robust safety specifications for stochas- tic systems. Automatica 94, 94-101 (2018)

[36] Dantzig, G. B.: Linear Programming and Extensions, Princeton Univ. Press, Princeton, NJ, USA (1998)

[37] Bertsimas, D., Tsitsiklis, J. N.: Introduction to Linear Optimization, Athena Scientific, Bel- mont, MA, USA (1997)

[38] Sethi, S. P., Thompson, G. L.: Optimal : Applications to Management Science and , Springer (2000)

[39] Dubins, L. E.: On curves of minimal length with a constraint on average curvature, and with prescribed initial and terminal positions and tangents. American Journal of Mathematics, 79(3), 497-516 (1957)

[40] Magron, V., Garoche, P.-L., Henrion, D. Thirioux, X.: Semidefinite approximations of reach- able sets for discrete-time polynomial systems. SIAM Journal on Control and Optimization, 57(4), 2799-2820 (2019)

[41] Kurzhanskiy, A. A., Varaiya, P.: Reach set computation and control synthesis for discrete-time dynamical systems with disturbances. Automatica, 47, 1414-1426 (2011)