Risk-Averse Stochastic Shortest Path Planning

Mohamadreza Ahmadi, Anushri Dixit, Joel W. Burdick, and Aaron D. Ames

Abstract— We consider the stochastic shortest path planning one over another depends on factors such as problem in MDPs, i.e., the problem of designing policies that sensitivity to rare events, ease of estimation from data, and ensure reaching a goal state from a given initial state with computational tractability. Artzner et. al. [8] characterized a minimum accrued cost. In order to account for rare but impor- tant realizations of the system, we consider a nested dynamic set of natural properties that are desirable for a risk measure, coherent risk total cost functional rather than the conventional called a , and have henceforth obtained risk-neutral total expected cost. Under some assumptions, we widespread acceptance in finance and operations research, show that optimal, stationary, Markovian policies exist and among others. Coherent risk measures can be interpreted as can be found via a special Bellman’s equation. We propose a a special form of distributional robustness, which will be computational technique based on difference convex programs (DCPs) to find the associated value functions and therefore the leveraged later in this paper. risk-averse policies. A rover navigation MDP is used to illustrate Conditional value-at-risk (CVaR) is an important coherent risk the proposed methodology with conditional-value-at-risk (CVaR) measure that has received significant attention in decision and entropic-value-at-risk (EVaR) coherent risk measures. making problems, such as MDPs [20], [19], [36], [9]. General coherent risk measures for MDPs were studied in [39], [3], wherein it was further assumed the risk measure is I.INTRODUCTION time consistent, akin to the dynamic programming property. Shortest path problems [10], i.e., the problem of reaching Following the footsteps of [39], [46] proposed a sampling- a goal state form an initial state with minimum total cost, based algorithm for MDPs with static and dynamic coherent arise in several real-world applications, such as driving di- risk measures using policy gradient and actor-critic methods, rections on web mapping websites like MapQuest or Google respectively (also, see a model predictive control technique for Maps [41] and robotic path planning [18]. In a shortest path linear dynamical systems with coherent risk objectives [45]). problem, if transitions from one system state to another is A method based on stochastic reachability analysis was pro- subject to stochastic uncertainty, the problem is referred to posed in [17] to estimate a CVaR-safe set of initial conditions as a stochastic shortest path (SSP) problem [13], [44]. In this via the solution to an MDP. A worst-case CVaR SSP planning case, we are interested in designing policies such that the total method was proposed and solved via dynamic programming expected cost is minimized. Such planning under uncertainty in [27]. Also, total cost undiscounted MDPs with static CVaR problems are indeed equivalent to an undiscounted total cost measures were studied in [16] and solved via a surrogate Markov decision processes (MDPs) [37] and can be solved MDP, whose solution approximates the optimal policy with efficiently via the dynamic programming method [13], [12]. arbitrary accuracy. However, emerging applications in path planning, such as In this paper, we propose a method for designing policies autonomous navigation in extreme environments, e.g., sub- for SSP planning problems, such that the total accrued cost terranean [24] and extraterrestrial environments [2], not only in terms of dynamic, coherent risk measures is minimized (a require reaching a goal region, but also risk-awareness for generalization of the problems considered in [27] and [16] to mission success. Nonetheless, the conventional total expected dynamic, coherent risk measures). We begin by showing that, cost is only meaningful if the law of large numbers can be in- under the assumption that the goal region is reachable in finite

arXiv:2103.14727v1 [eess.SY] 26 Mar 2021 voked and it ignores important but rare system realizations. In time with non-zero probability, the total accumulated risk cost addition, robust planning solutions may give rise to behavior is always bounded. We further show that, if the coherent risk that is extremely conservative. measures satisfy a Markovian property, we can find optimal, Risk can be quantified in numerous ways. For example, stationary, Markovian risk-averse policies via solving a spe- mission risks can be mathematically characterized in terms cial Bellman’s equation. We also propose a computational of chance constraints [34], [33], functions [23], and method based on difference convex programming to solve the distributional robustness [48]. Chance constraints often ac- Bellman’s equation and therefore design risk-averse policies. count for Boolean events (such as collision with an obstacle We elucidate the proposed method via numerical examples or reaching a goal set) and do not take into consideration involving a rover navigation MDP and CVaR and entropic- the tail of the cost distribution. To account for the latter, value-at-risk (EVaR) measures. risk measures have been advocated for planning and decision The rest of the paper is organized as follows. In the next making tasks in robotic systems [32]. The preference of section, we review some definitions and properties used in the sequel. In Section III, we present the problem under The authors are with the California Institute of Technology, 1200 E. study and show its well-posedness under an assumption. In California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: ({mrahmadi, adixit, ames}@caltech.edu, [email protected].) Section IV, we present the main result of the paper, i.e., a Consider a probability space pΩ, F, Pq, a filtration F0 Ă ¨ ¨ ¨ FT Ă F, and an adapted sequence of random vari- si ables (stage-wise costs) ct, t “ 0,...,T , where T P Ně0 Y t8u. For t “ 0,...,T , we further define the spaces Ct “ LppΩ, Ft, Pq, p P r1, 8q, Ct:T “ Ct ˆ ¨ ¨ ¨ ˆ CT and g g T ps | s , αq “ 1 C “ C0 ˆ C1 ˆ ¨ ¨ ¨ . In order to describe how one can evaluate the risk of sub-sequence ct, . . . , cT from the perspective of stage t, we require the following definitions. sg sj Definition 2 (Conditional Risk Measure): A mapping ρt:T : Ct:T Ñ Ct, where 0 ď t ď N, is called a conditional risk Fig. 1. The transition graph of the particular class of MDPs studied in this measure, if it has the following monotonicity property: paper. The goal state sg is cost-free and absorbing. 1 1 1 ρt:T pcq ď ρt:T pc q, @c, @c P Ct:T such that c ĺ c . special Bellman’s equation for the risk-averse SSP problem. Definition 3 (): A dynamic risk mea- In Section V, we describe a computational method to find sure is a sequence of conditional risk measures ρt:T : Ct:T Ñ risk-averse policies. In Section VI, we illustrate the proposed Ct, t “ 0,...,T . method via a numerical example and finally, in Section VI, One fundamental property of dynamic risk measures is their we conclude the paper. consistency over time [39, Definition 3]. If a risk measure is Notation: We denote by Rn the n-dimensional Euclidean time-consistent, we can define the one-step conditional risk space and Ně0 the set of non-negative integers. We use measure ρt : Ct`1 Ñ Ct, t “ 0,...,T ´ 1 as follows: bold font to denote a vector and p¨qJ for its transpose, e.g., J ρtpct`1q “ ρt,t`1p0, ct`1q, (1) a “ pa1, . . . , anq , with n P t1, 2,...u. For a vector a, we use a ľ pĺq0 to denote element-wise non-negativity (non- and for all t “ 1,...,T , we obtain: positivity) and a ” 0 to show all elements of a are zero. A For a finite set A, we denote its power set by 2 , i.e., the ρt,T pct, . . . , cT q “ ρt ct ` ρt`1pct`1 ` ρt`2pct`2 ` ¨ ¨ ¨ set of all subsets of A. For a probability space pΩ, F, q P `` ρT ´1 pcT ´1 ` ρT pcT qq ¨ ¨ ¨ qq . (2) and a constant p P r1, 8q, LppΩ, F, Pq denotes the vector p Note that the time-consistent risk measure is completely de- space of real valued random variables c for which E|c| ă 8. ˘ Superscripts are used to denote indices and subscripts are used fined by one-step conditional risk measures ρt, t “ 0,...,T ´ 2 1 and, in particular, for t “ 0, (2) defines a risk measure of to denote time steps (stages), e.g., for s P S, s1 means the the value of s2 P S at the 1st stage. the entire sequence c P C0:T . At this point, we are ready to define a coherent risk measure. II.PRELIMINARIES Definition 4 (Coherent Risk Measure): We call the one-step This section, briefly reviews notions and definitions used conditional risk measures ρt : Ct`1 Ñ Ct, t “ 1,...,N ´ 1 throughout the paper. as in (2) a coherent risk measure, if it satisfies the following conditions We are interested in designing policies for a class of finite 1 1 MDPs (termed transient MDPs in [16]) as shown in Figure 1, ‚ Convexity: ρtpλc ` p1 ´ λqc q ď λρtpcq ` p1 ´ λqρtpc q, 1 which is defined next. for all λ P p0, 1q and all c, c P Ct`1; 1 1 ‚ Monotonicity: If c ď c , then ρtpcq ď ρtpc q for all Definition 1 (MDP): An MDP is a tuple, M “ 1 g c, c P Ct`1; pS, Act, T, s0, c, s q, where 1 1 ‚ Translational Invariance: ρtpc `cq “ ρtpc q`c for all 1 |S| ‚ States S s , . . . , s of the autonomous agent(s) 1 “ t u c P Ct and c P Ct`1; and world model, ‚ Positive Homogeneity: ρtpβcq “ βρtpcq for all c P Ct`1 1 |Act| ‚ Actions Act “ tα , . . . , α u available to the robot, and β ě 0. j i ‚ A transition T ps |s , αq, satisfy- ing T ps|si, αq “ 1, @si P S, @α P Act, In fact, we can show that there exists a dual (or distributionally sPS robust) representation for any coherent risk measure. Let ‚ An initial state s0 P S, and ř i i m, n 1, such that 1 m 1 n 1 and ‚ An immediate cost function, cps , α q ě 0, for each state P r 8q { ` { “ si P S and action αi P Act, S 1 1 P “ q P LnpS, 2 , q | qps q ps q “ 1, q ě 0 . g P P ‚ s P S is a special cost-free goal (termination) state, i.e., s1PS T sg sg, α 1 c sg, α 0 α Act ÿ ( p | q “ and p q “ for all P . Proposition 1 (Proposition 4.14 in [26]): Let Q be a closed We assume the immediate cost function c is non-negative and convex subset of P. The one-step conditional risk measure upper-bounded by a positive constant c¯. ρt : Ct`1 Ñ Ct, t “ 1,...,N ´ 1 is a coherent risk measure Our risk-averse policies for the SSP problem rely on the if and only if notion of dynamic coherent risk measures, whose definitions ρtpcq “ sup xc, qyQ, @c P Ct`1, (3) and properties are presented next. qPQ where x¨, ¨yQ denotes the inner product in Q. that pπ is dependent solely on tπ1, π2, . . . , πτ u. Moreover, since Act is finite, the number of τ-stage policies is also Hereafter, all risk measures are assumed to be coherent. finite, which implies finiteness of pπ. Hence, p ă 1 as well. Therefore, for any policy π and initial state s, we obtain g g g III.PROBLEM FORMULATION Pps2τ ‰ s | s0 “ s, πq “ Pps2τ ‰ s | sτ ‰ s , s0 “ s, πq ˆ ps ‰ sg | s “ s, πq ď p2. Then, by induction, we Next, we formally describe the risk-averse SSP problem. We P τ 0 can show that, for any admissible SSP policy π, we have also demonstrate that, if the goal state is reachable in finite π g k time, the risk-averse SSP problem is well-posed. Let “ Ppskτ ‰ s | s0 “ s, πq ď p , @s P S, (6) tπ0, π1,...u be an admissible policy. and k “ 1, 2,.... Indeed, we can show that the risk-averse Problem 1: Consider MDP M as described in Definition 1. cost incurred in the τ periods between τk and τpk ` 1q ´ 1 g Given an initial state s0 ‰ s , we are interested in solving is bounded as follows the following problem ρ ρ c c ˚ 0 ¨ ¨ ¨ τpk`1q´1p τk ` ¨ ¨ ¨ ` τpk`1q´1q ¨ ¨ ¨ (7a) π P arg min Jps0, πq, (4) π “`ρ˜pcτk ` cτk`1 ` ¨ ¨ ¨ ` cτpk`1q´1q ˘ (7b) where ď ρ˜pc¯ ` ¨ ¨ ¨ ` c¯q (7c)

“ sup xτc,¯ qyQ (7d) Jps0, πq “ lim ρt:T pcps0, π0q, . . . , cpsT , πT qq , (5) T Ñ8 qPQ ˚ is the total risk functional for the admissible policy π. ď xτc,¯ q yQ (7e) “ τc¯ ps ‰ sg | s “ s, πqq˚ps, πq (7f) In fact, we are interested in reaching the goal state sg such P kτ 0 1 sPS that the total risk cost is minimized . Note that the risk-averse ÿ g ˚ deterministic shortest problem can be obtained as a special ď τc¯ ˆ sup pPpskτ ‰ s | s0 “ s, πqq |q ps, πq| (7g) sPS case when the transitions are deterministic. We define the k ÿ optimal risk value function as ď τ c¯ p , (7h) J ˚psq “ min Jps, πq, @s P S, where in (7b) we used the translational invariance property of π coherent risk measures and defined ρ˜ “ ρ0 ˝ ¨ ¨ ¨ ˝ ρτpk`1q´1. and call a stationary policy π “ tµ, µ, . . .u (denoted µ) Since any finite compositions of the coherent risk measures ˚ optimal if Jps, µq “ J psq “ minπ Jps, πq, @s P S. is a risk measure [42], we have that ρ˜ is also a coherent We posit the following assumption, which implies that the risk measure. Moreover, since the immediate cost function goal state is reachable eventually under all policies. c is upper-bounded, from the monotonicity property of the coherent risk measure ρ˜, we obtain (7c). Equality (7d) is Assumption 1 (Goal is Reachable in Finite Time): derived from Proposition 1. Inequality (7e) is obtained via Regardless of the policy used and the initial state, there ˚ defining q “ argsupqPQxτc,¯ qyQ. In inequality (7g), we used exists an integer τ such that there is a positive probability Holder¨ inequality and finally we used (6) to obtain the last g 2 that the goal state s is visited after no more than τ stages . inequality. Thus, the risk-averse total cost Jps, πq, s P S, We then have the following observation with respect to exists and is finite, because given Assumption 1 we have Problem 13. |Jps0, πq| “ lim ρ0 ˝ ¨ ¨ ¨ ˝ ρτ´1 ˝ ¨ ¨ ¨ ˝ ρT pc0 ` ¨ ¨ ¨ ` cT q T Ñ8 Proposition 2: Let Assumption 1 hold. Then, the risk-averse 8 SSP problem, i.e., Problem 1, is well-posed and Jps0, πq is ď ρ0 ¨ ¨ ¨ ρτpk`1q´1pcτk ` ¨ ¨ ¨ ` cτpk`1q´1q ¨ ¨ ¨ bounded for all policies π. k“0 ÿ ` ˘ 8 τc¯ Proof: Assumption 1 implies that for each admissible ď τ c¯ pk “ , g 1 ´ p policy π, we have pπ “ maxsPS Ppsτ ‰ s | s0 “ s, πq ă 1. k“0 ÿ That is, given a policy π, the probability pπ of not visiting (8) g the goal state s is less than one. Let p “ maxπ pπ. Remark where in the first equality above we used the translational invariance property and in the first inequality we used the sub- 1 An important class of SSP planning problems are concerned with additivity property of coherent risk measures. Hence, Jps0, πq minimum-time reachability. Indeed, our formulation also encapsulates is bounded for all π. minimum-time problems, in which for MDP M, we have cpsq “ 1, for all s P Sztsgu. 2If instead of one goal state sg, we were interested in a set of goal states IV. RISK-AVERSE SSPPLANNING G Ă S, it suffices to define τ “ inftt | Ppst P G | s0 P SzG, πq ą 0u. Then, all the paper’s derivations can be applied. For the sake of simplicity of the presentation, we present the results for a single goal state. This section presents the paper’s main result, which includes a 3Note that Problem 1 is ill-posed in general. For example, if the induced special Bellman’s equation for finding the risk value functions Markov chain for an admissible policy is periodic, then the limit in (5) for Problem 1. Furthermore, assuming that the coherent risk may not exist. This is in contrast to risk-averse discounted infinite-horizon MDPs [3], for which we only require non-negativity and boundedness of measures satisfy a Markovian property, we show that the immediate costs. optimal risk-averse policies are stationary and Markovian. To begin with, note that at any time t, the value of ρt limiting process and the measure ρt commute. The limit term is Ft-measurable and is allowed to depend on the entire is indeed the total risk cost starting at sτM , i.e., JpsτM , πq. history of the process ts0, s1,...u and we cannot expect Next, we show that under Assumption 1, this term remains to obtain a Markov optimal policy [35]. In order to obtain bounded. From (8) in the proof of Proposition 2, we have Markov optimal policies for Problem 1, we need the following |JpsτM ,πq| “ lim ρτM cτM ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ q property [39, Section 4] of risk measures. T Ñ8 8 Definition 5 (Markov Risk Measure [25]): A one-step con- ` ď ρτk cτk ` ¨ ¨ ¨ ` ρτpk`1q´1pcτpk`1q´1q ¨ ¨ ¨ ditional risk measure ρt : Ct`1 Ñ Ct is a Markov risk k“M ÿ ` ˘ measure with respect to MDP P, if there exist a risk transition 8 τc¯ pM mapping σ : L pS, 2S , q ˆ S ˆ M Ñ such that for all ď τ c¯ pk “ . (14) t m P R 1 ´ p v L S, 2S , and α π s , we have k“M P mp Pq t P p tq ÿ Substituting the above bound in (13) gives ρtpvpst`1qq “ σt pvpst`1q, st,T pst`1|st, αtqq . (9) τc¯ pM In fact, if ρ is a coherent risk measure, σ also satisfies the Jps0, πq ď ρ0 c0 ` ¨ ¨ ¨ ` ρτM 1pcτM 1 ` q ¨ ¨ ¨ t t ´ ´ 1 ´ p properties of a coherent risk measure (Definition 4). ` τc¯ pM ˘ “ ρ0 c0 ` ¨ ¨ ¨ ` ρτM 1pcτM 1q ¨ ¨ ¨ ` , (15) Assumption 2: The one-step coherent risk measure ρt is a ´ ´ 1 ´ p Markov risk measure. where the` last equality holds via the translational˘ invariance M We can now present the main result in the paper, a form of τc¯ p property of the one-step risk measures and the fact that 1´p Bellman’s equations for solving the risk-averse SSP problem. is constant. Similarly, following (14), we can also obtain a lower bound on Jps0, πq as follows Theorem 1: Consider MDP P as described in Definition 1 τc¯ pM Jps0, πq ě ρ0 c0 ` ¨ ¨ ¨ ` ρτM 1pcτM 1 ´ q ¨ ¨ ¨ and let Assumptions 1 and 2 hold. Then, the following ´ ´ 1 ´ p statements are true for the risk-averse SSP problem: ` τc¯ pM ˘ 0 (i) Given (non-negative) initial condition J psq, s P S, the “ ρ0 c0 ` ¨ ¨ ¨ ` ρτM´1pcτM´1q ´ . (16) 1 ´ p sequence generated by the recursive formula (dynamic pro- ` ˘ gramming) Thus, from (15) and (16), we obtain τc¯ pM Jps0, πq ´ ď ρ0 c0 ` ¨ ¨ ¨ ` ρτM 1pcτM 1qq ¨ ¨ ¨ J k`1psq “ min cps, αq 1 ´ p ´ ´ αPAct ˆ ` τc¯ pM ˘ ď Jps0, πq ` . (17) ` σ J kps1q, s, T ps1|s, αq , @s P S, (10) 1 ´ p ˙ J 0 sg 0 converges to the optimal risk value function( J ˚psq, s P S. Furthermore, Assumption 1 implies that p q “ . If we J 0 (ii) The optimal risk value functions J ˚psq, s P S are the consider as a terminal risk value function, we can obtain 0 0 unique solution to the Bellman’s equation |ρτM pJ psτM qq| “ | sup xJ psτM q, qyQ| (18a) qPQ 0 ˚ J ˚psq “ min cps, αq “ |xJ psτM q, q yQ| (18b) αPAct ˚ 0 ˆ “ | PpsτM “ s | s0, πqq ps, πqJ psq| (18c) ` σ J ˚ps1q, s, T ps1|s, αq , @s P S; (11) sPS ÿ ˚ 0 ˙ ď PpsτM “ s | s0, πqq ps, πq ˆ max |J psq| (18d) ( sPS (iii) For any stationary Markovian policy µ, the risk averse sPS value functions Jps, µpsqq, s P S are the unique solutions to ÿ 0 ď PpsτM “ s | s0, πq ˆ max |J psq| (18e) sPS sPS Jps, µq “ cps, µpsqq ÿ ď pM max |J 0psq|, (18f) ` σ Jps, µq, s, T ps1|s, αq , @s P S; (12) sPS where, similar to the derivation in (7), in (18a) we used Propo- (iv) A stationary Markovian policy µ is optimal( if and only sition 1. Defining q˚ “ argsup xτc,¯ qy , we obtained if µ attains the minimum in Bellman’s equation (11). qPQ Q (18b) and the last inequality is based on the fact that the Proof: For every positive integer M, an initial state s , g M 0 probability of sτM ‰ s is less than equal to p as in (6). and policy π, we can split the nested risk cost (5), where ρ t,T Combining inequalities (17) and (18), we have is defined in (2), at time index τM and obtain M M 0 τc¯ p ´p max |J psq| ` Jps0, πq ´ Jps0, πq “ ρ0 c0 ` ¨ ¨ ¨ ` ρτM´1pcτM´1 sPS 1 ´ p 0 ` lim ρτM cτM ` ¨ ¨ ¨ ` ρT pcT q ¨ ¨ ¨ qq , (13) ď ρ0 c0 ` ¨ ¨ ¨ ` ρτM 1pcτM 1 ` ρτM pJ psτM qq ¨ ¨ ¨ T`Ñ8 ´ ´ M where we used the fact` that the one-step coherent˘ risk M 0 τc¯ p ď p ` max |J psq| ` Jps0, πq ` . (19)˘ s S measures ρt are continuous [40, Corollary 3.1] and hence the P 1 ´ p Remark that the middle term in the ineqaulity above is the Then, Part (iii) and the above equation imply Jps, µq “ J ˚psq τM-stage risk-averse cost of the policy π with the terminal for all s P S. Conversely, if Jps, µq “ J ˚psq for all s P S, 0 cost J psτM q. Given Assumption 2, from [39, Theorem then Items (ii) and (iii) imply that µ is optimal. 2], the minimum of this cost is generated by the dynamic At this point, we should highlight that, in the case of condi- programming recursion (10) after τM iterations. Taking the tional expectation as the coherent risk measure (ρt “ E and minimum over the policy π on every side of (19) yields 1 1 1 1 σ tJps q, s, T ps |s, αqu “ s1PS T ps |s, αqJps q), Theorem 1 M M 0 ˚ τc¯ p simplifies to [11, Proposition 5.2.1] for the risk-neutral SSP ´p max |J psq| ` J ps0q ´ ř sPS 1 ´ p problem. In fact, Theorem 1 is a generalization of [11, τM Proposition 5.2.1] to the risk-averse case. ď J ps0q M Recursion (dynamic programming) (10) represents the Value M 0 ˚ τc¯ p ď p max |J psq| ` J ps0q ` , (20) Iteration (VI) for finding the risk value functions. In general, sPS 1 ´ p value iteration converges with infinite number of iterations for all s0 and M. Finally, let k “ τM. Since the above (k Ñ 8), but it can be shown, for a stationary policy inequality holds for all M, taking the limit of M Ñ 8 gives µ resulting in an acyclic induced Markov chain, the VI τM k ˚ algorithm converges in |S| of steps (see the derivation for lim J ps0q “ lim J ps0q “ J ps0q, @s0 P S. (21) MÑ8 kÑ8 the risk-neutral SSP problem in [14]). (ii) Taking the limit, k Ñ 8, of both sides of (10) Alternatively, one can design risk-averse policies k`1 yields limkÑ8 J psq “ limkÑ8 minαPAct cps, αq ` using Policy Iteration (PI). That is, starting with an σ J kps1q, s, T ps1|s, αq , @s P S. Equality (21) in the proof initial policy µ0, we can carry out policy evaluation of Part (i) implies that ` via (12) followed by a policy improvement step, which ( ˘ calculates an improved policy µk`1, as µk`1psq “ k ˚ µ 1 J psq “ lim min cps, αq arg minαPAct cps, αq ` σ J ps, αq, s, T ps |s, αq , @s P kÑ8 αPAct ˆ S. This process´ is repeated! until no further improvement)¯ is µk`1 µk ` σ J kps1q, s, T ps1|s, αq , @s P S. found in terms of the risk value functions: J psq “ J psq ˙ for all s P S. Since the limit and the minimization commute( over a finite However, we do not pursue VI or PI approaches further in number of alternatives, we have this work. The main obstacle for using VI and PI is that equations (10)-(12) are nonlinear (and non-smooth) in the risk J ˚psq “ min cps, αq value functions for a general coherent risk measure. Solving αPAct ˆ nonlinear equations (10)-(12) for the risk value functions may require significant computational burden (see the specialized ` lim σ J kps1q, s, T ps1|s, αq , @s P S. kÑ8 non-smooth Newton Method in [39] for solving similar non- ˙ Finally, because σ is continuous [40, Corollary( 3.1], the limit linear VIs). Instead, the next section present a computational method based on difference convex programs (DCPs). and σ commute as well and from (21), we obtain J ˚psq “ ˚ 1 1 minαPAct pcps, αq ` σ tJ ps q, s, T ps |s, αquq , @s P S. To show uniqueness, note that for any Jpsq, s P S satisfying V. A DCP COMPUTATIONAL APPROACH the above equation, the dynamic programming recursion (10) starting at Jpsq, s P S replicates Jpsq, s P S and from Part (i) In this section, we propose a computational method based we infer Jpsq “ J ˚psq for all s P S. on DCPs to find the risk value functions and subsequently (iii) Given a stationary Markovian policy µ, at every state s, policies that minimize the accrued dynamic risk in the SSP we have α “ µpsq, hence from Item (i), we have planning. Before stating the DCP formulation, we show that the Bellman operator in (10) is non-decreasing. Let

k`1 1 1 J ps, µq “ min cps, αq DπJ : “ cps, πpsqq ` σ Jps q, s, T ps |s, πpsqq , @s P S, αPtµpsqu ˆ DJ : “ min cps, αq ` σ Jps1q, s, T ps1|s, αq , @s P S. αPAct ` ˘ ` σ J kps1, µq, s, T ps1|s, αq , @s P S. ` ` ˘˘ ˙ Since the minimum is only over one element,( we have Lemma 1: Let Assumptions 1 and 2 hold. For all v, w P S J k`1ps, µq “ cps, µpsqq ` σ J kps1, µq, s, T ps1|s, αq , @s P LmpS, 2 , Pq, if v ď w, then Dπv ď Dπw and Dv ď Dw. S, which with k Ñ 8 converges uniquely (see Item (ii)) to ( Proof: Since ρ is a Markov risk measure, we have Jps, µq. S ρpvq “ σpv, s, T q for all v P LmpS, 2 , Pq. Furthermore, (iv) The stationary policy attains its minimum in (11), if since ρ is a coherent risk measure from Proposition 1, we know that (3) holds. Inner producting both sides of v ď w J ˚psq “ min cps, αq ` σ J ˚ps1q, s, T ps1|s, αq with the q P Q Ă P from right and αPtµpsqu ˆ ˙ taking the supremum over Q yields supqPQ xv, qyQ ď “ cps, µpsqq ` σ J ˚ps1q , s, T ps1|s, αq , @s P( S. supqPQ xw, qyQ. From Proposition 1, we have σpv, s, T q “ ( supqPQ xv, qyQ and σpw, s, T q “ supqPQ xw, qyQ. There- fore, σpv, s, T q ď σpw, s, T q. Adding c to both sides of the above inequality, gives Dπv ď Dπw. Taking the minimum with respect to α P Act from both sides of Dπv ď Dπw, does not change the inequality and gives Dv ď Dw. We are now ready to state an optimization formulation to the Bellman equation (10). Proposition 3: Consider MDP M as described in Defini- tion 1. Let the Assumptions of Theorem 1 hold. Then, the optimal value functions J ˚psq, s P S, are the solutions to the Fig. 2. Grid world illustration for the rover navigation example. Blue cells denote the obstacles and the yellow cell denotes the goal. following optimization problem In particular, in this work, we use a variant of the convex- sup Jpsq (22) concave procedure [31], [43], wherein the concave terms J sPS ÿ are replaced by a convex upper bound and solved. In fact, subject to the disciplined convex-concave programming (DCCP) [43] Jpsq ď cps, αq ` σtJps1q, s, T ps1|s, αqu, @ps, αq P S ˆ Act. technique linearizes DCP problems into a (disciplined) convex program (carried out automatically via the DCCP Python Proof: From Lemma 1, we infer that Dπ and D are package [43]), which is then converted into an equivalent non-decreasing; i.e., for v ď w, we have Dπv ď Dπw and Dv ď Dw. Therefore, if J ď DJ, then DJ ď DpDJq. By cone program by replacing each function with its graph imple- repeated application of D, we obtain J ď DJ ď D2J ď mentation. Then, the cone program can be solved readily by D8J “ J ˚. Any feasible solution to (22) must satisfy J ď available convex programming solvers, such as CVXPY [21]. DJ and hence must satisfy J ď J ˚. Thus, J ˚ is the largest In the Appendix, we present the specific DCPs required J that satisfies the constraint in optimization (22). Hence, the for risk-averse SSP planning for CVaR and EVaR risk optimal solution to (22) is the same as that of (10). measures used in our numerical experiments in the next section. Note that for the risk-neutral conditional expecta- Once a solution J ˚ to optimization problem (22) is found, tion measure, optimization (23) becomes a linear program, we can find a corresponding stationary Markovian policy as 1 1 1 1 since σ tJps q, s, T ps |s, αqu “ s1PS T ps |s, αqJps q is lin- ear in the decision variables J. µ˚psq “ arg min cps, αq ř αPAct ˆ VI.NUMERICAL EXPERIMENTS ` σ J ˚ps, αq, s, T ps1|s, αq , @s P S. ˙ In this section, we evaluate the proposed method for risk- ( averse SSP planning with a rover navigation MDP (also used in [2], [3]). We consider the traditional total expectation as A. DCPs for Risk-Averse SSP Planning well as CVaR and EVaR. The experiments were carried out on a MacBook Pro with 2.8 GHz Quad-Core Intel Core i5 Assumption 1 implies that each ρ is a coherent, Markov and 16 GB of RAM. The resultant linear programs and DCPs risk measure. Hence, the mapping v ÞÑ σpv, ¨, ¨q is convex were solved using CVX [21] with DCCP [43] add-on. (because σ is also a coherent risk measure). We next show An agent (e.g. a rover) must autonomously navigate a 2- that optimization problem (22) is in fact a DCP. dimensional terrain map (e.g. Mars surface) represented by an Let f0 “ 0, g0pJq “ sPS Jpsq, f1pJq “ Jpsq, g1ps, αq “ M ˆN grid with 0.25MN obstacles. Thus, the state space is cps, αq, and g2pJq “ σpJ, ¨, ¨q. Note that f0 and g1 are convex S si i x y, x 1,...,M , y 1,...,N ř given by “ t | “ ` P t u P t uu (constant) functions and g0, f1, and g2 are convex functions with x “ 1, y “ 0 being the leftmost bottom grid. Since in J. Then, (22) can be expressed as the minimization the rover can move from cell to cell, its action set is Act “ tE, W, N, Su. The actions move the robot from its current inf f0 ´ g0pJq J cell to a neighboring cell, with some uncertainty. The state subject to transition probabilities for various cell types are shown for actions E (East) and N (North) in Figure 2. Other actions lead f1pJq ´ g1ps, αq ´ g2pJq ď 0, @s, α. (23) to similar transitions. Hitting an obstacle incurs the immediate The above optimization problem is indeed a standard cost of 5, while the goal grid region has zero immediate cost. DCP [28]. Many applications require solving DCPs, such Any other grid has a cost of 1 to represent fuel consumption. as feature selection in machine learning [30] and inverse Once the policies are calculated, as a robustness test similar covariance estimation in statistics [47]. DCPs can be solved to [20], [2], [3], we included a set of single grid obstacles that globally [28], e.g. using branch and bound algorithms [29]. are perturbed in a random direction to one of the neighboring Yet, a locally optimal solution can be obtained based on grid cells with probability 0.2 to represent uncertainty in the techniques of nonlinear optimization [14] more efficiently. terrain map. For each risk measure, we run 100 Monte Carlo Total pM ˆ Nq J˚ps q # U.O. F.R. for risk-averse robot path planning. ρ 0 Time [s] The table also outlines the failure ratios of each risk mea- p4 ˆ 5q 13.25 0.56 2 39% E sure. In this case, EVaR outperformed both CVaR and total p10 ˆ 10q 27.31 1.04 4 46% E expectation in terms of robustness, which is consistent with p10 ˆ 20q 38.35 1.30 8 58% E the fact that EVaR is a more conservative risk measure. Lower p4 ˆ 5q 18.76 0.58 2 14% CVaR0.7 failure/collision rates were observed for ε “ 0.3, which p10 ˆ 10q 35.72 1.12 4 19% CVaR0.7 correspond to more risk-averse policies. In addition, these p10 ˆ 20q 47.36 1.36 8 21% CVaR0.7 results suggest that, although total expectation can be used p4 ˆ 5qCVaR 25.69 0.57 2 10% 0.3 as a measure of performance in high number of Monte Carlo p10 ˆ 10qCVaR0.3 43.86 1.16 4 13%

p10 ˆ 20qCVaR0.3 49.03 1.34 8 15% simulations, it may not be practical to use it for real-world planning under uncertainty scenarios. CVaR and EVaR seem p4 ˆ 5q 26.67 1.83 2 9% EVaR0.7 to be a more efficient metric for performance in shortest path p10 ˆ 10qEVaR0.7 41.31 2.02 4 11% planning under uncertainty. p10 ˆ 20qEVaR0.7 50.79 2.64 8 17%

p4 ˆ 5qEVaR0.3 29.05 1.79 2 7%

p10 ˆ 10qEVaR0.3 48.73 2.01 4 10% VII.CONCLUSIONS p10 ˆ 20qEVaR 61.28 2.78 8 12% 0.3 We proposed a method based on dynamic programming TABLE I for designing risk-averse policies for the SSP problem. We COMPARISON BETWEEN TOTAL EXPECTATION,CVAR, AND EVAR RISK presented a computational approach in terms of difference

MEASURES. pM ˆ Nqρ DENOTESTHEGRID-WORLDOFSIZE M ˆ N AND convex programs for finding the associated risk value func- ONE-STEPCOHERENTRISKMEASURE ρ.TOTAL TIMEDENOTESTHETIME tions and hence the risk-averse policies. Future research will TAKEN BY THE CVX SOLVER TO SOLVE THE ASSOCIATED LINEAR extend to risk-averse MDPs with average costs and risk-averse PROGRAMSOR DCPS. # U.O. DENOTESTHENUMBEROFSINGLEGRID MDPs with linear temporal logic specifications, where the UNCERTAIN OBSTACLES USED FOR ROBUSTNESS TEST. F.R. DENOTES former problem is cast as a special case of the risk-averse THE FAILURE RATE OUT OF 100 MONTE CARLO SIMULATIONS. SSP problem. In this work, we assumed the states are fully observable, we will study the SSP problems with partial state simulations with the calculated policies and count the number observation [4], [2] in the future, as well. of runs ending in a collision. REFERENCES In the experiments, we considered three grid-world sizes of 4 ˆ 5, 10 ˆ 10, and 10 ˆ 20 corresponding to 20, 100, and [1] A. Agrawal, R. Verschueren, S. Diamond, and S. Boyd. A rewriting system for convex optimization problems. Journal of Control and 200 states, respectively. We allocated 2, 4, and 8 uncertain Decision, 5(1):42–60, 2018. (single-cell) obstacles for the 4 ˆ 5, 10 ˆ 10, and 10 ˆ 20 [2] M. Ahmadi, M. Ono, M. D. Ingham, R. M. Murray, and A. D. Ames. grids, respectively. In each case, we solve DCP (22) (linear Risk-averse planning under uncertainty. In 2020 American Control Conference (ACC), pages 3305–3312. IEEE, 2020. program in the case of total expectation) with |S||Act| “ [3] M. Ahmadi, U. Rosolia, M. D. Ingham, R. M. Murray, and A. D. Ames. MN ˆ4 “ 4MN constraints and MN `1 variables (the risk Constrained risk-averse Markov decision processes. In The 35th AAAI value functions J’s and ζ for CVaR and EVaR as discussed Conference on Artificial Intelligence (AAAI-21), 2021. [4] M. Ahmadi, R. Sharan, and J. W. Burdick. Stochastic finite state control in the Appendix). In these experiments, we set the confidence of POMDPs with LTL specifications. arXiv preprint arXiv:2001.07679, levels to ε “ 0.3 (more risk-averse) and ε “ 0.7 (less risk- 2020. averse) for both CVaR and EVaR coherent risk measures. The [5] A. Ahmadi-Javid. Entropic value-at-risk: A new coherent risk measure. 1 J. Optimization Theory and Applications, 155(3):1105–1123, 2012. initial condition was chosen as s0 “ s , i.e., the agent starts [6] A. Ahmadi-Javid and M. Fallah-Tafti. Portfolio optimization with at the leftmost grid at the bottom, and the goal state was entropic value-at-risk. Euro. J. Operational Res., 279(1):225–241, 2019. selected as sg “ sMN , i.e., the rightmost grid at the top. [7] A. Ahmadi-Javid and A. Pichler. An analytical study of norms and banach spaces induced by the entropic value-at-risk. Mathematics and A summary of our numerical experiments is provided in Financial Economics, 11(4):527–550, 2017. Table 1. Note the computed values of Problem 1 satisfy [8] P. Artzner, F. Delbaen, J. Eber, and D. Heath. Coherent measures of risk. Mathematical finance, 9(3):203–228, 1999. Epcq ď CVaRεpcq ď EVaRεpcq. This is in accordance [9]N.B auerle¨ and J. Ott. Markov decision processes with average-value- with the theory that EVaR is a more conservative coherent at-risk criteria. Math. Methods Operations Res., 74(3):361–379, 2011. risk measure than CVaR [5] (see also our work on EVaR- [10] R. Bellman. On a routing problem. Quarterly of applied mathematics, 16(1):87–90, 1958. based model predictive control for dynamically moving ob- [11] D. Bertsekas. Dynamic programming and optimal control. Athena stacles [22]). Furthermore, the total accrued risk cost is higher Scientific: Nashua, NH, USA, vol. 1, 2017. for ε “ 0.3, since this leads to more risk-averse policies. [12] D. Bertsekas and H. Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS- For total expectation coherent risk measure, the calculations P-2909, MIT, 2013. took significantly less time, since they are the result of solving [13] D. P. Bertsekas and J. N. Tsitsiklis. An analysis of stochastic shortest path problems. Math. Operations Research, 16(3):580–595, 1991. a set of linear programs. For CVaR and EVaR, a set of [14] D.P. Bertsekas. Nonlinear Programming. Athena Scientific, 1999. DCPs were solved. EVaR calculations were the most com- [15] S. Boyd and L. Vandenberghe. Convex optimization. Cambridge putationally involved, since they require solving exponential university press, 2004. [16] S. Carpin, Y. Chow, and M. Pavone. Risk aversion in finite markov cone programs. Note that these calculations can be carried out decision processes using total cost criteria and average . In offline for policy synthesis and then the policy can be applied IEEE Int. Conf. Robotics and Automation, pages 335–342, 2016. [17] M. Chapman, J. Lacotte, A. Tamar, D. Lee, K. M Smith, V. Cheng, [47] J. Thai, T. Hunter, A. K. Akametalu, C. J. Tomlin, and A. M. Bayen. J. Fisac, S. Jha, M. Pavone, and C. Tomlin. A risk-sensitive finite-time Inverse covariance estimation from data with missing values using the reachability approach for safety of stochastic dynamic systems. In 2019 concave-convex procedure. In IEEE Conf. Decision and Control, pages American Control Conference (ACC), pages 2958–2963. IEEE, 2019. 5736–5742, 2014. [18] Danny Z Chen. Developing algorithms and software for geometric path [48] H. Xu and S. Mannor. Distributionally robust Markov decision planning problems. ACM Computing Surveys, 28(4es):18–es, 1996. processes. In Advances in Neural Information Processing Systems, [19] Y. Chow and M. Ghavamzadeh. Algorithms for CVaR optimization in pages 2505–2513, 2010. MDPs. In Advances in neural information processing systems, pages 3509–3517, 2014. [20] Y. Chow, A. Tamar, S. Mannor, and M. Pavone. Risk-sensitive and APPENDIX robust decision-making: a CVaR optimization approach. In Advances in Neural Information Processing Systems, pages 1522–1530, 2015. In this appendix, we present the specific DCPs for finding the [21] S. Diamond and S. Boyd. CVXPY: A Python-embedded modeling risk value functions for two coherent risk measures studied language for convex optimization. Journal of Machine Learning Research, 17(83):1–5, 2016. in our numerical experiments, namely, CVaR and EVaR. [22] A. Dixit, M. Ahmadi, and J. W. Burdick. Risk-sensitive motion planning For a given confidence level ε P p0, 1q, value-at-risk (VaRε) using entropic value-at-risk. European Control Conference, 2021. 1 ε CVaR [23] K. Dvijotham, M. Fazel, and E. Todorov. Convex risk averse control denotes the p ´ q-quantile value of the cost variable. ε design. In IEEE Conf. Decision and Control, pages 4020–4025, 2014. is the expected loss in the p1´εq-tail given that the particular [24] D. D. Fan, K. Otsu, Y. Kubo, A. Dixit, J. Burdick, and A. A. Agha- threshold VaRε has been crossed. CVaRε is given by Mohammadi. STEP: Stochastic traversability evaluation and planning for safe off-road navigation. arXiv 2103.02828, 2021. 1 [25] J. Fan and A. Ruszczynski.´ Process-based risk measures and risk- ρtpct`1q “ inf ζ ` E rpct`1 ´ ζq` | Fts , (24) ζP ε averse control of discrete-time systems. Mathematical Programming, R " * pages 1–28, 2018. where p¨q` “ maxt¨, 0u. A value of ε » 1 corresponds to [26]H.F ollmer¨ and A. Schied. Stochastic finance: an introduction in discrete time. Walter de Gruyter, 2011. a risk-neutral policy; whereas, a value of ε Ñ 0 is rather a [27] C. Gavriel, G. Hanasusanto, and D. Kuhn. Risk-averse shortest path risk-averse policy. problems. In 2012 IEEE 51st IEEE Conference on Decision and Control (CDC), pages 2533–2538. IEEE, 2012. In fact, Theorem 1 can applied to CVaR since it is a [28] R. Horst and N. V. Thoai. DC programming: overview. Journal of coherent risk measure. For MDP M, the risk value func- Optimization Theory and Applications, 103(1):1–43, 1999. tions can be computed by DCP (23), where g2pJq “ [29] E. L. Lawler and D. E. Wood. Branch-and-bound methods: A survey. inf ζ 1 J s1 ζ T s1 s, α , where the Operations research, 14(4):699–719, 1966. ζPR ` ε s1PS p p q ´ q` p | q [30] H. A. Le Thi, H. M. Le, T. P. Dinh, et al. A dc programming approach infimum on the right hand side of the above equation can be ř ( for feature selection in support vector machines learning. Advances in absorbed into the overal infimum problem, i.e., infJ,ζ . Note Data Analysis and Classification, 2(3):259–278, 2008. that g pJq above is convex in ζ [38, Theorem 1]. [31] T. Lipp and S. Boyd. Variations and extension of the convex–concave 2 procedure. Optimization and Engineering, 17(2):263–287, 2016. Unfortunately, CVaR ignores the losses below the VaR thresh- [32] A. Majumdar and M. Pavone. How should a robot assess risk? towards old (since it is only concerned with the average of VaR at the an axiomatic theory of risk in robotics. In Robotics Research, pages 75–84. Springer, 2020. p1´q-tail of the cost distribution). EVaR is the tightest upper [33] M. Ono, M. Pavone, Y. Kuwata, and J. Balaram. Chance-constrained bound in the sense of Chernoff inequality for the value at dynamic programming with application to risk-aware robotic space risk (VaR) and CVaR and its dual representation is associated exploration. Autonomous Robots, 39(4):555–571, 2015. [34] M. Ono, B. C. Williams, and L. Blackmore. Probabilistic planning for with the relative entropy. In fact, it was shown in [7] that continuous dynamic systems under bounded risk. Journal of Artificial EVaRε and CVaRε are equal only if there are no losses Intelligence Research, 46:511–577, 2013. (c Ñ 0) below the VaRε threshold. In addition, EVaR is [35] J. T. Ott. A Markov decision model for a surveillance application and risk-sensitive Markov decision processes. 2010. a strictly monotone risk measure; whereas, CVaR is only [36] L. Prashanth. Policy gradients for CVaR-constrained MDPs. In monotone [6] (see Definition 4). EVaRε is given by International Conference on Algorithmic Learning Theory, pages 155– ζct 1 169. Springer, 2014. Ere ` | Fts ρtpct`1q “ inf log {ζ . (25) [37] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic ζą0 ε Dynamic Programming. John Wiley & Sons, Inc., New York, NY, USA, ˆ ˆ ˙ ˙ 1st edition, 1994. Similar to CVaRε, for EVaRε, ε Ñ 1 is a risk-neutral [38] R. T. Rockafellar, S. Uryasev, et al. Optimization of conditional value- case; whereas, ε Ñ 0 corresponds to a risk-averse case. at-risk. Journal of risk, 2:21–42, 2000. [39] A. Ruszczynski.´ Risk-averse dynamic programming for markov deci- In fact, it was demonstrated in [5, Proposition 3.2] that sion processes. Mathematical programming, 125(2):235–261, 2010. limεÑ0 EVaRεpcq “ ess suppcq “ c¯ (worst-case cost). [40] A. Ruszczynski´ and A. Shapiro. Optimization of convex risk functions. Mathematics of operations research, 31(3):433–452, 2006. Since EVaRε is a coherent risk measure, the conditions [41] P. Sanders. Fast route planning. Google Tech Talk, March 23 2009. of Theorem 1 hold. Since ζ ą 0, using the change [42] A. Shapiro, D. Dentcheva, and A. Ruszczynski.´ Lectures on stochastic of variables, J˜ ” ζJ (note that this change of vari- programming: modeling and theory. SIAM, 2014. [43] X. Shen, S. Diamond, Y. Gu, and S. Boyd. Disciplined convex-concave ables is monotone increasing in ζ [1]), we can com- programming. In 2016 IEEE 55th Conference on Decision and Control pute EVaR value functions by solving (23), where f0 “ (CDC), pages 1009–1014. IEEE, 2016. ˜ ˜ ˜ ˜ ˜ 0, f1pJq “ J, g0pJq “ sPS Jpsq, g1pcq “ ζc, and g2pJq “ [44] C Elliott Sigal, A Alan B Pritsker, and James J Solberg. The stochastic ˜ 1 eJps qT ps1|s,αq shortest route problem. Operations Research, 28(5):1122–1129, 1980. s1PS log ε ř. [45] S. Singh, Y. Chow, A. Majumdar, and M. Pavone. A framework ř for time-consistent, risk-sensitive model predictive control: Theory and Similar´ to the CVaR case,¯ the infimum over ζ can be lumped algorithms. IEEE Transactions on Automatic Control, 2018. into the overall infimum problem, i.e., infJ˜,ζą0. Note that [46] A. Tamar, Y. Chow, M. Ghavamzadeh, and S. Mannor. Sequential ˜ ˜ decision making with coherent risk. IEEE Transactions on Automatic g2pJq is convex in J, since the logarithm of sums of expo- Control, 62(7):3323–3338, 2016. nentials is convex [15, p. 72].