Risk-Averse Stochastic Shortest Path Planning
Total Page:16
File Type:pdf, Size:1020Kb
Risk-Averse Stochastic Shortest Path Planning Mohamadreza Ahmadi, Anushri Dixit, Joel W. Burdick, and Aaron D. Ames Abstract— We consider the stochastic shortest path planning one risk measure over another depends on factors such as problem in MDPs, i.e., the problem of designing policies that sensitivity to rare events, ease of estimation from data, and ensure reaching a goal state from a given initial state with computational tractability. Artzner et. al. [8] characterized a minimum accrued cost. In order to account for rare but impor- tant realizations of the system, we consider a nested dynamic set of natural properties that are desirable for a risk measure, coherent risk total cost functional rather than the conventional called a coherent risk measure, and have henceforth obtained risk-neutral total expected cost. Under some assumptions, we widespread acceptance in finance and operations research, show that optimal, stationary, Markovian policies exist and among others. Coherent risk measures can be interpreted as can be found via a special Bellman’s equation. We propose a a special form of distributional robustness, which will be computational technique based on difference convex programs (DCPs) to find the associated value functions and therefore the leveraged later in this paper. risk-averse policies. A rover navigation MDP is used to illustrate Conditional value-at-risk (CVaR) is an important coherent risk the proposed methodology with conditional-value-at-risk (CVaR) measure that has received significant attention in decision and entropic-value-at-risk (EVaR) coherent risk measures. making problems, such as MDPs [20], [19], [36], [9]. General coherent risk measures for MDPs were studied in [39], [3], wherein it was further assumed the risk measure is I. INTRODUCTION time consistent, akin to the dynamic programming property. Shortest path problems [10], i.e., the problem of reaching Following the footsteps of [39], [46] proposed a sampling- a goal state form an initial state with minimum total cost, based algorithm for MDPs with static and dynamic coherent arise in several real-world applications, such as driving di- risk measures using policy gradient and actor-critic methods, rections on web mapping websites like MapQuest or Google respectively (also, see a model predictive control technique for Maps [41] and robotic path planning [18]. In a shortest path linear dynamical systems with coherent risk objectives [45]). problem, if transitions from one system state to another is A method based on stochastic reachability analysis was pro- subject to stochastic uncertainty, the problem is referred to posed in [17] to estimate a CVaR-safe set of initial conditions as a stochastic shortest path (SSP) problem [13], [44]. In this via the solution to an MDP. A worst-case CVaR SSP planning case, we are interested in designing policies such that the total method was proposed and solved via dynamic programming expected cost is minimized. Such planning under uncertainty in [27]. Also, total cost undiscounted MDPs with static CVaR problems are indeed equivalent to an undiscounted total cost measures were studied in [16] and solved via a surrogate Markov decision processes (MDPs) [37] and can be solved MDP, whose solution approximates the optimal policy with efficiently via the dynamic programming method [13], [12]. arbitrary accuracy. However, emerging applications in path planning, such as In this paper, we propose a method for designing policies autonomous navigation in extreme environments, e.g., sub- for SSP planning problems, such that the total accrued cost terranean [24] and extraterrestrial environments [2], not only in terms of dynamic, coherent risk measures is minimized (a require reaching a goal region, but also risk-awareness for generalization of the problems considered in [27] and [16] to mission success. Nonetheless, the conventional total expected dynamic, coherent risk measures). We begin by showing that, cost is only meaningful if the law of large numbers can be in- under the assumption that the goal region is reachable in finite arXiv:2103.14727v1 [eess.SY] 26 Mar 2021 voked and it ignores important but rare system realizations. In time with non-zero probability, the total accumulated risk cost addition, robust planning solutions may give rise to behavior is always bounded. We further show that, if the coherent risk that is extremely conservative. measures satisfy a Markovian property, we can find optimal, Risk can be quantified in numerous ways. For example, stationary, Markovian risk-averse policies via solving a spe- mission risks can be mathematically characterized in terms cial Bellman’s equation. We also propose a computational of chance constraints [34], [33], utility functions [23], and method based on difference convex programming to solve the distributional robustness [48]. Chance constraints often ac- Bellman’s equation and therefore design risk-averse policies. count for Boolean events (such as collision with an obstacle We elucidate the proposed method via numerical examples or reaching a goal set) and do not take into consideration involving a rover navigation MDP and CVaR and entropic- the tail of the cost distribution. To account for the latter, value-at-risk (EVaR) measures. risk measures have been advocated for planning and decision The rest of the paper is organized as follows. In the next making tasks in robotic systems [32]. The preference of section, we review some definitions and properties used in the sequel. In Section III, we present the problem under The authors are with the California Institute of Technology, 1200 E. study and show its well-posedness under an assumption. In California Blvd., MC 104-44, Pasadena, CA 91125, e-mail: (fmrahmadi, adixit, [email protected], [email protected].) Section IV, we present the main result of the paper, i.e., a Consider a probability space pΩ; F; Pq, a filtration F0 Ă ¨ ¨ ¨ FT Ă F, and an adapted sequence of random vari- si ables (stage-wise costs) ct; t “ 0;:::;T , where T P N¥0 Y t8u. For t “ 0;:::;T , we further define the spaces Ct “ LppΩ; Ft; Pq, p P r1; 8q, Ct:T “ Ct ˆ ¨ ¨ ¨ ˆ CT and g g T ps | s ; αq “ 1 C “ C0 ˆ C1 ˆ ¨ ¨ ¨ . In order to describe how one can evaluate the risk of sub-sequence ct; : : : ; cT from the perspective of stage t, we require the following definitions. sg sj Definition 2 (Conditional Risk Measure): A mapping ρt:T : Ct:T Ñ Ct, where 0 ¤ t ¤ N, is called a conditional risk Fig. 1. The transition graph of the particular class of MDPs studied in this measure, if it has the following monotonicity property: paper. The goal state sg is cost-free and absorbing. 1 1 1 ρt:T pcq ¤ ρt:T pc q; @c; @c P Ct:T such that c ¨ c : special Bellman’s equation for the risk-averse SSP problem. Definition 3 (Dynamic Risk Measure): A dynamic risk mea- In Section V, we describe a computational method to find sure is a sequence of conditional risk measures ρt:T : Ct:T Ñ risk-averse policies. In Section VI, we illustrate the proposed Ct, t “ 0;:::;T . method via a numerical example and finally, in Section VI, One fundamental property of dynamic risk measures is their we conclude the paper. consistency over time [39, Definition 3]. If a risk measure is Notation: We denote by Rn the n-dimensional Euclidean time-consistent, we can define the one-step conditional risk space and N¥0 the set of non-negative integers. We use measure ρt : Ct`1 Ñ Ct, t “ 0;:::;T ´ 1 as follows: bold font to denote a vector and p¨qJ for its transpose, e.g., J ρtpct`1q “ ρt;t`1p0; ct`1q; (1) a “ pa1; : : : ; anq , with n P t1; 2;:::u. For a vector a, we use a © p¨q0 to denote element-wise non-negativity (non- and for all t “ 1;:::;T , we obtain: positivity) and a ” 0 to show all elements of a are zero. A For a finite set A, we denote its power set by 2 , i.e., the ρt;T pct; : : : ; cT q “ ρt ct ` ρt`1pct`1 ` ρt`2pct`2 ` ¨ ¨ ¨ set of all subsets of A. For a probability space pΩ; F; q P `` ρT ´1 pcT ´1 ` ρT pcT qq ¨ ¨ ¨ qq : (2) and a constant p P r1; 8q, LppΩ; F; Pq denotes the vector p Note that the time-consistent risk measure is completely de- space of real valued random variables c for which E|c| ă 8. ˘ Superscripts are used to denote indices and subscripts are used fined by one-step conditional risk measures ρt, t “ 0;:::;T ´ 2 1 and, in particular, for t “ 0, (2) defines a risk measure of to denote time steps (stages), e.g., for s P S, s1 means the the value of s2 P S at the 1st stage. the entire sequence c P C0:T . At this point, we are ready to define a coherent risk measure. II. PRELIMINARIES Definition 4 (Coherent Risk Measure): We call the one-step This section, briefly reviews notions and definitions used conditional risk measures ρt : Ct`1 Ñ Ct, t “ 1;:::;N ´ 1 throughout the paper. as in (2) a coherent risk measure, if it satisfies the following conditions We are interested in designing policies for a class of finite 1 1 MDPs (termed transient MDPs in [16]) as shown in Figure 1, ‚ Convexity: ρtpλc ` p1 ´ λqc q ¤ λρtpcq ` p1 ´ λqρtpc q, 1 which is defined next. for all λ P p0; 1q and all c; c P Ct`1; 1 1 ‚ Monotonicity: If c ¤ c , then ρtpcq ¤ ρtpc q for all Definition 1 (MDP): An MDP is a tuple, M “ 1 g c; c P Ct`1; pS; Act; T; s0; c; s q, where 1 1 ‚ Translational Invariance: ρtpc `cq “ ρtpc q`c for all 1 |S| ‚ States S s ; : : : ; s of the autonomous agent(s) 1 “ t u c P Ct and c P Ct`1; and world model, ‚ Positive Homogeneity: ρtpβcq “ βρtpcq for all c P Ct`1 1 |Act| ‚ Actions Act “ tα ; : : : ; α u available to the robot, and β ¥ 0.