Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence

Probabilistic Goal Markov Decision Processes∗

Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected]

Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often . In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial . We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective. This variant is NP-hard, but break down point [Liu and Koenig, 2005], and risk mea- can be solved in pseudo-polynomial time. sures that can be reduced to robust MDPs satisfying the so- called “rectangular condition” [Nilim and El Ghaoui, 2005; Iyengar, 2005]. 1 Introduction Two notable exceptions that explicitly investigate the prob- The Markov Decision Process (MDP) model is a powerful abilistic goal or chance constrained formulation in the context tool in planning tasks and sequential decision making prob- of MDPs are [Filar et al., 1995] and [Delage and Mannor, lems [Puterman, 1994; Bertsekas, 1995].InMDPs,thesys- 2010]. The first reference considers the average reward case tem dynamics is captured by transition between a finite num- of MDPs, and reduces the probability of achieving a target to ber of states. In each decision stage, a decision maker picks the probability of entering irreducible chains. The analysis is an action from a finite action set, then the system evolves to thus very specific and can not be extended to finite horizon a new state accordingly, and the decision maker obtains a re- case or discounted reward infinite horizon case. The second ward. Both the state transition and the immediate reward are reference investigated, instead of the internal randomness, often random in nature, and hence the cumulative reward X the parametric uncertainty of the cumulative reward. That is, may be inherently stochastic. Classical approach deals with the parameters of the MDP, namely the transition probability the maximization of the expected value of X, which implic- and the reward, are not precisely known. While the decision itly assumes that the decision maker is risk-neutral. maker is risk-neutral to internal randomness – performance In this paper we propose and study an alternative opti- fluctuation due to the stochasticity of the state transition, im- mization criterion: fix a target level V ∈ R, the decision mediate reward and possible randomness of the action, he/she goal is to find a policy π that maximizes the probability of is risk averse to the performance deviation due to the param- achieving the target, i.e., to maximize Pr(Xπ ≥ V ).The eter uncertainty. Conceptually, in this setup one can think of criterion of maximizing the probability of achieving a goal, of having many identical machines all with the same uncer- which we call probabilistic goal, is an intuitively appealing tain parameter, and the goal is to maximize the probability objective, and justified by axiomatic decision theory [Castag- that the average reward (w.r.t. all machines) achieves a cer- noli and LiCalzi, 1996]. Moreover, empirical research has ∗H. Xu is supported by an NUS startup grant R-265-000-384- 1Instead of maximizing the probability of achieving a target, 133. S. Mannor is supported by the Israel Science Foundation under chance constrained program requires such probability to be no less contract 890015. than a fixed threshold.

2046 tain target. Because of this apparent difference in modeling, As standard, we use symbol π to denote a policy of an the techniques used in [Delage and Mannor, 2010] do not ex- MDP. A history-dependent, deterministic policy is a map- tend to the probabilistic goal of the internal randomness, the ping from the history Ht =(s0:t,a0:t,r0:t) of the process h setup that we consider. to an action a ∈Ast .WeuseΠ to represent the set of To the best of our knowledge, the probabilistic goal or history-dependent deterministic policies. The set of deter- chance constrained formulation has not been investigated for ministic policies that only depend on the current state (and finite-horizon or infinite-horizon discounted reward MDPs. the time horizon), which is often called Markovian policies, Our contributions include the following: is denoted by Πt,s. As we show in the sequel, it is sometimes 1. We compare different policy sets: we show that for a beneficial to incorporate the accumulated reward in the deci- probabilistic goal MDP, an optimal policy may depend sion. The set of policies that depend on the time horizon, the on accumulated reward, and randomization does not im- current state, and the accumulated reward up-to-now, which Πt,s,x prove the performance. we called “pseudo-Markovian” policy, is denoted by . Besides deterministic policy, we also considered random- 2. We show that the probabilistic goal MDP is NP-hard. ized policy. That is, assuming there is available a sequence Thus, it is of little hope that such problem can be solved of i.i.d. random variables U1, ··· ,UT , independent to every- in polynomial time in general. thing else, and a history-dependent, randomized policy is a mapping from (Ht,Ut) to an action. We denote the set of 3. We propose a pseudo-polynomial algorithm based on h,u state-augmentation, that solves the probabilistic goal such policies by Π . The set of Markovian and pseudo- MDP. Markovian randomized policies are defined similarly, and de- noted respectively by Πt,s,u and Πt,s,x,u. 4. We investigate chance constrained MDPs and show it Note that given the distribution of the reward parameter, can be solved in pseudo polynomial time. the total reward of the MDP under a policy π is a well de- Before concluding this section, let us briefly discuss some fined random variable, denoted by Xπ. We are interested in practical examples that motivate this study. Consider the fol- the following problems. As standard in the study of computa- lowing scenario that many of us have experienced. Suppose tional complexity, the first problem we consider is a “yes/no” one wants to drive to the airport to catch a plane which is decision problem. soon to take off. He/she needs to decide which route to take, Problem 1 (Decision Problem). Given V ∈ R and α ∈ while the time he/she will spend on each link is random. This h,u (0, 1).Isthereaπ ∈ Π such that can be modeled as an MDP with a deterministic transition probability and random reward. It is clear that the expected Pr(Xπ ≥ V ) ≥ α? time the decision maker will spend on route is less critical than whether he/she will catch the plane, a natural fit for We call this problem D(Πh,u). Similarly, we define probabilistic goal MDP. Other examples can be found in fi- D(Πt,s,u), D(Πt,s,x,u), and their deterministic counterparts. nance, where synthetic derivatives that are triggered by dis- A related problem of more practical interest is the opti- crete events are becoming common, and hence minimizing or mization one. maximizing the probability of such event is relevant and im- Problem 2 (probabilistic goal MDP). Given V ∈ R, find portant; and airplanes design, where one seek a reconfigura- ∈ h,u tion policy that maximizes the chance of not having a critical π Π that maximizes failure. Pr(Xπ ≥ V ). 2 Setup Let us remark that while we focus in this paper the finite horizon case, most results easily extend to infinite horizon In this section we present the problem setup and some discounted reward case, due to the fact that the contribution necessary notations. A (finite) MDP is a sextuple < of the tail of the time horizon can be made arbitrarily small T,S, A, R,p,g>where by increasing the time horizon. • T is the time horizon, assumed to be finite; •Sis a finite set of states, with s0 ∈Sthe initial state; 3 Comparison of policy sets •Ais a collection of finite action sets, one set for each In this section we compare different policy sets. We say a state. For s ∈S, we denote its action set by As; policy set Π is “inferior” to another Π,ifD(Π) being true  •Ris a finite subset of R, and is the set of possible values implies that D(Π ) is true, and there exists an instance where | | the reverse does not hold. We say two policy sets Π and Π of the immediate rewards. We let K =maxr∈R r ;   are “equivalent” if D(Π) is true if and only if D(Π ) is true. • p is the transition probability. That is, pt(s |s, a) is the h  We first show that randomization does not help: Π is probability that the state at time t +1is s , given that the equivalent to Πh,u, Πt,s,x is equivalent to Πt,s,x,u,andsimi- state at time t is s, and the action chosen at time t is a. larly Πt,s is equivalent to Πt,s,u. This essentially means that h t,s,x • g is a set of reward distributions. In particular, gt(r|s, a) for probabilistic goal MDP, it suffices to focus on Π , Π is the probability that the immediate reward at time t is and Πt,s. Note that it suffices to show the following theorem, r,ifthestateiss and action is a. since the reverse direction holds trivially.

2047 +0 (100%) +0 +0 a a

s0 s1 t s1 s2 s3 b w1 b t +1 (50%) 1/2 1/2w2 -1 (50%) +v1 +v2

w2 +1 (50%) 1-1/2w1 1-1/2 -2 (50%) -L

Figure 1: Illustration of Example 1 sbad

Theorem 1. Given an MDP, α ∈ [0, 1], and V , if there exists Figure 2: Example of NP-hard probabilistic goal MDP h,u t,s,x,u t,s,u πu ∈ Π (respectively Π and Π ) such that

Pr(Xπ ≥ V ) ≥ α, u Finally, we remark that adding extra information beyond h t,s,x t,s then there exists π ∈ Π (respectively Π and Π )such the cumulative reward does not improve the performance for that probabilistic goal MDP, as we will show in Section 5. The t,s,x h Pr(Xπ ≥ V ) ≥ α. policies clasees Π and Π are therefore equivalent. Proof. Since T<∞,thesetΠh is finite, i.e., there is a finite number of deterministic history dependent policies. For suc- 4Complexity h,u cinctness of presentation, we write πu ∼ π,whereπu ∈ Π In this section we show that in general, solving the probabilis- and π ∈ Πh, to denote the event (on probability space of U) tic goal MDP is a computationally hard problem. that πu(Ht,U0:t)=π(Ht) for all t and Ht. Fix a randomized Theorem 2. The problem D(Π) is NP-hard, for Π equals to policy π ∈ Πh,u, due to theorem of total probability we have h h,u t,s t,s,u t,s,x t,s,x,u  Π , Π , Π , Π , Π or Π . ≥ ≥ | ∼ ∼ t,s Pr(Xπu V )= Pr(Xπu V πu π)Pr(πu π) Proof. We start with proving the result for Π ,byshow- π∈Πh  ing that we can reduce a well-known NP complete problem, KnapSack Problem (KSP), into a probabilistic goal MDP, and = Pr(Xπ ≥ V )Pr(πu ∼ π). hence establish its NP-hardness. Recall that a KSP is the π∈Πh  following problem. There are n items, 1 through n, each Note that π∈Πh Pr(πu ∼ π)=1,wehavethat item i has a value vi and a weight wi. We can assume they are all non-negative integers. The KSP problem is to decide max Pr(Xπ ≥ V ) ≥ Pr(Xπu ≥ V ), π∈Πh given two positive number W and V , whether there exists a I ⊆ [1 : n] such that which establishes the theorem for the history-dependent case.   The other two cases are essentially same and hence omitted. wi ≤ W ; vi ≥ V. It is straightforward to generalize the finite horizon case i∈I i∈I to the discounted infinite horizon case, due to the fact that the contribution of the tail of the time horizon can be made Given n, wi, vi, W and V , we construct the following arbitrarily small. (The proof is by contradiction: if there is a MDP. Let T = n +2.Therearen +2states: S = bad difference in performance in the limit, it must appear in finite {s1, ··· ,sn,s ,t}: s1 is the initial state and t is the ter- time as well.) minal state. Each state si has two actions: if action a is taken, then the immediate reward is 0 and the next state will be si+1; Πt,s Πt,s,x We next show that in general, is inferior to . if action b is taken, then the immediate reward is vi;further- w This essentially means that including the information of ac- more, with probability 1/2 i the next state will be si+1 (re- cumulated reward can improve the performance of a policy spectively terminal state t if i = n) and otherwise the next for probabilistic goal MDP. state will be sbad.Thestatesbad has only one action which n Example 1. Consider the MDP as in Figure 1: S = incurs an immediate reward −L  −2 i=1 vi,andthenext {s0,s1,t},wheret is the terminal state. The initial state s0 state will be t. See Figure 2. has one action, which leads to state s1, and the immediate re- Now consider the decision problem D(Πt,s) with α = ward is that with probability 0.5 it is +1, and otherwise −1. 1/2W . That is, to answer whether there exists π ∈ Πt,s such State s1 has two actions a and b, both lead to state t.The that immediate reward of a is always 0, and that of b is that with Pr(Xπ ≥ V ) ≥ α. probability 0.5 it is +1, and otherwise −2. Observe that for t,s We now show that the answer to D(Π ) is positive if and either fixed action a or b, Pr(X ≥ 0) = 0.5. In contrast, only if the answer to KSP is positive. consider the policy π that at state s1, takes action a if the ac- t,s Suppose the answer to D(Π ) is positive, i.e., there exists cumulated reward is +1, otherwise takes action b.Observe t,s t,s a policy π ∈ Π such that that Pr(Xπ ≥ 0) = 0.75, which shows that Π is inferior t,s,x to Π . Pr(Xπ ≥ V ) ≥ α.

2048  Define set I as all i ∈ [1 : n] such that π takes b in si. 5.1 Integer Reward Observe this implies that The main technique of our pseudo-polynomial algorithm is  state augmentation. That is, we construct a new MDP in vi ≥ V. which the state space is the Cartesian product of the original i∈I state-space and the set of possible accumulated rewards. To better understand the algorithm, we first consider in this sub- Furthermore, due to the extreme large negative reward in section a special case where the immediate reward is integer- sbad, valued, i.e., R⊂Z. We show that the probabilistic goal MDP bad |S| |A| K Pr(Xπ ≥ V ) ≤ Pr(s is never reached.) can be solved in time polynomial to , and . 1 1 Theorem 3. Suppose that R⊂Z. In computational time  h,u =Πi∈I w = w , polynomial in T , |S|, |A| and K, one can solve D(Π ), 2 i 2 i∈I i i.e., determine the correctness of the following claim: which implies that h,u ∃π ∈ Π : Pr(Xπ ≥ V ) ≥ α.  1 W  ≥ 1/2 ⇒ wi ≤ W. Similarly, in computational time polynomial to T , |S|, |A| i∈I wi 2 i∈I and K, one can solve the probabilistic goal MDP, i.e., find h,u π ∈ Π that maximizes Pr(Xπ ≥ V ). Moreover, there Thus, the answer to KSP is also positive. exists an optimal solution to probabilistic goal MDP that be- Now suppose the answer to KSP is positive, i.e., there ex- longs to Πt,s,x. ists I ⊂ [1 : n] such that   Proof. It suffices to show that an optimal solution to the prob- wi ≤ W ; vi ≥ V. abilistic goal MDP ∗ i∈I i∈I π = arg max Pr(Xπ ≥ V ), π∈Πh Consider the following policy π of the MDP: take action b together with the optimal value Pr(Xπ∗ ≥ V ), can be ob- for all si where i ∈ I,anda otherwise. We have tained by solving an MDP with 2TK|S| states. Pr(sbad is never reached.) We construct a new MDP as follows: each state is a pair ∈ −  1 1 (s, C) where s S and C is an integer in [ TK,TK],and  ≥ W A = w = w 1/2 = α. the initial state is (s0, 0). The action set for (s, C) is s,for 2 i i∈I i i∈I 2 all C. In time t, the transition between states is defined as follows: bad    Furthermore, when s is never reached, the cumulative re-      Pr (s ,C ) a, (s, C) = pt(s |a, s) × gt(C − C|s, a). ward is i∈I vi ≥ V . Thus, we have that That is, a transition to (s,C) happens if in the original MDP, Pr(Xπ ≥ V ) ≥ α,  a transition to state s happens and the immediate reward is − i.e., the answer to D(Πt,s) is also positive. C C. Notice that since the immediate reward of the original [−K, K] T Thus, determining the answer to KSP is reduced to answer- MDP is bounded in and there are only stages, t,s the accumulated reward of the original MDP is bounded in ing D(Π ), the probabilistic goal MDP. Since the former [−TK,TK] is NP-complete, we conclude that probabilistic goal MDP is . The immediate-reward of the new MDP is zero (s, C) C ≥ NP-hard. except in the last stage, in which for states with V , a reward +1 is incurred. Notice that in this example, Πt,s =Πt,s,x =Πh. Thus, It is easy to see that the solution to probabilistic goal MDP NP-hardness for the decision problems with respect to these is equivalent to a policy that maximizes the expected re- policy sets are established as well. Furthermore, Theorem 1 ward for the new MDP. Note that there exists a determinis- implies that D(Πh,u), D(Πt,s,u), D(Πt,s,x,u) are also NP- tic, Markovian policy to the latter, then the probabilistic goal hard. t,s,x MDP has an optimal solution in Π . 5 Pseudo-polynomial Theorem 3 leads to the following algorithm for solving probabilistic goal MDPs. Note that, not surprisingly, the pro- In the previous section we showed that it is of little hope that posed algorithm indeed parallels the standard algorithm (also the probabilistic goal MDP can be solved in polynomial time. pseudo-polynomial) solving KSP. In this section we develop a pseudo-polynomial algorithm to handle this question. Recall that the running time of a pseudo- 5.2 General Reward polynomial algorithm is polynomial in the number of param- In this section we relax the assumption that the reward is inte- eters and the size of the parameter, as opposed to polynomial ger valued. We discretize the real-valued reward parameters algorithms whose running time is polynomial in the number to obtain a new MDP that approximates the original one. The of parameters and polylogarithmic in the size of the parame- new MDP is equivalent to an integer valued MDP, and hence ters. can be solved in pseudo-polynomial time thanks to results

2049 • Input: MDP with R⊂Z, V ∈ R. treat this probability as a constraint: the decision maker may t,s,w • Output: π ∈ Π that maximizes Pr(Xπ ≥ V ). be interested in maximizing the expected reward while in the mean time ensures that the probability of achieving the tar- • Algorithm: get V is larger than a threshold. This leads to the Chance 1. Construct MDP with augmented state-space. Constrained MDP (CC-MDP) formulation. 2. Find a Markovian πˆ∗ that maximizes the expected Problem 3 (Chance Constrained MDP). Given V ∈ R and reward of the new MDP. α ∈ (0, 1), find a policy π ∈ Πh,u that solves 3. Construct the policy of the original MDP as fol- Maximize: E(Xπ) lows: the action taken at stage t, state s, with an Subject to: Pr(Xπ ≥ V ) ≥ α. accumulated reward C equals to πˆt(s, C). Output the policy. Notice that checking the the feasibility of CC-MDP is equivalent to (NP-hard) D(πh,u), which implies the NP- hardness of CC-MDP. Nevertheless, similar as the probabilis- in Section 5.1. To show that such approximation is legiti- tic goal MDP, there exists a pseudo-polynomial algorithm mate, we bound the performance gap that diminishes as the that solves CC-MDP. However, in contrast to the probabilis- discretization becomes finer. tic goal MDP, the optimal policy may be randomized. For succinctness we focus in this section the integer-reward case. Theorem 4. There exists an algorithm, whose running time Note that the general reward case can be approximated to ar- is polynomial in T , |S|, |A|, K and 1/, that finds a policy πˆ, bitrary precision using discretization. such that Theorem 5. Given an MDP such that R∈Z, V ∈ R, h,u Pr(Xπ ≥ V + ) ≤ Pr(Xπˆ ≥ V ); ∀π ∈ Π . α ∈ [0, 1], the Chance Constrained MDP can be solved in a time polynomial in T,|S|, |A|,K. Furthermore, there ex- Proof. Given an MDP T,S, A, R,p,g, with K = ists an optimal solution π belongs to Πt,s,x,u. maxr∈R |r|, and the grid size δ>0, we can construct a new MDP ˆ ,whereRˆ = {iδ|i ∈ Z; −K/δ ≤ Proof. Similar to the probabilistic goal MDP case, we con- i ≤ K/δ},and struct a new MDP with augmented state space (s, C).The   transition probability is defined in a same way as in the proba-  ˆ r:r≤r

2050 similar in spirit to [Delage and Mannor, 2010].Thereare risk preference of the decision maker. The main thrust of this four possible states for a machine. The larger the state num- work is to address this shortcoming, with a decision criterion ber, the worse the machine is. One can either replace the ma- that is both intuitively appealing and also justified from a de- chine, which incurs a random cost, and the next state is s0, cision theory perspective, without sacrificing too much of the or “do nothing,” with no cost, and the next state will be one computational efficiency of the classical approach. state larger; see Figure 3. Fix T =20. We compute the pol- From an algorithmic perspective, the proposed algorithm icy that minimizes the expected cost as in the standard MDP, for solving probabilistic MDPs may appear to be “easy” and and the policies that maximizes the probability that the cost not scale well, partly due to the fact that the problem is NP- is no larger than 5, 6, 7, 8, 9, respectively. For each policy, hard. However, we believe that by embedding the proba- 1000 simulations are performed and the relative error is un- bilistic MDP into a large scale MDP, it opens doors to use der 1%. We plot the performance of each policy in Figure 4: more scalable methods, in the spirit of approximate dynamic for a given x, the y-value is the frequency that the cumulative programming, that have been developed to handle large-scale cost of a policy is smaller or equal to x. The simulation re- MDPs, to approximately solve the probabilistic MDP. This is sult clearly shows that each policy maximizes the chance of an interesting and important issue for future research. reaching its respective target. References s0 s1 s2 s3 [Altman, 1999] E. Altman. Constrained Markov Decision

+1 (100%) Processes. Chapman and Hall, 1999. [Bertsekas, 1995] D. P. Bertsekas. +1 (60%) +3 (40%) and . Athena Scientific, 1995. [Castagnoli and LiCalzi, 1996] E. Castagnoli and M. Li- +1 (60%) +5 (40%) Calzi. Expected utility without utility. Theory and De- cisions, 41:281–301, 1996. [Chung and Sobel, 1987] K. Chung and M. Sobel. Dis- Figure 3: Simulation: Machine replacement dynamics counted MDP’s: distribution functions and exponential utility maximization. SIAM Journal on Control and Op- timization, 25(1):49–62, 1987. [Delage and Mannor, 2010] E. Delage and S. Mannor. Per- 1 centile optimization for Markov decision processes with 0.9 Std. MDP parameter uncertainty. Operations Research, 58(1):203– 0.8 Target=9 213, 2010. 0.7 Target=8 0.6 Target=7 [Filar et al., 1995] J. A. Filar, D. Krass, and K. W. Ross. Per- 0.5 Target=6 centile performance criteria for limiting average Markov Target=5 0.4 decision processes. IEEE Transactions on Automatic Con-

0.3 trol, 40(1):2–10, 1995. Achieving Probability 0.2 [Iyengar, 2005] G. N. Iyengar. Robust dynamic program- 0.1 ming. Mathematics of Operations Research, 30(2):257– 0 280, 2005. 0 5 6 7 8 9 10 15 Target Cost [Lanzillotti, 1958] R. F. Lanzillotti. Pricing objectives in large companies. American Economic Review, 48:921– Figure 4: Simulation Result 940, 1958. [Le Tallec, 2007] Y. Le Tallec. Robust, Risk-Sensitive, and Data-driven Control of Markov Decision Processes.PhD 8 Conclusion and Discussions thesis, MIT, 2007. In this paper we proposed and investigated the probabilistic [Liu and Koenig, 2005] Y. Liu and S. Koenig. Risk-sensitive goal MDP model where the decision maker maximizes the planning with one-switch utility functions: Value iteration. probability that the cumulative reward of a policy is above a In Proceedings of the 20th AAAI Conference on Artificial fixed target. We showed that while this formulation is NP- Intelligence, pages 993–999, 2005. hard, it can be solved in pseudo-polynomial time using state [Mao, 1970] J. Mao. Survey of capital budgeting: Theory augmentation. We further discussed the chance constrained and practice. Journal of Finance, 25:349–360, 1970. MDP formulation and showed it to be solvable in pseudo- polynomial time. [Miller and Wagner, 1965] L. B. Miller and H. Wagner. The classical objective for MDP, considers only the ex- Chance-constrained programming with joint constraints,. pected cumulative reward, and hence may not capture the Operations Research, 13:930–945, 1965.

2051 [Nilim and El Ghaoui, 2005] A. Nilim and L. El Ghaoui. Robust control of Markov decision processes with uncer- tain transition matrices. Operations Research, 53(5):780– 798, September 2005. [Payne et al., 1980] J. W. Payne, D. J. Laughhunn, and R. Crum. Tanslation of gambles and aspiration level ef- fects in risky choice behaviour. Management Science, 26:1039–1060, 1980. [Payne et al., 1981] J. W. Payne, D. J. Laughhunn, and R. Crum. Further tests of aspiration level effects in risky choice behaviour. Management Science, 26:953–958, 1981. [Pr´ekopa, 1970] Pr´ekopa. On probabilistic constrained pro- gramming. In In Proceedings of the Princeton Symposium on Mathematical Programming, pages 113–138, 1970. [Puterman, 1994] M. L. Puterman. Markov Decision Pro- cesses. John Wiley & Sons, New York, 1994. [Riedel, 2004] F. Riedel. Dynamic coherent risk mea- sure. Stochastic Processes and Applications, 112:185– 200, 2004. [Simon, 1959] H. A. Simon. Theories of decision-making in and behavioral science. American Economic Review, 49(3):253–283, 1959.

2052