for Constrained Markov Decision Processes

Ather Gattami Qinbo Bai Vaneet Aggarwal AI Sweden Purdue University Purdue University

Abstract a given reward by interacting with the surrounding en- vironment. The interaction teaches the agent how to maximize its reward without knowing the underlying In this paper, we consider the problem of op- dynamics of the process. A classical example is swing- timization and learning for constrained and ing up a pendulum in an upright position. By making multi-objective Markov decision processes, several attempts to swing up a pendulum and balanc- for both discounted rewards and expected av- ing it, one might be able to learn the necessary forces erage rewards. We formulate the problems as that need to be applied in order to balance the pen- zero-sum games where one player (the agent) dulum without knowing the physical model behind it, solves a Markov decision problem and its op- which is the general approach of classical model based ponent solves a bandit optimization problem, control theory (Åström and Wittenmark, 1994). which we here call Markov-Bandit games. We extend Q-learning to solve Markov-Bandit Informally, the problem of constrained reinforcement games and show that our new Q-learning al- learning for Markov decision processes is described as gorithms converge to the optimal solutions follows. Given a process with state sk at of the zero-sum Markov-Bandit games, and time step k,rewardfunctionr,constraintfunctionrj, hence converge to the optimal solutions of the and a discount factor 0 <<1,themulti-objective constrained and multi-objective Markov deci- reinforcement learning problem is that for the optimiz- sion problems. We provide numerical exam- ing agent to find a stationary policy ⇡(sk) that simul- ples where we calculate the optimal policies taneously satisfies in the discounted reward setting and show by simulations that the converges to the calculated optimal policies. 1 k max E r(sk,⇡(sk)) (1) ⇡ To the best of our knowledge, this is the first k=0 ! time Q-learning guarantee con- X 1 vergence to optimal stationary policies for s.t. E krj(s ,⇡(s )) 0 (2) k k the multi-objective Reinforcement Learning k=0 ! problem with discounted and expected aver- X age rewards, respectively. or in the expected average reward setting T 1 1 max lim E r(sk,⇡(sk)) (3) ⇡ T T !1 ! 1Introduction kX=0 T 1 1 j s.t. lim E r (sk,⇡(sk)) 0 (4) 1.1 Motivation T T !1 ! kX=0 Reinforcement learning has made great advances in for j =1,...,J (a more formal definition of the problem several applications, ranging from online learning and is introduced in the next section and some examples recommender engines, natural language understanding of this setup are given in Appendix.) and generation, to mastering games such as Go (Silver et al., 2017) and Chess. The idea is to learn from ex- Surprisingly, although constrained MDP problems are tensive experience how to take actions that maximize fundamental and have been studied extensively in the literature (see (Altman, 1999) and the references Proceedings of the 24th International Conference on Artifi- therein), the reinforcement learning counter part seem cial Intelligence and Statistics (AISTATS) 2021, San Diego, to be still open. When an agent takes actions based California, USA. PMLR: Volume 130. Copyright 2021 by on the observed states and constraint-outputs solely the author(s). (without any knowledge about the dynamics, and/or Reinforcement Learning for Constrained Markov Decision Processes constraint-functions), a general solution seem to be no proofs of convergence are provided for the proposed lacking to the best of the author’s knowledge for both sub-optimal algorithms. Sub-optimal solutions with the discounted and expected average rewards cases. convergence guarantees were provided in (Chow et al., 2017) for the single constraint problem, allowing for Note that maximizing Eq. (1) is equivalent to maxi- randomized polices. In (Borkar, 2005), an actor-critic mizing subject to the constraint sub-optimal algorithm is provided for one single con- straint and it’s claimed that it can generalize to an 1 k E r(sk,⇡(sk)) arbitrary number of constraints. Reinforcement learn- ! kX=0 ing based model-free solutions have been proposed for the problems without guarantees (Djonin and Krishna- Thus, one could always replace r with r (1 ) murthy, 2007; Lizotte et al., 2010; Drugan and Nowe, and obtain a constraint of the form (1). Similarly for 2013; Achiam et al., 2017; Abels et al., 2019; Raghu the average reward case, one may replace r with r et al., 2019). to obtain a constraint of the form (3). Hence, we can run the bisection method with respect to and the Recently, (Tessler et al., 2018) proposed a policy gradi- problem in discounted setting will be transformed to ent algorithm with Lagrange multiplier in multi-time find a policy ⇡ such that scale for discounted constrained reinforcement learn- ing algorithm and proved that the policy converges to 1 E krj(s ,⇡(s )) 0 (5) afeasiblepolicy.(Efronietal.,2020)foundafeasi- k k k=0 ! ble policy by using Lagrange multiplier and zero-sum X game for reinforcement learning algorithm with convex where j =0, 1,...,J and r0 = r (1 ).Orinthe constraints and discounted reward. Yu et al. (2019) average setting, find policy ⇡ such that (Paternain et al., 2019) showed that constrained re-

T 1 inforcement learning has zero duality gap, which pro- 1 j vides a theoretical guarantee to policy gradient algo- lim E r (sk,⇡(sk)) 0 (6) T T !1 k=0 ! rithms in the dual domain. In constrast, our paper X does not use policy gradient based algorithms. (Zheng where r0 = r . In this paper, we call problems and Ratliff, 2020) proposed the C-UCRL algorithm 3 (5) and (6) as multi-objective MDPs problems and which achieves sub-linear O(T 4 log(T )/) with prob- propose the algorithm with Markov bandit game and ability 1 , while satisfiying the constraints. How- p prove the convergence of them. ever, this algorithm needs the knowledge of the model dynamics. Brantley et al. (2020) proposed a model- 1.2 Related Work based algorithm for tabular episodic reinforcement learning with concave rewards and convex constraints. Constrained MDP problems are convex and hence Singh et al. (2020) modified the famous UCRL2 al- one can convert the constrained MDP problem to an gorithm and proposed the model-based UCRL-CMDP unconstrained zero-sum game where the objective is algorithm to solve the CMDP problem and gave the the Lagrangian of the optimization problem (Altman, sub-linear result. Efroni et al. (2020) proposed 4 al- 1999). However, when the dynamics and rewards are gorithms for the constrained reinforcement learning not known, it doesn’t become apparent how to do problem in primal, dual or primal-dual domain and it as the Lagrangian will itself become unkown to showed a sub-linear bound for regret and constraints the optimizing agent. Previous work regarding con- violations. However, all these algorithms are model strained MDPs, when the dynamics of the stochastic based. While our algorithm is model-free and scalable process are not known, considers scalarization through to continuous spaces. Ding et al. (2020b) employed weighted sums of the rewards, see (Roijers et al., 2013) the natural policy gradient method to solve the dis- and the references therein. Another approach is to counted infinite-horizon CMDP problem. It achieves 1 consider Pareto optimality when multiple objectives the ( 2 ) convergence rate with respite to the concept O ✏ are present (Zhou et al., 2020) and Yang et al. (2019). of ✏-optimal or ✏-constraint violation. Despite that the Notice that there may be multiple Pareto optimal algorithm is model-free, it still needs simulator to get points, and all these points will not satisfy all the con- samples. Ding et al. (2020a) proposed a model-free straints in general. Further, using any solution of min- primal-dual algorithm without the simulator to solve max may not in general be Pareto optimal. Thus, the the CMDP and gives the (pT ) bound for both re- O problem formulation is very different from the above ward and constraint, which should be the state-of-the- papers which aims to achieve the Pareto front. art result in this problem. Shah and Borkar (2018) In (Geibel, 2006), the author considers a single con- proposed a three time scale Q-learning based algo- straint and allowing for randomized policies. However, rithm to solve a constraint satisfaction problem in the Ather Gattami, Qinbo Bai, Vaneet Aggarwal expected average setting. In contrast, this paper pro- 2ProblemFormulationand vides the first single time-scale Q-learning algorithm Assumptions for both discounted and expected average rewards. Consider a Markov Decision Process (MDP) defined by the tuple (S, A, P ), where S = S ,S ,...,S is a { 1 2 n} 1.3 Contributions finite set of states, A = A ,A ,...,A is a finite set { 1 2 m} of actions taken by the agent, and P : S A S [0, 1] ⇥ ⇥ ! We consider the problem of optimization and learn- is a transition function mapping each triple (s, a, s+) ing for constrained Markov decision processes, for both to a given by discounted rewards and expected average rewards. We formulate the problems as zero-sum games where one P (s, a, s+)=Pr(s+ s, a) player (the agent) solves a Markov decision problem | and its opponent solves a bandit optimization prob- and hence, lem, which we here call Markov-Bandit games which P (s, a, s )=1, (s, a) S A. are interesting on their own. The opponent acts on a + 8 2 ⇥ s S finite set (and not on a continuous space). This trans- X+2 formation is essential in order to achieve a tractable where s+ is the next state for state s.Let⇧ be the optimal algorithm. The reason is that using Lagrange set of policies that map a state s S to a probability duality without model knowledge requires infinite di- 2 distribution of the actions with a probability assigned mensional optimization in the learning algorithm since to each action a A,thatis⇡(s)=a with probability the Lagrange multipliers are continuous (compare to 2 Pr(a s). The agent’s objective in the multi-objective the intractability of a partially observable MDP, where | reinforcement learning is concerned with finding a pol- the beliefs are continuous variables). We extend Q- icy that satisfies a set of constraints of the form ((5)) or learning to solve Markov-Bandit games and show that j (6), for s0 = s S, where r : S A R are bounded our new Q-learning algorithms converge to the optimal 2 ⇥ ! functions, for j =0, 1,...,J, possibly unknown to the solutions of the zero-sum Markov-Bandit games, and agent. The parameter (0, 1) is a discount factor hence converge to the optimal solutions of the multi- 2 which models how much weight to put on future re- objective Markov decision problems. The proof tech- wards. The expectation is taken with respect to the niques are different for solving the discounted and av- introduced by the policy ⇡ and the tran- erage rewards problems, respectively, where the latter sition mapping P . becomes much more technically involved. We provide numerical examples where we calculate the optimal Definition 1 (Unichain MDP). An MDP is called policies and show by simulations that the algorithm unichain, if for each policy ⇡ the in- converges to the calculated optimal policies. To the duced by ⇡ is ergodic, i.e. each state is reachable from best of our knowledge, this is the first time Q-learning any other state. algorithms guarantee convergence to optimal station- ary policies for the constrained and multi-objective Unichain MDPs are usually considered in reinforce- MDP problem with discounted and expected average ment learning problems with discounted rewards, since rewards, respectively. they guarantee that we learn the process dynamics from the initial states. Thus, for the discounted re- ward case we will make the following assumption. 1.4 Outline Assumption 1 (Unichain MDP). The MDP (S, A, P ) is assumed to be unichain. In the problem formulation (Section 2), we present a precise mathematical definition of the multi-objective For the case of expected average reward, we will make problem for MDPs. Then, we give a brief introduc- asimplerassumptionregardingtheexistenceofare- tion and some useful results to reinforcement learning curring state, a standard assumption in Markov deci- with applications to zero-sum Markov-Bandit games sion process problems with expected average rewards (Section 3). Section 4 firstly shows the connection to ensure that the expected reward is independent of between the Markov Bandit Game and the multi- the initial state. objective problem. Then, it presents the solution to Assumption 2. There exists a state s⇤ S which is 2 the multi-objective reinforcement learning problem . recurrent for every stationary policy ⇡ played by the We demonstrate the proposed algorithm by an exam- agent. ple in Section 5 and we finally conclude the paper in j Section 6. Most proofs and some numerical results are Assumption 2 implies that E(r (sk,ak)) is indepen- relegated to the Appendix. dent of the initial state at stationarity. Hence, the Reinforcement Learning for Constrained Markov Decision Processes constraint (3) is at stationarity equivalent to the in- we have that j equality E(r (sk,ak)) 0,forallk. We will use this 1 k constraint in the sequel which turns out to be very use- Q(s, a, o)=R(s, a, o)+E R(s+,⇡(s+),o) ful in the game-theoretic approach to solve the prob- k=1 ! X lem. = R(s, a, o)+ E (Q(s ,⇡(s ),o)) · + + Assumption 3. The absolute values of the functions (8) j J r j=0 are bounded by some constant c known to the Equation (8) is known as the . The { } agent. solution to (8), with respect to Q and the initial state ? s0 that corresponds to the optimal policy ⇡ ,isde- The bounded reward function is a typical assumption noted Q?. If we have the function Q?,thenwecan in RL, Jin et al. (2018) and Ni et al. (2019), where the obtain the optimal policy ⇡? according to the equa- bound is known aprior. tions ? ? ? Q (s, a, o)=R(s, a, o)+ E (Q (s+,⇡ (s+),o)) ? ? · ? (⇡ (s0),o ) = arg max min E (Q (s0,⇡(s0),o)) 3ReinforcementLearningfor ⇡ ⇧ o 2 Zero-Sum Markov-Bandit Games ⇡?(s) = arg max E (Q?(s, ⇡(s),o?)) ⇡ ⇧ 2 (9) Azero-sumMarkov-Banditgameisdefinedbythetu- which maximizes the total discounted reward ple (S, A, O, P, R), where S, A and P are defined as in 1 k ? ? ? section 2, O = o1,o2,...,oq is a finite set of actions min E R(sk,⇡ (s),o) =minE (Q (s, ⇡ (s),o)) { } o ! o made by the agent’s opponent.Let⇧ be the set of kX=0 policies ⇡(s) that map a state s S to a probability 2 for s = s0.Notethattheoptimalpolicymaynot distribution of the actions with a probability assigned be deterministic, as opposed to reinforcement learning to each action a A,thatis⇡(s)=a with probability 2 for unconstrained Markov Decision Processes, where Pr(a s). | there is always an optimal policy that is determinis- For the zero-sum Markov-Bandit game, we define the tic. Also, not that we will get different Q tables for Q? reward R : S A O R which is assumed to be different initial states here. Therefore, is in fact ⇥ ⇥ ! bounded. The agent’s objective is to maximize the dependent on and varies with respect to s0.Amore proper notation would be to use Q? ,butweomitthe minimum (average or discounted) reward obtained due s0 to the opponent’s malicious action. The difference be- indexing with respect to s0 for ease of notation. It’s tween a zero-sum Markov game and a Markov-Bandit relevant to introduce the operator ? game is that the opponent’s action doesn’t affect the (TQ)(s, a, o)=R(s, a, o)+ E (Q(s+,⇡ (s+),o)) state and it chooses a constant action o = o O for · k 2 all time steps k. This will be made more precise in the which ⇡⇤ appears in Equation (9). It’s not hard to following sections. check that the operator T is not a contraction, so the standard Q-learning that is commonly used for rein- forcement learning in Markov decision processes with 3.1 Discounted Rewards discounted rewards can’t be applied here. In the case we don’t know the process P and the re- Consider a zero-sum Markov-Bandit game where the ward function R, we will not be able to take advantage agent is maximizing the total discounted reward given of the Bellman equation directly. The following results by show that we will be able to design an algorithm that always converges to Q?. 1 k V (s)=minE R(sk,ak,o) (7) Theorem 1. Consider a zero-sum Markov-Bandit o ! kX=0 game given by the tuple (S, A, O, P, R) where (S, A, P ) is unichain, and suppose that R is bounded by some for the initial state s S.LetQ(s, a, o) be the ex- ? ? 0 2 constant and known aprior. Let Q = Q and ⇡ be pected reward of the agent taking action a = a A 0 2 solutions to from state s = s, and continuing with a policy ⇡ 0 Q(s, a, o)=R(s, a, o)+ E (Q(s ,⇡?(s ),o)) thereafter when the opponent takes a fixed action · + + (⇡?(s),o?) = arg max min E (Q(s, ⇡(s),o)) o.Notethatthisisdifferentfromzero-sumMarkov ⇡ ⇧ o 2 games with discounted rewards (Littman, 1994), where ⇡?(s) = arg max E (Q(s, ⇡(s),o?)) the opponent’s actions may vary over time, that is ok ⇡ ⇧ 2 is not a constant. Then for any stationary policy ⇡, (10) Ather Gattami, Qinbo Bai, Vaneet Aggarwal

Let ↵ (s, a, o)=↵ 1 (s ,a ,o ) satisfy we have that k k · (s,a,o) k k k

1 H(s, o)+v(o)=E R(s, ⇡(s),o) 0 ↵k(s, a, o) < 1, ↵k(s, a, o)= ,  1 ⇣ (15) k=0 + P (s s, ⇡(s))H(s ,o) . X (11) + | + 1 s+ S ↵2(s, a, o) < , (s, a, o) S A O. X2 ⌘ k 1 8 2 ⇥ ⇥ kX=0 Furthermore, the value of (14) is V (o)=v(o). Then, the update rule Proof. Consult (Bertsekas, 2005). (⇡k,ok) = arg max min E(Qk(s+,⇡(s+),o)) ⇡ ⇧ o 2 Introduce Qk+1(s, a, ok)= (1 ↵k(s, a, ok))Qk(s, a, ok)+↵k(s, a, ok) Q(s, a, o) v(o)=R(s, a, o)+ P (s s, a)H(s ,o) + | + (R(s, a, o )+E(Q (s ,⇡ (s ),o ))) s+ S ⇥ k k + k + k X2 (12) (16) converges to Q? with probability 1. Furthermore, the and let Q?, v?,andH? be solutions to Equation (15)- optimal policy ⇡? ⇧ given by (9) maximizes (7) with (16) corresponding to the optimal policy ⇡? that max- 2 respect to the initial state s = s0. That is, imizes (13). Then we have that

? ? ? ? ? (⇡ (s0),o ) = arg max min E (Q (s0,⇡(s0),o)) ⇡ (s) = arg max min E (Q (s, ⇡(s),o)) ⇡ ⇧ o o O 2 ⇡ ⇧ 2 ? ? ? 2 ⇡ (s) = arg max E (Q (s, ⇡(s),o )) H?(s, o)=E (Q?(s, ⇡?(s),o)) ⇡ ⇧ 2 Q?(s, a, o) v?(o) 3.2 Expected Average Rewards = R(s, a, o)+ P (s s, ⇡?(s))H?(s ,o) (17) + | + The agent’s objective is to maximize the minimal av- s+ S X2 erage reward obtained due to the opponent’s malicious actions, that is maximizing the total reward given by We will make some additional assumptions that will be used in the learning of Q? in the average reward T 1 1 case. We start offby introducing a sequence of learn- min lim E R(sk,ak,o) (13) o O T T ing rates k and assume that this sequence satisfies 2 !1 k=0 ! { } X the following assumption. Notice that this is a typi- for some initial state s S.Notethatthisproblemis cal assumption in stochastic approximation and it is 0 2 different from the zero-sum game considered in (Man- common in expected reward RL setting. See for in- nor, 2004a) where the opponent has to pick a fixed stance Assumption 2.3 in Abounadi et al. (2001a)and value for its action, ok = o,asopposedtothework Assumption 2 in Mannor (2004b). in (Mannor, 2004a) where ok is allowed to vary over Assumption 4 (Learning rate). The sequence k sat- time. Thus, from the opponent’s point of view, the isfies: opponent is performing bandit optimization. 1. eventually Under Assumption 2 and for a given stationary policy k+1  k ⇡,thevalueof 2. For every 0 1) n (H(S1,o),...,H(Sn,o)) R , such that for each s S, satisfy Assumption 4. 2 2 Reinforcement Learning for Constrained Markov Decision Processes

Now define N(k, s, a, o) as the number of times that the constraints (1), that is state s and actions a and o were played up to time k, find ⇡ ⇧ that is 2 k 1 k j N(k, s, a, o)= 1 (st,at,ot). s. t. E r (sk,⇡(sk)) 0 (19) (s,a,o) ! t=1 k=0 X X for j =1,...,J. The following assumption is needed to guarantee that all combinations of the triple (s, a, o) are visited often, The next theorem states that the optimization prob- which can be satisfied by using the ✏-greedy method lem (19) is equivalent to a zero-sum Markov-Bandit in the algorithm. game, in the sense that an optimal strategy of the Assumption 5 (Often updates). There exists a de- agent in the zero-sum game is also optimal for (19). terministic number d>0 such that for every s S, Theorem 3. Consider optimization problem (19) and 2 a A, and o O, we have that suppose it’s feasible and that Assumption 3 holds. Let 2 2 ? N(k, s, a, o) ⇡ be an optimal stationary policy in the zero-sum lim inf d game k k !1 with probability 1. 1 k j v(s0) = max min E r (sk,⇡(sk)) . (20) Definition 2. We define the set as the set of all ⇡ ⇧ j [J] n m q 2 2 k=0 ! functions f : R ⇥ ⇥ R such that X ! Then, ⇡? is a feasible solution to (19) if and only if 1. f is Lipschitz v(s ) 0. 0 2. For any c R, f(cQ)=cf(Q) 2 The interpretation of the game (20) is that the mini- 3. For any r R and Q(s, a, o)=Q(s, a, o)+r for mizer chooses index j [J], where [J]=1, 2,...,J. 2 2 (s, a, o) n m q f(Q)=f(Q)+r all R ⇥ ⇥ , we have Now that we are equipped with Theorem 1 and 3, we 2 b 1 are ready to state and prove our next result. For instance, f(Q)= S A O s,a,o Q(bs, a, o) belongs the the set . | || || | Theorem 4. Consider the constrained MDP problem P (19) and suppose that it’s feasible and that Assumption The next result shows that we will be able to design 1 and 3 hold. Also, introduce O =[J], o = j, and an algorithm that always converges to Q?. j Theorem 2. Consider a Markov-Bandit zero-sum R(s, a, o)=R(s, a, j) , r (s, a),j=1,...,J. game given by the tuple (S, A, O, P, R) and suppose Q that R is bounded. Suppose that Assumption 2, 4, and Let k be given by the recursion according to (12). Q Q? k Q? 5 hold. Let f be given, where the set is defined Then, k as where is the solution 2 ! !1 as in Definition 2. Then, the asynchronous update al- to (10). Furthermore, the policy gorithm given by ? ? ⇡ (s0) = arg max min E (Q (s0,⇡(s),o)) ⇡ ⇧ o O Qk+1(s, a, o)=Qk(s, a, o)+1(s,a,o)(sk,ak,ok) 2 2 (21) ⇥ ⇡?(s) = arg max E(Q?(s, ⇡(s),o?)) N(k,s,a,o) max min (R(s, a, ok) ⇡ ⇧ ⇥ ⇡ ⇧ ok O 2 2 2 + E(Q (s ,⇡(s ),o )) f(Q ) Q (s, a, o)) k k+1 k+1 k k k is an optimal solution to (19) for all s S. (18) 2 ? converges to Q in (17) with probability 1. Further- Proof. According to Theorem 3, (19) is equivalent to more, the optimal policy ⇡? ⇧ given by (17) maxi- 2 the zero-sum game (20), which is equivalent to the mizes (13). zero-sum Markov-Bandit game given by (S, A, O, P, R) with the objective 4ReinforcementLearningfor 1 k multi-objective Reinforcement max min E R(sk,⇡(sk),o) . (22) ⇡ ⇧ o O Learning 2 2 k=0 ! X Assumption 3 implies that R(s, a, o) 2c for all 4.1 Discounted Rewards | | (s, a, o) S A O.NowletQ? be the solution to the 2 ⇥ ⇥ Consider the optimization problem of finding a sta- maximin Bellman equation (10). According to Theo- tionary policy ⇡ subject to the initial state s0 = s and rem 1, Qk in the recursion given by (11)-(12) converges Ather Gattami, Qinbo Bai, Vaneet Aggarwal

Algorithm 1 Zero Sum Markov Bandit Algorithm for Algorithm 2 Zero Sum Markov Bandit Algorithm for CMDP with Discounted Reward CMDP with Average Reward 1: Initialize Q(s, a, o) 0 for all (s, a, o) 1: Initialize Q(s, a, o) 0 and N(s, a, o) 2S⇥A⇥ .Observes0 and Initial a0 randomly. Select ↵k 0 (s, a, o) .Observes and ini- O 8 2S⇥A⇥O 0 according to Eq. (11) tialize a0 randomly 2: for Iteration k =0,...,K do 2: for Iteration k =0,...,K do 3: Take action ak and observe next state sk+1 3: Take action ak and observe next state sk+1 4: ⇡k+1,ok = arg max min Q(sk+1,⇡k+1(sk+1),ok) ⇡k+1 o 2O 4: ⇡k+1,ok = arg max min R(sk,ak,ok)+ ⇡k+1 o 5: Q(sk,ak,ok) (1 ↵k)Q(sk,ak,ok)+ 2O  ↵ [r(s ,a ,o )+ E(Q(s ,⇡ (s ),o )] k k k k k+1 k+1 k+1 k Q(sk+1,⇡k+1(sk+1),ok) 6: Sample ak+1 from the distribution ⇡k+1( sk+1) ·| 1 7: end for 5: t = N(sk,ak,ok) N(sk,ak,ok)+1; ↵t = t+1 1 6: f = S A O s,a,o Q(s, a, o) | || | | 7: y = R(s ,a ,o )+E[Q(s ,⇡ (s ),o ] f k kPk k+1 k+1 k+1 k to Q? with probability 1. By, definition, the optimal ? policy ⇡ achieves the value of the zero-sum Markov- 8: Q(s ,a ,o ) (1 ↵ )Q(s ,a ,o )+↵ y k k k t k k k k ⇤ Bandit game in (21), and thus achieves the value of 9: Sample a from the distribution ⇡ ( s ) k+1 k+1 ·| k+1 (22) and the proof is complete. 10: end for

Finally, the algorithm for Constrained Markov Deci- optimal policy in the zero-sum game sion Process with Discounted Reward is shown in Alg. s 1. In line 1, we initialize the Q-table, observe 0 and T 1 select a0 randomly. In line 3, we take the current ac- 1 j v = max min lim E r (sk,⇡(sk)) . (24) tion ak and observe the next state sk+1 so that we ⇡ ⇧ j [J] T T 2 2 !1 k=0 ! can compute the max-min operator in line 4 based on X the first line of Eq. (12). Line 5 updates the Q-table Then, ⇡? is a solution to (23) if and only if v 0. according to the second line of Eq. (12). Line 6 sam- ples the next action from the policy gotten from the Now that we are equipped with Theorem 2 and 5, we line 4. Notice that The max-min can be converted to are ready to state the second main result (proof in a maximization problem with linear inequalities and Appendix). can be solved by efficiently due to the number of inequalities here is limited by the space Theorem 6. Consider the constrained Markov Deci- of the opponent. Even for large scale problems, effi- sion Process problem (23) and suppose that Assump- cient algorithms exist for max-min Parpas and Rustem tion 2 and 3 hold. Introduce O =[J], o = j and (2009). j R(s, a, o)=R(s, a, j) , r (s, a),j=1,...,J 4.2 Expected Average Rewards Let Qk be given by the recursion according to (18) Consider the optimization problem of finding a sta- and suppose that Assumptions 4 and 5 hold. Then, Q Q? as k where Q? is the solution to (17). tionary policy ⇡ subject to the constraints (3), that k ! !1 is Furthermore, the policy

find ⇡ ⇧ ⇡?(s) = arg max min E (Q?(s, ⇡(s),o)) (25) 2 ⇡ ⇧ o O T 1 2 2 1 j s. t. lim E r (sk,⇡(sk)) 0 (23) T T s S !1 ! is a solution to (23) for all . kX=0 2 j =1,...,J. for The algorithm for Constrained Markov Decision Pro- cess with Discounted Reward is in Alg. 2. The most The next theorem states that the optimization prob- part of this algorithm is similar to Algorithm 1. How- lem (23) is equivalent to a zero-sum Markov-Bandit ever, in line 1, we initialize the N table, which records game, in the sense that an optimal strategy of the how many times (s, a, o) has been met in the learning agent in the zero-sum game is also optimal for (23). process and N table is updated in line 5. Besides, in Theorem 5. Consider optimization problem (23) and Line 6, f is computed according to Def. 2. Finally, in suppose that Assumption 2 and 3 hold. Let ⇡? be an line 8, Q-table is updated according to the Eq. (2). Reinforcement Learning for Constrained Markov Decision Processes Table 1: Transition probability of the queue system Current State P (x = x 1) P (x = x ) P (x = x + 1) t+1 t t+1 t t+1 t 1 x L 1 a(1 b) ab +(1 a)(1 b) (1 a)b s = L.Moreover,thecostfunctionissettobe  t  x = L a 1 a 0 t c(s, a, b)=s 5,theconstraintfunctionfortheservice xt =0 0 1 b(1 a) b(1 a) 1 is defined as c (s, a, b) = 10a 5,andtheconstraint function for the flow is c2(s, a, b) = 5(1 b)2 2. 5EvaluationonDiscretetime Single-Server Queue 0.4 0.4 Reward Service Flow In this section, we evaluate the proposed algorithm on 0.2 0.2 a queuing system with a single server in discrete time. 0 0 Constraint Value In this model, we assume there is a buffer of finite Constraint Value 0.2 0.2 Reward size L.Apossiblearrivalisassumedtooccuratthe Service 0.4 0.4 beginning of the time slot. The state of the system Flow 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Iteration 105 Iteration 105 is the number of customers waiting in the queue at · · the beginning of time slot such that S = L +1.We (a) =9.5. (d) =9.6. | | assume there are two kinds of actions, service action and flow action. The service action space is a finite 0.4 0.4 subset A of [a ,a ] and 0

1 Reward 1 Reward since the proposed Algorithm 1 optimize the minimal Service Service value function among V (s, a, o) with respect to o.We Flow Flow see that the three constraints for =9.5 are close to 0 0 each other and non-negative, thus demonstrating the 1 1 Constraint Value constraints are satisfied and =9.5 is feasible. Constraint Value

2 2 On the other extreme, we see the case when =9.7. 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Iteration 105 Iteration 105 We note that all three constraints are below 0, which · · means that no feasible policy for this setting. Thus, (a) =3.5. (b) =3.55. seeing the cases for =9.5 and 9.7,wenotethatthe optimal objective is between the two values. Looking Figure 2: Value of constraints with iterations for the at the case where =9.625,wealsonotethatthe Zero-Sum Markov Bandit algorithm applied to Dis- service constraint is clearly below zero and the con- crete time single-server queue in the expected average straints are not satisfied. Similarly, for =9.55,the case. constraints are non-negative - the closest to zero are the service constraints which are crossing zero every few iterations and thus the gap is within the margin. constraint functions, LP shows that the optimal total 3.62 This shows that the optimal objective is within 9.55 average reward is about . In order to compare the and 9.625.However,thejudgmentisnotasevident result with LP, we run Algorithm 2 with parameters =3.5 =3.55 between the two regimes and it cannot be clearly men- and . The results are shown in Fig. 2. tioned from =9.575 and =9.6 if they are feasible We see that for =3.5,allthethreeconstraintsare or not since they are not consistently lower than zero all above 0, which means that there exists feasible pol- after 80,000 iterations like in the case of =9.625 icy when =3.5. However, when =3.55,theflow and =9.7,arenotmostlyabovezeroasfor =9.5. constraint is not satisfied. Thus, the result by the Thus, looking at the figures, we estimate the value of proposed algorithm is between between 3.5 and 3.55, optimal objective between 9.55 and 9.625. which is close to the optimal result 3.62 (after con- In order to compare the result with the theoretical op- verting objective to a constraint). The reason for the timal total reward, we can assume the dynamics of the gap arises from the same reason as mentioned in the MDP is known in advance and use the Linear Program- discounted case. ming algorithm to solve the original problem. The re- Finally, We note that two additional numerical exam- sult solved by the LP is 9.62.WenotethatQ(s, a, o) ples can be seen in the Appendix. has 6 5 4 3 = 360 elements and it is possible that ⇥ ⇥ ⇥ 105 iterations are not enough to make all the elements in Q table to converge. Further, sampling 104 tra- 6Conclusions jectories can only achieve an accuracy of 0.1 with 99% confidence for the constraint function value and we are We considered the problem of optimization and learn- within that range. Thus, more iterations and more ing for constrained and multi-objective Markov deci- samples (especially more samples) would help improve sion processes, for both discounted rewards and ex- 9.55 the achievable estimate from in the algorithm per- pected average rewards. We formulated the prob- formance. Overall, considering the limited iterations lems as zero-sum games where one player (the agent) and sampling in the simulations, we conclude that the solves a Markov decision problem and its opponent result by the proposed algorithm is close to the optimal solves a bandit optimization problem, which we call result obtained by the Linear Programming. Markov-Bandit games. We extended Q-learning to Next, we test this example in the expected average solve Markov-Bandit games and proved that our new rewards case, which is formulated as Q-learning algorithms converge to the optimal solu- tions of the zero-sum Markov-Bandit games, and hence T 1 converge to the optimal solutions of the constrained 1 a b min lim E c(st,⇡ (st),⇡ (st)) and multi-objective Markov decision problems. The ⇡a,⇡b T T !1 t=0  X provided numerical examples and the simulation re- T 1 1 i a b sults illustrate that the proposed algorithm converges s.t. lim E c (st,⇡ (st),⇡ (st)) 0 i =1, 2 T T  to the optimal policy. !1  t=0 X (27) Having an approach with a combination of the long- When the model is known apriori, this problem can term constraints as in this paper and the peak con- be computed by linear programming approach (Alt- straints as in Gattami (2019); Bai et al. (2021) is an man, 1999). With the same reward function and two interesting future direction. Reinforcement Learning for Constrained Markov Decision Processes

References Ding D, Zhang K, Jovanovic M, Basar T (2020b) Nat- ural policy gradient primal-dual method for con- Abels A, Roijers D, Lenaerts T, Nowé A, Steck- strained markov decision processes. NIPS 2020 elmacher D (2019) Dynamic weights in multi- objective deep reinforcement learning. In: Chaud- Djonin DV, Krishnamurthy V (2007) Mimo trans- huri K, Salakhutdinov R (eds) Proceedings of the mission control in fading channels: A constrained 36th International Conference on Machine Learn- markov decision process formulation with mono- ing, PMLR, Long Beach, California, USA, Pro- tone randomized policies. IEEE Transactions on ceedings of Research, vol 97, Signal Processing 55(10):5069–5083, DOI 10.1109/ pp 11–20, URL http://proceedings.mlr.press/ TSP.2007.897859 v97/abels19a.html Drugan MM, Nowe A (2013) Designing multi-objective Abounadi J, Bertsekas D, Borkar V (2001a) Learning multi-armed bandits algorithms: A study. In: The algorithms for markov decision processes with aver- 2013 International Joint Conference on Neural Net- age cost. SIAM J Control Optim 40:681–698 works (IJCNN), IEEE, pp 1–8 Abounadi J, Bertsekas D, Borkar VS (2001b) Learning Efroni Y, Mannor S, Pirotta M (2020) Exploration- algorithms for markov decision processes with aver- exploitation in constrained mdps. arXiv preprint age cost. SIAM J Control and Optimization 40:681– arXiv:200302189 698 Gattami A (2019) Reinforcement learning of markov Achiam J, Held D, Tamar A, Abbeel P (2017) Con- decision processes with peak constraints. arXiv strained policy optimization. In: Proceedings of the preprint arXiv:190107839 34th International Conference on Machine Learn- ing - Volume 70, JMLR.org, ICML’17, pp 22– Geibel P (2006) Reinforcement learning for MDPs 31, URL http://dl.acm.org/citation.cfm?id= with constraints. In: Fürnkranz J, Scheffer T, 3305381.3305384 Spiliopoulou M (eds) Machine Learning: ECML Altman E (1999) Constrained Markov decision pro- 2006, Springer Berlin Heidelberg, Berlin, Heidel- cesses, vol 7. CRC Press berg, pp 646–653 Åström KJ, Wittenmark B (1994) Adaptive Control, Jaakkola T, Jordan MI, Singh SP (1994) On the con- 2nd edn. Addison-Wesley Longman Publishing Co., vergence of stochastic iterative dynamic program- Inc., Boston, MA, USA ming algorithms. Neural Comput 6(6):1185–1201, DOI 10.1162/neco.1994.6.6.1185, URL Bai Q, Aggarwal V, Gattami A (2021) Provably effi- http://dx. cient model-free algorithm for mdps with peak con- doi.org/10.1162/neco.1994.6.6.1185 straints. arXiv preprint arXiv:200305555 Jin C, Allen-Zhu Z, Bubeck S, Jordan MI (2018) Is Bertsekas DP (2005) and Op- q-learning provably efficient? In: Proceedings of timal Control, vol 1. Athena Scientific the 32nd International Conference on Neural Infor- mation Processing Systems, Curran Associates Inc., Borkar VS (2005) An actor-critic algorithm for Red Hook, NY, USA, NIPS’18, p 4868–4878 constrained markov decision processes. Sys- tems & Control Letters 54(3):207 – 213, DOI Littman ML (1994) Markov games as a framework for https://doi.org/10.1016/j.sysconle.2004.08.007, multi-agent reinforcement learning. In: Proceedings URL http://www.sciencedirect.com/science/ of the Eleventh International Conference on Interna- article/pii/S0167691104001276 tional Conference on Machine Learning, San Fran- Brantley K, Dudik M, Lykouris T, Miryoosefi S, cisco, CA, USA, pp 157–163, URL http://dl.acm. Simchowitz M, Slivkins A, Sun W (2020) Con- org/citation.cfm?id=3091574.3091594 strained episodic reinforcement learning in concave- Lizotte D, Bowling MH, Murphy AS (2010) Efficient convex and knapsack settings. arXiv preprint reinforcement learning with multiple reward func- arXiv:200605051 tions for randomized controlled trial analysis. In: Chow Y, Ghavamzadeh M, Janson L, Pavone M (2017) Proceedings of the 27th International Conference on Risk-constrained reinforcement learning with per- International Conference on Machine Learning, Om- centile risk criteria. Journal of Machine Learning nipress, USA, ICML’10, pp 695–702, URL http:// Research 18:167:1–167:51 dl.acm.org/citation.cfm?id=3104322.3104411 Ding D, Wei X, Yang Z, Wang Z, JovanovićMR Mannor S (2004a) Reinforcement learning for average (2020a) Provably efficient safe exploration via reward zero-sum games. In: Shawe-Taylor J, Singer primal-dual policy optimization. arXiv preprint Y (eds) Learning Theory, Springer Berlin Heidel- arXiv:200300534 berg, Berlin, Heidelberg, pp 49–63 Ather Gattami, Qinbo Bai, Vaneet Aggarwal

Mannor S (2004b) Reinforcement learning for average Yu M, Yang Z, Kolar M, Wang Z (2019) Convergent reward zero-sum games. In: Shawe-Taylor J, Singer policy optimization for safe reinforcement learning. Y (eds) Learning Theory, Springer Berlin Heidel- Advances in Neural Information Processing Systems berg, Berlin, Heidelberg, pp 49–63 32 Ni C, Yang LF, Wang M (2019) Learning to control in Zheng L, RatliffL (2020) Constrained upper confi- metric space with optimal regret. In: 2019 57th An- dence reinforcement learning. PMLR, The Cloud, nual Allerton Conference on Communication, Con- Proceedings of Machine Learning Research, vol trol, and Computing (Allerton), IEEE Press, p 120, pp 620–629, URL http://proceedings.mlr. 726–733, DOI 10.1109/ALLERTON.2019.8919864, press/v120/zheng20a.html URL https://doi.org/10.1109/ALLERTON.2019. Zhou D, Chen J, Gu Q (2020) Provable multi- 8919864 objective reinforcement learning with generative Parpas P, Rustem B (2009) An algorithm for the global models. arXiv preprint arXiv:201110134 optimization of a class of continuous minimax prob- lems. Journal of Optimization Theory and Applica- tions 141:461–473, DOI 10.1007/s10957-008-9473-4 Paternain S, Chamon LF, Calvo-Fullana M, Ribeiro A (2019) Constrained reinforcement learning has zero duality gap. arXiv preprint arXiv:191013393 Raghu R, Upadhyaya P, Panju M, Agarwal V, Sharma V(2019)Deepreinforcementlearningbasedpower control for wireless multicast systems. In: 2019 57th Annual Allerton Conference on Communica- tion, Control, and Computing (Allerton), IEEE, pp 1168–1175 Roijers DM, Vamplew P, Whiteson S, Dazeley R(2013)Asurveyofmulti-objectivesequen- tial decision-making. J Artif Int Res 48(1):67– 113, URL http://dl.acm.org/citation.cfm?id= 2591248.2591251 Shah SM, Borkar VS (2018) Q-learning for markov decision processes with a satisfiability criterion. Systems and Control Letters 113:45–51, DOI https://doi.org/10.1016/j.sysconle.2018.01.003, URL https://www.sciencedirect.com/science/ article/pii/S0167691118300045 Silver D, Schrittwieser J, Simonyan K, Antonoglou I, Huang A, Guez A, Hubert T, Baker L, Lai M, Bolton A, Chen Y, Lillicrap T, Hui F, Sifre L, Driess- che GVD, Graepel T, Hassabis D (2017) Mastering the game of go without human knowledge. Nature 550:354 – 359, URL http://dx.doi.org/10.1038/ nature24270 Singh R, Gupta A, ShroffNB (2020) Learning in markov decision processes under constraints. arXiv preprint arXiv:200212435 Tessler C, Mankowitz DJ, Mannor S (2018) Reward constrained policy optimization. In: International Conference on Learning Representations Yang R, Sun X, Narasimhan K (2019) A gener- alized algorithm for multi-objective reinforcement learning and policy adaptation. arXiv preprint arXiv:190808342