Reinforcement Learning for Constrained Markov Decision Processes

Reinforcement Learning for Constrained Markov Decision Processes Ather Gattami Qinbo Bai Vaneet Aggarwal AI Sweden Purdue University Purdue University Abstract a given reward by interacting with the surrounding en- vironment. The interaction teaches the agent how to maximize its reward without knowing the underlying In this paper, we consider the problem of op- dynamics of the process. A classical example is swing- timization and learning for constrained and ing up a pendulum in an upright position. By making multi-objective Markov decision processes, several attempts to swing up a pendulum and balanc- for both discounted rewards and expected av- ing it, one might be able to learn the necessary forces erage rewards. We formulate the problems as that need to be applied in order to balance the pen- zero-sum games where one player (the agent) dulum without knowing the physical model behind it, solves a Markov decision problem and its op- which is the general approach of classical model based ponent solves a bandit optimization problem, control theory (Åström and Wittenmark, 1994). which we here call Markov-Bandit games. We extend Q-learning to solve Markov-Bandit Informally, the problem of constrained reinforcement games and show that our new Q-learning al- learning for Markov decision processes is described as gorithms converge to the optimal solutions follows. Given a stochastic process with state sk at of the zero-sum Markov-Bandit games, and time step k,rewardfunctionr,constraintfunctionrj, hence converge to the optimal solutions of the and a discount factor 0 <γ<1,themulti-objective constrained and multi-objective Markov deci- reinforcement learning problem is that for the optimiz- sion problems. We provide numerical exam- ing agent to find a stationary policy ⇡(sk) that simul- ples where we calculate the optimal policies taneously satisfies in the discounted reward setting and show by simulations that the algorithm converges to the calculated optimal policies. 1 k max E γ r(sk,⇡(sk)) (1) ⇡ To the best of our knowledge, this is the first k=0 ! time Q-learning algorithms guarantee con- X 1 vergence to optimal stationary policies for s.t. E γkrj(s ,⇡(s )) 0 (2) k k ≥ the multi-objective Reinforcement Learning k=0 ! problem with discounted and expected aver- X age rewards, respectively. or in the expected average reward setting T 1 1 − max lim E r(sk,⇡(sk)) (3) ⇡ T T !1 ! 1Introduction kX=0 T 1 1 − j s.t. lim E r (sk,⇡(sk)) 0 (4) 1.1 Motivation T T ≥ !1 ! kX=0 Reinforcement learning has made great advances in for j =1,...,J (a more formal definition of the problem several applications, ranging from online learning and is introduced in the next section and some examples recommender engines, natural language understanding of this setup are given in Appendix.) and generation, to mastering games such as Go (Silver et al., 2017) and Chess. The idea is to learn from ex- Surprisingly, although constrained MDP problems are tensive experience how to take actions that maximize fundamental and have been studied extensively in the literature (see (Altman, 1999) and the references Proceedings of the 24th International Conference on Artifi- therein), the reinforcement learning counter part seem cial Intelligence and Statistics (AISTATS) 2021, San Diego, to be still open. When an agent takes actions based California, USA. PMLR: Volume 130. Copyright 2021 by on the observed states and constraint-outputs solely the author(s). (without any knowledge about the dynamics, and/or Reinforcement Learning for Constrained Markov Decision Processes constraint-functions), a general solution seem to be no proofs of convergence are provided for the proposed lacking to the best of the author’s knowledge for both sub-optimal algorithms. Sub-optimal solutions with the discounted and expected average rewards cases. convergence guarantees were provided in (Chow et al., 2017) for the single constraint problem, allowing for Note that maximizing Eq. (1) is equivalent to maxi- randomized polices. In (Borkar, 2005), an actor-critic mizing δ subject to the constraint sub-optimal algorithm is provided for one single constraint and it’s claimed that it can generalize to an 1 k E γ r(sk,⇡(sk)) δ arbitrary number of constraints. Reinforcement learn- ! ≥ kX=0 ing based model-free solutions have been proposed for the problems without guarantees (Djonin and Krishna- Thus, one could always replace r with r (1 γ)δ − − murthy, 2007; Lizotte et al., 2010; Drugan and Nowe, and obtain a constraint of the form (1). Similarly for 2013; Achiam et al., 2017; Abels et al., 2019; Raghu the average reward case, one may replace r with r δ − et al., 2019). to obtain a constraint of the form (3). Hence, we can run the bisection method with respect to δ and the Recently, (Tessler et al., 2018) proposed a policy gradi- problem in discounted setting will be transformed to ent algorithm with Lagrange multiplier in multi-time find a policy ⇡ such that scale for discounted constrained reinforcement learning algorithm and proved that the policy converges to 1 E γkrj(s ,⇡(s )) 0 (5) afeasiblepolicy.(Efronietal.,2020)foundafeasi- k k ≥ k=0 ! ble policy by using Lagrange multiplier and zero-sum X game for reinforcement learning algorithm with convex where j =0, 1,...,J and r0 = r (1 γ)δ.Orinthe constraints and discounted reward. Yu et al. (2019) − − average setting, find policy ⇡ such that (Paternain et al., 2019) showed that constrained re- T 1 inforcement learning has zero duality gap, which pro- 1 − j vides a theoretical guarantee to policy gradient algo- lim E r (sk,⇡(sk)) 0 (6) T T ≥ !1 k=0 ! rithms in the dual domain. In constrast, our paper X does not use policy gradient based algorithms. (Zheng where r0 = r δ. In this paper, we call problems and Ratliff, 2020) proposed the C-UCRL algorithm − 3 (5) and (6) as multi-objective MDPs problems and which achieves sub-linear O(T 4 log(T )/δ) with prob- propose the algorithm with Markov bandit game and ability 1 δ, while satisfiying the constraints. How- − p prove the convergence of them. ever, this algorithm needs the knowledge of the model dynamics. Brantley et al. (2020) proposed a model- 1.2 Related Work based algorithm for tabular episodic reinforcement learning with concave rewards and convex constraints. Constrained MDP problems are convex and hence Singh et al. (2020) modified the famous UCRL2 al- one can convert the constrained MDP problem to an gorithm and proposed the model-based UCRL-CMDP unconstrained zero-sum game where the objective is algorithm to solve the CMDP problem and gave the the Lagrangian of the optimization problem (Altman, sub-linear result. Efroni et al. (2020) proposed 4 al- 1999). However, when the dynamics and rewards are gorithms for the constrained reinforcement learning not known, it doesn’t become apparent how to do problem in primal, dual or primal-dual domain and it as the Lagrangian will itself become unkown to showed a sub-linear bound for regret and constraints the optimizing agent. Previous work regarding con- violations. However, all these algorithms are model strained MDPs, when the dynamics of the stochastic based. While our algorithm is model-free and scalable process are not known, considers scalarization through to continuous spaces. Ding et al. (2020b) employed weighted sums of the rewards, see (Roijers et al., 2013) the natural policy gradient method to solve the dis- and the references therein. Another approach is to counted infinite-horizon CMDP problem. It achieves 1 consider Pareto optimality when multiple objectives the ( 2 ) convergence rate with respite to the concept O ✏ are present (Zhou et al., 2020) and Yang et al. (2019). of ✏-optimal or ✏-constraint violation. Despite that the Notice that there may be multiple Pareto optimal algorithm is model-free, it still needs simulator to get points, and all these points will not satisfy all the con- samples. Ding et al. (2020a) proposed a model-free straints in general. Further, using any solution of min- primal-dual algorithm without the simulator to solve max may not in general be Pareto optimal. Thus, the the CMDP and gives the (pT ) bound for both re- O problem formulation is very different from the above ward and constraint, which should be the state-of-the- papers which aims to achieve the Pareto front. art result in this problem. Shah and Borkar (2018) In (Geibel, 2006), the author considers a single con- proposed a three time scale Q-learning based algo- straint and allowing for randomized policies. However, rithm to solve a constraint satisfaction problem in the Ather Gattami, Qinbo Bai, Vaneet Aggarwal expected average setting. In contrast, this paper pro- 2ProblemFormulationand vides the first single time-scale Q-learning algorithm Assumptions for both discounted and expected average rewards. Consider a Markov Decision Process (MDP) defined by the tuple (S, A, P ), where S = S ,S ,...,S is a { 1 2 n} 1.3 Contributions finite set of states, A = A ,A ,...,A is a finite set { 1 2 m} of actions taken by the agent, and P : S A S [0, 1] ⇥ ⇥ ! We consider the problem of optimization and learn- is a transition function mapping each triple (s, a, s+) ing for constrained Markov decision processes, for both to a probability given by discounted rewards and expected average rewards. We formulate the problems as zero-sum games where one P (s, a, s+)=Pr(s+ s, a) player (the agent) solves a Markov decision problem | and its opponent solves a bandit optimization prob- and hence, lem, which we here call Markov-Bandit games which P (s, a, s )=1, (s, a) S A.

Reinforcement Learning for Constrained Markov Decision Processes

Probabilistic Goal Markov Decision Processes∗

1 Introduction

Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy

A Brief Taster of Markov Decision Processes and Stochastic Games

Markov Decision Processes

Markovian Competitive and Cooperative Games with Applications to Communications

Factored Markov Decision Process Models for Stochastic Unit Commitment

Capturing Financial Markets to Apply Deep Reinforcement Learning

Markov Decision Processes and Dynamic Programming 1 The

An Introduction to Temporal Difference Learning

Markov Decision Processes

Reinforcement Learning and Markov Decision Processes (Mdps)