Reinforcement Learning for Constrained Markov Decision Processes
Ather Gattami Qinbo Bai Vaneet Aggarwal AI Sweden Purdue University Purdue University
Abstract a given reward by interacting with the surrounding en- vironment. The interaction teaches the agent how to maximize its reward without knowing the underlying In this paper, we consider the problem of op- dynamics of the process. A classical example is swing- timization and learning for constrained and ing up a pendulum in an upright position. By making multi-objective Markov decision processes, several attempts to swing up a pendulum and balanc- for both discounted rewards and expected av- ing it, one might be able to learn the necessary forces erage rewards. We formulate the problems as that need to be applied in order to balance the pen- zero-sum games where one player (the agent) dulum without knowing the physical model behind it, solves a Markov decision problem and its op- which is the general approach of classical model based ponent solves a bandit optimization problem, control theory (Åström and Wittenmark, 1994). which we here call Markov-Bandit games. We extend Q-learning to solve Markov-Bandit Informally, the problem of constrained reinforcement games and show that our new Q-learning al- learning for Markov decision processes is described as gorithms converge to the optimal solutions follows. Given a stochastic process with state sk at of the zero-sum Markov-Bandit games, and time step k,rewardfunctionr,constraintfunctionrj, hence converge to the optimal solutions of the and a discount factor 0 <