Markov Games As a Framework for Multi-Agent Reinforcement Learning

Total Page:16

File Type:pdf, Size:1020Kb

Markov Games As a Framework for Multi-Agent Reinforcement Learning Markov games as a framework for multi-agent reinforcement learning Michael L. Littman Brown University / Bellcore Department of Computer Science Brown University Providence, RI 02912-1910 [email protected] Abstract 2 DEFINITIONS In the Markov decision process (MDP) formaliza- An MDP [Howard, 1960] is de®ned by a set of states, ¡ ¡¥¤ £ ¢§¦ tion of reinforcement learning, a single adaptive , and actions, ¢ . A transition function, : ¡ © agent interacts with an environment de®ned by a PD ¨ , de®nes the effects of the various actions on the ¡ © probabilistic transition function. In this solipsis- state of the environment. (PD ¨ represents the set of ¡ tic view, secondary agents can only be part of the discrete probability distributions over the set .) The reward ¡ ¤ ¢ ¦ environment and are therefore ®xed in their be- function, : , speci®es the agent's task. havior. The framework of Markov games allows In broad terms, the agent's objective is to ®nd a policy us to widen this view to include multiple adap- mapping its interaction history to a current choice of action tive agents with interacting or competing goals. so as to maximize the expected sum of discounted reward, This paper considers a step in this direction in ! , where is the reward received " steps which exactly two agents with diametrically op- 0 into the future. A discount factor, 0 # 1 controls how posed goals share an environment. It describes %$ much effect future rewards have on the optimal decisions, a Q-learning-like algorithm for ®nding optimal with small values of emphasizing near-term gain and policies and demonstrates its application to a sim- larger values giving signi®cant weight to later rewards. ple two-player game in which the optimal policy is probabilistic. In its general form, a Markov game, sometimes called a stochastic game [Owen, 1982], is de®ned by a set of states, ¡ ¢ ¢+* , and a collection of action sets, 1 &(''')& , one for each 1 INTRODUCTION agent in the environment. State transitions are controlled by the current state and one action from each agent: £ : ¡¥¤ ¤-,(,,.¤ ¡ © * ¢ ¦ ¨ ¢ 1 PD . Each agent also has an ¤1,,(,2¤ No agent lives in a vacuum; it must interact with other agents ¡0¤ +/ ¢+*3¦ associated reward function, : ¢ , to achieve its goals. Reinforcement learning is a promis- 1 for agent 4 , and attempts to maximize its expected sum of 8 8 56 [ ing technique for creating agents that co-exist Tan, 1993, ( /7 discounted rewards, /7 , where is the Yanco and Stein, 1993], but the mathematical frame- 0 4 reward received " steps into the future by agent . work that justi®es it is inappropriate for multi-agent en- vironments. The theory of Markov Decision Processes In this paper, we consider a well-studied specialization in (MDP's) [Barto et al., 1989, Howard, 1960], which under- which there are only two agents and they have diametrically lies much of the recent work on reinforcement learning, opposed goals. This allows us to use a single reward func- assumes that the agent's environment is stationary and as tion that one agent tries to maximize and the other, called such contains no other adaptive agents. the opponent, tries to minimize. In this paper, we use ¢ to denote the agent's action set, 9 to denote the opponent's The theory of games [von Neumann and Morgenstern, © :¨<; action set, and &>=?&>@ to denote the immediate reward to 1947] is explicitly designed for reasoning about multi-agent ¡ ¢ ; A the agent for taking action =BA in state when its systems. Markov games (see e.g., [Van Der Wal, 1981]) is 9 opponent takes action @:A . an extension of game theory to MDP-like environments. This paper considers the consequences of using the Markov Adopting this specialization, which we call a two-player game framework in place of MDP's in reinforcement learn- zero-sum Markov game, simpli®es the mathematics but ing. Only the speci®c case of two-player zero-sum games makes it impossible to consider important phenomena such is addressed, but even in this restricted version there are as cooperation. However, it is a ®rst step and can be consid- 9DCE insights that can be applied to open questions in the ®eld of ered a strict generalization of both MDP's (when C 1) ¡ CE reinforcement learning. and matrix games (when C 1). As in MDP's, the discount factor, , can be thought of Agent as the probability that the game will be allowed to con- rock paper scissors tinue after the current move. It is possible to de®ne a no- rock 0 1 -1 tion of undiscounted rewards [Schwartz, 1993], but not all Opponent paper -1 0 1 Markov games have optimal strategies in the undiscounted scissors 1 -1 0 case [Owen, 1982]. This is because, in many games, it is best to postpone risky actions inde®nitely. For current pur- Table 1: The matrix game for ªrock, paper, scissors.º poses, the discount factor has the desirable effect of goading the players into trying to win sooner rather than later. paper scissors (vs. rock) rock scissors (vs. paper) rock paper (vs. scissors) E 3 OPTIMAL POLICIES rock paper scissors 1 Table 2: Linear constraints on the solutionto a matrix game. The previous section de®ned the agent's objective as max- imizing the expected sum of discounted reward. There are subtleties in applying this de®nition to Markov games, 4 FINDING OPTIMAL POLICIES however. First, we consider the parallel scenario in MDP's. In an MDP, an optimal policy is one that maximizes the This section reviews methods for ®nding optimal policies ¢¡¤£ ¡ £ 4 for matrix games, MDP's, and Markov games. It uses a uni- ¦¥ =¨§ © expected sum of discounted reward and is @ , meaning that there is no state from which any other policy form notation that is intended to emphasize the similarities can achieve a better expected sum of discounted reward. between the three frameworks. To avoid confusion, func- Every MDP has at least one optimal policy and of the op- tion names that appear more than once appear with different timal policies for a given MDP, at least one is stationary numbers of arguments each time. and deterministic. This means that, for any MDP, there is ¡ ¦ ¢ a policy : that is optimal. The policy is called 4.1 MATRIX GAMES stationary since it does not change as a function of time and it is called deterministic since the same action is always ¡ At the core of the theory of games is the matrix game de®ned ; ; chosen whenever the agent is in state , for all A . /87 by a matrix, , of instantaneous rewards. Component For many Markov games, there is no policy that is undomi- is the reward to the agent for choosing action " when its nated because performance depends critically on the choice opponent chooses action 4 . The agent strives to maximize of opponent. In the game theory literature, the resolu- its expected reward while the opponent tries to minimize tion to this dilemma is to eliminate the choice and evaluate it. Table 1 gives the matrix game corresponding to ªrock, each policy with respect to the opponent that makes it look paper, scissors.º the worst. This performance measure prefers conservative The agent's policy is a probability distribution over actions, strategies that can force any opponent to a draw to more © ¨¢ A PD . For ªrock, paper, scissors,º is made up of daring ones that accrue a great deal of reward against some 3 components: rock, paper, and scissors. According to the opponents and lose a great deal to others. This is the essence notion of optimality discussed earlier, the optimal agent's of minimax: Behave so as to maximize your reward in the minimum expected reward should be as large as possible. worst case. How can we ®nd a policy that achieves this? Imagine that Given this de®nition of optimality, Markov games have we would be satis®ed with a policy that is guaranteed an several important properties. Like MDP's, every Markov expected score of no matter which action the opponent game has a non-empty set of optimal policies, at least one chooses. The inequalities in Table 2, with 0, constrain of which is stationary. Unlike MDP's, there need not be a the components of to represent exactly those policiesÐ deterministic optimal policy. Instead, the optimal stationary any solution to the inequalities would suf®ce. policy is sometimes probabilistic, mapping states to discrete ¡ © For to be optimal, we must identify the largest for ¦ ¨8¢ probability distributions over actions, : PD . A which there is some value of that makes the constraints classic example is ªrock, paper, scissorsº in which any hold. Linear programming (see, e.g., [Strang, 1980]) is a deterministic policy can be consistently defeated. general technique for solving problems of this kind. In this The idea that optimal policies are sometimes stochastic may example, linear programming ®nds a value of 0 for and seem strange to readers familiar with MDP's or games with (1/3, 1/3, 1/3) for . We can abbreviate this linear program alternating turns like backgammon or tic-tac-toe, since in as: these frameworks there is always a deterministic policy that 7 E max min & PD does no worse than the best probabilistic one. The need for probabilistic action choice stems from the agent's uncer- 7 tainty of its opponent's current move and its requirement to where expresses the expected reward to the avoid being ªsecond guessed.º agent for using policy against the opponent's action @ . 4.2 MDP's The resulting recursive equations look much like Equations 1±2 and indeed the analogous value iteration algorithm can There is a host of methods for solving MDP's.
Recommended publications
  • Stochastic Game Theory Applications for Power Management in Cognitive Networks
    STOCHASTIC GAME THEORY APPLICATIONS FOR POWER MANAGEMENT IN COGNITIVE NETWORKS A thesis submitted to Kent State University in partial fulfillment of the requirements for the degree of Master of Digital Science by Sham Fung Feb 2014 Thesis written by Sham Fung M.D.S, Kent State University, 2014 Approved by Advisor , Director, School of Digital Science ii TABLE OF CONTENTS LIST OF FIGURES . vi LIST OF TABLES . vii Acknowledgments . viii Dedication . ix 1 BACKGROUND OF COGNITIVE NETWORK . 1 1.1 Motivation and Requirements . 1 1.2 Definition of Cognitive Networks . 2 1.3 Key Elements of Cognitive Networks . 3 1.3.1 Learning and Reasoning . 3 1.3.2 Cognitive Network as a Multi-agent System . 4 2 POWER ALLOCATION IN COGNITIVE NETWORKS . 5 2.1 Overview . 5 2.2 Two Communication Paradigms . 6 2.3 Power Allocation Scheme . 8 2.3.1 Interference Constraints for Primary Users . 8 2.3.2 QoS Constraints for Secondary Users . 9 2.4 Related Works on Power Management in Cognitive Networks . 10 iii 3 GAME THEORY|PRELIMINARIES AND RELATED WORKS . 12 3.1 Non-cooperative Game . 14 3.1.1 Nash Equilibrium . 14 3.1.2 Other Equilibriums . 16 3.2 Cooperative Game . 16 3.2.1 Bargaining Game . 17 3.2.2 Coalition Game . 17 3.2.3 Solution Concept . 18 3.3 Stochastic Game . 19 3.4 Other Types of Games . 20 4 GAME THEORETICAL APPROACH ON POWER ALLOCATION . 23 4.1 Two Fundamental Types of Utility Functions . 23 4.1.1 QoS-Based Game . 23 4.1.2 Linear Pricing Game .
    [Show full text]
  • Deterministic and Stochastic Prisoner's Dilemma Games: Experiments in Interdependent Security
    NBER TECHNICAL WORKING PAPER SERIES DETERMINISTIC AND STOCHASTIC PRISONER'S DILEMMA GAMES: EXPERIMENTS IN INTERDEPENDENT SECURITY Howard Kunreuther Gabriel Silvasi Eric T. Bradlow Dylan Small Technical Working Paper 341 http://www.nber.org/papers/t0341 NATIONAL BUREAU OF ECONOMIC RESEARCH 1050 Massachusetts Avenue Cambridge, MA 02138 August 2007 We appreciate helpful discussions in designing the experiments and comments on earlier drafts of this paper by Colin Camerer, Vince Crawford, Rachel Croson, Robyn Dawes, Aureo DePaula, Geoff Heal, Charles Holt, David Krantz, Jack Ochs, Al Roth and Christian Schade. We also benefited from helpful comments from participants at the Workshop on Interdependent Security at the University of Pennsylvania (May 31-June 1 2006) and at the Workshop on Innovation and Coordination at Humboldt University (December 18-20, 2006). We thank George Abraham and Usmann Hassan for their help in organizing the data from the experiments. Support from NSF Grant CMS-0527598 and the Wharton Risk Management and Decision Processes Center is gratefully acknowledged. The views expressed herein are those of the author(s) and do not necessarily reflect the views of the National Bureau of Economic Research. © 2007 by Howard Kunreuther, Gabriel Silvasi, Eric T. Bradlow, and Dylan Small. All rights reserved. Short sections of text, not to exceed two paragraphs, may be quoted without explicit permission provided that full credit, including © notice, is given to the source. Deterministic and Stochastic Prisoner's Dilemma Games: Experiments in Interdependent Security Howard Kunreuther, Gabriel Silvasi, Eric T. Bradlow, and Dylan Small NBER Technical Working Paper No. 341 August 2007 JEL No. C11,C12,C22,C23,C73,C91 ABSTRACT This paper examines experiments on interdependent security prisoner's dilemma games with repeated play.
    [Show full text]
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • Stochastic Games with Hidden States∗
    Stochastic Games with Hidden States¤ Yuichi Yamamoto† First Draft: March 29, 2014 This Version: January 14, 2015 Abstract This paper studies infinite-horizon stochastic games in which players ob- serve noisy public information about a hidden state each period. We find that if the game is connected, the limit feasible payoff set exists and is invariant to the initial prior about the state. Building on this invariance result, we pro- vide a recursive characterization of the equilibrium payoff set and establish the folk theorem. We also show that connectedness can be replaced with an even weaker condition, called asymptotic connectedness. Asymptotic con- nectedness is satisfied for generic signal distributions, if the state evolution is irreducible. Journal of Economic Literature Classification Numbers: C72, C73. Keywords: stochastic game, hidden state, connectedness, stochastic self- generation, folk theorem. ¤The author thanks Naoki Aizawa, Drew Fudenberg, Johannes Horner,¨ Atsushi Iwasaki, Michi- hiro Kandori, George Mailath, Takeaki Sunada, and Masatoshi Tsumagari for helpful conversa- tions, and seminar participants at various places. †Department of Economics, University of Pennsylvania. Email: [email protected] 1 1 Introduction When agents have a long-run relationship, underlying economic conditions may change over time. A leading example is a repeated Bertrand competition with stochastic demand shocks. Rotemberg and Saloner (1986) explore optimal col- lusive pricing when random demand shocks are i.i.d. each period. Haltiwanger and Harrington (1991), Kandori (1991), and Bagwell and Staiger (1997) further extend the analysis to the case in which demand fluctuations are cyclic or persis- tent. One of the crucial assumptions of these papers is that demand shocks are publicly observable before firms make their decisions in each period.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • 570: Minimax Sample Complexity for Turn-Based Stochastic Game
    Minimax Sample Complexity for Turn-based Stochastic Game Qiwen Cui1 Lin F. Yang2 1School of Mathematical Sciences, Peking University 2Electrical and Computer Engineering Department, University of California, Los Angeles Abstract guarantees are rather rare due to complex interaction be- tween agents that makes the problem considerably harder than single agent reinforcement learning. This is also known The empirical success of multi-agent reinforce- as non-stationarity in MARL, which means when multi- ment learning is encouraging, while few theoret- ple agents alter their strategies based on samples collected ical guarantees have been revealed. In this work, from previous strategy, the system becomes non-stationary we prove that the plug-in solver approach, proba- for each agent and the improvement can not be guaranteed. bly the most natural reinforcement learning algo- One fundamental question in MBRL is that how to design rithm, achieves minimax sample complexity for efficient algorithms to overcome non-stationarity. turn-based stochastic game (TBSG). Specifically, we perform planning in an empirical TBSG by Two-players turn-based stochastic game (TBSG) is a two- utilizing a ‘simulator’ that allows sampling from agents generalization of Markov decision process (MDP), arbitrary state-action pair. We show that the em- where two agents choose actions in turn and one agent wants pirical Nash equilibrium strategy is an approxi- to maximize the total reward while the other wants to min- mate Nash equilibrium strategy in the true TBSG imize it. As a zero-sum game, TBSG is known to have and give both problem-dependent and problem- Nash equilibrium strategy [Shapley, 1953], which means independent bound.
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • Markovian Competitive and Cooperative Games with Applications to Communications
    DISSERTATION in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy from University of Nice Sophia Antipolis Specialization: Communication and Electronics Lorenzo Maggi Markovian Competitive and Cooperative Games with Applications to Communications Thesis defended on the 9th of October, 2012 before a committee composed of: Reporters Prof. Jean-Jacques Herings (Maastricht University, The Netherlands) Prof. Roberto Lucchetti, (Politecnico of Milano, Italy) Examiners Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italy) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Thesis Director Prof. Laura Cottatellucci (EURECOM, France) Thesis Co-Director Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´ede Nice-Sophia Antipolis Sp´ecialit´e: Automatique, Traitement du Signal et des Images Lorenzo Maggi Jeux Markoviens, Comp´etitifs et Coop´eratifs, avec Applications aux Communications Th`ese soutenue le 9 Octobre 2012 devant le jury compos´ede : Rapporteurs Prof. Jean-Jacques Herings (Maastricht University, Pays Bas) Prof. Roberto Lucchetti, (Politecnico of Milano, Italie) Examinateurs Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italie) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Directrice de Th`ese Prof. Laura Cottatellucci (EURECOM, France) Co-Directeur de Th`ese Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) Abstract In this dissertation we deal with the design of strategies for agents interacting in a dynamic environment. The mathematical tool of Game Theory (GT) on Markov Decision Processes (MDPs) is adopted.
    [Show full text]
  • Factored Markov Decision Process Models for Stochastic Unit Commitment
    MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski, Weihong Zhang TR2010-083 October 2010 Abstract In this paper, we consider stochastic unit commitment problems where power demand and the output of some generators are random variables. We represent stochastic unit commitment prob- lems in the form of factored Markov decision process models, and propose an approximate algo- rithm to solve such models. By incorporating a risk component in the cost function, the algorithm can achieve a balance between the operational costs and blackout risks. The proposed algorithm outperformed existing non-stochastic approaches on several problem instances, resulting in both lower risks and operational costs. 2010 IEEE Conference on Innovative Technologies This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2010 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski and Weihong Zhang, {nikovski,wzhang}@merl.com, phone 617-621-7510 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, USA, fax 617-621-7550 Abstract—In this paper, we consider stochastic unit commit- generate in order to meet demand.
    [Show full text]
  • Capturing Financial Markets to Apply Deep Reinforcement Learning
    Capturing Financial markets to apply Deep Reinforcement Learning Souradeep Chakraborty* BITS Pilani University, K.K. Birla Goa Campus [email protected] *while working under Dr. Subhamoy Maitra, at the Applied Statistics Unit, ISI Calcutta JEL: C02, C32, C63, C45 Abstract In this paper we explore the usage of deep reinforcement learning algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, we present a novel Markov decision process (MDP) model to capture the financial trading markets. We review and propose various modifications to existing approaches and explore different techniques like the usage of technical indicators, to succinctly capture the market dynamics to model the markets. We then go on to use deep reinforcement learning to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, we go on to show that our model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments. Keywords: deep reinforcement learning, online learning, computational finance, Markov decision process, modeling financial markets, algorithmic trading Introduction 1.1 Motivation Since the early 1990s, efforts have been made to automatically generate trades using data and computation which consistently beat a benchmark and generate continuous positive returns with minimal risk. The goal started to shift from “learning how to win in financial markets” to “create an algorithm that can learn on its own, how to win in financial markets”.
    [Show full text]