An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective

Total Page:16

File Type:pdf, Size:1020Kb

An Overview of Multi-Agent Reinforcement Learning from Game Theoretical Perspective An Overview of Multi-agent Reinforcement Learning from Game Theoretical Perspective Yaodong Yang∗1,2 and Jun Wang1,2 1University College London, 2Huawei R&D U.K. Abstract Following the remarkable success of the AlphaGO series, 2019 was a boom- ing year that witnessed significant advances in multi-agent reinforcement learning (MARL) techniques. MARL corresponds to the learning problem in a multi-agent system in which multiple agents learn simultaneously. It is an interdisciplinary do- main with a long history that includes game theory, machine learning, stochastic control, psychology, and optimisation. Although MARL has achieved considerable empirical success in solving real-world games, there is a lack of a self-contained overview in the literature that elaborates the game theoretical foundations of mod- ern MARL methods and summarises the recent advances. In fact, the majority of existing surveys are outdated and do not fully cover the recent developments since 2010. In this work, we provide a monograph on MARL that covers both the fundamentals and the latest developments in the research frontier. Our work is separated into two parts. From 1 to 4, we present the self- x x contained fundamental knowledge of MARL, including problem formulations, basic solutions, and existing challenges. Specifically, we present the MARL formulations through two representative frameworks, namely, stochastic games and extensive- form games, along with different variations of games that can be addressed. The goal of this part is to enable the readers, even those with minimal related back- ground, to grasp the key ideas in MARL research. From 5 to 9, we present an x x overview of recent developments of MARL algorithms. Starting from new tax- onomies for MARL methods, we conduct a survey of previous survey papers. In later sections, we highlight several modern topics in MARL research, including Q-function factorisation, multi-agent soft learning, networked multi-agent MDP, stochastic potential games, zero-sum continuous games, online MDP, turn-based arXiv:2011.00583v3 [cs.MA] 18 Mar 2021 stochastic games, policy space response oracle, approximation methods in general- sum games, and mean-field type learning in games with infinite agents. Within each topic, we select both the most fundamental and cutting-edge algorithms. The goal of our monograph is to provide a self-contained assessment of the current state-of-the-art MARL techniques from a game theoretical perspective. We expect this work to serve as a stepping stone for both new researchers who are about to enter this fast-growing domain and existing domain experts who want to obtain a panoramic view and identify new directions based on recent advances. ∗This manuscript is under actively development. We appreciated any constructive comments and suggestions corresponding to: <[email protected]>. 1 Contents 1 Introduction4 1.1 A Short History of RL . .5 1.2 2019: A Booming Year for MARL . .8 2 Single-Agent RL9 2.1 Problem Formulation: Markov Decision Process . 10 2.2 Justification of Reward Maximisation . 11 2.3 Solving Markov Decision Processes . 12 2.3.1 Value-Based Methods . 13 2.3.2 Policy-Based Methods . 14 3 Multi-Agent RL 15 3.1 Problem Formulation: Stochastic Game . 16 3.2 Solving Stochastic Games . 17 3.2.1 Value-Based MARL Methods . 18 3.2.2 Policy-Based MARL Methods . 19 3.2.3 Solution Concept of the Nash Equilibrium . 19 3.2.4 Special Types of Stochastic Games . 22 3.2.5 Partially Observable Settings . 26 3.3 Problem Formulation: Extensive-Form Game . 28 3.3.1 Normal-Form Representation . 31 3.3.2 Sequence-Form Representation . 32 3.4 Solving Extensive-Form Games . 35 3.4.1 Perfect-Information Games . 35 3.4.2 Imperfect-Information Games . 36 4 Grand Challenges of MARL 38 4.1 The Combinatorial Complexity . 39 4.2 The Multi-Dimensional Learning Objectives . 40 4.3 The Non-Stationarity Issue . 41 4.4 The Scalability Issue when N 2...................... 43 5 A Survey of MARL Surveys 44 5.1 Taxonomy of MARL Algorithms . 44 5.2 A Survey of Surveys . 47 6 Learning in Identical-Interest Games 50 6.1 Stochastic Team Games . 50 2 6.1.1 Solutions via Q-function Factorisation . 51 6.1.2 Solutions via Multi-Agent Soft Learning . 53 6.2 Dec-POMDPs . 56 6.3 Networked Multi-Agent MDPs . 57 6.4 Stochastic Potential Games . 59 7 Learning in Zero-Sum Games 61 7.1 Discrete State-Action Games . 62 7.2 Continuous State-Action Games . 63 7.3 Extensive-Form Games . 66 7.3.1 Variations of Fictitious Play . 67 7.3.2 Counterfactual Regret Minimisation . 70 7.4 Online Markov Decision Processes . 74 7.5 Turn-Based Stochastic Games . 77 7.6 Open-Ended Meta-Games . 78 8 Learning in General-Sum Games 82 8.1 Solutions by Mathematical Programming . 82 8.2 Solutions by Value-Based Methods . 84 8.3 Solutions by Two-Timescale Analysis . 84 8.4 Solutions by Policy-Based Methods . 85 9 Learning in Games when N + 87 ! 1 9.1 Non-cooperative Mean-Field Game . 90 9.2 Cooperative Mean-Field Control . 93 9.3 Mean-Field MARL . 95 10 Future Directions of Interest 98 Bibliography 101 3 1 Introduction Machine learning can be considered as the process of converting data into knowledge (Shalev-Shwartz and Ben-David, 2014). The input of a learning algorithm is training data (for example, images containing cats), and the output is some knowledge (for example, rules about how to detect cats in an image). This knowledge is usually represented as a computer program that can perform certain task(s) (for example, an automatic cat detector). In the past decade, considerable progress has been made by means of a special kind of machine learning technique: deep learning (LeCun et al., 2015). One of the critical embodiments of deep learning is different kinds of deep neural networks (DNNs) (Schmidhuber, 2015) that can find disentangled representations (Bengio, 2009) in high-dimensional data, which allows the software to train itself to perform new tasks rather than merely relying on the programmer for designing hand-crafted rules. An uncountable number of breakthroughs in real-world AI applications have been achieved through the usage of DNNs, with the domains of computer vision (Krizhevsky et al., 2012) and natural language processing (Brown et al., 2020; Devlin et al., 2018) being the greatest beneficiaries. In addition to feature recognition from existing data, modern AI applications often require computer programs to make decisions based on acquired knowledge (see Figure 1). To illustrate the key components of decision making, let us consider the real-world example of controlling a car to drive safely through an intersection. At each time step, a robot car can move by steering, accelerating and braking. The goal is to safely exit the intersection and reach the destination (with possible decisions of going straight or turning left/right into another lane). Therefore, in addition to being able to detect objects, such as traffic lights, lane markings, and other cars (by converting data to knowledge), we aim to find a steering policy that can control the car to make a sequence of manoeuvres to achieve the goal (making decisions based on the knowledge gained). In a decision-making setting such as this, two additional challenges arise: 1. First, during the decision-making process, at each time step, the robot car should consider not only the immediate value of its current action but also the consequences of its current action in the future. For example, in the case of driving through an intersection, it would be detrimental to have a policy that chooses to steer in a \safe" direction at the beginning of the process if it would eventually lead to a car 4 Data Data Knowledge interactions happen! Knowledge Decisions Figure 1: Modern AI applications are being transformed from pure feature recognition (for example, detecting a cat in an image) to decision making (driving through a traffic intersection safely), where interaction among multiple agents inevitably occurs. As a result, each agent has to behave strategically. Furthermore, the problem becomes more challenging because current decisions influence future outcomes. crash later on. 2. Second, to make each decision correctly and safely, the car must also consider other cars' behaviour and act accordingly. Human drivers, for example, often predict in advance other cars' movements and then take strategic moves in response (like giving way to an oncoming car or accelerating to merge into another lane). The need for an adaptive decision-making framework, together with the complexity of addressing multiple interacting learners, has led to the development of multi-agent RL. Multi-agent RL tackles the sequential decision-making problem of having multiple intelligent agents that operate in a shared stochastic environment, each of which targets to maximise its long-term reward through interacting with the environment and other agents. Multi-agent RL is built on the knowledge of both multi-agent systems (MAS) and RL. In the next section, we provide a brief overview of (single-agent) RL and the research developments in recent decades. 1.1 A Short History of RL RL is a sub-field of machine learning, where agents learn how to behave optimally based on a trial-and-error procedure during their interaction with the environment. Unlike 5 supervised learning, which takes labelled data as the input (for example, an image labelled with cats), RL is goal-oriented: it constructs a learning model that learns to achieve the optimal long-term goal by improvement through trial and error, with the learner having no labelled data to obtain knowledge from. The word \reinforcement" refers to the learning mechanism since the actions that lead to satisfactory outcomes are reinforced in the learner's set of behaviours.
Recommended publications
  • A History of Von Neumann and Morgenstern's Theory of Games, 12Pt Escolha Sob Incerteza Ax
    Marcos Thiago Graciani Axiomatic choice under uncertainty: a history of von Neumann and Morgenstern’s Theory of Games Escolha sob incerteza axiomática: uma história do Theory of Games de von Neumann e Morgenstern São Paulo 2019 Prof. Dr. Vahan Agopyan Reitor da Universidade de São Paulo Prof. Dr. Fábio Frezatti Diretor da Faculdade de Economia, Administração e Contabilidade Prof. Dr. José Carlos de Souza Santos Chefe do Departamento de Economia Prof. Dr. Ariaster Baumgratz Chimeli Coordenador do Programa de Pós-Graduação em Economia Marcos Thiago Graciani Axiomatic choice under uncertainty: a history of von Neumann and Morgenstern’s Theory of Games Escolha sob incerteza axiomática: uma história do Theory of Games de von Neumann e Morgenstern Dissertação de Mestrado apresentada ao Departamento de Economia da Faculdade de Economia, Administração e Contabilidade da Universidade de São Paulo (FEA-USP) como requisito parcial à obtenção do título de Mestre em Ciências. Área de Concentração: Teoria Econômica. Universidade de São Paulo — USP Faculdade de Economia, Administração e Contabilidade Programa de Pós-Graduação Orientador: Pedro Garcia Duarte Versão Corrigida (A versão original está disponível na biblioteca da Faculdade de Economia, Administração e Contabilidade.) São Paulo 2019 FICHA CATALOGRÁFICA Elaborada pela Seção de Processamento Técnico do SBD/FEA com dados inseridos pelo autor. Graciani, Marcos Thiago. Axiomatic choice under uncertainty: a history of von Neumann and Morgenstern’s Theory of Games / Marcos Thiago Graciani. – São Paulo, 2019. 133p. Dissertação (Mestrado) – Universidade de São Paulo, 2019. Orientador: Pedro Garcia Duarte. 1. História da teoria dos jogos 2. Axiomática 3. Incerteza 4. Von Neumann e Morgenstern I.
    [Show full text]
  • Game Theory Lecture Notes
    Game Theory: Penn State Math 486 Lecture Notes Version 2.1.1 Christopher Griffin « 2010-2021 Licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License With Major Contributions By: James Fan George Kesidis and Other Contributions By: Arlan Stutler Sarthak Shah Contents List of Figuresv Preface xi 1. Using These Notes xi 2. An Overview of Game Theory xi Chapter 1. Probability Theory and Games Against the House1 1. Probability1 2. Random Variables and Expected Values6 3. Conditional Probability8 4. The Monty Hall Problem 11 Chapter 2. Game Trees and Extensive Form 15 1. Graphs and Trees 15 2. Game Trees with Complete Information and No Chance 18 3. Game Trees with Incomplete Information 22 4. Games of Chance 24 5. Pay-off Functions and Equilibria 26 Chapter 3. Normal and Strategic Form Games and Matrices 37 1. Normal and Strategic Form 37 2. Strategic Form Games 38 3. Review of Basic Matrix Properties 40 4. Special Matrices and Vectors 42 5. Strategy Vectors and Matrix Games 43 Chapter 4. Saddle Points, Mixed Strategies and the Minimax Theorem 45 1. Saddle Points 45 2. Zero-Sum Games without Saddle Points 48 3. Mixed Strategies 50 4. Mixed Strategies in Matrix Games 53 5. Dominated Strategies and Nash Equilibria 54 6. The Minimax Theorem 59 7. Finding Nash Equilibria in Simple Games 64 8. A Note on Nash Equilibria in General 66 Chapter 5. An Introduction to Optimization and the Karush-Kuhn-Tucker Conditions 69 1. A General Maximization Formulation 70 2. Some Geometry for Optimization 72 3.
    [Show full text]
  • Introduction to Probability
    Massachusetts Institute of Technology Course Notes 10 6.042J/18.062J, Fall ’02: Mathematics for Computer Science November 4 Professor Albert Meyer and Dr. Radhika Nagpal revised November 6, 2002, 572 minutes Introduction to Probability 1 Probability Probability will be the topic for the rest of the term. Probability is one of the most important sub- jects in Mathematics and Computer Science. Most upper level Computer Science courses require probability in some form, especially in analysis of algorithms and data structures, but also in infor- mation theory, cryptography, control and systems theory, network design, artificial intelligence, and game theory. Probability also plays a key role in fields such as Physics, Biology, Economics and Medicine. There is a close relationship between Counting/Combinatorics and Probability. In many cases, the probability of an event is simply the fraction of possible outcomes that make up the event. So many of the rules we developed for finding the cardinality of finite sets carry over to Proba- bility Theory. For example, we’ll apply an Inclusion-Exclusion principle for probabilities in some examples below. In principle, probability boils down to a few simple rules, but it remains a tricky subject because these rules often lead unintuitive conclusions. Using “common sense” reasoning about probabilis- tic questions is notoriously unreliable, as we’ll illustrate with many real-life examples. This reading is longer than usual . To keep things in bounds, several sections with illustrative examples that do not introduce new concepts are marked “[Optional].” You should read these sections selectively, choosing those where you’re unsure about some idea and think another ex- ample would be helpful.
    [Show full text]
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • Turing Learning with Nash Memory
    Turing Learning with Nash Memory Machine Learning, Game Theory and Robotics Shuai Wang Master of Science Thesis University of Amsterdam Local Supervisor: Dr. Leen Torenvliet (University of Amsterdam, NL) Co-Supervisor: Dr. Frans Oliehoek (University of Liverpool, UK) External Advisor: Dr. Roderich Groß (University of Sheffield, UK) ABSTRACT Turing Learning is a method for the reverse engineering of agent behaviors. This approach was inspired by the Turing test where a machine can pass if its behaviour is indistinguishable from that of a human. Nash memory is a memory mechanism for coevolution. It guarantees monotonicity in convergence. This thesis explores the integration of such memory mechanism with Turing Learning for faster learning of agent behaviors. We employ the Enki robot simulation platform and learns the aggregation behavior of epuck robots. Our experiments indicate that using Nash memory can reduce the computation time by 35.4% and results in faster convergence for the aggregation game. In addition, we present TuringLearner, the first Turing Learning platform. Keywords: Turing Learning, Nash Memory, Game Theory, Multi-agent System. COMMITTEE MEMBER: DR.KATRIN SCHULZ (CHAIR) PROF. DR.FRANK VAN HARMELEN DR.PETER VAN EMDE BOAS DR.PETER BLOEM YFKE DULEK MSC MASTER THESIS,UNIVERSITY OF AMSTERDAM Amsterdam, August 2017 Contents 1 Introduction ....................................................7 2 Background ....................................................9 2.1 Introduction to Game Theory9 2.1.1 Players and Games................................................9 2.1.2 Strategies and Equilibria........................................... 10 2.2 Computing Nash Equilibria 11 2.2.1 Introduction to Linear Programming.................................. 11 2.2.2 Game Solving with LP............................................. 12 2.2.3 Asymmetric Games............................................... 13 2.2.4 Bimatrix Games.................................................
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • Markovian Competitive and Cooperative Games with Applications to Communications
    DISSERTATION in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy from University of Nice Sophia Antipolis Specialization: Communication and Electronics Lorenzo Maggi Markovian Competitive and Cooperative Games with Applications to Communications Thesis defended on the 9th of October, 2012 before a committee composed of: Reporters Prof. Jean-Jacques Herings (Maastricht University, The Netherlands) Prof. Roberto Lucchetti, (Politecnico of Milano, Italy) Examiners Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italy) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Thesis Director Prof. Laura Cottatellucci (EURECOM, France) Thesis Co-Director Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´ede Nice-Sophia Antipolis Sp´ecialit´e: Automatique, Traitement du Signal et des Images Lorenzo Maggi Jeux Markoviens, Comp´etitifs et Coop´eratifs, avec Applications aux Communications Th`ese soutenue le 9 Octobre 2012 devant le jury compos´ede : Rapporteurs Prof. Jean-Jacques Herings (Maastricht University, Pays Bas) Prof. Roberto Lucchetti, (Politecnico of Milano, Italie) Examinateurs Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italie) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Directrice de Th`ese Prof. Laura Cottatellucci (EURECOM, France) Co-Directeur de Th`ese Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) Abstract In this dissertation we deal with the design of strategies for agents interacting in a dynamic environment. The mathematical tool of Game Theory (GT) on Markov Decision Processes (MDPs) is adopted.
    [Show full text]
  • Factored Markov Decision Process Models for Stochastic Unit Commitment
    MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski, Weihong Zhang TR2010-083 October 2010 Abstract In this paper, we consider stochastic unit commitment problems where power demand and the output of some generators are random variables. We represent stochastic unit commitment prob- lems in the form of factored Markov decision process models, and propose an approximate algo- rithm to solve such models. By incorporating a risk component in the cost function, the algorithm can achieve a balance between the operational costs and blackout risks. The proposed algorithm outperformed existing non-stochastic approaches on several problem instances, resulting in both lower risks and operational costs. 2010 IEEE Conference on Innovative Technologies This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2010 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski and Weihong Zhang, {nikovski,wzhang}@merl.com, phone 617-621-7510 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, USA, fax 617-621-7550 Abstract—In this paper, we consider stochastic unit commit- generate in order to meet demand.
    [Show full text]
  • Capturing Financial Markets to Apply Deep Reinforcement Learning
    Capturing Financial markets to apply Deep Reinforcement Learning Souradeep Chakraborty* BITS Pilani University, K.K. Birla Goa Campus [email protected] *while working under Dr. Subhamoy Maitra, at the Applied Statistics Unit, ISI Calcutta JEL: C02, C32, C63, C45 Abstract In this paper we explore the usage of deep reinforcement learning algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, we present a novel Markov decision process (MDP) model to capture the financial trading markets. We review and propose various modifications to existing approaches and explore different techniques like the usage of technical indicators, to succinctly capture the market dynamics to model the markets. We then go on to use deep reinforcement learning to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, we go on to show that our model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments. Keywords: deep reinforcement learning, online learning, computational finance, Markov decision process, modeling financial markets, algorithmic trading Introduction 1.1 Motivation Since the early 1990s, efforts have been made to automatically generate trades using data and computation which consistently beat a benchmark and generate continuous positive returns with minimal risk. The goal started to shift from “learning how to win in financial markets” to “create an algorithm that can learn on its own, how to win in financial markets”.
    [Show full text]