Value Function Approximation in Zero-Sum Markov Games {Mgl,Parr}

Total Page:16

File Type:pdf, Size:1020Kb

Value Function Approximation in Zero-Sum Markov Games {Mgl,Parr} UAI2002 LAGOUDAKIS & PARR 283 Value Function Approximation in Zero-Sum Markov Games Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 {mgl,parr}©cs.duke.edu Abstract by working in strategy space may be represented more ef­ fectively as Markov games. A well-known example of a Markov game is Littman's soccer domain (Littman, 1994). This paper investigates value function approx­ imation in the context of zero-sum Markov Recent work on learning in games has emphasized accel­ games, which can be viewed as a generalization erating learning and exploiting opponent suboptimalities of the Markov decision process (MDP) frame­ (Bowling & Veloso, 2001). The problem of learning in work to the two-agent case. We generalize er­ games with very large state spaces using function approx­ ror bounds from MDPs to Markov games and imation has not been a major focus, possibly due to the describe generalizations of reinforcement learn­ perception that a prohibitively large number of linear pro­ ing algorithms to Markov games. We present grams would need to be solved during learning. a generalization of the optimal stopping prob­ This paper contributes to the theory and practice of learning lem to a two-player simultaneous move Markov in Markov games. On the theoretical side, we generalize game. For this special problem, we provide some standard bounds on MDP performance with approxi­ stronger bounds and can guarantee convergence mate value functions to Markov games. When presented in for LSTD and temporal difference learning with their fully general form, these tum out to be straightforward linear value function approximation. We demon­ applications of known properties of contraction mappings strate the viability of value function approxima­ to Markov games. tion for Markov games by using the Least squares policy iteration (LSPI) algorithm to learn good We also generalize the optimal stopping problem to the policies for a soccer domain and a flow control Markov game case. Optimal stopping is a special case of problem. an MDP in which states have only two actions: continue on the current Markov chain, or exit and receive a (possi­ bly state dependent) reward. Optimal stopping is of inter­ est because it can be used to model the decision problem 1 Introduction faced by the holder of stock options (and is purportedly modeled as such by financial institutions). Moreover, lin­ Markov games can be viewed as generalizations of both ear value function approximation algorithms can be used classical game theory and the Markov decision process for these problems with guaranteed convergence and more (MDP) framework1. In this paper, we consider the two­ compelling error bounds than those obtainable for generic player zero-sum case, in which two players make simulta­ MDPs. We extend these results to a Markov game we call neous decisions in the same environment with shared state the "opt-out", in which two players must decide whether --; information. The reward function and the state transition to remain in the game or opt out, with a payoff that is de­ probabilities depend on the current state and the current pendent upon the actions of both players. For example, this agents' joint actions. The reward function in each state is can be used to model the decision problem faced by two the payoff matrix of a zero-sum game. "partners" who can choose to continue participating in a business venture, or force a sale and redistribution of the Many practical problem can be viewed as Markov games, business assets. including military-style pursuit-evasion problems. More­ over, many problems that are solved as classical games As in the case of MDPs, these theoretical results are some­ what more motivational than practical. They provide re­ 1 Littman (1994) notes that the Markov game framework actu­ assurance that value function approximation architectures ally preceded the MOP framework. 284 LAGOUDAKIS & PARR UAI2002 which can represent value functions with reasonable er­ value function for a zero-sum Markov game can be defined rors will lead to good policies. However, as in the MDP in a manner analogous to the Bellman equation for MDPs: case, they do not provide a priori guarantees of good per­ V(s) = max min Q(s,a,o)1r(s,a) , (1) formance. Therefore we demonstrate, on the practical side rr(s)E!"l(A) oEO L of this paper, that with a good value function approximation aEA architecture, we can achieve good performance on large where the Q-values familiar to MDPs are now definedover Markov games. We achieve this by extending the least states and pairs of agent and opponent actions: squares policy iteration (LSPI) algorithm (Lagoudakis & Parr, 2001), originally designed for MDPs, to general two­ Q(s, a,o) = R(s, a, o) + 'Y L P(s, a, o, s')V(s') . (2) player zero-sum Markov games. LSPI is an approximate s'ES policy iteration algorithm that uses linear approximation We will refer to the policy chosen by Eq. (I) as the minimax architectures, makes efficient use of training samples, and policy with respect to Q. This policy can be determined in learns faster than conventional learning methods. The ad­ any state s by solving the following linear program: vantage of using LSPI for Markov games stems from its generalization ability and its data efficiency. The latter is Maximize: V(s) of even greater importance for large Markov Games than Subject to: L 1r(s, a) = 1 for MDPs since each learning step is not light weight; it aEA requires solving a linear program. 'riaE A, 1r(s,a) 2: 0 o We demonstrate LSPI with value function approximation 'ri E 0, V(s) � L Q(s,a,o)1r(s,a) in two domains: Littman's soccer domain in which LSPI aEA learns very good policies using a relatively small num­ ber of sample "random games" for training; and, a simple In many ways, it is useful to think of the above linear pro­ gram as a generalization of the max operator for MDPs server/router flow control problem where the optimal pol­ (this is also the view taken by Szepesvari and Littman icy is found easily with a small number of training samples. (1999)). We briefly summarize some of the properties of operators for Markov games which are analogous to oper­ ators for MDPs. (See Bertsekas and Tsitsiklis (1996) for 2 Markov Games an overview.) Eq. (I) and Eq. (2) have fixed points V* and Q*, respectively, for which the minimax policy is optimal A two-player zero-sum Markov game is defined as a 6- in the minimax sense. The following iteration, which we name with operator converges to tuple (S, A, 0, P, n, 'f'), where: s = { S), 82, ... , Sn } is a T*, Q*: finite set of game states; A = {a 1,a 2, ..., am} and 0 = (i+ll T* : Q (s,a, o) = R(s, a, o) + { o1,o2 , ..., o!} are finitesets of actions, one for each player; ! L P(s,a,o,s') max min L QUl(s,a,o)7r(s,a) Pis a Markovian state transition model- P(s, a, o, s') is rr(s)E!"!(A) oEO aEA the probability that s' will be the next state of the game s'ES when the players take actions a and o respectively in state This is analogous to value iteration in the MDP case. s; n is a reward (or cost) function- R(s, a,o) is the ex­ For any policy 1r,we can defineQ rr ( s,a, o) as the expected, pected one-step reward for taking actions a and o in state discounted reward when following policy 1rafter taking ac­ s; and, 'Y E (0, 1] is the discount factor for future rewards. tions and for the first step. The corresponding fixed We will refer to the first player as the "agent," and the sec­ a o point equation for Qrr and operator Trrare: ond player as the "opponent." Note that if the opponent is rr i permitted only a single action, the Markov game becomes T : Q(+l)(s,a,o) = R(s,a,o) + an MDP. 'Y '"""' � P(s,a,o,s')mioEOn�'"""' Q(i)(s,a,o)1r(s,a) A policy 1r for a player in a Markov game is a mapping, s'ES aEA 1r : S --+ O(A), which yields probability distributions over the agent's actions for each state in S. Unlike MDPs, the A policy iteration algorithm can be implemented for optimal policy for a Markov game may be stochastic, i.e., Markov games in a manner analogous to policy iteration it may define a mixed strategy for every state. By conven­ for MDPs by fixing a policy 7r;, solving for Qrr,, choosing tion, 1r(s) denotes the probability distribution over actions 1ri+1 as the minimax policy with respect to Qrr, and iterat­ in states and 1r(s,a) denotes the probability of action a in ing. This algorithm will also converge to Q*. states. The agent is interested in maximizing its expected, dis­ 3 Value Function Approximation counted return in the minimax sense, that is, assuming the worst case of an optimal opponent. Since the underlying Value function approximation has played an important role rewards are zero-sum, it is sufficient to view the opponent in extending the classical MDP framework to solve practi­ as acting to minimize the agent's return. Thus, the state cal problems. While we are not yet able to provide strong UAI2002 LAGOUDAKIS & PARR 285 a priori guarantees in most cases for the performance of IIVIIoo = maxsES IV(s)l (Bertsekas & Tsitsiklis, 1996).
Recommended publications
  • Calculating the Value Function When the Bellman Equation Cannot Be
    Journal of Economic Dynamics & Control 28 (2004) 1437–1460 www.elsevier.com/locate/econbase Investment under uncertainty: calculating the value function when the Bellman equation cannot besolvedanalytically Thomas Dangla, Franz Wirlb;∗ aVienna University of Technology, Theresianumgasse 27, A-1040 Vienna, Austria bUniversity of Vienna, Brunnerstr.˝ 72, A-1210 Vienna, Austria Abstract The treatment of real investments parallel to ÿnancial options if an investment is irreversible and facing uncertainty is a major insight of modern investment theory well expounded in the book Investment under Uncertainty by Dixit and Pindyck. Thepurposeof this paperis to draw attention to the fact that many problems in managerial decision making imply Bellman equations that cannot be solved analytically and to the di6culties that arise when approximating therequiredvaluefunction numerically.This is so becausethevaluefunctionis thesaddlepoint path from the entire family of solution curves satisfying the di7erential equation. The paper uses a simple machine replacement and maintenance framework to highlight these di6culties (for shooting as well as for ÿnite di7erence methods). On the constructive side, the paper suggests and tests a properly modiÿed projection algorithm extending the proposal in Judd (J. Econ. Theory 58 (1992) 410). This fast algorithm proves amendable to economic and stochastic control problems(which onecannot say of theÿnitedi7erencemethod). ? 2003 Elsevier B.V. All rights reserved. JEL classiÿcation: D81; C61; E22 Keywords: Stochastic optimal control; Investment under uncertainty; Projection method; Maintenance 1. Introduction Without doubt, thebook of Dixit and Pindyck (1994) is oneof themost stimulating books concerning management sciences. The basic message of this excellent book is the following: Considering investments characterized by uncertainty (either with ∗ Corresponding author.
    [Show full text]
  • From Single-Agent to Multi-Agent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course
    From Single-Agent to Multi-Agent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course Gon¸caloNeto Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico May 2005 Abstract Interest in robotic and software agents has increased a lot in the last decades. They allow us to do tasks that we would hardly accomplish otherwise. Par- ticularly, multi-agent systems motivate distributed solutions that can be cheaper and more efficient than centralized single-agent ones. In this context, reinforcement learning provides a way for agents to com- pute optimal ways of performing the required tasks, with just a small in- struction indicating if the task was or was not accomplished. Learning in multi-agent systems, however, poses the problem of non- stationarity due to interactions with other agents. In fact, the RL methods for the single agent domain assume stationarity of the environment and cannot be applied directly. This work is divided in two main parts. In the first one, the reinforcement learning framework for single-agent domains is analyzed and some classical solutions presented, based on Markov decision processes. In the second part, the multi-agent domain is analyzed, borrowing tools from game theory, namely stochastic games, and the most significant work on learning optimal decisions for this type of systems is presented. ii Contents Abstract 1 1 Introduction 2 2 Single-Agent Framework 5 2.1 Markov Decision Processes . 6 2.2 Dynamic Programming . 10 2.2.1 Value Iteration . 11 2.2.2 Policy Iteration . 12 2.2.3 Generalized Policy Iteration . 13 2.3 Learning with Model-free methods .
    [Show full text]
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • Dynamic Programming for Dummies, Parts I & II
    DYNAMIC PROGRAMMING FOR DUMMIES Parts I & II Gonçalo L. Fonseca [email protected] Contents: Part I (1) Some Basic Intuition in Finite Horizons (a) Optimal Control vs. Dynamic Programming (b) The Finite Case: Value Functions and the Euler Equation (c) The Recursive Solution (i) Example No.1 - Consumption-Savings Decisions (ii) Example No.2 - Investment with Adjustment Costs (iii) Example No. 3 - Habit Formation (2) The Infinite Case: Bellman's Equation (a) Some Basic Intuition (b) Why does Bellman's Equation Exist? (c) Euler Once Again (d) Setting up the Bellman's: Some Examples (i) Labor Supply/Household Production model (ii) Investment with Adjustment Costs (3) Solving Bellman's Equation for Policy Functions (a) Guess and Verify Method: The Idea (i) Guessing the Policy Function u = h(x) (ii) Guessing the Value Function V(x). (b) Guess and Verify Method: Two Examples (i) One Example: Log Return and Cobb-Douglas Transition (ii) The Other Example: Quadratic Return and Linear Transition Part II (4) Stochastic Dynamic Programming (a) Some Basics (b) Some Examples: (i) Consumption with Uncertainty (ii) Asset Prices (iii) Search Models (5) Mathematical Appendices (a) Principle of Optimality (b) Existence of Bellman's Equation 2 Texts There are actually not many books on dynamic programming methods in economics. The following are standard references: Stokey, N.L. and Lucas, R.E. (1989) Recursive Methods in Economic Dynamics. (Harvard University Press) Sargent, T.J. (1987) Dynamic Macroeconomic Theory (Harvard University Press) Sargent, T.J. (1997) Recursive Macroeconomic Theory (unpublished, but on Sargent's website at http://riffle.stanford.edu) Stokey and Lucas does more the formal "mathy" part of it, with few worked-through applications.
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • Markovian Competitive and Cooperative Games with Applications to Communications
    DISSERTATION in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy from University of Nice Sophia Antipolis Specialization: Communication and Electronics Lorenzo Maggi Markovian Competitive and Cooperative Games with Applications to Communications Thesis defended on the 9th of October, 2012 before a committee composed of: Reporters Prof. Jean-Jacques Herings (Maastricht University, The Netherlands) Prof. Roberto Lucchetti, (Politecnico of Milano, Italy) Examiners Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italy) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Thesis Director Prof. Laura Cottatellucci (EURECOM, France) Thesis Co-Director Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´ede Nice-Sophia Antipolis Sp´ecialit´e: Automatique, Traitement du Signal et des Images Lorenzo Maggi Jeux Markoviens, Comp´etitifs et Coop´eratifs, avec Applications aux Communications Th`ese soutenue le 9 Octobre 2012 devant le jury compos´ede : Rapporteurs Prof. Jean-Jacques Herings (Maastricht University, Pays Bas) Prof. Roberto Lucchetti, (Politecnico of Milano, Italie) Examinateurs Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italie) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Directrice de Th`ese Prof. Laura Cottatellucci (EURECOM, France) Co-Directeur de Th`ese Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) Abstract In this dissertation we deal with the design of strategies for agents interacting in a dynamic environment. The mathematical tool of Game Theory (GT) on Markov Decision Processes (MDPs) is adopted.
    [Show full text]
  • Factored Markov Decision Process Models for Stochastic Unit Commitment
    MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski, Weihong Zhang TR2010-083 October 2010 Abstract In this paper, we consider stochastic unit commitment problems where power demand and the output of some generators are random variables. We represent stochastic unit commitment prob- lems in the form of factored Markov decision process models, and propose an approximate algo- rithm to solve such models. By incorporating a risk component in the cost function, the algorithm can achieve a balance between the operational costs and blackout risks. The proposed algorithm outperformed existing non-stochastic approaches on several problem instances, resulting in both lower risks and operational costs. 2010 IEEE Conference on Innovative Technologies This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2010 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski and Weihong Zhang, {nikovski,wzhang}@merl.com, phone 617-621-7510 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, USA, fax 617-621-7550 Abstract—In this paper, we consider stochastic unit commit- generate in order to meet demand.
    [Show full text]
  • Capturing Financial Markets to Apply Deep Reinforcement Learning
    Capturing Financial markets to apply Deep Reinforcement Learning Souradeep Chakraborty* BITS Pilani University, K.K. Birla Goa Campus [email protected] *while working under Dr. Subhamoy Maitra, at the Applied Statistics Unit, ISI Calcutta JEL: C02, C32, C63, C45 Abstract In this paper we explore the usage of deep reinforcement learning algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, we present a novel Markov decision process (MDP) model to capture the financial trading markets. We review and propose various modifications to existing approaches and explore different techniques like the usage of technical indicators, to succinctly capture the market dynamics to model the markets. We then go on to use deep reinforcement learning to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, we go on to show that our model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments. Keywords: deep reinforcement learning, online learning, computational finance, Markov decision process, modeling financial markets, algorithmic trading Introduction 1.1 Motivation Since the early 1990s, efforts have been made to automatically generate trades using data and computation which consistently beat a benchmark and generate continuous positive returns with minimal risk. The goal started to shift from “learning how to win in financial markets” to “create an algorithm that can learn on its own, how to win in financial markets”.
    [Show full text]
  • The Uncertainty Bellman Equation and Exploration
    The Uncertainty Bellman Equation and Exploration Brendan O’Donoghue 1 Ian Osband 1 Remi Munos 1 Volodymyr Mnih 1 Abstract tions that maximize rewards given its current knowledge? We consider the exploration/exploitation prob- Separating estimation and control in RL via ‘greedy’ algo- lem in reinforcement learning. For exploitation, rithms can lead to premature and suboptimal exploitation. it is well known that the Bellman equation con- To offset this, the majority of practical implementations in- nects the value at any time-step to the expected troduce some random noise or dithering into their action value at subsequent time-steps. In this paper we selection (such as -greedy). These algorithms will even- consider a similar uncertainty Bellman equation tually explore every reachable state and action infinitely (UBE), which connects the uncertainty at any often, but can take exponentially long to learn the opti- time-step to the expected uncertainties at subse- mal policy (Kakade, 2003). By contrast, for any set of quent time-steps, thereby extending the potential prior beliefs the optimal exploration policy can be com- exploratory benefit of a policy beyond individual puted directly by dynamic programming in the Bayesian time-steps. We prove that the unique fixed point belief space. However, this approach can be computation- of the UBE yields an upper bound on the vari- ally intractable for even very small problems (Guez et al., ance of the posterior distribution of the Q-values 2012) while direct computational approximations can fail induced by any policy. This bound can be much spectacularly badly (Munos, 2014). tighter than traditional count-based bonuses that For this reason, most provably-efficient approaches to re- compound standard deviation rather than vari- inforcement learning rely upon the optimism in the face of ance.
    [Show full text]