Reinforcement Learning

Total Page:16

File Type:pdf, Size:1020Kb

Reinforcement Learning Reinforcement Learning PROF XIAOHUI XIE SPRING 2019 CS 273P Machine Learning and Data Mining Slides courtesy of Sameer Singh, David Silver Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 2 Reinforcement Learning 3 What makes it different? No direct supervision, only rewards Feedback is delayed, not instantaneous Time really matters, i.e. data is sequential Agent’s actions affect what data it will receive • Fly stunt maneuvers in a helicopter • Defeat the world champion at Backgammon or Go • Manage an investment portfolio • Control a power station • Make a humanoid robot walk Examples • Play many different Atari games better than humans 4 Agent-Environment Interface Agent • decides on an action • receives next observation • receives next reward Environment • executes the action • computes next observation • computes next reward 5 Reward, Rt Nothing about WHY it is +, positive (Good) How well the doing well, could have agent is doing -, negative (Bad) little to do with At-1 Agent is trying to maximize its cumulative reward 6 Example of Rewards • Fly stunt maneuvers in a helicopter • +ve reward for following desired trajectory • −ve reward for crashing • Defeat the world champion at Backgammon • +/−ve reward for winning/losing a game • Manage an investment portfolio • +ve reward for each $ in bank • Control a power station • +ve reward for producing power • −ve reward for exceeding safety thresholds • Make a humanoid robot walk • +ve reward for forward motion • −ve reward for falling over • Play many different Atari games better than humans • +/−ve reward for increasing/decreasing score 7 Sequential Decision Making Actions have long term consequences Rewards may be delayed May be better to sacrifice short term reward for long term benefit • A financial investment (may take months to mature) • Refueling a helicopter (might prevent a crash later) • Blocking opponent moves (might eventually help win) • Spend a lot of money and go to college (earn more later) Examples • Don’t commit crimes (rewarded by not going to jail) • Get started on final project early (avoid stress later) A key aspect of intelligence: How far ahead are you able to plan? 8 Reinforcement Learning Given an environment (produces observations and rewards) Reinforcement Learning Automated agent that selects actions to maximize total rewards in the environment 9 Let’s look at the Agent What does the choice of action depend on? • Can you ignore Ot completely? • Is just Ot enough? Or (Ot,At)? • Is it last few observations? • Is it all observations so far? 10 Agent State, St History: everything that happened so far Ht = O1R1A1O2R2A2O3R3,…,At-1OtRt State, St can be Ot OtRt At-1OtRt Ot-3Ot-2Ot-1Ot You, as AI designer, In general, S = f(H ) t t specify this function 11 Agent Policy, π Current state Next action S A t π t Good policy: Leads to larger cumulative reward Bad policy: Leads to worse cumulative reward (we will explore this later) 12 Example: Atari Rules are unknown • What makes the score increase? Dynamics are unknown • How do actions change pixels? 13 Video Time! https://www.youtube.com/watch?v=V1eYniJ0Rnk 14 Example: Robotic Soccer https://www.youtube.com/watch?v=CIF2SBVY-J0 15 AlphaGo https://www.youtube.com/watch?v=I2WFvGl4y8c 16 Overview of Lecture States Markov Processes + Rewards Markov Reward Processes + Actions Markov Decision Processes 17 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 18 Markov Property 19 Markov Property 20 Markov Property 21 State Transition Matrix 22 State Transition Matrix 23 Markov Processes 24 Markov Processes 25 Student Markov Chain 26 Student MC: Episodes 27 Student MC: Episodes 28 Student MC: Transition Matrix 29 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 30 Markov Reward Process 31 Markov Reward Process 32 The Student MRP 33 Return 34 Return 35 Return 36 Why discount? 37 Why discount? 38 Why discount? 39 Why discount? 40 Why discount? 41 Value Function 42 Value Function 43 Student MRP: Returns 44 Student MRP: Returns 45 Student MRP: Value Function 46 Student MRP: Value Function 47 Student MRP: Value Function 48 Bellman Equations for MRP 49 Backup Diagrams for MRP 50 Student MRP: Bellman Eq 51 Matrix Form of Bellman Eq 52 Solving the Bellman Equation 53 Solving the Bellman Equation 54 Dynamic Programming 2 1 1 v (C1) = -2 + γ ( .5 v (FB) + .5 v (C2) ) 2 1 1 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … 3 2 2 v (FB) = -1 + γ ( .9 v (FB) + .1 v (C1) ) … C1 C2 C3 Pa Pub FB Slp γ=0.5 t=1 -2 -2 -2 10 1 -1 0 t=2 -2.75 -2.8 1.2 10 0 -1.55 0 t=3 -3.09 -1.52 1 10 0.41 -1.83 0 t=4 -2.84 -1.6 1.08 10 0.59 -1.98 0 55 Machine Learning Intro to Reinforcement Learning Markov Processes Markov Reward Processes Markov Decision Processes 56 Markov Decision Process 57 Markov Decision Process 58 The Student MDP 59 Policies 60 Policies 61 Policies 62 MPs → MRPs → MDPs 63 MPs → MRPs → MDPs 64 Value Function 65 Value Function 66 Student MDP: Value Function 67 Bellman Expected Equation 68 Bellman Expected Equation 69 Bellman Expected Equation, V 70 Bellman Expected Equation, Q 71 Bellman Expected Equation, V 72 Bellman Expected Equation, Q 73 Student MDP: Bellman Exp Eq. 74 Bellman Exp Eq: Matrix Form 75 Optimal Value Function 76 Optimal Value Function 77 Optimal Value Function 78 Student MDP: Optimal V 79 Student MDP: Optimal Q 80 Optimal Policy 81 Optimal Policy 82 Finding Optimal Policy 83 Student MDP: Optimal Policy 84 Bellman Optimality Eq, V 85 Bellman Optimality Eq, Q 86 Bellman Optimality Eq, V 87 Bellman Optimality Eq, Q 88 Student MDP: Bellman Optimality 89 Solving Bellman Equations in MDP Not easy ● Not a linear equation ● No “closed-form” solutions 90 Overview MDPs States, Transitions, Actions, Rewards Prediction Given Policy π, Estimate State Value Functions, Action Value Functions Control Estimate Optimal Value Functions, Optimal Policy Does the agent know the MDP? Yes! It’s “planning” No! It’s “Model-free RL” Agent knows everything Agent observes everything as it goes 91 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Q-Learning 92 Overview Evaluate Policy, Find Best Policy, π π* Planning MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Q-Learning 93 Dynamic Programming 94 Requirements for Dynamic Programming 95 Requirements for Dynamic Programming 96 Planning by Dynamic Programming 97 Applications of DPs 98 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 99 Iterative Policy Evaluation 100 Iterative Policy Evaluation 101 Iterative Policy Evaluation 102 Iterative Policy Evaluation 103 Random Policy: Grid World 104 Policy Evaluation: Grid World Time 0 : do nothing, stop; no cost. Time 1 : move (reward -1); then k=0 Unless in goal: reward 0 Time 2 : move (reward -1); then k=1 Most: move (-1) + [v1 = -1] = -2 Some: move (-1) + ¾ [v1 = -1] + ¼ [v1=0] = 1.75 105 Policy Evaluation: Grid World 106 Policy Evaluation: Grid World 107 Policy Evaluation: Grid World In general: best policy & value for “one step, then follow random policy” (always better policy than random!) 108 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 109 Improving a Policy! 110 Improving a Policy! 111 Improving a Policy! 112 Policy Iteration 113 Jack’s Car Rental 114 Policy Iteration in Car Rental 115 Policy Improvement 116 Policy Improvement 117 Policy Improvement 118 Modified Policy Iteration 119 Modified Policy Iteration 120 Modified Policy Iteration 121 Modified Policy Iteration 122 Generalized Policy Iteration 123 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 124 Monte Carlo RL 125 Monte Carlo RL 126 Monte Carlo RL 127 Monte Carlo Policy Evaluation 128 Every-Visit MC Policy Evaluation Equivalent, “incremental tracking” form: Looks like SGD to minimize MSE from the mean value… 129 Blackjack Example stand hit stand hit hit 130 Blackjack Value Function stand hit 131 Blackjack Value Function stand hit 132 Temporal Difference Learning 133 Temporal Difference Learning 134 MC and TD 135 MC and TD 136 MC and TD 137 MC and TD 138 Driving Home Example 139 Driving Home: MC vs TD 140 Driving Home: MC vs TD 141 Finite Episodes: AB Example MC & TD can give different answers on fixed data: V(B) = 6 / 8 V(A) = 0 ? (Direct MC estimate) V(A) = 6 / 8? (TD estimate) 142 MC vs TD Temporal Monte Carlo Difference • Wait till end of episode to learn • Learn online after every step • Only for terminating worlds • Non-terminating worlds ok • High-variance, low bias • Low variance, high bias • Not sensitive to initial value • Sensitive to initial value • Good convergence properties • Much more efficient • Doesn’t exploit Markov property • Exploits Markov Property • Minimizes squared error • Maximizes log-likelihood 143 Unified View: Monte Carlo 144 Unified View: TD Learning 145 Unified View: Dynamic Prog. 146 Unified View of RL (Prediction) 147 Overview Evaluate Policy, Find Best Policy, π π* MDP Known Policy Evaluation Policy/Value Iteration MDP Unknown MC and TD Learning Sarsa + Q-Learning 148 Model-free Control Learn a policy 휋 to maximize rewards in the environment 149 On and Off Policy Learning 150 Generalized Policy Iteration 151 Gen Policy Improvement? 152 Not quite! 153 Learn Q function directly… 154 Greedy Action Selection? 155 Greedy Action Selection? 156 Greedy Action Selection? 157 ∊-Greedy Exploration 158 ∊-Greedy Exploration 159 Which Policy Evaluation? 160 Sarsa: TD for Policy Evaluation 161 On-Policy Control w/ Sarsa 162 Off-Policy Learning 163 Off-Policy Learning 164 Q-Learning 165 Q-Learning 166 Q-Learning 167 Off-Policy w/ Q-Learning 168 Off-Policy w/ Q-Learning 169 Off-Policy w/ Q-Learning 170 Q-Learning Control Algorithm (SARSAMAX) 171 Relation between DP and TD 172 Update Eqns for DP and TD 173.
Recommended publications
  • Calculating the Value Function When the Bellman Equation Cannot Be
    Journal of Economic Dynamics & Control 28 (2004) 1437–1460 www.elsevier.com/locate/econbase Investment under uncertainty: calculating the value function when the Bellman equation cannot besolvedanalytically Thomas Dangla, Franz Wirlb;∗ aVienna University of Technology, Theresianumgasse 27, A-1040 Vienna, Austria bUniversity of Vienna, Brunnerstr.˝ 72, A-1210 Vienna, Austria Abstract The treatment of real investments parallel to ÿnancial options if an investment is irreversible and facing uncertainty is a major insight of modern investment theory well expounded in the book Investment under Uncertainty by Dixit and Pindyck. Thepurposeof this paperis to draw attention to the fact that many problems in managerial decision making imply Bellman equations that cannot be solved analytically and to the di6culties that arise when approximating therequiredvaluefunction numerically.This is so becausethevaluefunctionis thesaddlepoint path from the entire family of solution curves satisfying the di7erential equation. The paper uses a simple machine replacement and maintenance framework to highlight these di6culties (for shooting as well as for ÿnite di7erence methods). On the constructive side, the paper suggests and tests a properly modiÿed projection algorithm extending the proposal in Judd (J. Econ. Theory 58 (1992) 410). This fast algorithm proves amendable to economic and stochastic control problems(which onecannot say of theÿnitedi7erencemethod). ? 2003 Elsevier B.V. All rights reserved. JEL classiÿcation: D81; C61; E22 Keywords: Stochastic optimal control; Investment under uncertainty; Projection method; Maintenance 1. Introduction Without doubt, thebook of Dixit and Pindyck (1994) is oneof themost stimulating books concerning management sciences. The basic message of this excellent book is the following: Considering investments characterized by uncertainty (either with ∗ Corresponding author.
    [Show full text]
  • From Single-Agent to Multi-Agent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course
    From Single-Agent to Multi-Agent Reinforcement Learning: Foundational Concepts and Methods Learning Theory Course Gon¸caloNeto Instituto de Sistemas e Rob´otica Instituto Superior T´ecnico May 2005 Abstract Interest in robotic and software agents has increased a lot in the last decades. They allow us to do tasks that we would hardly accomplish otherwise. Par- ticularly, multi-agent systems motivate distributed solutions that can be cheaper and more efficient than centralized single-agent ones. In this context, reinforcement learning provides a way for agents to com- pute optimal ways of performing the required tasks, with just a small in- struction indicating if the task was or was not accomplished. Learning in multi-agent systems, however, poses the problem of non- stationarity due to interactions with other agents. In fact, the RL methods for the single agent domain assume stationarity of the environment and cannot be applied directly. This work is divided in two main parts. In the first one, the reinforcement learning framework for single-agent domains is analyzed and some classical solutions presented, based on Markov decision processes. In the second part, the multi-agent domain is analyzed, borrowing tools from game theory, namely stochastic games, and the most significant work on learning optimal decisions for this type of systems is presented. ii Contents Abstract 1 1 Introduction 2 2 Single-Agent Framework 5 2.1 Markov Decision Processes . 6 2.2 Dynamic Programming . 10 2.2.1 Value Iteration . 11 2.2.2 Policy Iteration . 12 2.2.3 Generalized Policy Iteration . 13 2.3 Learning with Model-free methods .
    [Show full text]
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • Dynamic Programming for Dummies, Parts I & II
    DYNAMIC PROGRAMMING FOR DUMMIES Parts I & II Gonçalo L. Fonseca [email protected] Contents: Part I (1) Some Basic Intuition in Finite Horizons (a) Optimal Control vs. Dynamic Programming (b) The Finite Case: Value Functions and the Euler Equation (c) The Recursive Solution (i) Example No.1 - Consumption-Savings Decisions (ii) Example No.2 - Investment with Adjustment Costs (iii) Example No. 3 - Habit Formation (2) The Infinite Case: Bellman's Equation (a) Some Basic Intuition (b) Why does Bellman's Equation Exist? (c) Euler Once Again (d) Setting up the Bellman's: Some Examples (i) Labor Supply/Household Production model (ii) Investment with Adjustment Costs (3) Solving Bellman's Equation for Policy Functions (a) Guess and Verify Method: The Idea (i) Guessing the Policy Function u = h(x) (ii) Guessing the Value Function V(x). (b) Guess and Verify Method: Two Examples (i) One Example: Log Return and Cobb-Douglas Transition (ii) The Other Example: Quadratic Return and Linear Transition Part II (4) Stochastic Dynamic Programming (a) Some Basics (b) Some Examples: (i) Consumption with Uncertainty (ii) Asset Prices (iii) Search Models (5) Mathematical Appendices (a) Principle of Optimality (b) Existence of Bellman's Equation 2 Texts There are actually not many books on dynamic programming methods in economics. The following are standard references: Stokey, N.L. and Lucas, R.E. (1989) Recursive Methods in Economic Dynamics. (Harvard University Press) Sargent, T.J. (1987) Dynamic Macroeconomic Theory (Harvard University Press) Sargent, T.J. (1997) Recursive Macroeconomic Theory (unpublished, but on Sargent's website at http://riffle.stanford.edu) Stokey and Lucas does more the formal "mathy" part of it, with few worked-through applications.
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • Markovian Competitive and Cooperative Games with Applications to Communications
    DISSERTATION in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy from University of Nice Sophia Antipolis Specialization: Communication and Electronics Lorenzo Maggi Markovian Competitive and Cooperative Games with Applications to Communications Thesis defended on the 9th of October, 2012 before a committee composed of: Reporters Prof. Jean-Jacques Herings (Maastricht University, The Netherlands) Prof. Roberto Lucchetti, (Politecnico of Milano, Italy) Examiners Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italy) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Thesis Director Prof. Laura Cottatellucci (EURECOM, France) Thesis Co-Director Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´ede Nice-Sophia Antipolis Sp´ecialit´e: Automatique, Traitement du Signal et des Images Lorenzo Maggi Jeux Markoviens, Comp´etitifs et Coop´eratifs, avec Applications aux Communications Th`ese soutenue le 9 Octobre 2012 devant le jury compos´ede : Rapporteurs Prof. Jean-Jacques Herings (Maastricht University, Pays Bas) Prof. Roberto Lucchetti, (Politecnico of Milano, Italie) Examinateurs Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italie) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Directrice de Th`ese Prof. Laura Cottatellucci (EURECOM, France) Co-Directeur de Th`ese Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) Abstract In this dissertation we deal with the design of strategies for agents interacting in a dynamic environment. The mathematical tool of Game Theory (GT) on Markov Decision Processes (MDPs) is adopted.
    [Show full text]
  • Factored Markov Decision Process Models for Stochastic Unit Commitment
    MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski, Weihong Zhang TR2010-083 October 2010 Abstract In this paper, we consider stochastic unit commitment problems where power demand and the output of some generators are random variables. We represent stochastic unit commitment prob- lems in the form of factored Markov decision process models, and propose an approximate algo- rithm to solve such models. By incorporating a risk component in the cost function, the algorithm can achieve a balance between the operational costs and blackout risks. The proposed algorithm outperformed existing non-stochastic approaches on several problem instances, resulting in both lower risks and operational costs. 2010 IEEE Conference on Innovative Technologies This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2010 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski and Weihong Zhang, {nikovski,wzhang}@merl.com, phone 617-621-7510 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, USA, fax 617-621-7550 Abstract—In this paper, we consider stochastic unit commit- generate in order to meet demand.
    [Show full text]
  • Capturing Financial Markets to Apply Deep Reinforcement Learning
    Capturing Financial markets to apply Deep Reinforcement Learning Souradeep Chakraborty* BITS Pilani University, K.K. Birla Goa Campus [email protected] *while working under Dr. Subhamoy Maitra, at the Applied Statistics Unit, ISI Calcutta JEL: C02, C32, C63, C45 Abstract In this paper we explore the usage of deep reinforcement learning algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, we present a novel Markov decision process (MDP) model to capture the financial trading markets. We review and propose various modifications to existing approaches and explore different techniques like the usage of technical indicators, to succinctly capture the market dynamics to model the markets. We then go on to use deep reinforcement learning to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, we go on to show that our model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments. Keywords: deep reinforcement learning, online learning, computational finance, Markov decision process, modeling financial markets, algorithmic trading Introduction 1.1 Motivation Since the early 1990s, efforts have been made to automatically generate trades using data and computation which consistently beat a benchmark and generate continuous positive returns with minimal risk. The goal started to shift from “learning how to win in financial markets” to “create an algorithm that can learn on its own, how to win in financial markets”.
    [Show full text]
  • The Uncertainty Bellman Equation and Exploration
    The Uncertainty Bellman Equation and Exploration Brendan O’Donoghue 1 Ian Osband 1 Remi Munos 1 Volodymyr Mnih 1 Abstract tions that maximize rewards given its current knowledge? We consider the exploration/exploitation prob- Separating estimation and control in RL via ‘greedy’ algo- lem in reinforcement learning. For exploitation, rithms can lead to premature and suboptimal exploitation. it is well known that the Bellman equation con- To offset this, the majority of practical implementations in- nects the value at any time-step to the expected troduce some random noise or dithering into their action value at subsequent time-steps. In this paper we selection (such as -greedy). These algorithms will even- consider a similar uncertainty Bellman equation tually explore every reachable state and action infinitely (UBE), which connects the uncertainty at any often, but can take exponentially long to learn the opti- time-step to the expected uncertainties at subse- mal policy (Kakade, 2003). By contrast, for any set of quent time-steps, thereby extending the potential prior beliefs the optimal exploration policy can be com- exploratory benefit of a policy beyond individual puted directly by dynamic programming in the Bayesian time-steps. We prove that the unique fixed point belief space. However, this approach can be computation- of the UBE yields an upper bound on the vari- ally intractable for even very small problems (Guez et al., ance of the posterior distribution of the Q-values 2012) while direct computational approximations can fail induced by any policy. This bound can be much spectacularly badly (Munos, 2014). tighter than traditional count-based bonuses that For this reason, most provably-efficient approaches to re- compound standard deviation rather than vari- inforcement learning rely upon the optimism in the face of ance.
    [Show full text]