Cs188 Lecture 10 and 11 -- Reinforcement Learning 6PP.Pdf

Total Page:16

File Type:pdf, Size:1020Kb

Cs188 Lecture 10 and 11 -- Reinforcement Learning 6PP.Pdf MDPs and RL Outline CS 188: Artificial Intelligence § Markov Decision Processes (MDPs) § Formalism § Planning § Value iteration Reinforcement Learning (RL) § Policy Evaluation and Policy Iteration § Reinforcement Learning --- MDP with T and/or R unknown § Model-based Learning § Model-free Learning § Direct Evaluation [performs policy evaluation] § Temporal Difference Learning [performs policy evaluation] § Q-Learning [learns optimal state-action value function Q*] Pieter Abbeel – UC Berkeley § Policy search [learns optimal policy from subset of all policies] Many slides over the course adapted from Dan Klein, Stuart § Exploration vs. exploitation Russell, Andrew Moore § Large state spaces: feature-based representations 1 2 Reinforcement Learning Example: learning to walk [DEMOS on § Basic idea: next slides] § Receive feedback in the form of rewards § Agent’s utility is defined by the reward function § Must (learn to) act so as to maximize expected rewards Before learning (hand-tuned) One of many learning runs After learning [After 1000 field traversals] [Kohl and Stone, ICRA 2004] Example: snake robot Example: Toddler [Andrew Ng] Tedrake, Zhang and Seung, 2005 5 6 1 In your project 3 Reinforcement Learning § Still assume a Markov decision process (MDP): § A set of states s ∈ S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) § Still looking for a policy π(s) § New twist: don’t know T or R § I.e. don’t know which states are good or what the actions do § Must actually try actions and states out to learn 7 8 Example to Illustrate Model-Based vs. MDPs and RL Outline Model-Free: Expected Age § Markov Decision Processes (MDPs) Goal: Compute expected age of cs188 students Known P(A) § Formalism § Planning § Value iteration § Policy Evaluation and Policy Iteration Without P(A), instead collect samples [a1, a2, … aN] § Reinforcement Learning --- MDP with T and/or R unknown Unknown P(A): “Model Based” Unknown P(A): “Model Free” § Model-based Learning § Model-free Learning § Direct Evaluation [performs policy evaluation] § Temporal Difference Learning [performs policy evaluation] § Q-Learning [learns optimal state-action value function Q*] § Policy search [learns optimal policy from subset of all policies] § Exploration vs. exploitation § Large state spaces: feature-based representations 9 10 Example: Learning the Model in Model-Based Learning Model-Based Learning y § Idea: § Episodes: +100 § Step 1: Learn the model empirically through experience (1,1) up -1 (1,1) up -1 § Step 2: Solve for values as if the learned model were -100 correct (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 § Step 1: Simple empirical model learning (1,3) right -1 (2,3) right -1 § Count outcomes for each s,a s (2,3) right -1 (3,3) right -1 x § Normalize to give estimate of T(s,a,s’) γ = 1 π(s) (3,3) right -1 (3,2) up -1 § Discover R(s,a,s’) when we experience (s,a,s’) s, π(s) (3,2) up -1 (4,2) exit -100 ’ T(<3,3>, right, <4,3>) = 1 / 3 § Step 2: Solving the MDP with the learned models, π(s),s (3,3) right -1 (done) § Value iteration, or policy iteration s (4,3) exit +100 T(<2,3>, right, <3,3>) = 2 / 2 ’ (done) 11 12 2 Learning the Model in Model- Based Learning Model-based vs. Model-free § Estimate P(x) from samples § Model-based RL § First act in MDP and learn T, R § Samples: § Then value iteration or policy iteration with learned T, R § Estimate: § Advantage: efficient use of data § Disadvantage: requires building a model for T, R § Estimate P(s’ | s, a) from samples § Model-free RL § Bypass the need to learn T, R § Samples: § Methods to evaluate V¼, the value function for a fixed policy ¼ without knowing T, R: § Estimate: § (i) Direct Evaluation § (ii) Temporal Difference Learning § Why does this work? Because samples appear with the right § Method to learn ¼*, Q*, V* without knowing T, R frequencies! § (iii) Q-Learning 13 14 Direct Evaluation Example: Direct Evaluation y § Repeatedly execute the policy π § Episodes: +100 (1,1) up -1 (1,1) up -1 § Estimate the value of the state s as the -100 average over all times the state s was (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 visited of the sum of discounted rewards (1,3) right -1 (2,3) right -1 accumulated from state s onwards (2,3) right -1 (3,3) right -1 x (3,3) right -1 (3,2) up -1 γ = 1, R = -1 (3,2) up -1 (4,2) exit -100 (3,3) right -1 (done) (4,3) exit +100 V(2,3) ~ (96 + -103) / 2 = -3.5 (done) V(3,3) ~ (99 + 97 + -102) / 3 = 31.3 15 16 Model-Free Learning Limitations of Direct Evaluation § Want to compute an expectation weighted by P(x): § Assume random initial state § Assume the value of state § Model-based: estimate P(x) from samples, compute expectation (1,2) is known perfectly based on past runs § Now for the first time § Model-free: estimate expectation directly from samples encounter (1,1) --- can we do better than estimating V(1,1) as the rewards outcome of § Why does this work? Because samples appear with the right that run? frequencies! 17 18 3 Sample-Based Policy Evaluation? Temporal-Difference Learning § Big idea: learn from every experience! s § Update V(s) each time we experience (s,a,s’,r) § Likely s’ will contribute updates more often π(s) s § Who needs T and R? Approximate the expectation with samples of s’ (drawn from T!) s, π(s) π(s) § Temporal difference learning s, π(s) § Policy still fixed! s’ § Move values toward value of whatever s, π(s),s’ successor occurs: running average! ’ s2’ s1s s3’ ’ Sample of V(s): Update to V(s): Almost! But we can’t rewind time to get sample Same update: after sample from state s. 20 21 Temporal-Difference Learning Exponential Moving Average § Big idea: learn from every experience! § Exponential moving average s § Update V(s) each time we experience (s,a,s’,r) § Makes recent samples more important § Likely s’ will contribute updates more often π(s) s, π(s) § Temporal difference learning § Policy still fixed! s’ § Move values toward value of whatever successor occurs: running average! § Forgets about the past (distant past values were wrong anyway) § Easy to compute from the running average Sample of V(s): Update to V(s): § Decreasing learning rate can give converging averages Same update: 22 23 Policy Evaluation when T (and R) unknown --- recap Problems with TD Value Learning § Model-based: § TD value leaning is a model-free way s § Learn the model empirically through experience to do policy evaluation § Solve for values as if the learned model were correct a § However, if we want to turn values into s, a a (new) policy, we’re sunk: § Model-free: s,a,s’ s § Direct evaluation: ’ § V(s) = sample estimate of sum of rewards accumulated from state s onwards § Temporal difference (TD) value learning: § Move values toward value of whatever successor occurs: running average! § Idea: learn Q-values directly § Makes action selection model-free too! 25 26 4 Active RL Detour: Q-Value Iteration § Value iteration: find successive approx optimal values § Full reinforcement learning * § Start with V0 (s) = 0, which we know is right (why?) § You don’t know the transitions T(s,a,s’) * § Given Vi , calculate the values for all states for depth i+1: § You don’t know the rewards R(s,a,s’) § You can choose any actions you like § Goal: learn the optimal policy / values § … what value iteration did! § But Q-values are more useful! * § In this case: § Start with Q0 (s,a) = 0, which we know is right (why?) * § Learner makes choices! § Given Qi , calculate the q-values for all q-states for depth i+1: § Fundamental tradeoff: exploration vs. exploitation § This is NOT offline planning! You actually take actions in the world and find out what happens… 27 28 Q-Learning Q-Learning Properties § Q-Learning: sample-based Q-value iteration § Amazing result: Q-learning converges to optimal policy § Learn Q*(s,a) values § If you explore enough § If you make the learning rate small enough § Receive a sample (s,a,s’,r) § … but not decrease it too quickly! § Consider your old estimate: § Basically doesn’t matter how you select actions (!) § Consider your new sample estimate: § Neat property: off-policy learning § learn optimal policy without following it § Incorporate the new estimate into a running average: 29 31 Exploration / Exploitation Exploration Functions § Several schemes for forcing exploration § When to explore § Simplest: random actions (ε greedy) § Random actions: explore a fixed amount § Better idea: explore areas whose badness is not (yet) § Every time step, flip a coin established § With probability ε, act randomly § With probability 1-ε, act according to current policy § Exploration function § Takes a value estimate and a count, and returns an optimistic § Problems with random actions? utility, e.g. (exact form not important) § You do explore the space, but keep thrashing Qi+1(s, a) (1 α)Qi(s, a)+α R(s, a, s)+γ max Qi(s,a) around once learning is done ← − a now becomes: § One solution: lower ε over time § Another solution: exploration functions Qi+1(s, a) (1 α)Qi(s, a)+α R(s, a, s)+γ max f(Qi(s,a),N(s,a)) ← − a 33 35 5 Q-Learning The Story So Far: MDPs and RL § Q-learning produces tables of q-values: Things we know how to do: Techniques: § If we know the MDP § Model-based DPs § Compute V*, Q*, π* exactly § Value Iteration § Evaluate a fixed policy π § Policy evaluation § If we don’t know the MDP § We can estimate the MDP then solve § Model-based RL § We can estimate V for a fixed policy π § Model-free RL § We can estimate Q*(s,a) for the optimal policy while executing an § Value learning exploration policy § Q-learning 37
Recommended publications
  • Probabilistic Goal Markov Decision Processes∗
    Proceedings of the Twenty-Second International Joint Conference on Artificial Intelligence Probabilistic Goal Markov Decision Processes∗ Huan Xu Shie Mannor Department of Mechanical Engineering Department of Electrical Engineering National University of Singapore, Singapore Technion, Israel [email protected] [email protected] Abstract also concluded that in daily decision making, people tends to regard risk primarily as failure to achieving a predeter- The Markov decision process model is a power- mined goal [Lanzillotti, 1958; Simon, 1959; Mao, 1970; ful tool in planing tasks and sequential decision Payne et al., 1980; 1981]. making problems. The randomness of state tran- Different variants of the probabilistic goal formulation sitions and rewards implies that the performance 1 of a policy is often stochastic. In contrast to the such as chance constrained programs , have been extensively standard approach that studies the expected per- studied in single-period optimization [Miller and Wagner, formance, we consider the policy that maximizes 1965; Pr´ekopa, 1970]. However, little has been done in the probability of achieving a pre-determined tar- the context of sequential decision problem including MDPs. get performance, a criterion we term probabilis- The standard approaches in risk-averse MDPs include max- tic goal Markov decision processes. We show that imization of expected utility function [Bertsekas, 1995], this problem is NP-hard, but can be solved using a and optimization of a coherent risk measure [Riedel, 2004; pseudo-polynomial algorithm. We further consider Le Tallec, 2007]. Both approaches lead to formulations that a variant dubbed “chance-constraint Markov deci- can not be solved in polynomial time, except for special sion problems,” that treats the probability of achiev- cases including exponential utility function [Chung and So- ing target performance as a constraint instead of the bel, 1987], piecewise linear utility function with a single maximizing objective.
    [Show full text]
  • 1 Introduction
    SEMI-MARKOV DECISION PROCESSES M. Baykal-G}ursoy Department of Industrial and Systems Engineering Rutgers University Piscataway, New Jersey Email: [email protected] Abstract Considered are infinite horizon semi-Markov decision processes (SMDPs) with finite state and action spaces. Total expected discounted reward and long-run average expected reward optimality criteria are reviewed. Solution methodology for each criterion is given, constraints and variance sensitivity are also discussed. 1 Introduction Semi-Markov decision processes (SMDPs) are used in modeling stochastic control problems arrising in Markovian dynamic systems where the sojourn time in each state is a general continuous random variable. They are powerful, natural tools for the optimization of queues [20, 44, 41, 18, 42, 43, 21], production scheduling [35, 31, 2, 3], reliability/maintenance [22]. For example, in a machine replacement problem with deteriorating performance over time, a decision maker, after observing the current state of the machine, decides whether to continue its usage, or initiate a maintenance (preventive or corrective) repair, or replace the machine. There are a reward or cost structure associated with the states and decisions, and an information pattern available to the decision maker. This decision depends on a performance measure over the planning horizon which is either finite or infinite, such as total expected discounted or long-run average expected reward/cost with or without external constraints, and variance penalized average reward. SMDPs are based on semi-Markov processes (SMPs) [9] [Semi-Markov Processes], that include re- newal processes [Definition and Examples of Renewal Processes] and continuous-time Markov chains (CTMCs) [Definition and Examples of CTMCs] as special cases.
    [Show full text]
  • Arxiv:2004.13965V3 [Cs.LG] 16 Feb 2021 Optimal Policy
    Graph-based State Representation for Deep Reinforcement Learning Vikram Waradpande, Daniel Kudenko, and Megha Khosla L3S Research Center, Leibniz University, Hannover {waradpande,khosla,kudenko}@l3s.de Abstract. Deep RL approaches build much of their success on the abil- ity of the deep neural network to generate useful internal representa- tions. Nevertheless, they suffer from a high sample-complexity and start- ing with a good input representation can have a significant impact on the performance. In this paper, we exploit the fact that the underlying Markov decision process (MDP) represents a graph, which enables us to incorporate the topological information for effective state representation learning. Motivated by the recent success of node representations for several graph analytical tasks we specifically investigate the capability of node repre- sentation learning methods to effectively encode the topology of the un- derlying MDP in Deep RL. To this end we perform a comparative anal- ysis of several models chosen from 4 different classes of representation learning algorithms for policy learning in grid-world navigation tasks, which are representative of a large class of RL problems. We find that all embedding methods outperform the commonly used matrix represen- tation of grid-world environments in all of the studied cases. Moreoever, graph convolution based methods are outperformed by simpler random walk based methods and graph linear autoencoders. 1 Introduction A good problem representation has been known to be crucial for the perfor- mance of AI algorithms. This is not different in the case of reinforcement learn- ing (RL), where representation learning has been a focus of investigation.
    [Show full text]
  • A Brief Taster of Markov Decision Processes and Stochastic Games
    Algorithmic Game Theory and Applications Lecture 15: a brief taster of Markov Decision Processes and Stochastic Games Kousha Etessami Kousha Etessami AGTA: Lecture 15 1 warning • The subjects we will touch on today are so interesting and vast that one could easily spend an entire course on them alone. • So, think of what we discuss today as only a brief “taster”, and please do explore it further if it interests you. • Here are two standard textbooks that you can look up if you are interested in learning more: – M. Puterman, Markov Decision Processes, Wiley, 1994. – J. Filar and K. Vrieze, Competitive Markov Decision Processes, Springer, 1997. (This book is really about 2-player zero-sum stochastic games.) Kousha Etessami AGTA: Lecture 15 2 Games against Nature Consider a game graph, where some nodes belong to player 1 but others are chance nodes of “Nature”: Start Player I: Nature: 1/3 1/6 3 2/3 1/3 1/3 1/6 1 −2 Question: What is Player 1’s “optimal strategy” and “optimal expected payoff” in this game? Kousha Etessami AGTA: Lecture 15 3 a simple finite game: “make a big number” k−1k−2 0 • Your goal is to create as large a k-digit number as possible, using digits from D = {0, 1, 2,..., 9}, which “nature” will give you, one by one. • The game proceeds in k rounds. • In each round, “nature” chooses d ∈ D “uniformly at random”, i.e., each digit has probability 1/10. • You then choose which “unfilled” position in the k-digit number should be “filled” with digit d.
    [Show full text]
  • Markov Decision Processes
    Markov Decision Processes Robert Platt Northeastern University Some images and slides are used from: 1. CS188 UC Berkeley 2. AIMA 3. Chris Amato 4. Stacy Marsella Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* But only in deterministic domains... Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? Stochastic domains So far, we have studied search Can use search to solve simple planning problems, e.g. robot planning using A* A* doesn't work so well in stochastic environments... !!? We are going to introduce a new framework for encoding problems w/ stochastic dynamics: the Markov Decision Process (MDP) SEQUENTIAL DECISION- MAKING MAKING DECISIONS UNDER UNCERTAINTY • Rational decision making requires reasoning about one’s uncertainty and objectives • Previous section focused on uncertainty • This section will discuss how to make rational decisions based on a probabilistic model and utility function • Last class, we focused on single step decisions, now we will consider sequential decision problems REVIEW: EXPECTIMAX max • What if we don’t know the outcome of actions? • Actions can fail a b • when a robot moves, it’s wheels might slip chance • Opponents may be uncertain 20 55 .3 .7 .5 .5 1020 204 105 1007 • Expectimax search: maximize average score • MAX nodes choose action that maximizes outcome • Chance nodes model an outcome
    [Show full text]
  • Markovian Competitive and Cooperative Games with Applications to Communications
    DISSERTATION in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy from University of Nice Sophia Antipolis Specialization: Communication and Electronics Lorenzo Maggi Markovian Competitive and Cooperative Games with Applications to Communications Thesis defended on the 9th of October, 2012 before a committee composed of: Reporters Prof. Jean-Jacques Herings (Maastricht University, The Netherlands) Prof. Roberto Lucchetti, (Politecnico of Milano, Italy) Examiners Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italy) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Thesis Director Prof. Laura Cottatellucci (EURECOM, France) Thesis Co-Director Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) THESE pr´esent´ee pour obtenir le grade de Docteur en Sciences de l’Universit´ede Nice-Sophia Antipolis Sp´ecialit´e: Automatique, Traitement du Signal et des Images Lorenzo Maggi Jeux Markoviens, Comp´etitifs et Coop´eratifs, avec Applications aux Communications Th`ese soutenue le 9 Octobre 2012 devant le jury compos´ede : Rapporteurs Prof. Jean-Jacques Herings (Maastricht University, Pays Bas) Prof. Roberto Lucchetti, (Politecnico of Milano, Italie) Examinateurs Prof. Pierre Bernhard, (INRIA Sophia Antipolis, France) Prof. Petros Elia, (EURECOM, France) Prof. Matteo Sereno (University of Torino, Italie) Prof. Bruno Tuffin (INRIA Rennes Bretagne-Atlantique, France) Directrice de Th`ese Prof. Laura Cottatellucci (EURECOM, France) Co-Directeur de Th`ese Prof. Konstantin Avrachenkov (INRIA Sophia Antipolis, France) Abstract In this dissertation we deal with the design of strategies for agents interacting in a dynamic environment. The mathematical tool of Game Theory (GT) on Markov Decision Processes (MDPs) is adopted.
    [Show full text]
  • Factored Markov Decision Process Models for Stochastic Unit Commitment
    MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski, Weihong Zhang TR2010-083 October 2010 Abstract In this paper, we consider stochastic unit commitment problems where power demand and the output of some generators are random variables. We represent stochastic unit commitment prob- lems in the form of factored Markov decision process models, and propose an approximate algo- rithm to solve such models. By incorporating a risk component in the cost function, the algorithm can achieve a balance between the operational costs and blackout risks. The proposed algorithm outperformed existing non-stochastic approaches on several problem instances, resulting in both lower risks and operational costs. 2010 IEEE Conference on Innovative Technologies This work may not be copied or reproduced in whole or in part for any commercial purpose. Permission to copy in whole or in part without payment of fee is granted for nonprofit educational and research purposes provided that all such whole or partial copies include the following: a notice that such copying is by permission of Mitsubishi Electric Research Laboratories, Inc.; an acknowledgment of the authors and individual contributions to the work; and all applicable portions of the copyright notice. Copying, reproduction, or republishing for any other purpose shall require a license with payment of fee to Mitsubishi Electric Research Laboratories, Inc. All rights reserved. Copyright c Mitsubishi Electric Research Laboratories, Inc., 2010 201 Broadway, Cambridge, Massachusetts 02139 MERLCoverPageSide2 Factored Markov Decision Process Models for Stochastic Unit Commitment Daniel Nikovski and Weihong Zhang, {nikovski,wzhang}@merl.com, phone 617-621-7510 Mitsubishi Electric Research Laboratories, 201 Broadway, Cambridge, MA 02139, USA, fax 617-621-7550 Abstract—In this paper, we consider stochastic unit commit- generate in order to meet demand.
    [Show full text]
  • Capturing Financial Markets to Apply Deep Reinforcement Learning
    Capturing Financial markets to apply Deep Reinforcement Learning Souradeep Chakraborty* BITS Pilani University, K.K. Birla Goa Campus [email protected] *while working under Dr. Subhamoy Maitra, at the Applied Statistics Unit, ISI Calcutta JEL: C02, C32, C63, C45 Abstract In this paper we explore the usage of deep reinforcement learning algorithms to automatically generate consistently profitable, robust, uncorrelated trading signals in any general financial market. In order to do this, we present a novel Markov decision process (MDP) model to capture the financial trading markets. We review and propose various modifications to existing approaches and explore different techniques like the usage of technical indicators, to succinctly capture the market dynamics to model the markets. We then go on to use deep reinforcement learning to enable the agent (the algorithm) to learn how to take profitable trades in any market on its own, while suggesting various methodology changes and leveraging the unique representation of the FMDP (financial MDP) to tackle the primary challenges faced in similar works. Through our experimentation results, we go on to show that our model could be easily extended to two very different financial markets and generates a positively robust performance in all conducted experiments. Keywords: deep reinforcement learning, online learning, computational finance, Markov decision process, modeling financial markets, algorithmic trading Introduction 1.1 Motivation Since the early 1990s, efforts have been made to automatically generate trades using data and computation which consistently beat a benchmark and generate continuous positive returns with minimal risk. The goal started to shift from “learning how to win in financial markets” to “create an algorithm that can learn on its own, how to win in financial markets”.
    [Show full text]
  • Markov Decision Processes and Dynamic Programming 1 The
    Master MVA: Reinforcement Learning Lecture: 2 Markov Decision Processes and Dynamic Programming Lecturer: Alessandro Lazaric http://researchers.lille.inria.fr/∼lazaric/Webpage/Teaching.html Objectives of the lecture 1. Understand: Markov decision processes, Bellman equations and Bellman operators. 2. Use: dynamic programming algorithms. 1 The Markov Decision Process 1.1 Definitions Definition 1 (Markov chain). Let the state space X be a bounded compact subset of the Euclidean space, the discrete-time dynamic system (xt)t2N 2 X is a Markov chain if P(xt+1 = x j xt; xt−1; : : : ; x0) = P(xt+1 = x j xt); (1) so that all the information needed to predict (in probability) the future is contained in the current state (Markov property). Given an initial state x0 2 X, a Markov chain is defined by the transition proba- bility p such that p(yjx) = P(xt+1 = yjxt = x): (2) Remark: notice that in some cases we can turn a higher-order Markov process into a Markov process by including the past as a new state variable. For instance, in the control of an inverted pendulum, the state that can be observed is only the angular position θt. In this case the system is non-Markov since the next position depends on the previous position but also on the angular speed, which is defined as dθt = θt − θt−1 (as a first approximation). Thus, if we define the state space as xt = (θt; dθt) we obtain a Markov chain. Definition 2 (Markov decision process [Bellman, 1957, Howard, 1960, Fleming and Rishel, 1975, Puterman, 1994, Bertsekas and Tsitsiklis, 1996]).
    [Show full text]
  • An Introduction to Temporal Difference Learning
    An Introduction to Temporal Difference Learning Florian Kunz Seminar on Autonomous Learning Systems Department of Computer Science TU Darmstadt [email protected] Abstract Temporal Difference learning is one of the most used approaches for policy evaluation. It is a central part of solving reinforcement learning tasks. For deriving optimal control, policies have to be evaluated. This task requires value function approximation. At this point TD methods find application. The use of eligibility traces for backpropagation of updates as well as the bootstrapping of the prediction for every update state make these methods so powerful. This paper gives an introduction to reinforcement learning for a novice to understand the TD(λ) algorithm as presented by R. Sutton. The TD methods are the center of this paper, and hence, each step for deriving the update function is treated. Starting with value function approximation followed by the Bellman Equation and ending with eligibility traces. The further enhancement of TD in form of linear-squared temporal difference methods is treated. Both methods are compared in respect to their computational cost and learning rate. In the end an outlook towards application in control is given. 1 Introduction This article addresses the machine learning approach of temporal difference. TD can be classified as an incremental method specialized in predicting future values for a partially unknown system. Temporal difference learning is declared to be a reinforcement learning method. This area of ma- chine learning covers the problem of finding a perfect solution in an unknown environment. To be able to do so, a representation is needed to define which action yields the highest rewards.
    [Show full text]
  • Markov Decision Processes
    An Introduction to Markov Decision Processes Bob Givan Ron Parr Purdue University Duke University MDP Tutorial - 1 Outline Markov Decision Processes defined (Bob) • Objective functions • Policies Finding Optimal Solutions (Ron) • Dynamic programming • Linear programming Refinements to the basic model (Bob) • Partial observability • Factored representations MDP Tutorial - 2 Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s,a) • A description T of each action’s effects in each state. We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history. MDP Tutorial - 3 Stochastic Automata with Utilities A Markov Decision Process (MDP) model contains: • A set of possible world states S • A set of possible actions A • A real valued reward function R(s,a) • A description T of each action’s effects in each state. We assume the Markov Property: the effects of an action taken in a state depend only on that state and not on the prior history. MDP Tutorial - 4 Representing Actions Deterministic Actions: • T :SA× → S For each state and action we specify a new state. 0.6 0.4 Stochastic Actions: • T :SA× → Prob()S For each state and action we specify a probability distribution over next states. Represents the distribution P(s’ | s, a). MDP Tutorial - 5 Representing Actions Deterministic Actions: • T :SA× → S For each state and action we specify a new state. 1.0 Stochastic Actions: • T :SA× → Prob()S For each state and action we specify a probability distribution over next states.
    [Show full text]
  • Reinforcement Learning and Markov Decision Processes (Mdps)
    Reinforcement learning and Markov Decision Processes (MDPs) 15-859(B) Avrim Blum RL and MDPs General scenario: We are an agent in some state. Have obser- vations, perform actions, get rewards. (See lights, pull levers, get cookies) Markov Decision Process: like DFA problem except we’ll assume: • Transitions are probabilistic. (harder than DFA) • Observation = state. (easier than DFA) Assumption is that reward and next state are (probabilistic) func- tions of current observation and action only. • Goal is to learn a good strategy for collecting reward, rather than necessarily to make a model. (different from DFA) '$ a , 0.6: s 1 2 '$s &% 1 XX XX XXX XXX &% Xz '$ a1, 0.4 s3 &% Typical example Imagine a grid world with walls. • Actions left, right, up, down. • If not currently hugging a wall, then with some probability the action takes you to an incorrect neighbor. • Entering top-right corner gives reward of 100 and then takes you to a random state. Nice features of MDPs • Like DFA, an appealing model of agent trying to figure out actions to take in the world. Incorporates notion of actions being good or bad for reasons that will only become apparent later. • Probabilities allow to handle situations like: wanted robot to go forward 1 foot but went forward 1.5 feet instead and turned slightly to right. Or someone randomly picked it up and moved it somewhere else. • Natural learning algorithms that propagate reward backwards through state space. If get reward 100 in state s, then perhaps give value 90 to state s′ you were in right before s.
    [Show full text]