Q-Learning Christopher Watkins and Peter Dayan
Total Page:16
File Type:pdf, Size:1020Kb
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21 Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21 Introduction Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21 It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21 Introduction What is reinforcement learning? Toy Problem sf s0 s Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21 A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at