Q-Learning Christopher Watkins and Peter Dayan

Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21 Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21 Introduction Outline 1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality 2 Q-Learning Algorithm Convergence analysis 3 Assumptions and Limitations 4 Summary Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21 It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 There is a tradeoff between exploration and exploitation Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation Environment Agent Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21 Introduction What is reinforcement learning? Toy Problem sf s0 s Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21 A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP) S 2 S - the state of the system A 2 A - the agent’s action p (s0js, a)- the dynamics of the system r : S × A ! R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri p (ajs) - the agent’s policy St−1 St St+1 St+2 p p p Rt−1 Rt Rt+1 p p p At−1 At At+1 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function The un-discounted value function for any given policy p " # T p V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy p " # ¥ p t V (s) = E ∑ g r (st, at) s0 = s t=0 0 < g < 1 is the discounting factor The goal: find a policy p∗ that maximizes the value 8s 2 S ∗ V∗ (s) = Vp (s) = max Vp (s) p Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 p∗ is not necessarily unique, but V∗ is unique Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation The value function can be written recursively " ¥ # p t p 0 V (s) = E g r (st, at) s0 = s = E r (s, a) + gV s ∑ ∼ ·j t=0 a p( s) s0∼p(·js,a) The optimal value satisfies the Bellman equation V∗ (s) = max E r (s, a) + gV∗ s0 p a∼p(·js) s0∼p(·js,a) p∗ is not necessarily unique, but V∗ is unique Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 This means that V∗ (s) = max Q∗ (s, a) a Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Conclusion Learning Q∗ makes it easy to obtain an optimal policy Introduction Bellman optimality The Q-Function The value of taking action a at state s: Q (s, a) = r (s, a) + g E V s0 s0∼p(·js,a) If we know V∗ then an optimal policy is to decide deterministically a∗ (s) = arg max Q∗ (s, a) a This means that V∗ (s) = max Q∗ (s, a) a Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function The value of taking action a at

Q-Learning Christopher Watkins and Peter Dayan

Machine Learning for Neuroscience Geoffrey E Hinton

Genetic Algorithms Are a Competitive Alternative for Training Deep Neural Networks for Reinforcement Learning

Acetylcholine in Cortical Inference

Model-Free Episodic Control

When Planning to Survive Goes Wrong: Predicting the Future and Replaying the Past in Anxiety and PTSD

Pnas11052ackreviewers 5098..5136

Technical Note Q-Learning

Human Functional Brain Imaging 1990–2009

Active Inference on Discrete State-Spaces – a Synthesis

Winners of the Brain Prize 2017

Guaranteeing the Learning of Ethical Behaviour Through Multi-Objective Reinforcement Learning∗

Annual Report 2018 Edition TABLE of CONTENTS 2018