<<

Q-Learning Christopher Watkins and Peter Dayan

Noga Zaslavsky

The Hebrew University of Jerusalem Advanced Seminar in (67679)

November 1, 2015

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 1 / 21 Outline

1 Introduction What is ? Modeling the problem Bellman optimality

2 Q-Learning Algorithm Convergence analysis

3 Assumptions and Limitations

4 Summary

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 2 / 21 Introduction Outline

1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality

2 Q-Learning Algorithm Convergence analysis

3 Assumptions and Limitations

4 Summary

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 3 / 21 It is not supervised learning, and it is not There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation

Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction

Reinforcement learning is learning how to behave in order to maximize value or achieve some goal

Environment

Agent

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 There is a tradeoff between exploration and exploitation

Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction

Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations

Environment

Agent

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction

Reinforcement learning is learning how to behave in order to maximize value or achieve some goal It is not supervised learning, and it is not unsupervised learning There is no expert to tell the learning agent right from wrong - it is forced to learn from its own experience The learning agent receives feedback from its environment - it can learn by trying different actions at different situations There is a tradeoff between exploration and exploitation

Environment

Agent

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 4 / 21 Introduction What is reinforcement learning? Reinforcement Learning: Agent-Environment Interaction

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 5 / 21 Introduction What is reinforcement learning? Toy Problem

sf

s0 s

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 6 / 21 A ∈ A - the agent’s action p (s0|s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

π (a|s) - the agent’s policy

St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 p (s0|s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

π (a|s) - the agent’s policy

St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system A ∈ A - the agent’s action

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

π (a|s) - the agent’s policy

St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system A ∈ A - the agent’s action p (s0|s, a)- the dynamics of the system

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 π (a|s) - the agent’s policy

St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system A ∈ A - the agent’s action p (s0|s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system A ∈ A - the agent’s action p (s0|s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

π (a|s) - the agent’s policy

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 Introduction Modeling the problem Markov Decision Process (MDP)

S ∈ S - the state of the system A ∈ A - the agent’s action p (s0|s, a)- the dynamics of the system r : S × A → R - the reward function (possibly stochastic) Definition MDP is the tuple hS, A, p, ri

π (a|s) - the agent’s policy

St−1 St St+1 St+2 p p p

Rt−1 Rt Rt+1 π π π

At−1 At At+1

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 7 / 21 The discounted value function for any given policy π " # ∞ π t V (s) = E ∑ γ r (st, at) s0 = s t=0 0 < γ < 1 is the discounting factor

The goal: find a policy π∗ that maximizes the value ∀s ∈ S

∗ V∗ (s) = Vπ (s) = max Vπ (s) π

Introduction Modeling the problem The Value Function

The un-discounted value function for any given policy π " # T π V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The goal: find a policy π∗ that maximizes the value ∀s ∈ S

∗ V∗ (s) = Vπ (s) = max Vπ (s) π

Introduction Modeling the problem The Value Function

The un-discounted value function for any given policy π " # T π V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π " # ∞ π t V (s) = E ∑ γ r (st, at) s0 = s t=0 0 < γ < 1 is the discounting factor

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 Introduction Modeling the problem The Value Function

The un-discounted value function for any given policy π " # T π V (s) = E ∑ r (st, at) s0 = s t=0 T is the horizon - can be finite, episodic, or infinite The discounted value function for any given policy π " # ∞ π t V (s) = E ∑ γ r (st, at) s0 = s t=0 0 < γ < 1 is the discounting factor

The goal: find a policy π∗ that maximizes the value ∀s ∈ S

∗ V∗ (s) = Vπ (s) = max Vπ (s) π

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 8 / 21 The optimal value satisfies the Bellman equation

V∗ (s) = max E r (s, a) + γV∗ s0 π a∼π(·|s) s0∼p(·|s,a)

π∗ is not necessarily unique, but V∗ is unique

Introduction Bellman optimality The Bellman Equation

The value function can be written recursively

" ∞ # π t  π 0 V (s) = E γ r (st, at) s0 = s = E r (s, a) + γV s ∑ ∼ ·| t=0 a π( s) s0∼p(·|s,a)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 π∗ is not necessarily unique, but V∗ is unique

Introduction Bellman optimality The Bellman Equation

The value function can be written recursively

" ∞ # π t  π 0 V (s) = E γ r (st, at) s0 = s = E r (s, a) + γV s ∑ ∼ ·| t=0 a π( s) s0∼p(·|s,a)

The optimal value satisfies the Bellman equation

V∗ (s) = max E r (s, a) + γV∗ s0 π a∼π(·|s) s0∼p(·|s,a)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 Introduction Bellman optimality The Bellman Equation

The value function can be written recursively

" ∞ # π t  π 0 V (s) = E γ r (st, at) s0 = s = E r (s, a) + γV s ∑ ∼ ·| t=0 a π( s) s0∼p(·|s,a)

The optimal value satisfies the Bellman equation

V∗ (s) = max E r (s, a) + γV∗ s0 π a∼π(·|s) s0∼p(·|s,a)

π∗ is not necessarily unique, but V∗ is unique

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 9 / 21 If we know V∗ then an optimal policy is to decide deterministically

a∗ (s) = arg max Q∗ (s, a) a

This means that V∗ (s) = max Q∗ (s, a) a

Conclusion Learning Q∗ makes it easy to obtain an optimal policy

Introduction Bellman optimality The Q-Function

The value of taking action a at state s:

Q (s, a) = r (s, a) + γ E V s0 s0∼p(·|s,a)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 This means that V∗ (s) = max Q∗ (s, a) a

Conclusion Learning Q∗ makes it easy to obtain an optimal policy

Introduction Bellman optimality The Q-Function

The value of taking action a at state s:

Q (s, a) = r (s, a) + γ E V s0 s0∼p(·|s,a)

If we know V∗ then an optimal policy is to decide deterministically

a∗ (s) = arg max Q∗ (s, a) a

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Conclusion Learning Q∗ makes it easy to obtain an optimal policy

Introduction Bellman optimality The Q-Function

The value of taking action a at state s:

Q (s, a) = r (s, a) + γ E V s0 s0∼p(·|s,a)

If we know V∗ then an optimal policy is to decide deterministically

a∗ (s) = arg max Q∗ (s, a) a

This means that V∗ (s) = max Q∗ (s, a) a

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Introduction Bellman optimality The Q-Function

The value of taking action a at state s:

Q (s, a) = r (s, a) + γ E V s0 s0∼p(·|s,a)

If we know V∗ then an optimal policy is to decide deterministically

a∗ (s) = arg max Q∗ (s, a) a

This means that V∗ (s) = max Q∗ (s, a) a

Conclusion Learning Q∗ makes it easy to obtain an optimal policy

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 10 / 21 Q-Learning Outline

1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality

2 Q-Learning Algorithm Convergence analysis

3 Assumptions and Limitations

4 Summary

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 11 / 21 Otherwise, it needs to estimate Q∗ from its experience

The experience of an agent is a sequence s0, a0, r0, s1, a1, r1, s2,... The n-th episode is (sn, an, rn, sn+1)

Q-Learning Algorithm Learning Q∗

If the agent knows the dynamics p and the reward function r, it can find Q∗ by dynamic programing (e.g. value iteration)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21 Q-Learning Algorithm Learning Q∗

If the agent knows the dynamics p and the reward function r, it can find Q∗ by dynamic programing (e.g. value iteration)

Otherwise, it needs to estimate Q∗ from its experience

The experience of an agent is a sequence s0, a0, r0, s1, a1, r1, s2,... The n-th episode is (sn, an, rn, sn+1)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 12 / 21 Q-Learning Algorithm Learning Q∗

Q-Learning

Initialize Q0 (s, a) for all a ∈ A and s ∈ S For each episode n

observe the current state sn

select and execute an action an

observe the next state sn+1 and reward rn   Qn (sn, an) ← (1 − αn) Qn−1 (sn, an) + αn rn + γmax Qn−1 (sn+1, a) a

αn ∈ (0, 1) - the learning rate

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 13 / 21 If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting)

One common approach is to use an e-greedy policy

Q-Learning Algorithm Exploration vs. Exploitation

How should the agent gain experience in order to learn?

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting)

One common approach is to use an e-greedy policy

Q-Learning Algorithm Exploration vs. Exploitation

How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 One common approach is to use an e-greedy policy

Q-Learning Algorithm Exploration vs. Exploitation

How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting)

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Algorithm Exploration vs. Exploitation

How should the agent gain experience in order to learn? If it explores too much it might suffer from a low value If it exploits the learned Q function too early, it might get stuck at bad states without even knowing about better possibilities (overfitting)

One common approach is to use an e-greedy policy

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 14 / 21 Q-Learning Convergence analysis Convergence of the Q function

Theorem

Given bounded rewards |rn| ≤ R, and learning rates 0 ≤ αn ≤ 1 s.t.

∞ ∞ 2 ∑ αn = ∞ and ∑ αn < ∞ n=1 n=1 then with probability 1

∗ Qn (s, a) −→ Q (s, a) n→∞

The exploration policy should be such that each state-action pair will be encountered infinitely many times

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 15 / 21 Q-Learning Convergence analysis Convergence of the Q function

Poof - main idea Define the operator

0  0 0 B [Q]s,a = r (s, a) + γ p s |s, a max Q s , a ∑ a0 s0

B is a contraction under the k · k∞ norm

kB [Q1] − B [Q2] k∞ ≤ γkQ1 − Q2k∞

It is easy to see that B [Q∗] = Q∗ We’re not done because the updates come from a stochastic process

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 16 / 21 Assumptions and Limitations Outline

1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality

2 Q-Learning Algorithm Convergence analysis

3 Assumptions and Limitations

4 Summary

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 17 / 21 Assumptions and Limitations What assumptions did we make?

Q-learning is model free

The state and reward are assumed to be fully observable POMDP is a much harder problem

The Q function can be represented by a look-up table otherwise we need to do something else such as function approximation, and this is where deep learning comes in!

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 18 / 21 Summary Outline

1 Introduction What is reinforcement learning? Modeling the problem Bellman optimality

2 Q-Learning Algorithm Convergence analysis

3 Assumptions and Limitations

4 Summary

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 19 / 21 Summary Summary

The basic reinforcement learning problem can be modeled as an MDP If the model is known we can solve precisely by dynamic programing Q-learning allows learning an optimal policy when the model is not known, and without trying to learn the model It converges asymptotically to the optimal Q if the learning rate is not too fast and not too slow, and if the exploration policy is satisfactory

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 20 / 21 Summary Further Reading

Richard S. Sutton and Andrew G. Barto Reinforcement Learning: An Introduction. MIT Press, 1998

Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992) November 1, 2015 21 / 21