Markov decision process & Dynamic programming
value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration.
Vien Ngo MLR, University of Stuttgart Outline
• Reinforcement learning problem. – Element of reinforcement learning – Markov Process – Markov Reward Process – Markov decision process.
• Dynamic programming – Value iteration – Policy iteration
2/?? Reinforcement Learning Problem Elements of Reinforcement Learning Problem
• Agent vs. Environment.
• State, Action, Reward, Goal, Return.
• The Markov property.
• Markov decision process.
• Bellman equations.
• Optimality and Approximation.
3/?? • The learner and decision-maker is called the agent.
• The thing it interacts with, comprising everything outside the agent, is called the environment.
• The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto)
Agent vs. Environment
4/?? Agent vs. Environment
• The learner and decision-maker is called the agent.
• The thing it interacts with, comprising everything outside the agent, is called the environment.
• The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems.
(from Introduction to RL book, Sutton & Barto)4/?? • Formally,
P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)
• Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.
The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto)
5/?? • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.
The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally,
P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)
5/?? The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally,
P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)
• Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.
5/?? Markov Process
• A Markov Process (Markov Chain) is defined as 2-tuple (S, P). – S is a state space. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s |st = s)
6/?? Markov Process: Example Rycycling Robot’s Markov Chain
recharge 0.9 0.9
0.1 Batery: Batery: 0.5 high low 0.5 0.5 1.0 0.5 0.1 search
wait stop
7/?? Markov Reward Process
• A Markov Reward Process is defined as 4-tuple (S, P, R, γ). – S is a state space of n states. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s |st = s)
– R is a reward matrix of Rs. – γ is a discount factor, γ ∈ [0, 1]. • The total return is
2 ρt = Rt + γRt+1 + γ Rt+2 + ...
8/?? Markov Reward Process: Example
recharge 0.9;0.0 0.9;0.0
0.1;0.0 Batery: Batery: 0.5;0.0 high low 0.5;0.0 0.5;0.0 0.1;-10.0 1.0;0.0 0.5;-1.0 search
wait stop
9/?? Markov Reward Process: Bellman Equations
• The value function V (s)
V (s) = E ρt|st = s = E Rt + γV (st+1)|st = s
• V = R + γP V , hence V = (I − γP )−1R We will visit again in MDP.
10/?? Markov Reward Process: Discount Factor? Many meanings: • Weighing the importance of differently timed rewards, higher importance of more recent rewards. • Representing uncertainty over the presence of next rewards, i.e geometric distributions. • Representing human/animal’s preference over ordering of received rewards.
11/?? Markov decision process
12/?? • MDP = {S, A, T , R, P0, γ}. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).
– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.
Markov decision process
• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.
13/?? – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).
– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.
Markov decision process
• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.
• MDP = {S, A, T , R, P0, γ}.
13/?? Markov decision process
• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.
• MDP = {S, A, T , R, P0, γ}. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).
– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.
13/?? Example: Recycling Robot MDP
14/?? • A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function: – Expected average reward.
T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.
∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ
a0 a1 a2
s0 s1 s2
r0 r1 r2
15/?? • Objective function: – Expected average reward.
T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.
∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ
a0 a1 a2
s0 s1 s2
r0 r1 r2
• A policy is a mapping from state space to action space
µ : S 7→ A
15/?? • Singh et. al. 1994: 1 ηγ = η 1 − γ
a0 a1 a2
s0 s1 s2
r0 r1 r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function: – Expected average reward.
T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.
∞ h X t i ηγ = E γ r(st, at, st+1) t=0
15/?? a0 a1 a2
s0 s1 s2
r0 r1 r2
• A policy is a mapping from state space to action space
µ : S 7→ A
• Objective function: – Expected average reward.
T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.
∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ
15/?? Dynamic Programming
16/?? Dynamic Programming
• State Value Functions • Bellman Equations • Value Iteration • Policy Iteration
17/?? • definition of optimality: behavior π∗ is optimal iff
π∗ ∗ ∗ π ∀s : V (s) = V (s) where V (s) = max V (s) π
(simultaneously maximising the value in all states)
(In MDPs there always exists (at least one) optimal deterministic policy.)
State value function
• The value (expected discounted return) of policy π when started in state s:
π 2 V (s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s} (1)
discounting factor γ ∈ [0, 1]
18/?? State value function
• The value (expected discounted return) of policy π when started in state s:
π 2 V (s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s} (1)
discounting factor γ ∈ [0, 1]
• definition of optimality: behavior π∗ is optimal iff
π∗ ∗ ∗ π ∀s : V (s) = V (s) where V (s) = max V (s) π
(simultaneously maximising the value in all states)
(In MDPs there always exists (at least one) optimal deterministic policy.)
18/?? • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For stochastic π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )
• Bellman optimality equation
∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm)
Bellman optimality equation
π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}
= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s )
19/?? • Bellman optimality equation
∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm)
Bellman optimality equation
π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}
= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s ) • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For stochastic π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )
19/?? Bellman optimality equation
π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}
= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s ) • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For stochastic π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )
• Bellman optimality equation
∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) 19/?? Richard E. Bellman (1920-1984) Bellman’s principle of optimality B
A
A opt ⇒ B opt
∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a
20/?? Value Iteration
• Given the Bellman equation h X i V ∗(s) = max R(a, s) + γ P (s0 | a, s) V ∗(s0) a s0
→ iterate
h X 0 0 i ∀s : Vk+1(s) = max R(a, s) + γ P (s |π(s), s) Vk(s ) a s0
stopping criterion:
max |Vk+1(s) − Vk(s)| ≤ s
• Value Iteration converges to the optimal value function V ∗ (proof below)
21/?? 2x2 Maze
0.0 1.0 80%
10% 10% 0.0 0.0
manually solving.
22/?? State-action value function (Q-function)
• The state-action value function (or Q-function) is the expected discounted return when starting in state s and taking first action a:
π 2 Q (a, s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s, a0 =a} X = R(a, s) + γ P (s0 | a, s) Qπ(π(s0), s0) s0
(Note: V π(s) = Qπ(π(s), s).)
• Bellman optimality equation for the Q-function X Q∗(a, s) = R(a, s) + γ P (s0 | a, s) max Q∗(a0, s0) a0 s0 π∗(s) = argmax Q∗(a, s) a
23/?? Q-Iteration
• Given the Bellman equation X Q∗(a, s) = R(a, s) + γ P (s0 | a, s) max Q∗(a0, s0) a0 s0
→ iterate
X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) max Qk(a , s ) a0 s0
stopping criterion:
max |Qk+1(a, s) − Qk(a, s)| ≤ a,s
• Q-Iteration converges to the optimal state-action value function Q∗
24/?? Proof of convergence
∗ ∗ • Let ∆k = ||Q − Qk||∞ = maxa,s |Q (a, s) − Qk(a, s)|
X 0 0 0 Qk+1(a, s) = R(a, s) + γ P (s |a, s) max Qk(a , s ) a0 s0 X 0 h ∗ 0 0 i ≤ R(a, s) + γ P (s |a, s) max Q (a , s ) + ∆k a0 s0 h X 0 ∗ 0 0 i = R(a, s) + γ P (s |a, s) max Q (a , s ) + γ∆k a0 s0 ∗ = Q (a, s) + γ∆k
∗ ∗ similarly: Qk ≥ Q − ∆k ⇒ Qk+1 ≥ Q − γ∆k
25/?? ∗ • Stopping condition: ||Vk+1 − Vk|| ≤ ⇒ ||Vk+1 − V || ≤ γ/(1 − γ) Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤ + | V − V ∗ | γ k+1
Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||
which guarantees convergence with different initial values U0,V0 of two approximations.
k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||
26/?? Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤ + | V − V ∗ | γ k+1
Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||
which guarantees convergence with different initial values U0,V0 of two approximations.
k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||
∗ • Stopping condition: ||Vk+1 − Vk|| ≤ ⇒ ||Vk+1 − V || ≤ γ/(1 − γ)
26/?? Convergence
• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||
which guarantees convergence with different initial values U0,V0 of two approximations.
k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||
∗ • Stopping condition: ||Vk+1 − Vk|| ≤ ⇒ ||Vk+1 − V || ≤ γ/(1 − γ) Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤ + | V − V ∗ | γ k+1
26/?? • Iterate using π instead of maxa:
X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0
• Or, invert the matrix equation
V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ
requires inversion of n × n matrix for |S| = n, O(n3)
Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:
27/?? • Or, invert the matrix equation
V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ
requires inversion of n × n matrix for |S| = n, O(n3)
Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:
• Iterate using π instead of maxa:
X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0
27/?? Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:
• Iterate using π instead of maxa:
X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0
• Or, invert the matrix equation
V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ
requires inversion of n × n matrix for |S| = n, O(n3) 27/?? • Policy Iteration
1. Initialise π0 somehow (e.g. randomly) 2. Iterate – Policy Evaluation: compute V πk or Qπk πk – Policy Improvement: πk+1(s) ← argmaxa Q (a, s)
demo: 2x2 maze
Policy Iteration
• What does it help to just compute V π or Qπ to find the optimal policy?
28/?? Policy Iteration
• What does it help to just compute V π or Qπ to find the optimal policy?
• Policy Iteration
1. Initialise π0 somehow (e.g. randomly) 2. Iterate – Policy Evaluation: compute V πk or Qπk πk – Policy Improvement: πk+1(s) ← argmaxa Q (a, s)
demo: 2x2 maze
28/?? Convergence proof The fact is that: • After policy improvement: V πk ≤ V πk+1 (with a sketch proof from Rich Sutton’s book) • The policy space is finite, |A||S|. • The Bellman operator has a unique fixed point (due to the strict contraction property (0 < γ < 1) on a Banach space). This condition is also used to prove the fixed point for the VI algorithm.
29/?? VI vs. PI
• VI is PI with one step of policy evaluation. • PI converges surprisingly rapildy, however with expensive compution, i.e. the policy evaluation step (wait for convergence of V π). • PI is prefered if the action set is large.
30/?? Asynchronous Dynamic Programming
• The value function table is updated asynchronously. • Computation is significantly reduced. • If all states are updated infinitely, convergence is still guaranteed. • Three simple algorithms: • Gauss-Seidel Value Iteration • Real-time dynamic programming • Prioritised sweeping
31/?? Gauss-Seidel Value Iteration • Standard VI algorithm updates all states at next iteration using old values at previous iteration (each iteration finishes when all states get updated).
Algorithm 1 Standard Value Iteration Algorithm 1: while (!converged) do
2: Vold = V 3: for (each s ∈ S) do P 0 0 4: V (s) = maxa{R(s, a) + γ s0 P (s |s, a)Vold(s )}
• Gauss-Seidel VI updates each state using values from previous computation.
Algorithm 2 Gauss-Seidel Value Iteration Algorithm 1: while (!converged) do 2: for (each s ∈ S) do P 0 0 3: V (s) = maxa{R(s, a) + γ s0 P (s |s, a)V (s )} 32/?? Prioritised Sweeping
• Similar to Gauss-Seidel VI, but the sequence of states in each iteration is proportional to their update magnitudes (Bellman errors).
• Define Bellman error as E(s; Vt) = |Vt+1(s) − Vt(s)| that is the change of s’s value after the most recent update.
Algorithm 3 Prioritised Sweeping VI Algorithm
1: Initialize V0(s) and priority values H0(s), ∀s ∈ S. 2: for k = 0, 1, 2, 3,... do
3: pick a state to update (with the highest priortiy): sk ∈ arg maxs∈S Hk(s) P 0 0 4: value update: Vk+1(sk) = maxa∈A R(sk, ak) + γ s0 P (s |sk, ak)Vk(s ) 5: for s 6= sk: Vk+1(s) = Vk(s)
6: update priority values: ∀s ∈ S,Hk+1(s) ← E(s; Vk+1) (Note: the error is w.r.t the future update).
33/?? Real-Time Dynamic Programming
• Similar to Gauss-Seidel VI, but the sequence of states in each iteration is generated by simulating the transitions.
Algorithm 4 Real-Time Value Iteration Algorithm
1: start at an arbitray s0, and initialize V0(s), ∀s ∈ calS. 2: for k = 0, 1, 2, 3,... do 3: action selection:
X 0 0 ak ∈ arg max R(sk, a) + γ P (s |sk, a)Vk(s ) a∈A s0
P 0 0 4: value update: Vk+1(sk) = R(sk, ak) + γ s0 P (s |sk, ak)Vk(s ) 5: For s 6= sk: Vk+1(s) = Vk(s) 0 6: simulate the next state: sk+1 ∼ P (s |sk, ak)
34/?? • So far, we introduce basic notions of an MDP and value functions and methods to compute optimal policies assuming that we know the world (know P (s0|a, s) and R(a, s)):
– Value Iteration/Q-Iteration → V ∗,Q∗, π∗ – Policy Evaluation → V π,Qπ
πk – Policy Improvement π(s) ← argmaxa Q (a, s) – Policy Iteration (iterate Policy Evaluation and Policy Improvement)
• Reinforcement Learning?
35/??