Markov decision process &

value function, , optimality, Markov property, Markov decision process, dynamic programming, value iteration, policy iteration.

Vien Ngo MLR, University of Stuttgart Outline

• Reinforcement learning problem. – Element of reinforcement learning – Markov Process – Markov Reward Process – Markov decision process.

• Dynamic programming – Value iteration – Policy iteration

2/?? Reinforcement Learning Problem Elements of Reinforcement Learning Problem

• Agent vs. Environment.

• State, Action, Reward, Goal, Return.

• The Markov property.

• Markov decision process.

• Bellman equations.

• Optimality and Approximation.

3/?? • The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, is called the environment.

• The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems. (from Introduction to RL book, Sutton & Barto)

Agent vs. Environment

4/?? Agent vs. Environment

• The learner and decision-maker is called the agent.

• The thing it interacts with, comprising everything outside the agent, is called the environment.

• The environment is formally formulated as a Markov Decision Process, which is a mathematically principled framework for sequential decision problems.

(from Introduction to RL book, Sutton & Barto)4/?? • Formally,

P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.

The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto)

5/?? • Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.

The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally,

P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)

5/?? The Markov property A state that summarizes past sensations compactly yet in such a way that all relevant information is retained. This normally requires more than the immediate sensations, but never more than the complete history of all past sensations. A state that succeeds in retaining all relevant information is said to be Markov, or to have the Markov property. (Introduction to RL book, Sutton & Barto) • Formally,

P r(st+1, rt+1|st, at, rt, ··· , s0, a0, r0) = P r(st+1, rt+1|st, at, rt)

• Example: the current configuration of the chess board for predicting the next steps, the position, velocity of the cart, the angle and its changing rate of the pole in cart-pole domain.

5/?? Markov Process

• A Markov Process () is defined as 2-tuple (S, P). – S is a state space. 0 – P is a state transition matrix: Pss0 = P (st+1 = s |st = s)

6/?? Markov Process: Example Rycycling Robot’s Markov Chain

recharge 0.9 0.9

0.1 Batery: Batery: 0.5 high low 0.5 0.5 1.0 0.5 0.1 search

wait stop

7/?? Markov Reward Process

• A Markov Reward Process is defined as 4-tuple (S, P, R, γ). – S is a state space of n states. 0 – P is a state transition probability matrix: Pss0 = P (st+1 = s |st = s)

– R is a reward matrix of Rs. – γ is a discount factor, γ ∈ [0, 1]. • The total return is

2 ρt = Rt + γRt+1 + γ Rt+2 + ...

8/?? Markov Reward Process: Example

recharge 0.9;0.0 0.9;0.0

0.1;0.0 Batery: Batery: 0.5;0.0 high low 0.5;0.0 0.5;0.0 0.1;-10.0 1.0;0.0 0.5;-1.0 search

wait stop

9/?? Markov Reward Process: Bellman Equations

• The value function V (s)

  V (s) = E ρt|st = s   = E Rt + γV (st+1)|st = s

• V = R + γP V , hence V = (I − γP )−1R We will visit again in MDP.

10/?? Markov Reward Process: Discount Factor? Many meanings: • Weighing the importance of differently timed rewards, higher importance of more recent rewards. • Representing uncertainty over the presence of next rewards, i.e geometric distributions. • Representing human/animal’s preference over ordering of received rewards.

11/?? Markov decision process

12/?? • MDP = {S, A, T , R, P0, γ}. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).

– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.

Markov decision process

• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.

13/?? – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).

– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.

Markov decision process

• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.

• MDP = {S, A, T , R, P0, γ}.

13/?? Markov decision process

• A reinforcement learning problem that satisfies the Markov property is called a Markov decision process, or MDP.

• MDP = {S, A, T , R, P0, γ}. – S: consists of all possible states. – A: consists of all possible actions. – T : is a transition function which defines the probability T (s0, s, a) = P r(s0|s, a). – R: is a reward function which defines the reward R(s, a).

– P0: is the probability distribution over initial states. – γ ∈ [0, 1]: is a discount factor.

13/?? Example: Recycling Robot MDP

14/?? • A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function: – Expected average reward.

T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.

∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ

a0 a1 a2

s0 s1 s2

r0 r1 r2

15/?? • Objective function: – Expected average reward.

T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.

∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ

a0 a1 a2

s0 s1 s2

r0 r1 r2

• A policy is a mapping from state space to action space

µ : S 7→ A

15/?? • Singh et. al. 1994: 1 ηγ = η 1 − γ

a0 a1 a2

s0 s1 s2

r0 r1 r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function: – Expected average reward.

T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.

∞ h X t i ηγ = E γ r(st, at, st+1) t=0

15/?? a0 a1 a2

s0 s1 s2

r0 r1 r2

• A policy is a mapping from state space to action space

µ : S 7→ A

• Objective function: – Expected average reward.

T −1 1 h X i η = lim E r(st, at, st+1) T →∞ T t=0 – Expected discounted reward.

∞ h X t i ηγ = E γ r(st, at, st+1) t=0 • Singh et. al. 1994: 1 ηγ = η 1 − γ

15/?? Dynamic Programming

16/?? Dynamic Programming

• State Value Functions • Bellman Equations • Value Iteration • Policy Iteration

17/?? • definition of optimality: behavior π∗ is optimal iff

π∗ ∗ ∗ π ∀s : V (s) = V (s) where V (s) = max V (s) π

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

State value function

• The value (expected discounted return) of policy π when started in state s:

π 2 V (s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

18/?? State value function

• The value (expected discounted return) of policy π when started in state s:

π 2 V (s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s} (1)

discounting factor γ ∈ [0, 1]

• definition of optimality: behavior π∗ is optimal iff

π∗ ∗ ∗ π ∀s : V (s) = V (s) where V (s) = max V (s) π

(simultaneously maximising the value in all states)

(In MDPs there always exists (at least one) optimal deterministic policy.)

18/?? • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )

• Bellman optimality equation

∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product )

Bellman optimality equation

π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}

= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s )

19/?? • Bellman optimality equation

∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm)

Bellman optimality equation

π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}

= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s ) • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For stochastic π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )

19/?? Bellman optimality equation

π 2 V (s)=E {r0 + γr1 + γ r2 + · · · | s0 =s; π}

= E{r0 | s0 =s; π} + γE{r1 + γr2 + · · · | s0 =s; π} P 0 0 = R(π(s), s) + γ s0 P (s | π(s), s)E{r1 + γr2 + · · · | s1 =s ; π} P 0 π 0 = R(π(s), s) + γ s0 P (s | π(s), s) V (s ) • We can write this in vector notation V π = Rπ + γP πV π π π π π 0 with vectors V s = V (s), Rs = R(π(s), s) and matrix P s0s = P (s | π(s), s) π P P 0 π 0 • For stochastic π(a|s): V (s) = a π(a|s)R(a, s) + γ s0,a π(a|s)P (s | a, s) V (s )

• Bellman optimality equation

∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a 0 (Sketch of proof: If π would select another action than argmaxa[·], then π which = π 0 everywhere except π (s) = argmaxa[·] would be better.) • This is the principle of optimality in the stochastic case (related to Viterbi, max-product algorithm) 19/?? Richard E. Bellman (1920-1984) Bellman’s principle of optimality B

A

A opt ⇒ B opt

∗ h P 0 ∗ 0 i V (s) = max R(a, s) + γ 0 P (s | a, s) V (s ) a s ∗ h P 0 ∗ 0 i π (s) = argmax R(a, s) + γ s0 P (s | a, s) V (s ) a

20/?? Value Iteration

• Given the Bellman equation h X i V ∗(s) = max R(a, s) + γ P (s0 | a, s) V ∗(s0) a s0

→ iterate

h X 0 0 i ∀s : Vk+1(s) = max R(a, s) + γ P (s |π(s), s) Vk(s ) a s0

stopping criterion:

max |Vk+1(s) − Vk(s)| ≤  s

• Value Iteration converges to the optimal value function V ∗ (proof below)

21/?? 2x2 Maze

0.0 1.0 80%

10% 10% 0.0 0.0

manually solving.

22/?? State-action value function (Q-function)

• The state-action value function (or Q-function) is the expected discounted return when starting in state s and taking first action a:

π 2 Q (a, s) = Eπ{r0 + γr1 + γ r2 + · · · | s0 =s, a0 =a} X = R(a, s) + γ P (s0 | a, s) Qπ(π(s0), s0) s0

(Note: V π(s) = Qπ(π(s), s).)

• Bellman optimality equation for the Q-function X Q∗(a, s) = R(a, s) + γ P (s0 | a, s) max Q∗(a0, s0) a0 s0 π∗(s) = argmax Q∗(a, s) a

23/?? Q-Iteration

• Given the Bellman equation X Q∗(a, s) = R(a, s) + γ P (s0 | a, s) max Q∗(a0, s0) a0 s0

→ iterate

X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) max Qk(a , s ) a0 s0

stopping criterion:

max |Qk+1(a, s) − Qk(a, s)| ≤  a,s

• Q-Iteration converges to the optimal state-action value function Q∗

24/?? Proof of convergence

∗ ∗ • Let ∆k = ||Q − Qk||∞ = maxa,s |Q (a, s) − Qk(a, s)|

X 0 0 0 Qk+1(a, s) = R(a, s) + γ P (s |a, s) max Qk(a , s ) a0 s0 X 0 h ∗ 0 0 i ≤ R(a, s) + γ P (s |a, s) max Q (a , s ) + ∆k a0 s0 h X 0 ∗ 0 0 i = R(a, s) + γ P (s |a, s) max Q (a , s ) + γ∆k a0 s0 ∗ = Q (a, s) + γ∆k

∗ ∗ similarly: Qk ≥ Q − ∆k ⇒ Qk+1 ≥ Q − γ∆k

25/?? ∗ • Stopping condition: ||Vk+1 − Vk|| ≤  ⇒ ||Vk+1 − V || ≤ γ/(1 − γ) Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤  + | V − V ∗ | γ k+1

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||

which guarantees convergence with different initial values U0,V0 of two approximations.

k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||

26/?? Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤  + | V − V ∗ | γ k+1

Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||

which guarantees convergence with different initial values U0,V0 of two approximations.

k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||

∗ • Stopping condition: ||Vk+1 − Vk|| ≤  ⇒ ||Vk+1 − V || ≤ γ/(1 − γ)

26/?? Convergence

• Contraction property: ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk||

which guarantees convergence with different initial values U0,V0 of two approximations.

k+1 ||Uk+1 − Vk+1|| ≤ γ||Uk − Vk|| ≤ ... ≤ γ ||U0 − V0||

∗ • Stopping condition: ||Vk+1 − Vk|| ≤  ⇒ ||Vk+1 − V || ≤ γ/(1 − γ) Proof: | V − V ∗ | k+1 ≤ | V − V ∗ | ≤ | V − V | + | V − V ∗ | γ k k+1 k k+1 | V − V ∗ | k+1 ≤  + | V − V ∗ | γ k+1

26/?? • Iterate using π instead of maxa:

X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0

• Or, invert the matrix equation

V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ

requires inversion of n × n matrix for |S| = n, O(n3)

Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:

27/?? • Or, invert the matrix equation

V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ

requires inversion of n × n matrix for |S| = n, O(n3)

Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0

27/?? Policy Evaluation Value Iteration and Q-Iteration compute directly V ∗ and Q∗ If we want to evaluate a given policy π, we want to compute V π or Qπ:

• Iterate using π instead of maxa:

X 0 0 ∀s : Vk+1(s) = R(π(s), s) + γ P (s |π(s), s) Vk(s ) s0 X 0 0 0 ∀a,s : Qk+1(a, s) = R(a, s) + γ P (s |a, s) Qk(π(s ), s ) s0

• Or, invert the matrix equation

V π = Rπ + γP πV π V π + γP πV π = Rπ (I − γP π)V π = Rπ V π = (I − γP π)−1Rπ

requires inversion of n × n matrix for |S| = n, O(n3) 27/?? • Policy Iteration

1. Initialise π0 somehow (e.g. randomly) 2. Iterate – Policy Evaluation: compute V πk or Qπk πk – Policy Improvement: πk+1(s) ← argmaxa Q (a, s)

demo: 2x2 maze

Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?

28/?? Policy Iteration

• What does it help to just compute V π or Qπ to find the optimal policy?

• Policy Iteration

1. Initialise π0 somehow (e.g. randomly) 2. Iterate – Policy Evaluation: compute V πk or Qπk πk – Policy Improvement: πk+1(s) ← argmaxa Q (a, s)

demo: 2x2 maze

28/?? Convergence proof The fact is that: • After policy improvement: V πk ≤ V πk+1 (with a sketch proof from Rich Sutton’s book) • The policy space is finite, |A||S|. • The Bellman operator has a unique fixed point (due to the strict contraction property (0 < γ < 1) on a Banach space). This condition is also used to prove the fixed point for the VI algorithm.

29/?? VI vs. PI

• VI is PI with one step of policy evaluation. • PI converges surprisingly rapildy, however with expensive compution, i.e. the policy evaluation step (wait for convergence of V π). • PI is prefered if the action set is large.

30/?? Asynchronous Dynamic Programming

• The value function table is updated asynchronously. • Computation is significantly reduced. • If all states are updated infinitely, convergence is still guaranteed. • Three simple : • Gauss-Seidel Value Iteration • Real-time dynamic programming • Prioritised sweeping

31/?? Gauss-Seidel Value Iteration • Standard VI algorithm updates all states at next iteration using old values at previous iteration (each iteration finishes when all states get updated).

Algorithm 1 Standard Value Iteration Algorithm 1: while (!converged) do

2: Vold = V 3: for (each s ∈ S) do P 0 0 4: V (s) = maxa{R(s, a) + γ s0 P (s |s, a)Vold(s )}

• Gauss-Seidel VI updates each state using values from previous computation.

Algorithm 2 Gauss-Seidel Value Iteration Algorithm 1: while (!converged) do 2: for (each s ∈ S) do P 0 0 3: V (s) = maxa{R(s, a) + γ s0 P (s |s, a)V (s )} 32/?? Prioritised Sweeping

• Similar to Gauss-Seidel VI, but the sequence of states in each iteration is proportional to their update magnitudes (Bellman errors).

• Define Bellman error as E(s; Vt) = |Vt+1(s) − Vt(s)| that is the change of s’s value after the most recent update.

Algorithm 3 Prioritised Sweeping VI Algorithm

1: Initialize V0(s) and priority values H0(s), ∀s ∈ S. 2: for k = 0, 1, 2, 3,... do

3: pick a state to update (with the highest priortiy): sk ∈ arg maxs∈S Hk(s)  P 0 0  4: value update: Vk+1(sk) = maxa∈A R(sk, ak) + γ s0 P (s |sk, ak)Vk(s ) 5: for s 6= sk: Vk+1(s) = Vk(s)

6: update priority values: ∀s ∈ S,Hk+1(s) ← E(s; Vk+1) (Note: the error is w.r.t the future update).

33/?? Real-Time Dynamic Programming

• Similar to Gauss-Seidel VI, but the sequence of states in each iteration is generated by simulating the transitions.

Algorithm 4 Real-Time Value Iteration Algorithm

1: start at an arbitray s0, and initialize V0(s), ∀s ∈ calS. 2: for k = 0, 1, 2, 3,... do 3: action selection:

 X 0 0 ak ∈ arg max R(sk, a) + γ P (s |sk, a)Vk(s ) a∈A s0

P 0 0 4: value update: Vk+1(sk) = R(sk, ak) + γ s0 P (s |sk, ak)Vk(s ) 5: For s 6= sk: Vk+1(s) = Vk(s) 0 6: simulate the next state: sk+1 ∼ P (s |sk, ak)

34/?? • So far, we introduce basic notions of an MDP and value functions and methods to compute optimal policies assuming that we know the world (know P (s0|a, s) and R(a, s)):

– Value Iteration/Q-Iteration → V ∗,Q∗, π∗ – Policy Evaluation → V π,Qπ

πk – Policy Improvement π(s) ← argmaxa Q (a, s) – Policy Iteration (iterate Policy Evaluation and Policy Improvement)

• Reinforcement Learning?

35/??