Feb 23 & 25, 2021

Markov Decision Processes

Cathy Wu 6.246: : Foundations and Methods Outline

1 Markov Decision Processes

2 for MDPs

3 Bellman equations and operators

4 Value iteration

5 Policy iteration algorithm

Wu Outline

1 Markov Decision Processes

2 Dynamic Programming for MDPs

3 Bellman equations and operators

4 Value iteration algorithm

5 Policy iteration algorithm

Wu Why problems?

§ Stochastic environment: • Uncertainty in rewards (e.g. multi-armed bandits, contextual bandits) • Uncertainty in dynamics, i.e. !",$" → !"&' • Uncertainty in horizon (called stochastic shortest path) § Stochastic policies (technical reasons) • Trades off exploration and exploitation • Enables off-policy learning • Compatible with maximum likelihood estimation (MLE)

Dynamic programming in deterministic setting is insufficient.

Wu The Amazing Goods Company (Supply Chain)

Description. At each month !, a warehouse contains "# items of a specific goods and the demand for that goods is $ (stochastic). At the end of each month the manager of the warehouse can order %_! more items from the supplier.

§ The cost of maintaining an inventory of " is ℎ("). § The cost to order % items is *(%). § The income for selling + items if ,(+). § If the demand -~$ is bigger than the available inventory ", customers that cannot be served leave. § The value of the remaining inventory at the end of the year is / " . § Constraint: the store has a maximum capacity 0.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space,

Example: The Amazing Goods Company § State space: + ∈ $ = {0, 1, … , !}.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, § A is the action space,

Example: The Amazing Goods Company § Action space: it is not possible to order more item than the capacity of the store, then the action space should depend on the current state. Formally, at state +, , ∈ & + = {0, 1, … , ! − +}.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .)

Example: The Amazing Goods Company 1 § Dynamics: +012 = +0 + .0 − 50 . 6.6.8 § The demand 50 is stochastic and time-independent. Formally, 50 ~ :.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+)

Example: The Amazing Goods Company 1 § Reward: (0 = −4 .0 − ℎ +0 + .0 + 7( +0 + .0 − +012 ). This corresponds to a purchasing cost, a cost for excess stock (storage, maintenance), and a reward for fulfilling orders.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+) § ) ∈ [0,1) is the discount factor.

Example: The Amazing Goods Company § Discount: ) = 0.95. A dollar today is worth more than a dollar tomorrow.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+) § ) ∈ [0,1) is the discount factor.

Example: The Amazing Goods Company = 0 § Objective: 7 +8; .8, … = ∑0<8 ) (0. This corresponds to the cumulative reward, plus the value of the remaining inventory.

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+) § ) ∈ [0,1) is the discount factor.

F In general, a non-Markovian decision process’s transitions could depend on much more information: , ℙ +012 = + +0 = +,.0 = .,+072,.072,…,+9,.9 ,

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+) § ) ∈ [0,1) is the discount factor.

F The process generates trajectories 70 = (+8, .8, …, +0:2, .0:2, +0), with +012~'(⋅ |+0, .0)

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, often simplified to finite § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., sometimes simply ((+) § ) ∈ [0,1) is the discount factor.

F The process generates trajectories 70 = (+8, .8, …, +0:2, .0:2, +0), with +012~'(⋅ |+0, .0)

Wu Freeway Atari game (David Crane,1981)

FREEWAY is an Atari 2600 video game, released in 1981. In FREEWAY, the agent must navigate a chicken (think: jaywalker) across a busy road of ten lanes of incoming traffic. The top of the screen lists the score. After a successful crossing, the chicken is teleported back to the bottom of the screen. If hit by a car, a chicken is forced back either slightly, or pushed back to the bottom of the screen, depending on what difficulty the switch is set to. One or two players can play at once.

Related applications: Self-driving cars (input from LIDAR, radar, cameras) Traffic signal control (input from cameras) Crowd navigation robot Figure: Atari 2600 video game FREEWAY.

Wu ATARI Breakout

Non-Markov dynamics

Wu ATARI Breakout

Wu ATARI Breakout 21

Non-Markov dynamics

An MDP satisfies the Markovian property if ℙ "#$% = " '#, )#) = ℙ "#$% = " "#,)#,"#+%,)#+%,…,"-,)- = ℙ("#$% = "|"#,)#)

i.e., the current state "# and action )# are sufficient for predicting the next state ".

Wu Markov Decision Process: the Assumptions

Time assumption: time is discrete

! → ! + 1

Possible relaxations § Identify the proper time granularity § Most of MDP literature extends to continuous time

Wu ATARI Breakout

Wu ATARI Breakout

Too fine-grained resolution

Wu ATARI Breakout

Wu ATARI Breakout

Too coarse-grained resolution

Wu Markov Decision Process: the Assumptions

Reward assumption: the reward is uniquely defined by a transition (or part of it)

!(#, %, #&)

Possible relaxations § Distinguish between global goal and reward function § Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviors

Wu Markov Decision Process: the Assumptions

Stationarity assumption: the dynamics and reward do not change over time

# # # ! " ", % = ℙ "()* = " "( = ",%( = % +(", %, " )

Possible relaxations § Identify and add/remove the non-stationary components (e.g., cyclo-stationary dynamics) § Identify the time-scale of the changes § Work on finite horizon problems

Wu Markov Decision Process

Definition (Markov decision process) A Markov decision process (MDP) is defined as a tuple ! = ($,&,',(,)) where § $ is the state space, § A is the action space, § '(+,|+,.) is the transition probability with , , ' + +,. = ℙ(+012 = + |+0 = +,.0 = .) § ((+,.,+,) is the immediate reward at state + upon taking action ., § ) ∈ [0,1) is the discount factor.

F Two missing ingredients: § How actions are selected → policy. § What determines which actions (and states) are good → value function.

Wu Policy

Definition (Policy) A decision rule ! can be § Deterministic: !: # → %, § Stochastic: !: # → Δ(%),

§ History-dependent: !: )* → %, § Markov: !: # → Δ(%), A policy (strategy, plan) can be

§ Non-stationary: + = (!-, !/, !0, … ), § Stationary: + = !, !, !, … .

FMDP 2 + stationary policy + = !, !, … ⟹ of state # and transition probability 4 56 5) = 4 56 5, ! 5 .

FFor simplicity, we will typically write + instead of ! for stationary policies, and +* instead of !* for non-stationary policies. Wu Recall: The Amazing Goods Company Example

Description. At each month !, a warehouse contains "# items of a specific goods and the demand for that goods is $ (stochastic). At the end of each month the manager of the warehouse can order

%# more items from the supplier.

§ The cost of maintaining an inventory of " is ℎ("). § The cost to order % items is )(%). § The income for selling * items if +(*). § If the demand ,~$ is bigger than the available inventory ", customers that cannot be served leave. § The value of the remaining inventory at the end of the year is . " . § Constraint: the store has a maximum capacity /.

Wu Recall: The Amazing Goods Company Example

Description. At each month !, a warehouse contains "# items of a specific goods and the demand for that goods is $ (stochastic). At the end of each month the manager of the warehouse can order

%# more items from the supplier.

§ Stationary policy composed of deterministic Markov decision rules ) − " if " < )/4 & " = ( 0 otherwise

Wu Recall: The Amazing Goods Company Example

Description. At each month !, a warehouse contains "# items of a specific goods and the demand for that goods is $ (stochastic). At the end of each month the manager of the warehouse can order

%# more items from the supplier.

§ Stationary policy composed of stochastic history-dependent decision rules )(+ − ", + − " + 10) if "# < "#56/2 & " = ( # 0 otherwise

Wu Recall: The Amazing Goods Company Example

Description. At each month !, a warehouse contains "# items of a specific goods and the demand for that goods is $ (stochastic). At the end of each month the manager of the warehouse can order

%# more items from the supplier.

§ Non-stationary policy composed of deterministic Markov decision rules ) − " if ! < 6 & " = ( # () − ")/5 otherwise

Wu State Value Function

Given a policy ! = ($%, $', … ) (deterministic to simplify notation) § Infinite time horizon with discount: the problem never terminates but rewards which are closer in time receive a higher importance. 2 + / * , = - . 3 4 ,/, !/ 56 |,1 = ,; ! /01 with discount factor 0 ≤ 3 < 1: § Small = short-term rewards, big = long-term rewards § For any 3 ∈ [0, 1) the series always converges (for bounded rewards)

§ Used when: there is uncertainty about the deadline and/or an intrinsic definition of discount.

Wu State Value Function

Given a policy ! = ($%, $', … ) (deterministic to simplify notation) § Finite time horizon *: deadline at time *, the agent focuses on the sum of the rewards up to *. 45% , + -, . = / 0 6 .1, !1, ℎ1 + 9 .4 |.3 = .; ! = (!3, … , !4) 123 where 9 is a value function for the final state.

§ Used when: there is an intrinsic deadline to meet.

Wu State Value Function

Given a policy ! = ($%, $', … ) (deterministic to simplify notation) § Stochastic shortest path: the problem never terminates but the agent will eventually reach a termination state. 2 + * , = - . 3 ,/, !/, ℎ/ |,1 = ,; ! /01 where 7 is the first (random) time when the termination state is achieved.

§ Used when: there is a specific goal condition.

Wu State Value Function

Given a policy ! = ($%, $', … ) (deterministic to simplify notation) § Infinite time horizon with average reward: the problem never terminates but the agent only focuses on the (expected) average of the rewards. 0:% + 1 * , = lim 3 6 ; ,7, !7, ℎ7 |,9 = ,; ! 0→2 5 789

§ Used when: the system should be constantly controlled over time.

Wu State Value Function

Technical note: the expectations refer to all possible stochastic trajectories. A (possibly non-stationary stochastic) policy ! applied from state "# returns "#, %#, "&, %&, "', %', …

Where %) = % "), ℎ) and ")~ - ⋅ ")/&, 0) = !) 12 are random realizations.

The value function (discounted infinite horizon) is = 4 5 ) 3 = 6(58,59,… ) ; > % "), !), ℎ) |"# = "; ! )<#

More generally, for stochastic policies: = 4 5 ) 3 = 6(AB,58,A8,59,… ) ; > % "), !), ℎ) |"# = "; ! )<#

Wu Notice

From now on we mostly work in the discounted infinite horizon setting.

Most results (not always so smoothly ) extend to other settings.

Wu Optimization Problem

Definition (Optimal policy and optimal value function)

The solution to an MDP is an optimal policy !∗ satisfying

!∗ ∈ arg max +) )∈* In all states , ∈ -, where Π is some policy set of interest.

The corresponding value function is the optimal value function ∗ +∗ = +)

Wu Limitations

1. Average case: All the previous value functions define an objective in expectation

2. Imperfect information (partial observations)

3. Time delays

4. Correlated disturbances

1. See: DPOC vol1 1.4 & Appendix F

Wu Summary & Takeaways

§ Stochastic problems are needed to represent uncertainty in the environment and in the policy. § Markov Decision Processes (MDPs) represent a general class of stochastic sequential decision problems, for which reinforcement learning methods are commonly designed. MDPs enable a discussion of model-free learning (later lectures). § The Markovian property means that the next state is fully determined by the current state and action. § Although quite general, MDPs bake in numerous assumptions. Care should be taken when modeling a problem as an MDP. § Similarly, care should be taken to select an appropriate type of policy and value function, depending on the use case.

Wu Outline

1 Markov Decision Processes

2 Dynamic Programming for MDPs

3 Bellman equations and operators

4 Value iteration algorithm

5 Policy iteration algorithm

Wu The Optimization Problem

The overall optimization problem for stochastic sequential decision making is 2 $ & / max % ' = ) , - . 3 4 56, 86 9, 51 = 51 $ *∈* /01 (1) 5. ;. 86 = 9 56 OR 86~ 9 ⋅ 56 for all D = 0, 1, … 5/HI~ J ⋅ 56, 86 for all D = 0, 1, … With: § Initial state 51 ∈ K § Discount factor 3 ∈ 0, 1 § Transition function J 5/HI 56, 86 § Reward function 4 56, 86 § Value function %$: K → ℝ

§ Recall that the action space is O. § Let P$ & denote the space of trajectories of the form P = (51, 81, 41, 5I, 8I, 4I, … ) generated by the MDP under policy 9.

Wu The Optimization Problem

The overall optimization problem for stochastic sequential decision making is 2 $ & / max % ' = ) , - . 3 4 56, 86 9, 51 = 51 $ *∈* /01 (1) 5. ;. 86 = 9 56 OR 86~ 9 ⋅ 56 for all D = 0, 1, … 5/HI~ J ⋅ 56, 86 for all D = 0, 1, …

The value function is the objective. Our goal is to find the optimal value function:

∗ $ % 51 = max % 51 (2) $∈L

Wu Dynamic programming applies to MDPs, too!

Recall: dynamic programming for deterministic problems

!" #" = %" #" for ) = * − 1, … , 0 do !1 #1 = max %1 #1, 91 + !1;< #1;< 56∈8 end for

Finite horizon deterministic (e.g. shortest path routing, traveling salesperson) ∗ !" #" = %" #" for all #" terminal reward ∗ ∗ (3) !1 #1 = max %1(#1, 91) + !1;< #1;< for all #1, and ) = {* − 1, … , 0} 56∈8

Wu Dynamic programming applies to MDPs, too!

Finite horizon deterministic (e.g. shortest path routing, traveling salesperson) ∗ !" $" = &" $" for all $" terminal reward ∗ ∗ (3) !3 $3 = max &3($3, ;3) + !3>? $3>? for all $3, and @ = {B − 1, … , 0} 56∈8

From deterministic to stochastic problems?

Finite horizon stochastic and Markov problems (e.g. driving, , games)

∗ !" $" = &" $" for all $" terminal reward !∗ $ = max & ($ , ; ) + H !∗ $ for all $ , and @ = {B − 1, … , 0} 3 3 3 3 3 I6JK~M ⋅ $ , ; 3>? 3>? 3 56∈8 3 3 (4)

Wu Dynamic programming applies to MDPs, too!

Finite horizon stochastic and Markov problems (e.g. driving, robotics, games)

∗ != #= = += #= for all #= terminal reward !∗ # = max + (# , . ) + 2 !∗ # for all # , and D = {F − 1, … , 0} > > > > > 3?@A~6 ⋅ # , . >BC >BC > (?∈* > > (4) From finite to (discounted) infinite horizon problems?

Infinite horizon stochastic problems (e.g. package delivery over months or years, long-term customer satisfaction, control of autonomous vehicles)

∗ ∗ 8 ! # = max +(#, .) + 12 4 ! (# ) for all # (5) (∈* 3 ~ 6 ⋅ 3,()

Wu Dynamic programming applies to MDPs, too!

Infinite horizon stochastic problems (e.g. package delivery over months or years, long-term customer satisfaction, control of autonomous vehicles)

∗ ∗ 8 (5) ! # = max +(#, .) + 12 4 ! (# ) for all # (∈* 3 ~ 6 ⋅ 3,()

F This is called the optimal .

An optimal policy is such that: ∗ ∗ 8 = # = arg max + #, . + 12 4 ! # for all # (∈* 3 ~ 6 ⋅ 3,()

Discuss: Any difficulties with this new algorithm?

Wu Outline

1 Markov Decision Processes

2 Dynamic Programming for MDPs

3 Bellman equations and operators

4 Value iteration algorithm

5 Policy iteration algorithm

Wu Outline

1 Markov Decision Processes

2 Dynamic Programming for MDPs

3 Bellman equations and operators

4 Value iteration algorithm

5 Policy iteration algorithm

Wu Value iteration algorithm

1. Let !"($) be any function !": ' → ℝ. [Note: not stage 0, but iteration 0.] 2. Apply the principle of optimality so that given !* at iteration +, we compute @ !*,- $ = max 5($, 7) + 9: < !* $ for all $ 2∈4 ; ~ > ⋅ ;,2) 3. Terminate when !* stops improving, e.g. when max |!*,- $ − !* $ | is small. ; @ 4. Return the greedy policy: GH $ = arg max 5 $, 7 + 9: < !H $ 2∈4 ; ~ > ⋅ ;,2)

Definition (Optimal Bellman operator) For any J ∈ ℝ K , the optimal Bellman operator is defined as @ ΤJ $ = max 5 $, 7 + 9: < J $ for all $ 2∈4 ; ~ > ⋅ ;,2)

F Then we can write the algorithm step 2 concisely: !*,- $ = Τ!* $ for all $

∗ Key question: Does !* → ! ?

Wu The student dilemma

2 r=1 Model: all the transitions are p=0.5 1 Rest r=−10 Markov, states !",!$,!%are terminal. Rest 0.5 0.4 Setting: infinite horizon with r=0 Work 5 terminal states. Work 0.3 0.4 0.6 0.7 Objective: find the policy that 0.5 0.5 Rest r=100 maximizes the expected sum of 0.6 rewards before achieving a terminal r=−1 0.9 6 Rest state. Work 3 Notice: Not a discounted infinite 0.1 r=−1000 0.5 1 horizon setting. But the Bellman 0.5 Work equations hold unchanged. r=−10 4 7 Discuss: What kind of problem setting is this? (Hint: value function.)

Wu The Optimal Bellman Equation

Bellman’s Principle of Optimality (Bellman (1957)): “An optimal policy has the property that, whatever the initial state and the initial decision are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision.”

Wu The Optimal Bellman Equation

Theorem

The optimal value function !∗ (i.e. !∗ = max !' ) is the solution to the optimal ' Bellman equation:

!∗ ( = max , (, . + 0 1 4 (5 (, .) !∗((5) )∈+ 23 And any optimal policy is such that:

8∗ . ( ≥ 0 ⟺ . ∈ arg max , (, .5 + 0 1 4 (5 (, .) !∗((5) )3 ∈ + 23

F There is always a deterministic policy (see: Puterman, 2005, Chapter 7) Proof: The Optimal Bellman Equation

For any policy 2 = 5, 26 (possibly non-stationary), . ∗ + ! # = max ) * / 0 #+, 2 #+ | #- = #; 2 [value function] ( +,-

9 = max 0 #, 5 + / * = #6 #, 5 !( #6 [Markov property & (8,(9) <9 change of “time”]

9 = max 0 #, 5 + / * = #6 #, 5 max !( #6 8 (9 <9

= max 0 #, 5 + / * = #6 #, 5 !∗ #6 8 [value function] <9

Wu The student dilemma 2 r=1 p=0.5 Rest 1 r=−10 Rest 0.5 0.4

r=0 Work 5 ∗ ∗ Work ! # = max + ,, . + 0 1 3 4 ,, .) ! (4) 0.3 0.4 0.6 (∈* 0.7 0.5 0.5 2 Rest r=100 0.6

r=−1 0.9 Rest 6 Work 3 System of equations r=−1000 0.1 0.5 1 0.5 Work ! = max 0 + 0.5 ! + 0.5 ! ; 0 + 0.5 ! + 0.5 ! r=−10 7 7 8 7 9 4 7 !8 = max 0 + 0.4 !; + 0.6 !8; 0 + 0.3 !7 + 0.7 !9 ! = 9 max −1 + 0.4 !8 + 0.6 !9; −1 + 0.5 !: + 0.5 !9 ! = : max −10 + 0.9 !< + 0.1 !:; −10 + != !; = −10 !< = 100 != = −1000

Discuss: How to solve this system of equations?

Wu System of Equations

The optimal Bellman equation:

!∗ # = max + #, - + / 0 3 #4 #, -) !∗(#4) (∈* 12

Is a non-linear system of equations with 7 unknowns and 7 non-linear constraints (i.e. the max operator). Value iteration algorithm

1. Let !"($) be any function !": ' → ℝ. [Note: not stage 0, but iteration 0.] 2. Apply the principle of optimality so that given !* at iteration +, we compute @ !*,- $ = max 5($, 7) + 9: < !* $ for all $ 2∈4 ; ~ > ⋅ ;,2) 3. Terminate when !* stops improving, e.g. when max |!*,- $ − !* $ | is small. ; @ 4. Return the greedy policy: GH $ = arg max 5 $, 7 + 9: < !H $ 2∈4 ; ~ > ⋅ ;,2)

Definition (Optimal Bellman operator) For any J ∈ ℝ K , the optimal Bellman operator is defined as @ ΤJ $ = max 5 $, 7 + 9: < J $ for all $ 2∈4 ; ~ > ⋅ ;,2)

F Then we can write the algorithm step 2 concisely: !*,- $ = Τ!* $ for all $

∗ Key question: Does !* → ! ?

Wu Properties of Bellman Operators

Proposition ) 1. Contraction in !"-norm: for any #$, #&, ∈ ℝ

Τ#$ − Τ#& " ≤ - #$ − #& "

2. Fixed point: .∗ is the unique fixed point of Τ, i.e. .∗= T.∗.

Proof: value iteration ∗ ∗ § From contraction property of Τ, .0 = ΤV3 4$, and optimal value function . = Τ. : ∗ . − .05$ " ∗ = Τ. − Τ.0 " [value iteration and optimal Bellman eq.] ∗ ≤ - . − .0 " [contraction] 05$ ∗ ≤ - . − .6 " [recursion] → 0 ∗ .0 → . [convergence of Cauchy sequences, fixed point]

Wu Properties of Bellman Operators

Proposition B 1. Contraction in <%-norm: for any =/, =?, ∈ ℝ

Τ=/ − Τ=? % ≤ 0 =/ − =? %

2. Fixed point: *∗ is the unique fixed point of Τ, i.e. *∗= T*∗.

Proof: value iteration § Convergence rate. Let ! > 0 and $ % ≤ $max, then after at most $ log max 1 − 0 ! *∗ − * ≤ 0-./ *∗ − * < ! ⟹ 4 ≥ -./ % 1 % 1 log(0)

Wu Proof: Contraction of the Bellman Operator

For any $ ∈ (

Τ"# $ − Τ"& $

7 7 7 7 7 7 = max . $, 0 + 2 3 6 $ $, 0) "# $ − max . $, 0 + 2 3 6 $ $, 0 ) "& $ - -5 45 45

7 7 7 7 ≤ max . $, 0 + 2 3 6 $ $, 0) "# $ − . $, 0 + 2 3 6 $ $, 0) "& $ - 45 45

7 7 7 = 2 max 3 6 $ $, 0) "# $ − "& $ - 45

7 ≤ 2 "# − "& : max 3 6 $ $, 0) = 2 "# − "& : - 45 max < = − max > =7 ≤ max(< = − > = ) ; ; ; Wu The Grid-World Problem

Wu Example: Winter parking (complete with ice and potholes)

Simple grid world with a goal state (green, desired parking spot) with reward (+1), a “bad state” (red, pothole) with reward (-100), and all other states neural (+0). Omnidirectional vehicle (agent) can head in any direction. Actions move in the desired direction with probably 0.8, in one of the perpendicular directions with. Taking an action that would bump into a wall leaves agent where it is.

[Source: adapted from Kolter, 2016]

Wu Example: value iteration

(a) Recall value iteration algorithm: ; V"#$ % = max - %, / + 12 4 9: % for all % *∈, 3 ~ 6 ⋅ 3,*) Let’s arbitrarily initialize 9@ as the reward function, since it can be any function. Example update (red state): V$ red = −100 + 1 max{ 0.89@ green + 0.19@ red + 0, [up] 0 + 0.19@ red + 0, [down] 0 + 0.19@ green + 0, [left] 0.89@ red + 0.19@ green + 1 } [right] = −100 + 0.9 0.1 ∗ 1 = −99.91 [best: go left] Wu Example: value iteration

(a) Recall value iteration algorithm: ; V"#$ % = max - %, / + 12 4 9: % for all % *∈, 3 ~ 6 ⋅ 3,*) Let’s arbitrarily initialize 9@ as the reward function, since it can be any function. Example update (green state): V$ red = 1 + 1 max{ 0.89@ green + 0.19@ green , [up] 0.89@ red + 0.19@ green , [down] 0 + 0.19@ green + 0.19@ red , [left] 0.89@ red + 0.19@ green + 0 } [right] = 1 + 0.9 0.9 ∗ 1 = 1.81 [best: go up] Wu Example: value iteration

(a) (b) Recall value iteration algorithm: ; V"#$ % = max - %, / + 12 4 9: % for all % *∈, 3 ~ 6 ⋅ 3,*) Let’s arbitrarily initialize 9@ as the reward function, since it can be any function.

Need to also do this for all the “unnamed” states, too.

Wu Example: value iteration

(a) (b) (c)

(d) (e) (f) Wu Value Iteration: the Guarantees

Corollary

Let !" be the function computed after # iterations by value iteration, then the greedy policy

7 7 $" % ∈ arg max . %, 0 + 2 3 6 % %, 0 !" % ,∈- 45 is such that

22 !∗ − !:" ≤ !∗ − ! ; 1 − 2 " ; performance loss approx. error

∗ Furthermore, there exists ? > 0 such that if ! − !" ; ≤ ?, then $" is optimal. Proof: Performance Loss

Note 1: We drop the & everywhere.

Note 2: ' is the greedy policy corresponding to !, and !$ is the value function evaluated with '.

∗ $ ∗ $ $ $ $ ! − ! % ≤ Τ! − Τ ! % + Τ V − Τ ! % ∗ $ ≤ Τ! − Τ! % + , V − ! %

∗ ∗ ∗ $ ≤ , V − ! % + ,( V − ! %+ V − ! %) 2, ≤ V∗ − ! 1 − , %

Wu Value Iteration: the Complexity

Time complexity § Each iteration takes on the order of !"# operations.

8 8 $%&' ( = Τ$% ( = max 1 (, 3 + 5 6 9 ( (, 3 $% ( .∈0 78

§ The computation of the greedy policy takes on the order of !"# operations.

8 8 :; ( ∈ arg max 1 (, 3 + 5 6 9 ( (, 3 $% ( .∈0 78

§ Total time complexity on the order of >!"#.

Space complexity § Storing the MDP: dynamics on the order of !"# and reward on the order of !#. § Storing the value function and the optimal policy on the order of !.

Wu Value Iteration: Extensions and Implementations

Asynchronous VI: $ 1. Let !" be any vector in #

2. At each iteration % = 1, 2, … . , , • Choose a state -. • Compute !./0 -. = Τ!.(-.)

3. Return the greedy policy

F F 45 - ∈ arg max > -, ? + A B E - -, ? !5 - <∈= CD Comparison § Reduced time complexity to O(SA) § Using round-robin, number of iterations increased by at most O(KS) but much smaller in practice if states are properly prioritized § Convergence guarantees if no starvation Outline

1 Markov Decision Processes

2 Dynamic Programming for MDPs

3 Bellman equations and operators

4 Value iteration algorithm

5 Policy iteration algorithm

Wu More generally…

Value iteration: < 1. #$%& ' = max /(', 2) + 56 8 #$ (' ) for all ' ,∈. 7 ~ : ⋅ 7,,) < 2. >? ' = arg max / ', 2 + 56 8 #? ' ,∈. 7 ~ : ⋅ 7,,)

Related Operations: B B < § Policy evaluation: # ' = / ', > ' + 56 8 # ' for all ' $%& B 7 ~ : ⋅ 7,CD 7 ) $ B < § Policy improvement: >?%& ' = arg max / ', 2 + 56 8 # ' ,∈. 7 ~ : ⋅ 7,,) E

F Generalized Policy Iteration: Repeat: 1. Policy evaluation for F steps 2. Policy improvements

(Value iteration: F = 1; Policy iteration: F = ∞.)

Wu Policy Iteration: the Idea

1. Let !" be any stationary policy

2. At each iteration # = 1, 2, … . , * - • Policy evaluation: given !+, compute , . • Policy improvement: compute the greedy policy

B -. B !+/0 1 ∈ arg max : 1, ; + = > A 1 1, ; , 1 8∈9 ?@

3. Stop if ,-. = ,-.CD

4. Return the last policy !E Policy Iteration: the Guarantees

Proposition The policy iteration algorithm generates a sequence of policies with non- decreasing performance !"#$% ≥ !"# and it converges to '∗ in a finite number of iterations. The Bellman Equation

Theorem For any stationary policy ! = (!, !, … ), at any state ' ∈ ), the state value function satisfies the Bellman equation:

*+ ' = , ', ! ' + . / 3 '4 ', ! ' *+ '4 01∈2 V2 = 88.3 p=0.5 r=1 Rest The student dilemma r=−10 0.4 V1 = 88.3 Rest 0.5 V5 = −10 r=0 Work Work 0.3 0.4 0.6 0.5 0.7 0.5 r=100 " " Rest ! # = % #, ' # + ) * , - #, ' # ! (-) 0.6 V = 100 6 r=−1 0.9 Res + V3 = 86.9 Wor t k 0.1 r=−1000 0.5 1 0.5 V = −100 r=−10 Work 7 System of equations V4 = 88.9

!0 = 0 + 0.5 !0 + 0.5 !1 !1 = 1 + 0.3 !0 + 0.7 !2 !2 = −1 + 0.5 !3 + 0.5 !2 !3 = −10 + 0.9!5 + 0.1 !3 !4 = −10 !5 = 100 !6 = −1000

Discuss: How to solve this system of equations?

Wu The Bellman Operators

Notation. w.l.o.g. a discrete state space = = > and ?& ∈ ℝ$ (analysis extends to include > → ∞ ) Definition For any ! ∈ ℝ$, the Bellman operator Τ&: ℝ$ → ℝ$ is

Τ&! ) = + ), - ) + / 0 3 )4 ), - ) !()4) 12 And the optimal Bellman operator (or dynamic programming operator) is

Τ! ) = max + ), < + / 0 3 )4 ), < !()) :∈; 12 The Bellman Operators

Proposition Properties of the Bellman operators ' 1. Monotonicity: For any !", !$ ∈ ℝ , if !" ≤ \W$ component-wise, then , , Τ !" ≤ Τ !$

Τ!" ≤ Τ!$ 2. Offset: For any scalar - ∈ ℝ, , , Τ ! − -/' = Τ ! + 2-/'

Τ ! − -/' = Τ! + 2-/' The Bellman Operators

Proposition

) 3. Contraction in !"-norm: For any #$, #& ∈ ℝ + + Τ #$ − Τ #& " ≤ . #$ − #& "

Τ#$ − Τ#& " ≤ . #$ − #& " 4. Fixed point: For any policy /, 0+ is the unique fixed point of Τ+ 0∗ is the unique fixed point of Τ

F For any # ∈ ℝ) and any stationary policy / lim Τ+ 5# = 0+ 5→" lim Τ 5# = 0∗ 5→" Policy Iteration: the Idea

1. Let !" be any stationary policy

2. At each iteration # = 1, 2, … . , * - • Policy evaluation: given !+, compute , . • Policy improvement: compute the greedy policy

B -. B !+/0 1 ∈ arg max : 1, ; + = > A 1 1, ; , 1 8∈9 ?@

3. Stop if ,-. = ,-.CD

4. Return the last policy !E Policy Iteration: the Guarantees

Proposition The policy iteration algorithm generates a sequence of policies with non- decreasing performance !"#$% ≥ !"# and it converges to '∗ in a finite number of iterations. Proof: Policy Iteration

From the definition of the Bellman operators and the greedy policy !"#$ %&' = Τ&'%&' ≤ Τ%&' = Τ&'+,%&' (1)

and from the monotonicity property of Τ&'+,, it follows that %&' ≤ Τ&'+,%&' Τ&'+,%&' ≤ Τ&'+, -%&' … Τ&'+, /0$%&' ≤ Τ&'+, /%&' …

Joining all inequalities in the chain, we obtain %&' ≤ lim Τ&'+, /%&' = %&'+, /→5

& Then % ' " is a non-decreasing sequence.

Wu Policy Iteration: the Guarantees

Since a finite MDP admits a finite number of policies, then the termination condition is eventually met for a specific !.

Thus eq. 1 holds with an equality and we obtain "#$ = Τ"#$

# ∗ and " $ = " which implies that () is an optimal policy.

Wu Policy Iteration: Complexity

Policy Evaluation Step § Direct computation: For any policy ! compute "# = % − '(# )*+# Complexity: O(S3). § Iterative policy evaluation: For any policy ! # # lim Τ "3 = " /→1 ; 89: # 7 < Complexity: An 4-approximation of " requires 5 6 ; steps. 89: = § Monte-Carlo simulation: In each state 6, simulate > trajectories @ 6? following policy ! and compute ?A3 *B@B/ / 1 "C # 6 ≃ F F '?+ 6@, ! 6@ > ? ? @G* ?A3 I * Complexity: In each state, the approximation error is 5 JKL . *)M /

Wu Policy Iteration: Complexity

Policy Improvement Step § If the policy is evaluated with !, then complexity O(S2A)

Number of Iterations #$ % § At most " log %&' %&' § Other results exist that do not depend on +

Wu Comparison between Value and Policy Iteration

Value Iteration § Pros: each iteration is very computationally efficient. § Cons: convergence is only asymptotic.

Policy Iteration § Pros: converge in a finite number of iterations (often small in practice). § Cons: each iteration requires a full policy evaluation and it might be expensive.

Wu Example: Winter parking (complete with ice and potholes)

Simple grid world with a goal state (green, desired parking spot) with reward (+1), a “bad state” (red, pothole) with reward (-100), and all other states neural (+0). Omnidirectional vehicle (agent) can head in any direction. Actions move in the desired direction with probably 0.8, in one of the perpendicular directions with. Taking an action that would bump into a wall leaves agent where it is.

[Source: adapted from Kolter, 2016]

Wu Example: value iteration

(a) (b) (c)

(d) (e) (f) Wu Example: policy iteration

(a) (b)

(c) (d) Wu Other

§ § Modified Policy Iteration § !-Policy Iteration § Primal-dual formulations

Wu Summary & Takeaways

§ The ideas from dynamic programming, namely the principle of optimality, carry over to stochastic problems. § The value iteration algorithm solves discounted infinite horizon MDP problems by leveraging properties of Bellman operators, namely the optimal Bellman equation, contractions, and fixed points. § Generalized policy iteration methods include policy iteration and value iteration. § As in deterministic problems, the curse of dimensionality often emerges in stochastic problems, where the state space is too large.

Wu References

§ Dimitri P. Bertsekas. Dynamic Programming and . Volume 2. 4th Edition. (2012). Chapters 1-2: Discounted Problems. § R. E. Bellman. Dynamic Programming. Princeton University Press, Princeton, N.J., 1957. § D.P. Bertsekas and J. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, (1) Belmont, MA, 1996. § W. Fleming and R. Rishel. Deterministic and stochastic optimal control. Applications of Mathematics, 1, Springer-Verlag, Berlin New York, 1975. § R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960. § M.L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., New York, Etats-Unis, 1994.

Template & materials adapted from Alessandro Lazaric (FAIR/INRIA).

Wu