6.246 : Foundations and Methods Feb 23, 2021 Lecture 3: Markov Decision Processes Instructor: Cathy Wu Scribe: Athul Paul Jacob

Note: the lecture notes have not been thoroughly checked for errors and are not at the level of publication.

1 Markov Decision Processes

Last lecture, we talked about deterministic decision problems as well as a little bit of stochasticity by introducing the variant of LQR. In this lecture, we will introduce a formulation called Markov decision processes to study decision problems that have stochasticity.

1.1 Why stochastic problems? The reasons can be roughly split into two categories:

• Stochastic Environment: This is the case where there is stochasticity in the environment itself. As such, a stochastic framework is necessary to model the problem. Various components of the environ- ment can be a source of stochasticity: – Uncertainty in Reward/Objective: Examples of this include problems like multi-armed ban- dits and contextual bandits. – Uncertainty in Dynamics – Uncertainty in horizon: Uncertainty in the length of the problem. Stochastic Shortest Path which will be introduced later is an example of this.

• Stochastic Policies: Another source of stochasticity is in the policy itself. Stochastic policies usually has to do with technical reasons, such as: – Helps trade off Exploration and exploitation – Enables off-policy learning – Compatible with Maximum likelihood estimation (MLE)

As such, for reasons that are mentioned above, in the deterministic setting is insuffi- cient.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 1 Definition 1. A Markov Decision Process (MDP) is defined as a tuple M = (S, A, P, r, γ) where

• S is the state space, • A is action space, • P (s0|s, a) is transition with:

0 0 P (s |s, a) = P (st+1 = s |st = s, at = a)

• r(s, a, s0) is the immediate reward at state s upon taking action a, • γ ∈ [0, 1) is the discount factor.

The MDP generates trajectories τt = (s0, a0, ..., st−1, at−1, st) with st+1 ∼ P (·|st, at).

Note: Two key ingredients that will be discussed later are:

• Policy: How actions are selected. • Value function: What determines which actions (and states) are good. The state and action spaces is generally simplified to be finite but it can be infinite, countably infinite or continuous. In general, a non-Markovian decision process’ transitions could depend on much more informa- tion: 0 P (st+1 = s |st = s, at = a, st−1, at1 , ..., s0, a0) such as the whole history.

1.2 Example: The Amazing Goods Company (Supply Chain) Consider an example of a supply chain problem which can be formulated as a Markov Decision Process.

Description: At each month t, a warehouse contains st items of a specific goods and the demand for that goods is D (stochastic). At the end of each month the manager of the warehouse can order at more items from the supplier

• The cost of maintaining an inventory s is h(s) • The cost to order a items is C(a) • The income for selling q items is f(q) • If the demand d ∼ D is bigger than the available inventory s, customers that cannot be served leave.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 2 • The value of the remaining inventory at the end of the year is g(s) • Constraint: The store has a maximum capacity M

We can formulate the problem as an MDP as follows:

• State space: s ∈ S = {0, 1, ..., M} which is the number of goods. • Action space: For a state s, a ∈ A(s) = {0, 1, ..., M − s}. As it is not possible to order more item than the capacity of the store, the action space depends on the current state s.

+ • Dynamics: st+1 = [st + at − dt] . The demand dt is stochastic and time-independent. Formally, i.i.d. dt ∼ D.

+ • Reward: rt = −C(at) − h(st + at) + f([st + at − st+1] ). This corresponds to a purchasing cost, a cost for excess stock (storage, maintenance), and a reward for fulfilling orders.

• Discount: γ = 0.95. The discount factor essentially encodes the sentiment that a dollar today is worth more than a dollar tomorrow.

P∞ t Infinite horizon objective: V (s0; a0, ...) = t=0 γ rt, which corresponds to the cumulative reward, plus the value of the remaining inventory.

1.3 Example: Freeway Atari game (David Crane, 1981) FREEWAY is an Atari 2600 video game, released in 1981. In FREEWAY, the agent must navigate a chicken (think: jaywalker) across a busy road often lanes of incoming traffic. The top of the screen lists the score. After a successful crossing, the chicken is teleported back to the bottom of the screen. If hit by a car, a chicken is forced back either slightly, or pushed back to the bottom of the screen, depending on what difficulty the switch is set to. One or two players can play at once.

Figure 1: Atari 2600 video game FREEWAY

Discussion: How to devise a successful strategy for jaywalking across this busy road? We can formulate the problem as an MDP as follows:

• State space:

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 3 – Option 1: Whether or not there is a car, chicken or nothing in each location on each road lane and road shoulder, where the road is discretized by lane (10) and car length. There is also a game over state, when enough damage has been done to the chicken. The velocity of the cars could also be added. – Option 2: Another option is to consider multiple consecutive image frames of the game. – Option 3: Fixed-sized representing the coordinates of the car assuming a maximum number of cars. • Action space: up, down, left, right, or no action (movement of chicken) • Transitions: Chicken and vehicles move, based on the action selected; may include new cars (ran- domly) entering different lanes. • Reward: Whether or not the chicken is at the top of the screen. • Discount: γ = 0.999. Choose a high discount factor as we care about maximizing the overall score over time. ∞ t Infinite horizon objective:Σt=0γ rt, indicating the number of times the chicken crossed the road.

Figure 2: Deep reinforcement learning vs human player. Freeway is one of the games where a DQN agent is able to exceed human performance. Mnih et al. (2013)

Some related applications: • Self-driving cars (input from LIDAR, radar, cameras) • Traffic signal control (input from cameras) • Crowd navigation robot

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 4 2 MDP Assumptions

Several assumptions are made when developing MDPs which one needs to be careful about when designing them. Consider the Atari Breakout game:

Figure 3: Non-markovian dynamics as more information would help. Mnih et al. (2013)

Figure 4: Markovian dynamics. Mnih et al. (2013)

Fact 2. An MDP satisfies the markovian property if:

p(st+1 = s|τt, at) = P (st+1 = s|st, at, ..., s0, a0) = P (st+1 = s|st = s, at = a)

i.e, the current state st and action at are sufficient for predicting the next state s.

As discussed previously, game states can be encoded as frames. However, the formulation in figure 3 is non-markovian because we do not know which direction the ball is travelling in, from just one image. How- ever, in Mnih et al. (2013), the authors use multiple frames (see figure 4) to encode such information, which could possibly make it markovian.

Assumption 3. Time assumption: Time is discrete

t → t + 1

Possible relaxation:

• Identify proper time granularity • Most of MDP literature extends to continuous time

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 5 Figure 5: Too fine-grained resolution

Figure 6: Too coarse-grained resolution

Identifying proper time granularity is important as one can observe from the examples in figure 5, where it is too fine-grained and in figure 6, where it is too coarse-grained. Having too fine-grained time resolution would give a very long horizon learning problem which can be challenging.

Assumption 4. Reward assumption: The reward is uniquely defined by a transition (or part of it)

r(s, a, s0)

Possible relaxation:

• Various notions of rewards: global or local reward function. • Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviours.

Assumption 5. The dynamics and reward do not change over time and,

0 0 p(s |s, a) = P (st+1 = s |st = s, at = a) r(s, a, s0)

This is often the biggest assumption, especially in real-world contexts. Some types of non-stationarities can be handled. Possible relaxation:

• Identify and add/remove the non-stationary components (e.g. cyclo-stationary dynamics as seen in traffic).

• Identify the time-scale of the changes • Work on finite horizon problems

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 6 3 Policy

Definition 6. A decision rule d can be:

• Deterministic: d : S → A, • Stochastic: d : S → ∆(A),

• History-dependent: d : Ht → A,

• Markov: d : S → ∆(A), A decision rule in essence is a mapping from states to a probability distribution over actions.

Definition 7. A policy (strategy, plan) d can be:

• Non-stationary: π = (d0, d1, d2, ...), • Stationary: π = (d, d, d, ...), A policy is a sequence of decision rules. You have as many decision rules as there are time-steps in the horizon.

Fact 8. MDP M + stationary policy π = (d, d, d, ...) =⇒ of state S and transition probability p(s0|s) = p(s0|s, d(s))

For simplicity, π will be used instead of d for stationary policies, and πt instead of dt, for non-stationary policies.

3.1 The Amazing Goods Company (Supply Chain) Example In this section, we look at what different types of policies and decision rules would look like in this example.

• Stationary policy composed of deterministic Markov decision rules ( M − s if s ≤ M/4 π(s) = 0 otherwise

• Stationary policy composed of stochastic history-dependent Markov decision rules ( U(M − s, M − s + 10) if st ≤ st−1/2 π(st) = 0 otherwise

• Non-stationary policy composed of deterministic Markov decision rules ( M − s if t ≤ 6 πt(s) = b(M − s)/5c otherwise

As one can see, any combination of different types of decision rules and policies can be constructed.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 7 4 State Value Function

The state value function is what we are optimizing for. Although there are several things that we can opti- mize for, we are generally trying to maximize the cumulative rewards.

Given a policy π = (d1, d2, ...) (deterministic to simplify notation)

• Infinite time horizon with discount: The problem never terminates but rewards which are closer in time receive higher importance

π  ∞ t  V (s) = E Σt=0γ r(st, πt(ht))|s0 = s; π with discount factor 0 ≤ γ ≤ 1: – Small = short term rewards, big = long-term rewards – For any γ ∈ [0, 1) the series always converges (for bounded rewards) This is the most popular formulation that is used in practice. It is used when there is uncertainty about the deadline and/or an intrinsic definition of discount.

• Finite time horizon T : Deadline at time T , the agent focuses on the sum of the rewards up to T .

π  T −1  V (s, t) = E Στ=t r(sτ , πτ , (hτ )) + R(sT )|st = s; π = (πt, ..., πT )

where R(sT ) is a value function for the final state. It is used when there is an intrinsic deadline to meet. e.g. This course for example has a fixed deadline.

• Stochastic shortest path T : The problem never terminates but the agent will eventually reach a termination state. π  T  V (s) = E Σt=0r(st, πt, (hτ ))|s0 = s; π where T is the first random time when the termination state is achieved. These are less discussed but are pertinent to many applications that we will discuss. This is often used when there is a specific goal condition like when a car reaches a destination.

• Infinite time horizon with average reward: The problem never terminates but the agent only focuses on the (expected) average of the rewards.   π 1 T −1 V (s) = lim E Σt=0 r(st, πt, (hτ ))|s0 = s; π T →∞ T

1 The T is essential for the limit to be finite. This is often used when the system needs to be constantly controlled over time. E.g. Medical implant.

Note: The expectations refer to all possible stochastic trajectories. A (possibly non-stationary stochastic) policy π applied from state s0 returns (s0, r0, s1, r1, s2, r2, ...)

Where, rt = r(st, (ht)) and st ∼ p(·|st−1, at = πt(h(t))) are random realizations. More generally, for stochastic policies: π  ∞ t  V (s) = Ea0,s1,a1,s2,... Σt=0γ r(st, πt(ht))|s0 = s; π From now on we will mostly work in the discounted infinite horizon setting.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 8 5 Optimization Problem

Definition 9. Optimal policy and optimal value function: The solution to an MDP is an optimal policy π∗ satisfying: ∗ π π ∈ arg maxπ∈ΠV In all states s ∈ S, where Π is some policy set of interest. The corresponding value function is the optimal value function ∗ V ∗ = V π The optimal policy maximizes the value for every state.

Limitations • Average case: All previous value functions define an objective in expectation. • Imperfect information (partial observations) • Time delays • Correlated disturbances

6 Dynamic Programming for MDPs

Consider the dynamic programming for the deterministic problems in Figure 7.

Figure 7: Dynamic programming for deterministic problems

We will shortly show that an algorithm with a similar form can be used for MDPs too. • Finite horizon deterministic (e.g. shortest path routing, travelling salesperson) ∗ VT (sT ) = rT (sT ) ∀sT ∗ ∗ VT (st) = max rt(st, at) + Vt+1(st+1) ∀st, t = {T − 1, ..., 0} at∈A

• Finite horizon stochastic and Markov problems (e.g. driving, , games) ∗ VT (sT ) = rT (sT ) ∀sT ∗ ∗ VT (st) = max rt(st, at) + Est+1∼P (·|st,at)Vt+1(st+1) ∀st, t = {T − 1, ..., 0} at∈A

• For discounted infinite horizon problems (e.g. package delivery over months or years, long-term customer satisfaction, control of autonomous vehicles), we have the following optimal value function.

∗ ∗ 0 V (s) = max r(s, a) + γEs0∼P (·|s,a)V (s )) ∀s a∈A This is known as the optimal . From this, the optimal policy can be extracted as: ∗ ∗ 0 π (s) = arg max r(s, a) + γEs0∼P (·|s,a)V (s )) ∀s a∈A

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 9 Question Any difficulties with this new algorithm? This is not an algorithm yet, since V ∗ is defined in terms of itself.

7 Value Iteration Algorithm

With this, we can construct the value iteration algorithm as follows:

1. Let V0(s) be any function V0 : S → R [Note: not stage 0, but iteration 0]

2. Apply the principle of optimality so that given Vi at iteration i, we compute

0 Vi+1(s) = max r(s, a) + γEs0∼P (·|s,a)Vi(s ) a∈A

3. Terminate when Vi stops improving, e.g. when maxs |Vi+1(s) − Vi(s)| is small. 4. Return the greedy policy:

0 πK (s) = arg max r(s, a) + γEs0∼P (·|s,a)VK (s ) a∈A

Definition 10. Optimal Bellman Operator: For any W ∈ R|S|, the optimal Bellman operator is defined as: 0 TW (s) = max r(s, a) + γEs0∼P (·|s,a)W (s ) ∀s a∈A With this, the value iteration algorithm above can be written concisely as:

Vi+1(s) = TVi(s) ∀s

The proof of the optimal bellman equation leverages the definition of value function as well as markov and change of time properties.

Proof: The Optimal Bellman Equation

∗  ∞ t  V (s) = max Σt=0γ r(st, πt(ht))|s0 = s; π π E h 0 π0 0 i = max r(s, a) + γΣs0 p(s |s, a)V (s )) a,π h 0 π0 0 i = max r(s, a) + γΣs0 p(s |s, a) max V (s )) a π0 h 0 π0 0 i = max r(s, a) + γΣs0 p(s |s, a) max V (s )) a π0 h 0 ∗ 0 i = max r(s, a) + γΣs0 p(s |s, a) max V (s )) a π0

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 10 8 Summary

• Stochastic problems are needed to represent uncertainty in the environment. • Markov Decision Processes (MDPs) represent a general class of stochastic sequential decision problems, for which reinforcement learning methods are commonly designed. MDPs enable a discussion of model-free learning.

• The Markovian property means that the next state is fully determined by the current state and action.

• Although quite general, MDPs bake in numerous assumptions. Care should be taken when modeling a problem as an MDP.

• Similarly, care should be taken to select an appropriate type of policy and value function, depending on the use case.

• Finally, dynamic programming for the deterministic setting can also be extended for MDPs. In par- ticular, we introduce the optimal bellman operator and the value iteration algorithm.

9 Contributions

Athul Paul Jacob contributed to this draft of the lecture. TA Sirui Li reviewed the draft.

References

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.

6.246 Reinforcement Learning: Foundations and Methods — Lec3 — 11