Note: the Lecture Notes Have Not Been Thoroughly Checked for Errors and Are Not at the Level of Publication

6.246 Reinforcement Learning: Foundations and Methods Feb 23, 2021 Lecture 3: Markov Decision Processes Instructor: Cathy Wu Scribe: Athul Paul Jacob Note: the lecture notes have not been thoroughly checked for errors and are not at the level of publication. 1 Markov Decision Processes Last lecture, we talked about deterministic decision problems as well as a little bit of stochasticity by introducing the stochastic variant of LQR. In this lecture, we will introduce a formulation called Markov decision processes to study decision problems that have stochasticity. 1.1 Why stochastic problems? The reasons can be roughly split into two categories: • Stochastic Environment: This is the case where there is stochasticity in the environment itself. As such, a stochastic framework is necessary to model the problem. Various components of the environment can be a source of stochasticity: { Uncertainty in Reward/Objective: Examples of this include problems like multi-armed bandits and contextual bandits. { Uncertainty in Dynamics { Uncertainty in horizon: Uncertainty in the length of the problem. Stochastic Shortest Path which will be introduced later is an example of this. • Stochastic Policies: Another source of stochasticity is in the policy itself. Stochastic policies usually has to do with technical reasons, such as: { Helps trade off Exploration and exploitation { Enables off-policy learning { Compatible with Maximum likelihood estimation (MLE) As such, for reasons that are mentioned above, dynamic programming in the deterministic setting is insuffi- cient. 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 1 Definition 1. A Markov Decision Process (MDP) is defined as a tuple M = (S; A; P; r; γ) where • S is the state space, • A is action space, • P (s0js; a) is transition probability with: 0 0 P (s js; a) = P (st+1 = s jst = s; at = a) • r(s; a; s0) is the immediate reward at state s upon taking action a, • γ 2 [0; 1) is the discount factor. The MDP generates trajectories τt = (s0; a0; :::; st−1; at−1; st) with st+1 ∼ P (·|st; at). Note: Two key ingredients that will be discussed later are: • Policy: How actions are selected. • Value function: What determines which actions (and states) are good. The state and action spaces is generally simplified to be finite but it can be infinite, countably infinite or continuous. In general, a non-Markovian decision process' transitions could depend on much more information: 0 P (st+1 = s jst = s; at = a; st−1; at1 ; :::; s0; a0) such as the whole history. 1.2 Example: The Amazing Goods Company (Supply Chain) Consider an example of a supply chain problem which can be formulated as a Markov Decision Process. Description: At each month t, a warehouse contains st items of a specific goods and the demand for that goods is D (stochastic). At the end of each month the manager of the warehouse can order at more items from the supplier • The cost of maintaining an inventory s is h(s) • The cost to order a items is C(a) • The income for selling q items is f(q) • If the demand d ∼ D is bigger than the available inventory s, customers that cannot be served leave. 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 2 • The value of the remaining inventory at the end of the year is g(s) • Constraint: The store has a maximum capacity M We can formulate the problem as an MDP as follows: • State space: s 2 S = f0; 1; :::; Mg which is the number of goods. • Action space: For a state s, a 2 A(s) = f0; 1; :::; M − sg. As it is not possible to order more item than the capacity of the store, the action space depends on the current state s. + • Dynamics: st+1 = [st + at − dt] . The demand dt is stochastic and time-independent. Formally, i:i:d: dt ∼ D. + • Reward: rt = −C(at) − h(st + at) + f([st + at − st+1] ). This corresponds to a purchasing cost, a cost for excess stock (storage, maintenance), and a reward for fulfilling orders. • Discount: γ = 0:95. The discount factor essentially encodes the sentiment that a dollar today is worth more than a dollar tomorrow. P1 t Infinite horizon objective: V (s0; a0; :::) = t=0 γ rt, which corresponds to the cumulative reward, plus the value of the remaining inventory. 1.3 Example: Freeway Atari game (David Crane, 1981) FREEWAY is an Atari 2600 video game, released in 1981. In FREEWAY, the agent must navigate a chicken (think: jaywalker) across a busy road often lanes of incoming traffic. The top of the screen lists the score. After a successful crossing, the chicken is teleported back to the bottom of the screen. If hit by a car, a chicken is forced back either slightly, or pushed back to the bottom of the screen, depending on what difficulty the switch is set to. One or two players can play at once. Figure 1: Atari 2600 video game FREEWAY Discussion: How to devise a successful strategy for jaywalking across this busy road? We can formulate the problem as an MDP as follows: • State space: 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 3 { Option 1: Whether or not there is a car, chicken or nothing in each location on each road lane and road shoulder, where the road is discretized by lane (10) and car length. There is also a game over state, when enough damage has been done to the chicken. The velocity of the cars could also be added. { Option 2: Another option is to consider multiple consecutive image frames of the game. { Option 3: Fixed-sized representing the coordinates of the car assuming a maximum number of cars. • Action space: up, down, left, right, or no action (movement of chicken) • Transitions: Chicken and vehicles move, based on the action selected; may include new cars (ran- domly) entering different lanes. • Reward: Whether or not the chicken is at the top of the screen. • Discount: γ = 0:999. Choose a high discount factor as we care about maximizing the overall score over time. 1 t Infinite horizon objective:Σt=0γ rt, indicating the number of times the chicken crossed the road. Figure 2: Deep reinforcement learning vs human player. Freeway is one of the games where a DQN agent is able to exceed human performance. Mnih et al. (2013) Some related applications: • Self-driving cars (input from LIDAR, radar, cameras) • Traffic signal control (input from cameras) • Crowd navigation robot 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 4 2 MDP Assumptions Several assumptions are made when developing MDPs which one needs to be careful about when designing them. Consider the Atari Breakout game: Figure 3: Non-markovian dynamics as more information would help. Mnih et al. (2013) Figure 4: Markovian dynamics. Mnih et al. (2013) Fact 2. An MDP satisfies the markovian property if: p(st+1 = sjτt; at) = P (st+1 = sjst; at; :::; s0; a0) = P (st+1 = sjst = s; at = a) i.e, the current state st and action at are sufficient for predicting the next state s. As discussed previously, game states can be encoded as frames. However, the formulation in figure 3 is non-markovian because we do not know which direction the ball is travelling in, from just one image. How- ever, in Mnih et al. (2013), the authors use multiple frames (see figure 4) to encode such information, which could possibly make it markovian. Assumption 3. Time assumption: Time is discrete t ! t + 1 Possible relaxation: • Identify proper time granularity • Most of MDP literature extends to continuous time 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 5 Figure 5: Too fine-grained resolution Figure 6: Too coarse-grained resolution Identifying proper time granularity is important as one can observe from the examples in figure 5, where it is too fine-grained and in figure 6, where it is too coarse-grained. Having too fine-grained time resolution would give a very long horizon learning problem which can be challenging. Assumption 4. Reward assumption: The reward is uniquely defined by a transition (or part of it) r(s; a; s0) Possible relaxation: • Various notions of rewards: global or local reward function. • Move to inverse reinforcement learning (IRL) to induce the reward function from desired behaviours. Assumption 5. The dynamics and reward do not change over time and, 0 0 p(s js; a) = P (st+1 = s jst = s; at = a) r(s; a; s0) This is often the biggest assumption, especially in real-world contexts. Some types of non-stationarities can be handled. Possible relaxation: • Identify and add/remove the non-stationary components (e.g. cyclo-stationary dynamics as seen in traffic). • Identify the time-scale of the changes • Work on finite horizon problems 6.246 Reinforcement Learning: Foundations and Methods | Lec3 | 6 3 Policy Definition 6. A decision rule d can be: • Deterministic: d : S ! A, • Stochastic: d : S ! ∆(A), • History-dependent: d : Ht ! A, • Markov: d : S ! ∆(A), A decision rule in essence is a mapping from states to a probability distribution over actions. Definition 7. A policy (strategy, plan) d can be: • Non-stationary: π = (d0; d1; d2; :::), • Stationary: π = (d; d; d; :::), A policy is a sequence of decision rules. You have as many decision rules as there are time-steps in the horizon. Fact 8. MDP M + stationary policy π = (d; d; d; :::) =) Markov Chain of state S and transition probability p(s0js) = p(s0js; d(s)) For simplicity, π will be used instead of d for stationary policies, and πt instead of dt, for non-stationary policies. 3.1 The Amazing Goods Company (Supply Chain) Example In this section, we look at what different types of policies and decision rules would look like in this example.

Load more