Markov Decision Process and

Machine Learning 10-601B

Many of these slides are derived from Tom Mitchell, William Cohen, Eric Xing. Thanks! Types of Learning

• Supervised Learning – A situaon in which sample (input, output) pairs of the funcon to be learned can be perceived or are given – You can think about it as if there is a kind teacher - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classificaon.

- Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.

2 Example of Supervised Learning

• Predict the price of a stock in 6 months from now, based on economic data. (Regression) • Predict whether a paent, hospitalized due to a heart aack, will have a second heart aack. The predicon is to be based on demographic, diet and clinical measurements for that paent. (Logisc Regression) • Idenfy the numbers in a handwrien ZIP code, from a digized image (pixels). (Classificaon)

© Eric Xing @ CMU, 2013 3 What is Learning?

• Learning takes place as a result of interacon between an agent and the world. The idea behind learning is that

– Percepts received by an agent should be used not only for understanding/interpreng/predicon, as in the tasks we have addressed so far, but also for acng, and furthermore for improving the agent’s ability to behave opmally in the future to achieve the goal.

4 RL is learning from interacon

5 Examples of Reinforcement Learning

• How should a robot behave so as to opmize its “performance”? (Robocs)

• How to automate the moon of a helicopter? (Control Theory)

• How to make a good chess-playing program? (Arficial Intelligence)

6 Supervised Learning vs Reinforcement Learning

• Supervised learning • Reinforcement learning for learning a policy

f : X →Y

– X: inputs – S: states – Y: outputs – A: acons € – Take acon A to affect state S – The predicted outputs can be – The predicted acon A cannot evaluated immediately by the be evaluated directly but can teacher be given posive/negave rewards – f evaluated with loss funcon – Policy evaluated with value funcons Reinforcement Learning

• Reinforcement Learning – in the case of the agent acng on its environment, it receives some evaluaon of its acon (reinforcement), but is not told about which acon is the correct one to achieve its goal – Imagine learning to go back to GHC from this room …

- Training data: (S, A, R). (State-Acon-Reward) - Develop an opmal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robocs, Board game playing programs.

8 Elements of RL

• policy - A map from state space to acon space. - May be stochasc. • reward funcon R(S) - It maps each state (or, state-acon pair) to a real number, called reward. • value funcon - Value of a state (or, state-acon pair) is the total expected reward, starng from that state (or, state-acon pair).

9 History of Reinforcement Learning

• Roots in the of animal learning (Thorndike,1911).

• Another independent thread was the problem of opmal control, and its soluon using (Bellman, 1957).

• Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).

• A major breakthrough was the discovery of Q-learning (Watkins, 1989).

10 Robot in a room

• what’s the strategy to achieve max reward? • what if the acons were NOT determinisc?

11 Policy

12 Reward for each step -2

13 Reward for each step: -0.1

14 Reward for each step: -0.04

15 The Precise Goal

• To find a policy that maximizes the Value funcon.

• There are different approaches to achieve this goal in various situaons.

• Value iteraon and Policy iteraon are two more classic approaches to this problem. But essenally both are dynamic programming.

• Q-learning is a more recent approaches to this problem. Essenally it is a temporal-difference method.

16 Markov Decision Processes

A Markov decision process is a tuple where:

17 The dynamics of an MDP

• We start in some state s0, and get to choose some acon a0 ∈ A • As a result of our choice, the state of the MDP randomly transions to some successor state s1, drawn according to s1~ Ps0a0

• Then, we get to pick another acon a1 • …

18 The dynamics of an MDP, (Cont’d)

• Upon vising the sequence of states s0, s1, …, with acons a0, a1, …, our total payoff is given by

• Or, when we are wring rewards as a funcon of the states only, this becomes

– For most of our development, we will use the simpler state-rewards R(s), though the generalizaon to state-acon rewards R(s; a) offers no special diffcules.

• Our goal in reinforcement learning is to choose acons over me so as to maximize the expected value of the total payoff:

19 Policy

• A policy is any funcon mapping from the states to the acons.

• We say that we are execung some policy if, whenever we are in state s, we take acon a = π(s).

• We also define the value funcon for a policy π according to

– Vπ (s) is simply the expected sum of discounted rewards upon starng in state s, and taking acons according to π.

20 Value Funcon

• Given a fixed policy π, its value funcon Vπ sasfies the Bellman equaons:

Immediate reward expected sum of future discounted rewards

– Bellman's equaons can be used to efficiently solve for Vπ (see later)

21 The Grid world

M = 0.8 in direction you want to go 0.1 left 0.2 in perpendicular 0.1 right Policy: mapping from states to actions utilities of states:

3 +1 3 0.812 0.868 0.912 +1 An optimal policy for 2 2 0.762 0.660 the -1 -1 environmen 1 1 0.705 0.655 0.611 0.388 t: 1 2 3 4 1 2 3 4 Observable (accessible): percept identifies the state Environment Partially observable Markov property: Transition depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify 22 transition probabilities. Opmal value funcon

• We define the opmal value funcon according to

(1)

– In other words, this is the best possible expected sum of discounted rewards that can be aained using any policy

• There is a version of Bellman's equaons for the opmal value funcon: (2)

– Why?

23 Opmal policy

• We also define the opmal policy: as follows:

(3)

• Fact: –

– Policy π* has the interesng property that it is the opmal policy for all states s. • It is not the case that if we were starng in some state s then there'd be some opmal policy for that state, and if we were starng in some other state s0 then there'd be some other policy that's opmal policy for s0. • The same policy π* aains the maximum above for all states s. This means that we can use the same policy no maer what the inial state of our MDP is.

24 The Basic Seng for Learning

• Training data: n finite horizon trajectories, of the form

• Determinisc or stochasc policy: A sequence of decision rules

• Each π maps from the observable history (states and acons) to the acon space at that me point.

25 1: Value iteraon

• Consider only MDPs with finite state and acon spaces

• The value iteraon algorithm:

• It can be shown that value iteraon will cause V to converge to V *. Having found V* , we can find the opmal policy as follows:

26 Algorithm 2: Policy iteraon

• The policy iteraon algorithm:

– The inner-loop repeatedly computes the value funcon for the current policy, and then updates the policy using the current value funcon. – Greedy update – Aer at most a finite number of iteraons of this algorithm, V will converge to V* , and π will converge to π*.

27 Convergence

• The ulity values for selected states at each iteraon step in the applicaon of VALUE-ITERATION to the 4x3 world in our example

3 +1

2 -1

1 start 1 2 3 4

Thrm: As t!∞, value iteration converges to exact U even if updates are

done asynchronously & i is picked randomly at every step. 28 Convergence

When to stop value iteration? 29 Q learning

• Define Q-value funcon

• Rule to choose the acon to take

30 Algorithm 3: Q learning

For each pair (s, a), inialize Q(s,a) Observe the current state s Loop forever { Select an acon a (oponally with ε-exploraon) and execute it

Receive immediate reward r and observe the new state s’ Update Q(s,a)

s=s’ }

31 Exploraon

• Tradeoff between exploitaon (control) and exploraon (idenficaon)

• Extremes: greedy vs. random acng (n-armed bandit models)

Q-learning converges to opmal Q-values if – Every state is visited infinitely oen (due to exploraon),

32 A Success Story

• TD Gammon (Tesauro, G., 1992) - A Backgammon playing program. - Applicaon of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against itself and learning from the outcome. So smart!!

33 What is special about RL?

• RL is learning how to map states to acons, so as to maximize a numerical reward over me.

• Unlike other forms of learning, it is a mulstage decision- making process (oen Markovian).

• An RL agent must learn by trial-and-error. (Not enrely supervised, but interacve)

• Acons may affect not only the immediate reward but also subsequent rewards (Delayed effect).

34 Summary

• Both value iteraon and policy iteraon are standard for solving MDPs, and there isn't currently universal agreement over which algorithm is beer. • For small MDPs, value iteraon is oen very fast and converges with very few iteraons. However, for MDPs with large state spaces, solving for V explicitly would involve solving a large system of linear equaons, and could be difficult. • In these problems, policy iteraon may be preferred. In pracce value iteraon seems to be used more oen than policy iteraon. • Q-learning is model-free, and explore the temporal difference

35 Types of Learning

• Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classificaon. • Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering. • Reinforcement Learning - Training data: (S, A, R). (State-Acon-Reward) - Develop an opmal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robocs, Board game playing programs

36