Markov Decision Process and Reinforcement Learning
Machine Learning 10-601B
Many of these slides are derived from Tom Mitchell, William Cohen, Eric Xing. Thanks! Types of Learning
• Supervised Learning – A situa on in which sample (input, output) pairs of the func on to be learned can be perceived or are given – You can think about it as if there is a kind teacher - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classifica on.
• Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering.
2 Example of Supervised Learning
• Predict the price of a stock in 6 months from now, based on economic data. (Regression) • Predict whether a pa ent, hospitalized due to a heart a ack, will have a second heart a ack. The predic on is to be based on demographic, diet and clinical measurements for that pa ent. (Logis c Regression) • Iden fy the numbers in a handwri en ZIP code, from a digi zed image (pixels). (Classifica on)
© Eric Xing @ CMU, 2013 3 What is Learning?
• Learning takes place as a result of interac on between an agent and the world. The idea behind learning is that
– Percepts received by an agent should be used not only for understanding/interpre ng/predic on, as in the machine learning tasks we have addressed so far, but also for ac ng, and furthermore for improving the agent’s ability to behave op mally in the future to achieve the goal.
4 RL is learning from interac on
5 Examples of Reinforcement Learning
• How should a robot behave so as to op mize its “performance”? (Robo cs)
• How to automate the mo on of a helicopter? (Control Theory)
• How to make a good chess-playing program? (Ar ficial Intelligence)
6 Supervised Learning vs Reinforcement Learning
• Supervised learning • Reinforcement learning for learning a policy
f : X →Y
– X: inputs – S: states – Y: outputs – A: ac ons € – Take ac on A to affect state S – The predicted outputs can be – The predicted ac on A cannot evaluated immediately by the be evaluated directly but can teacher be given posi ve/nega ve rewards – f evaluated with loss func on – Policy evaluated with value func ons Reinforcement Learning
• Reinforcement Learning – in the case of the agent ac ng on its environment, it receives some evalua on of its ac on (reinforcement), but is not told about which ac on is the correct one to achieve its goal – Imagine learning to go back to GHC from this room …
- Training data: (S, A, R). (State-Ac on-Reward) - Develop an op mal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robo cs, Board game playing programs.
8 Elements of RL
• policy - A map from state space to ac on space. - May be stochas c. • reward func on R(S) - It maps each state (or, state-ac on pair) to a real number, called reward. • value func on - Value of a state (or, state-ac on pair) is the total expected reward, star ng from that state (or, state-ac on pair).
9 History of Reinforcement Learning
• Roots in the psychology of animal learning (Thorndike,1911).
• Another independent thread was the problem of op mal control, and its solu on using dynamic programming (Bellman, 1957).
• Idea of temporal difference learning (on-line method), e.g., playing board games (Samuel, 1959).
• A major breakthrough was the discovery of Q-learning (Watkins, 1989).
10 Robot in a room
• what’s the strategy to achieve max reward? • what if the ac ons were NOT determinis c?
11 Policy
12 Reward for each step -2
13 Reward for each step: -0.1
14 Reward for each step: -0.04
15 The Precise Goal
• To find a policy that maximizes the Value func on.
• There are different approaches to achieve this goal in various situa ons.
• Value itera on and Policy itera on are two more classic approaches to this problem. But essen ally both are dynamic programming.
• Q-learning is a more recent approaches to this problem. Essen ally it is a temporal-difference method.
16 Markov Decision Processes
A Markov decision process is a tuple where:
17 The dynamics of an MDP
• We start in some state s0, and get to choose some ac on a0 ∈ A • As a result of our choice, the state of the MDP randomly transi ons to some successor state s1, drawn according to s1~ Ps0a0
• Then, we get to pick another ac on a1 • …
18 The dynamics of an MDP, (Cont’d)
• Upon visi ng the sequence of states s0, s1, …, with ac ons a0, a1, …, our total payoff is given by
• Or, when we are wri ng rewards as a func on of the states only, this becomes
– For most of our development, we will use the simpler state-rewards R(s), though the generaliza on to state-ac on rewards R(s; a) offers no special diffcul es.
• Our goal in reinforcement learning is to choose ac ons over me so as to maximize the expected value of the total payoff:
19 Policy
• A policy is any func on mapping from the states to the ac ons.
• We say that we are execu ng some policy if, whenever we are in state s, we take ac on a = π(s).
• We also define the value func on for a policy π according to
– Vπ (s) is simply the expected sum of discounted rewards upon star ng in state s, and taking ac ons according to π.
20 Value Func on
• Given a fixed policy π, its value func on Vπ sa sfies the Bellman equa ons:
Immediate reward expected sum of future discounted rewards
– Bellman's equa ons can be used to efficiently solve for Vπ (see later)
21 The Grid world
M = 0.8 in direction you want to go 0.1 left 0.2 in perpendicular 0.1 right Policy: mapping from states to actions utilities of states:
3 +1 3 0.812 0.868 0.912 +1 An optimal policy for 2 2 0.762 0.660 the -1 -1 stochastic environmen 1 1 0.705 0.655 0.611 0.388 t: 1 2 3 4 1 2 3 4 Observable (accessible): percept identifies the state Environment Partially observable Markov property: Transition probabilities depend on state only, not on the path to the state. Markov decision problem (MDP). Partially observable MDP (POMDP): percepts does not have enough info to identify 22 transition probabilities. Op mal value func on
• We define the op mal value func on according to
(1)
– In other words, this is the best possible expected sum of discounted rewards that can be a ained using any policy
• There is a version of Bellman's equa ons for the op mal value func on: (2)
– Why?
23 Op mal policy
• We also define the op mal policy: as follows:
(3)
• Fact: –
– Policy π* has the interes ng property that it is the op mal policy for all states s. • It is not the case that if we were star ng in some state s then there'd be some op mal policy for that state, and if we were star ng in some other state s0 then there'd be some other policy that's op mal policy for s0. • The same policy π* a ains the maximum above for all states s. This means that we can use the same policy no ma er what the ini al state of our MDP is.
24 The Basic Se ng for Learning
• Training data: n finite horizon trajectories, of the form
• Determinis c or stochas c policy: A sequence of decision rules
• Each π maps from the observable history (states and ac ons) to the ac on space at that me point.
25 Algorithm 1: Value itera on
• Consider only MDPs with finite state and ac on spaces
• The value itera on algorithm:
• It can be shown that value itera on will cause V to converge to V *. Having found V* , we can find the op mal policy as follows:
26 Algorithm 2: Policy itera on
• The policy itera on algorithm:
– The inner-loop repeatedly computes the value func on for the current policy, and then updates the policy using the current value func on. – Greedy update – A er at most a finite number of itera ons of this algorithm, V will converge to V* , and π will converge to π*.
27 Convergence
• The u lity values for selected states at each itera on step in the applica on of VALUE-ITERATION to the 4x3 world in our example
3 +1
2 -1
1 start 1 2 3 4
Thrm: As t!∞, value iteration converges to exact U even if updates are
done asynchronously & i is picked randomly at every step. 28 Convergence
When to stop value iteration? 29 Q learning
• Define Q-value func on
• Rule to choose the ac on to take
30 Algorithm 3: Q learning
For each pair (s, a), ini alize Q(s,a) Observe the current state s Loop forever { Select an ac on a (op onally with ε-explora on) and execute it
Receive immediate reward r and observe the new state s’ Update Q(s,a)
s=s’ }
31 Explora on
• Tradeoff between exploita on (control) and explora on (iden fica on)
• Extremes: greedy vs. random ac ng (n-armed bandit models)
Q-learning converges to op mal Q-values if – Every state is visited infinitely o en (due to explora on),
32 A Success Story
• TD Gammon (Tesauro, G., 1992) - A Backgammon playing program. - Applica on of temporal difference learning. - The basic learner is a neural network. - It trained itself to the world class level by playing against itself and learning from the outcome. So smart!!
33 What is special about RL?
• RL is learning how to map states to ac ons, so as to maximize a numerical reward over me.
• Unlike other forms of learning, it is a mul stage decision- making process (o en Markovian).
• An RL agent must learn by trial-and-error. (Not en rely supervised, but interac ve)
• Ac ons may affect not only the immediate reward but also subsequent rewards (Delayed effect).
34 Summary
• Both value itera on and policy itera on are standard algorithms for solving MDPs, and there isn't currently universal agreement over which algorithm is be er. • For small MDPs, value itera on is o en very fast and converges with very few itera ons. However, for MDPs with large state spaces, solving for V explicitly would involve solving a large system of linear equa ons, and could be difficult. • In these problems, policy itera on may be preferred. In prac ce value itera on seems to be used more o en than policy itera on. • Q-learning is model-free, and explore the temporal difference
35 Types of Learning
• Supervised Learning - Training data: (X,Y). (features, label) - Predict Y, minimizing some loss. - Regression, Classifica on. • Unsupervised Learning - Training data: X. (features only) - Find “similar” points in high-dim X-space. - Clustering. • Reinforcement Learning - Training data: (S, A, R). (State-Ac on-Reward) - Develop an op mal policy (sequence of decision rules) for the learner so as to maximize its long-term reward. - Robo cs, Board game playing programs
36