Reinforcement Learning Reinforcement Learning (Model-Free RL) 1 2 3 4 Intended T(S,A,S’) • R&N Chapter 21 3 +1 Action A: 0.8 2 -1 0.1 0.1

Reinforcement Learning Reinforcement Learning (Model-free RL) 1 2 3 4 Intended T(s,a,s’) • R&N Chapter 21 3 +1 action a: 0.8 2 -1 0.1 0.1

• Same (fully observable) MDP as before except: – We don’t know the model of the environment – We don’t know T(.,.,.) • Demos and Data Contributions from Vivek Mehta ([email protected]) – We don’t know R(.) Rohit Kelkar ([email protected]) • Task is still the same: – Find an optimal policy

General Problem Classes of Techniques

• All we can do is try to execute actions and Reinforcement Learning record the resulting rewards – World: You are in state 102, you have a choice of 4 actions – Robot: I’ll take action 2 – World: You get a reward of 1 and you are now in state Model-Based Model-Free 63, you have a choice of 3 actions • Try to learn an • Recover an optimal – Robot: I’ll take action 3 explicit model of policy without ever – World: You get a reward of -10 and you are now in T(.,.,.) and R(.) estimating a model state 12, you have a choice of 4 actions – …………..

Notice we have state-number-observability!

1 Model-Free Temporal Differencing • Upon action a = π(s) , the values satisfy: • We are not interested in T(.,.,.), we are only interested in the resulting values and U(s) = R(s) + γ Σs’ T(s,a,s’) U(s’) policies For any s’ successor of s, U(s) is “in between”: • Can we compute something without an The new value considering only s’: explicit model of T(.,.,.) ? R(s) + γ U(s’) and the old value • First, let’s fix a policy and compute the U(s) resulting values

Temporal Differencing Temporal Differencing Discrepancy between current • Upon moving from s to s’ by using action value and new guess at a Current value value after moving to s’ a, the new estimate of U(s) is approximated by: U(s) „ U(s) + α (R(s) + γ U(s’) – U(s)) U(s) = (1-α) U(s) + α (R(s) + γ U(s’)) • Temporal Differencing: When moving from The transition probabilities do not appear anywhere!!! any state s to a state s’, update: U(s) „ U(s) + α (R(s) + γ U(s’) – U(s))

2 Temporal Differencing Temporal Differencing How to choose 0 < α < 1? Learning rate • Start with large α ‰ Not confident in our current estimate so we can change it a lot U(s) „ U(s) + α (R(s) + γ U(s’) – U(s)) α γ • Decrease α as we explore more ‰ We are more and more confident in our How to choose 0 < α < 1? estimate so we don’t want to change it a lot α • Too small: Converges slowly; tends to always trust the current estimate of U • Too large: Changes very quickly; tends to always replace the current estimate by the new guess

Iterations

Temporal Differencing Summary Technical conditions: • Learning exploring environment and recording received rewards Σ α(t) = infinity (α does not decrease too quickly) • Model-Based techniques – Evaluate transition probabilities and apply previous α2(t) converges (but it does decrease fast enough) MDP techniques to find values and policies Σ – More efficient: Single value update at each state – Selection of “interesting” states to update: Prioritized Example: α = K/(K+t) sweeping • Exploration strategies α • Model-Free Techniques (so far) • Temporal update to estimate values without ever estimating the transition model • Parameter: Learning rate must decay over iterations

Iterations

3 Temporal Differencing Q-Learning Discrepancy between current value and new guess at a Current value value after moving to s’ • U(s) = Utility of state s = expected sum of future discounted rewards U(s) „ U(s) + α (R(s) + γ U(s’) – U(s))

The transition probabilities do not appear anywhere!!! • Q(s,a) = Value of taking action a at state s = expected sum of future But how to find the optimal policy? discounted rewards after taking action a at state s

Q-Learning Q-Learning

(s,a) = “state-action” pair. • For the optimal Q*: • U(s) = Utility Moafi nstataint eta bsle= o fe Qx(ps,ea)cted sum of future discounteinds treeawd oafr Ud(ss)

Q*(s,a) = R(s) + γ Σs’T (s,a,s’) maxa’Q*(s’,a’) • Q(s,a) = Value of taking action a at state s = expected sum of future π*(s) = argmaxa Q*(s,a) discounted rewards after taking action a at state s

4 Best value averaged over all possible Q-Learning: Updating Q without a Q-sLtaeteas sr’nthiant cgan be reached from s Best expected after executing action a Model • Fvoalru eth foer sotpattei mal Q*: After moving from state s to state s’ using action a: action (s,a)

Q*(s,a) = R(s) + γ Σs’T (s,a,s’) maxa’Q*(s’,a’)

Q(s,a)„Q(s,a)+α(R(s)+γ maxa’Q(s’,a’)–Q(s,a)) Reward at Best value at the next state = π*(s) = argmaxa Q*(s,a) state s Maximum over all actions that could be executed at the next state s’

Q-Learning: Updating Q without a Q-Learning: Estimating the policy Model After moving from state s to state s’ using action a: Q-Update: After moving from state s to state s’ using Old estimate Difference between old estimate and action a: of Q(s,a) new guess after taking action a

Q(s,a) „ Q(s,a) + α(R(s) + γ maxa’Q(s’,a’) – Q(s,a))

Q(s,a)„Q(s,a)+α(R(s)+γ maxa’Q(s’,a’)–Q(s,a)) Policy estimation:

Learning rate New 0< α <1 estimate of α π(s) = argmaxa Q(s,a) Q(s,a)

5 Q-Learning: Estimating the policy Q-Learning: Convergence Key Point: We do not use T(.,.,.) anywhere ‰ We Qca-Un pcdoamtep:u Ateft eorp tmimoavli nvga lfuroems a sntadt eposlitcoie sst awteithso’uuts ing • Q-learning guaranteed to converge to an ever computing a modealc otifo tnhea :MDP! optimal policy (Watkins) • Very general procedure (because „ Q(s,a) Q(s,a) + α(R(s) + γ maxa’Q(s’,a’) – Q(s,a)) completely model-free) • May be slow (because completely model- Policy estimation: free)

π(s) = argmaxa Q(s,a)

6 Q-Learning: Exploration Strategies • How to choose the next action while we’re learning? – Random – Greedy: Always choose the estimated best action π(s) – ε-Greedy: Choose the estimated best with probability 1-ε – Boltzmann: Choose the estimated best with π*(S1) = a1 Q(s,a)/T π*(S2) = a1 probability proportional to e

Evaluation Constant Learning Rate • How to measure how well the learning α = 0.001 procedure is doing? • U(s) = Value estimated at s at the current learning iteration • U*(s) = Optimal value if we knew everything about the environment α = 0.1 Error = |U – U*|

7 Decaying Learning Rate Changing Environments

α = K/(K+iteration #)

[Data from Rohit & Vivek, 2005] [Data from Rohit & Vivek, 2005]

Adaptive Learning Rate Example: Pushing Robot • Task: Learn how to push boxes around. • States: Sensor readings • Actions: Move forward, turn

Example from Mahadevan and Connell, “Automatic Programming of Behavior- [Data from Rohit & Vivek, 2005] based Robots using Reinforcement Learning, Proceedings AAAI 1991

8 Example: Pushing Robot Learn How to Find the Boxes

FAR NEAR • Box is found when the NEAR bits are on BUMP for all the front sonars. • Reward: STUCK R(s) = +3 if NEAR bits are on • State = 1 bit for each NEAR and FAR gates x 8 R(s) = -1 if NEAR bits are off sensors + 1 bit for BUMP + 1 bit for STUCK = 18 bits • Actions = move forward or turn +/- 22o or turn +/- NEAR 45o = 5 actions

Example from Mahadevan and Connell, “Automatic Programming of Behavior- based Robots using Reinforcement Learning, Proceedings AAAI 1991

Learn How to Push the Box Learn how to Get Unwedged

• Try to maintain contact with the box while • Robot may get wedged against walls, in moving forward which the STUCK bit is raised. • Reward: • Reward: R(s) = +1 if BUMP while moving forward R(s) = +1 if STUCK is 0 R(s) = -3 if robot loses contact R(s) = -3 if STUCK is 1

BUMP

STUCK

9 Q-Learning Q-Learning • Initialize Q(s,a) to 0 for all state- • Initialize Q(s,a) to 0 for all state-action pairs action pairs • Repeat: • Repeat: – Observe the current state s • 90% of the time, choose the action a that –Observe the current state s maximimizes Q(s,a) • 90% of the time, choose the action a • Else choose a random action a that maximimizes Q(s,a) – Update Q(s,a) • Else choose a random action a Improvement: Update also all the states s’ that are “similar” to s. –Update Q(s,a) In this case: Similarity between s and s’ is measured by the Hamming distance between the bit strings

Performance Hand-coded Generalization

• In real problems: Too many states (or state-action pairs) to store in a table • Example: Backgammon ‰ 1020 states!

Q-Learning (2 different versions of similarity) • Need to:

– Store U for a subset of states {s1,..,sK} – Generalize to compute U(s) for any other Random agent states s

10 Generalization Generalization • Possible function approximators: Value U(s) Value U(s) – Neural networks – Memory-based methods • …… and many others solutions to representing U over large state spaces: f(sn) ~ U(sn) – Decision trees – Clustering – Hierarchical representations States s s1 s2 ……….. States s We interpolate a We have sample State s Value U(s) values of U for function f(.), such some of the that for any query state s , f(s ) states s1, s2 n n approximates U(sn)

Example: Backgammon Example: Backgammon

• Represent mapping from states to expected outcomes by multilayer neural net • Run a large number of “training games” – For each state s in a training game: – Update using temporal differencing • States: Number of red and white checkers at each location – At every step of the game ‰ Choose best move according to current estimate of U ‰ Order 1020 states!!!! • Initially: Random moves ‰ Branching factor prevents direct search • Actions: Set of legal moves from any state • After learning: Converges to good selection of moves Example from: G. Tesauro. Temporal Difference Learning and TD-Gammon. Communications of the ACM, 1995

11 Performance Example: Control and Robotics • Can learn starting with no knowledge at all! • Devil-stick juggling (Schaal and Atkeson): Non- • Example: 200,000 training games with 40 linear control at 200ms per decision. Program hidden units. learns to keep juggling after ~40 trials. A human • Enhancements use better encoding and requires 10 times more practice. additional hand-designed features • Helicopter control (Andrew Ng): Control of a helicopter for specific flight patterns. Learning • Example: – 1,500,000 training games policies from simulator. Learns policies for – 80 hidden units control pattern that are difficult even for human – -1 pt/40 games (against world-class opponent) experts (e.g., inverted flight). http://heli.stanford.edu/

Summary (Some) References • S. Sutton and A.G. Barto. Reinforcement • Certainty equivalent learning for estimating Learning: An Introduction. MIT Press. future rewards • L. Kaelbling, M. Littman and A. Moore. • Exploration strategies Reinforcement Learning: A Survey. Journal of Artificial Intelligence Research. Volume 4, 1996. • One-backup update, prioritized sweeping • G. Tesauro. TD-Gammon, a self-teaching • Model free (Temporal Differencing = TD) for backgammon program, achieves master-level estimating future rewards play. Neural Computation 6(2), 1995. • Q-Learning for model-free estimation of future • http://ai.stanford.edu/~ang/ rewards and optimal policy • http://www-all.cs.umass.edu/rlr/ • Exploration strategies and selection of actions