Cs188 Lecture 10 and 11 -- Reinforcement Learning 6PP.Pdf

MDPs and RL Outline CS 188: Artificial Intelligence § Markov Decision Processes (MDPs) § Formalism § Planning § Value iteration Reinforcement Learning (RL) § Policy Evaluation and Policy Iteration § Reinforcement Learning --- MDP with T and/or R unknown § Model-based Learning § Model-free Learning § Direct Evaluation [performs policy evaluation] § Temporal Difference Learning [performs policy evaluation] § Q-Learning [learns optimal state-action value function Q*] Pieter Abbeel – UC Berkeley § Policy search [learns optimal policy from subset of all policies] Many slides over the course adapted from Dan Klein, Stuart § Exploration vs. exploitation Russell, Andrew Moore § Large state spaces: feature-based representations 1 2 Reinforcement Learning Example: learning to walk [DEMOS on § Basic idea: next slides] § Receive feedback in the form of rewards § Agent’s utility is defined by the reward function § Must (learn to) act so as to maximize expected rewards Before learning (hand-tuned) One of many learning runs After learning [After 1000 field traversals] [Kohl and Stone, ICRA 2004] Example: snake robot Example: Toddler [Andrew Ng] Tedrake, Zhang and Seung, 2005 5 6 1 In your project 3 Reinforcement Learning § Still assume a Markov decision process (MDP): § A set of states s ∈ S § A set of actions (per state) A § A model T(s,a,s’) § A reward function R(s,a,s’) § Still looking for a policy π(s) § New twist: don’t know T or R § I.e. don’t know which states are good or what the actions do § Must actually try actions and states out to learn 7 8 Example to Illustrate Model-Based vs. MDPs and RL Outline Model-Free: Expected Age § Markov Decision Processes (MDPs) Goal: Compute expected age of cs188 students Known P(A) § Formalism § Planning § Value iteration § Policy Evaluation and Policy Iteration Without P(A), instead collect samples [a1, a2, … aN] § Reinforcement Learning --- MDP with T and/or R unknown Unknown P(A): “Model Based” Unknown P(A): “Model Free” § Model-based Learning § Model-free Learning § Direct Evaluation [performs policy evaluation] § Temporal Difference Learning [performs policy evaluation] § Q-Learning [learns optimal state-action value function Q*] § Policy search [learns optimal policy from subset of all policies] § Exploration vs. exploitation § Large state spaces: feature-based representations 9 10 Example: Learning the Model in Model-Based Learning Model-Based Learning y § Idea: § Episodes: +100 § Step 1: Learn the model empirically through experience (1,1) up -1 (1,1) up -1 § Step 2: Solve for values as if the learned model were -100 correct (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 § Step 1: Simple empirical model learning (1,3) right -1 (2,3) right -1 § Count outcomes for each s,a s (2,3) right -1 (3,3) right -1 x § Normalize to give estimate of T(s,a,s’) γ = 1 π(s) (3,3) right -1 (3,2) up -1 § Discover R(s,a,s’) when we experience (s,a,s’) s, π(s) (3,2) up -1 (4,2) exit -100 ’ T(<3,3>, right, <4,3>) = 1 / 3 § Step 2: Solving the MDP with the learned models, π(s),s (3,3) right -1 (done) § Value iteration, or policy iteration s (4,3) exit +100 T(<2,3>, right, <3,3>) = 2 / 2 ’ (done) 11 12 2 Learning the Model in Model- Based Learning Model-based vs. Model-free § Estimate P(x) from samples § Model-based RL § First act in MDP and learn T, R § Samples: § Then value iteration or policy iteration with learned T, R § Estimate: § Advantage: efficient use of data § Disadvantage: requires building a model for T, R § Estimate P(s’ | s, a) from samples § Model-free RL § Bypass the need to learn T, R § Samples: § Methods to evaluate V¼, the value function for a fixed policy ¼ without knowing T, R: § Estimate: § (i) Direct Evaluation § (ii) Temporal Difference Learning § Why does this work? Because samples appear with the right § Method to learn ¼*, Q*, V* without knowing T, R frequencies! § (iii) Q-Learning 13 14 Direct Evaluation Example: Direct Evaluation y § Repeatedly execute the policy π § Episodes: +100 (1,1) up -1 (1,1) up -1 § Estimate the value of the state s as the -100 average over all times the state s was (1,2) up -1 (1,2) up -1 (1,2) up -1 (1,3) right -1 visited of the sum of discounted rewards (1,3) right -1 (2,3) right -1 accumulated from state s onwards (2,3) right -1 (3,3) right -1 x (3,3) right -1 (3,2) up -1 γ = 1, R = -1 (3,2) up -1 (4,2) exit -100 (3,3) right -1 (done) (4,3) exit +100 V(2,3) ~ (96 + -103) / 2 = -3.5 (done) V(3,3) ~ (99 + 97 + -102) / 3 = 31.3 15 16 Model-Free Learning Limitations of Direct Evaluation § Want to compute an expectation weighted by P(x): § Assume random initial state § Assume the value of state § Model-based: estimate P(x) from samples, compute expectation (1,2) is known perfectly based on past runs § Now for the first time § Model-free: estimate expectation directly from samples encounter (1,1) --- can we do better than estimating V(1,1) as the rewards outcome of § Why does this work? Because samples appear with the right that run? frequencies! 17 18 3 Sample-Based Policy Evaluation? Temporal-Difference Learning § Big idea: learn from every experience! s § Update V(s) each time we experience (s,a,s’,r) § Likely s’ will contribute updates more often π(s) s § Who needs T and R? Approximate the expectation with samples of s’ (drawn from T!) s, π(s) π(s) § Temporal difference learning s, π(s) § Policy still fixed! s’ § Move values toward value of whatever s, π(s),s’ successor occurs: running average! ’ s2’ s1s s3’ ’ Sample of V(s): Update to V(s): Almost! But we can’t rewind time to get sample Same update: after sample from state s. 20 21 Temporal-Difference Learning Exponential Moving Average § Big idea: learn from every experience! § Exponential moving average s § Update V(s) each time we experience (s,a,s’,r) § Makes recent samples more important § Likely s’ will contribute updates more often π(s) s, π(s) § Temporal difference learning § Policy still fixed! s’ § Move values toward value of whatever successor occurs: running average! § Forgets about the past (distant past values were wrong anyway) § Easy to compute from the running average Sample of V(s): Update to V(s): § Decreasing learning rate can give converging averages Same update: 22 23 Policy Evaluation when T (and R) unknown --- recap Problems with TD Value Learning § Model-based: § TD value leaning is a model-free way s § Learn the model empirically through experience to do policy evaluation § Solve for values as if the learned model were correct a § However, if we want to turn values into s, a a (new) policy, we’re sunk: § Model-free: s,a,s’ s § Direct evaluation: ’ § V(s) = sample estimate of sum of rewards accumulated from state s onwards § Temporal difference (TD) value learning: § Move values toward value of whatever successor occurs: running average! § Idea: learn Q-values directly § Makes action selection model-free too! 25 26 4 Active RL Detour: Q-Value Iteration § Value iteration: find successive approx optimal values § Full reinforcement learning * § Start with V0 (s) = 0, which we know is right (why?) § You don’t know the transitions T(s,a,s’) * § Given Vi , calculate the values for all states for depth i+1: § You don’t know the rewards R(s,a,s’) § You can choose any actions you like § Goal: learn the optimal policy / values § … what value iteration did! § But Q-values are more useful! * § In this case: § Start with Q0 (s,a) = 0, which we know is right (why?) * § Learner makes choices! § Given Qi , calculate the q-values for all q-states for depth i+1: § Fundamental tradeoff: exploration vs. exploitation § This is NOT offline planning! You actually take actions in the world and find out what happens… 27 28 Q-Learning Q-Learning Properties § Q-Learning: sample-based Q-value iteration § Amazing result: Q-learning converges to optimal policy § Learn Q*(s,a) values § If you explore enough § If you make the learning rate small enough § Receive a sample (s,a,s’,r) § … but not decrease it too quickly! § Consider your old estimate: § Basically doesn’t matter how you select actions (!) § Consider your new sample estimate: § Neat property: off-policy learning § learn optimal policy without following it § Incorporate the new estimate into a running average: 29 31 Exploration / Exploitation Exploration Functions § Several schemes for forcing exploration § When to explore § Simplest: random actions (ε greedy) § Random actions: explore a fixed amount § Better idea: explore areas whose badness is not (yet) § Every time step, flip a coin established § With probability ε, act randomly § With probability 1-ε, act according to current policy § Exploration function § Takes a value estimate and a count, and returns an optimistic § Problems with random actions? utility, e.g. (exact form not important) § You do explore the space, but keep thrashing Qi+1(s, a) (1 α)Qi(s, a)+α R(s, a, s)+γ max Qi(s,a) around once learning is done ← − a now becomes: § One solution: lower ε over time § Another solution: exploration functions Qi+1(s, a) (1 α)Qi(s, a)+α R(s, a, s)+γ max f(Qi(s,a),N(s,a)) ← − a 33 35 5 Q-Learning The Story So Far: MDPs and RL § Q-learning produces tables of q-values: Things we know how to do: Techniques: § If we know the MDP § Model-based DPs § Compute V*, Q*, π* exactly § Value Iteration § Evaluate a fixed policy π § Policy evaluation § If we don’t know the MDP § We can estimate the MDP then solve § Model-based RL § We can estimate V for a fixed policy π § Model-free RL § We can estimate Q*(s,a) for the optimal policy while executing an § Value learning exploration policy § Q-learning 37

Cs188 Lecture 10 and 11 -- Reinforcement Learning 6PP.Pdf

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support