Reinforcement Learning Lecture Inverse Reinforcement Learning

Reinforcement Learning Lecture Inverse Reinforcement Learning

Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo MLR, University of Stuttgart Outline Introduction to Inverse RL • Inverse RL vs. behavioral cloning • IRL algorithms • (Inspired from a lecture from Pieter Abbeel.) 2/?? Inverse RL: Informal Definition Given Measurements of an agent’s behaviour π over time (s ; a ; s0 ), in • t t t different circumstances. If possible, given transition model (not given reward function). π Goal: Find the reward function R (s; a; s0). • 3/?? Inverse Reinforcement Learning Advanced Robotics Vien Ngo Machine Learning & Robotics lab, University of Stuttgart Universittsstrae 38, 70569 Stuttgart, Germany June 17, 2014 1 A Small Maze Domain Given a 2 4 maze as in figure. Given a MDP/R = S,A,P , where S is a state space consisting of 8 states; A is an action × { } space consisting of four movement actions move-left,move-right,move-down,move-up , the optimal policy (arrows in { } figure), and the transition function T (s, a, s0) = P (s0 s, a). | Goal π∗ 1 Given γ = 0.95, compute (I γP )− . • − Then, write full formulation of the LP problem in slide # 12. • Bonus: use fminimax function in Matlab (or other LP library) to find R∗. • 4/?? 1 inspired from a poster of Boularias, Kober, Peters. Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning) Motivation: Two Sources The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ... 5/?? Motivation: Two Sources The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ... Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning) 5/?? Examples Car driving simulation • Abbeel et al 2004, etc. Autonomous Helicopter Flight • Andrew Ng et. al. Urban navigation • Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. • 6/?? Problem Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; IRL: • – Recover R. Apprenticeship learning via IRL • – Use R to compute a good policy. Behaviour cloning: • – Using supersived-learning to learn the teacher’s policy. 7/?? IRL vs. behavioral cloning 8/?? Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). IRL vs. Behavioral cloning is R^∗ vs. π^∗. • IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. 9/?? IRL vs. Behavioral cloning is R^∗ vs. π^∗. • IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). 9/?? IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). IRL vs. Behavioral cloning is R^∗ vs. π^∗. • 9/?? Inverse Reinforcement Learning 10/?? Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition. – π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies IRL: Mathematical Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; Find R∗, such that • h X1 t i h X1 t i E γ R∗(s ) π∗ E γ R∗(s ) π ; π t j ≥ t j 8 t=0 t=0 11/?? IRL: Mathematical Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; Find R∗, such that • h X1 t i h X1 t i E γ R∗(s ) π∗ E γ R∗(s ) π ; π t j ≥ t j 8 t=0 t=0 Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition. – π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies 11/?? IRL as linear programming with l • 1 S j j n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b =a − − − jj jj i=1 2A ∗ s.t. a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best. – With l1 penalty. IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0; a − − ≥ 8 (if consider only deterministic policies) 12/?? IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0; a − − ≥ 8 (if consider only deterministic policies) IRL as linear programming with l • 1 S j j n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b =a − − − jj jj i=1 2A ∗ s.t. a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best. – With l1 penalty. 12/?? The optimization problem: finding w∗ such that • w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0 IRL: With FA in large state spaces n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • 2 S 7! Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t j t j h t i = w>E γ φ(s ) π = w>.η(π) t j 13/?? IRL: With FA in large state spaces n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • 2 S 7! Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t j t j h t i = w>E γ φ(s ) π = w>.η(π) t j The optimization problem: finding w∗ such that • w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0 13/?? Apprenticeship learning Abbeel & Ng, 2004 14/?? n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. 2 S 7! 2: Initialize π0 3: for i = 1; 2;::: do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ; w 1 jj jj jj j|≤ s.t. w>.η(π) w>.η(π) + γ; π π0; π1; : : : ; πi 1 ≥ 8 2 f − g 5: Find optimal policy πi for the reward function Rw w.r.t current w. Apprenticeship learning Finding a policy π whose performance is as close to the expert policy’s • performance as possible w∗>.η(π∗) w>.η(π) jj − jj ≤ 15/?? Apprenticeship learning Finding a policy π whose performance is as close to the expert policy’s • performance as possible w∗>.η(π∗) w>.η(π) jj − jj ≤ n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. 2 S 7! 2: Initialize π0 3: for i = 1; 2;::: do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ; w 1 jj jj jj j|≤ s.t. w>.η(π) w>.η(π) + γ; π π0; π1; : : : ; πi 1 ≥ 8 2 f − g 5: Find optimal policy πi for the reward function Rw w.r.t current w15/. ?? Examples 16/?? Simulated Highway Driving Given dynamic model T (s; a; s0) • Each teacher demonstrates 1 minute. • Abbeel et. al. 2004 17/?? Simulated Highway Driving expert demonstration (left), learned control (right) 18/?? Urban Navigation picture from a tutorial of Pieter Abbeel. 19/?? References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: 663-670 Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004 Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    27 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us