Reinforcement Learning Lecture Inverse Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Reinforcement Learning Inverse Reinforcement Learning Inverse RL, behaviour cloning, apprenticeship learning, imitation learning. Vien Ngo MLR, University of Stuttgart Outline Introduction to Inverse RL • Inverse RL vs. behavioral cloning • IRL algorithms • (Inspired from a lecture from Pieter Abbeel.) 2/?? Inverse RL: Informal Definition Given Measurements of an agent’s behaviour π over time (s ; a ; s0 ), in • t t t different circumstances. If possible, given transition model (not given reward function). π Goal: Find the reward function R (s; a; s0). • 3/?? Inverse Reinforcement Learning Advanced Robotics Vien Ngo Machine Learning & Robotics lab, University of Stuttgart Universittsstrae 38, 70569 Stuttgart, Germany June 17, 2014 1 A Small Maze Domain Given a 2 4 maze as in figure. Given a MDP/R = S,A,P , where S is a state space consisting of 8 states; A is an action × { } space consisting of four movement actions move-left,move-right,move-down,move-up , the optimal policy (arrows in { } figure), and the transition function T (s, a, s0) = P (s0 s, a). | Goal π∗ 1 Given γ = 0.95, compute (I γP )− . • − Then, write full formulation of the LP problem in slide # 12. • Bonus: use fminimax function in Matlab (or other LP library) to find R∗. • 4/?? 1 inspired from a poster of Boularias, Kober, Peters. Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning) Motivation: Two Sources The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ... 5/?? Motivation: Two Sources The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ... Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning) 5/?? Examples Car driving simulation • Abbeel et al 2004, etc. Autonomous Helicopter Flight • Andrew Ng et. al. Urban navigation • Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. • 6/?? Problem Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; IRL: • – Recover R. Apprenticeship learning via IRL • – Use R to compute a good policy. Behaviour cloning: • – Using supersived-learning to learn the teacher’s policy. 7/?? IRL vs. behavioral cloning 8/?? Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). IRL vs. Behavioral cloning is R^∗ vs. π^∗. • IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. 9/?? IRL vs. Behavioral cloning is R^∗ vs. π^∗. • IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). 9/?? IRL vs. Behavioral cloning Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...) – Given (s0; a0); (s1; a1);:::; generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving). IRL vs. Behavioral cloning is R^∗ vs. π^∗. • 9/?? Inverse Reinforcement Learning 10/?? Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition. – π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies IRL: Mathematical Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; Find R∗, such that • h X1 t i h X1 t i E γ R∗(s ) π∗ E γ R∗(s ) π ; π t j ≥ t j 8 t=0 t=0 11/?? IRL: Mathematical Formulation Given • – State space , action space cA. S – Transition model T (s; a; s0) = P (s0 s; a) j – not given reward function R(s; a; s0). – Teacher’s demonstration (from teacher’s policy π∗): s0; a0; s1; a1;:::; Find R∗, such that • h X1 t i h X1 t i E γ R∗(s ) π∗ E γ R∗(s ) π ; π t j ≥ t j 8 t=0 t=0 Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition. – π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies 11/?? IRL as linear programming with l • 1 S j j n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b =a − − − jj jj i=1 2A ∗ s.t. a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best. – With l1 penalty. IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0; a − − ≥ 8 (if consider only deterministic policies) 12/?? IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0; a − − ≥ 8 (if consider only deterministic policies) IRL as linear programming with l • 1 S j j n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b =a − − − jj jj i=1 2A ∗ s.t. a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best. – With l1 penalty. 12/?? The optimization problem: finding w∗ such that • w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0 IRL: With FA in large state spaces n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • 2 S 7! Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t j t j h t i = w>E γ φ(s ) π = w>.η(π) t j 13/?? IRL: With FA in large state spaces n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • 2 S 7! Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t j t j h t i = w>E γ φ(s ) π = w>.η(π) t j The optimization problem: finding w∗ such that • w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0 13/?? Apprenticeship learning Abbeel & Ng, 2004 14/?? n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. 2 S 7! 2: Initialize π0 3: for i = 1; 2;::: do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ; w 1 jj jj jj j|≤ s.t. w>.η(π) w>.η(π) + γ; π π0; π1; : : : ; πi 1 ≥ 8 2 f − g 5: Find optimal policy πi for the reward function Rw w.r.t current w. Apprenticeship learning Finding a policy π whose performance is as close to the expert policy’s • performance as possible w∗>.η(π∗) w>.η(π) jj − jj ≤ 15/?? Apprenticeship learning Finding a policy π whose performance is as close to the expert policy’s • performance as possible w∗>.η(π∗) w>.η(π) jj − jj ≤ n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. 2 S 7! 2: Initialize π0 3: for i = 1; 2;::: do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers. max γ γ; w 1 jj jj jj j|≤ s.t. w>.η(π) w>.η(π) + γ; π π0; π1; : : : ; πi 1 ≥ 8 2 f − g 5: Find optimal policy πi for the reward function Rw w.r.t current w15/. ?? Examples 16/?? Simulated Highway Driving Given dynamic model T (s; a; s0) • Each teacher demonstrates 1 minute. • Abbeel et. al. 2004 17/?? Simulated Highway Driving expert demonstration (left), learned control (right) 18/?? Urban Navigation picture from a tutorial of Pieter Abbeel. 19/?? References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: 663-670 Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004 Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight.