<<

Inverse Reinforcement Learning

Inverse RL, behaviour cloning, , imitation learning.

Vien Ngo MLR, University of Stuttgart Outline

Introduction to Inverse RL • Inverse RL vs. behavioral cloning • IRL algorithms • (Inspired from a lecture from .)

2/?? Inverse RL: Informal Definition

Given Measurements of an agent’s behaviour π over time (s , a , s0 ), in • t t t different circumstances. If possible, given transition model (not given reward function). π Goal: Find the reward function R (s, a, s0). •

3/?? Inverse Reinforcement Learning

Advanced Robotics

Vien Ngo & Robotics lab, University of Stuttgart Universittsstrae 38, 70569 Stuttgart, Germany

June 17, 2014

1 A Small Maze Domain

Given a 2 4 maze as in figure. Given a MDP/R = S,A,P , where S is a state space consisting of 8 states; A is an action × { } space consisting of four movement actions move-left,move-right,move-down,move-up , the optimal policy (arrows in { } figure), and the transition function T (s, a, s0) = P (s0 s, a). |

Goal

π∗ 1 Given γ = 0.95, compute (I γP )− . • − Then, write full formulation of the LP problem in slide # 12. •

Bonus: use fminimax function in Matlab (or other LP library) to find R∗. • 4/??

1

inspired from a poster of Boularias, Kober, Peters. Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning)

Motivation: Two Sources

The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ...

5/?? Motivation: Two Sources

The potential use of RL/related methods as computational model for • animal and human learning: bee foraging (Montague et al 1995), song-bird vocalization (Doya & Sejnowski 1995), ... Construction of an intelligent agent in a particular domain: Car driver, • helicopter (Ng et al),... (imitation learning, apprenticeship learning)

5/?? Examples

Car driving simulation • Abbeel et al 2004, etc.

Autonomous Helicopter Flight • et. al.

Urban navigation • Ziebart, Maas, Bagnell and Dey, AAAI 2008 (route recommendation, and destination prediction) etc. • 6/?? Problem Formulation

Given • – State space , action space cA. S – Transition model T (s, a, s0) = P (s0 s, a) | – not given reward function R(s, a, s0).

– Teacher’s demonstration (from teacher’s policy π∗): s0, a0, s1, a1,..., IRL: • – Recover R. Apprenticeship learning via IRL • – Use R to compute a good policy. Behaviour cloning: • – Using supersived-learning to learn the teacher’s policy.

7/?? IRL vs. behavioral cloning

8/?? Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving).

IRL vs. Behavioral cloning is Rˆ∗ vs. πˆ∗. •

IRL vs. Behavioral cloning

Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, ,...)

– Given (s0, a0), (s1, a1),..., generated from a policy π∗. – Estimate a policy mapping s to a.

9/?? IRL vs. Behavioral cloning is Rˆ∗ vs. πˆ∗. •

IRL vs. Behavioral cloning

Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...)

– Given (s0, a0), (s1, a1),..., generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving).

9/?? IRL vs. Behavioral cloning

Behavioral cloning: Formulated as a supervised-learning problem. • (Using SVM, Neural networks, deep learning,...)

– Given (s0, a0), (s1, a1),..., generated from a policy π∗. – Estimate a policy mapping s to a. Behavioral cloning: can only mimic the trajectory of the teacher, then • can not: with change of goal/destination, and non-Markovian environment (e.g. car driving).

IRL vs. Behavioral cloning is Rˆ∗ vs. πˆ∗. •

9/?? Inverse Reinforcement Learning

10/?? Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition.

– π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies

IRL: Mathematical Formulation

Given • – State space , action space cA. S – Transition model T (s, a, s0) = P (s0 s, a) | – not given reward function R(s, a, s0).

– Teacher’s demonstration (from teacher’s policy π∗): s0, a0, s1, a1,...,

Find R∗, such that •

h X∞ t i h X∞ t i E γ R∗(s ) π∗ E γ R∗(s ) π , π t | ≥ t | ∀ t=0 t=0

11/?? IRL: Mathematical Formulation

Given • – State space , action space cA. S – Transition model T (s, a, s0) = P (s0 s, a) | – not given reward function R(s, a, s0).

– Teacher’s demonstration (from teacher’s policy π∗): s0, a0, s1, a1,...,

Find R∗, such that •

h X∞ t i h X∞ t i E γ R∗(s ) π∗ E γ R∗(s ) π , π t | ≥ t | ∀ t=0 t=0

Challenges? • – R = 0 is a solution (rewrad function ambiguity), and multiple R∗ satisfy the above condition.

– π∗ is only given partially through trajectories, then how to evaluate the expectation terms. – must assume the expert is optimal – R.H.S is computationally expensive, i.e enumerate all policies 11/?? IRL as linear programming with l • 1 S | | n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b /a − − − || || i=1 ∈A ∗ s.t.

a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best.

– With l1 penalty.

IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0, a − − ≥ ∀ (if consider only deterministic policies)

12/?? IRL: Finite state spaces Bellman equations • π π 1 V = (I γP )− R − Then IRL finds R such that • a∗ a a∗ 1 (P P )(I γP )− R 0, a − − ≥ ∀ (if consider only deterministic policies) IRL as linear programming with l • 1 S | | n o X a∗ b a∗ 1 max min (P (i) P (i))(I γP )− R λ R 1 b /a − − − || || i=1 ∈A ∗ s.t.

a∗ b a∗ 1 (P P )(I γP )− R 0 − − ≥ R(i) R ≤ max – Maximize the sum of differences between the values of the optimal action and the next-best.

– With l1 penalty. 12/?? The optimization problem: finding w∗ such that •

w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0

IRL: With FA in large state spaces

n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • ∈ S 7→ Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t | t | h t i = w>E γ φ(s ) π = w>.η(π) t |

13/?? IRL: With FA in large state spaces

n Using FA: R(s) = w>.φ(s), where w R , and φ : R. • ∈ S 7→ Thus, • h t i h t i E γ R(s ) π = E γ w>φ(s ) π t | t | h t i = w>E γ φ(s ) π = w>.η(π) t |

The optimization problem: finding w∗ such that •

w∗>.η(π∗) w∗>.η(π) ≥ η(π) can be evaluated with sampled trajectories from π. • N T 1 X Xi η(π) = γtφ(s ) N t i=1 t=0

13/?? Apprenticeship learning Abbeel & Ng, 2004

14/?? n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. ∈ S 7→ 2: Initialize π0 3: for i = 1, 2,... do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers.

max γ γ, w 1 || || || ||≤ s.t.

w>.η(π) w>.η(π) + γ, π π0, π1, . . . , πi 1 ≥ ∀ ∈ { − } 5: Find optimal policy πi for the reward function Rw w.r.t current w.

Apprenticeship learning

Finding a policy π whose performance is as close to the expert policy’s • performance as possible

w∗>.η(π∗) w>.η(π)  || − || ≤

15/?? Apprenticeship learning

Finding a policy π whose performance is as close to the expert policy’s • performance as possible

w∗>.η(π∗) w>.η(π)  || − || ≤

n 1: Assume R(s) = w>.φ(s), where w R , and φ : R. ∈ S 7→ 2: Initialize π0 3: for i = 1, 2,... do 4: Find a reward function such that the teacher maximally outperforms all previously found controllers.

max γ γ, w 1 || || || ||≤ s.t.

w>.η(π) w>.η(π) + γ, π π0, π1, . . . , πi 1 ≥ ∀ ∈ { − }

5: Find optimal policy πi for the reward function Rw w.r.t current w15/. ?? Examples

16/?? Simulated Highway Driving

Given dynamic model T (s, a, s0) • Each teacher demonstrates 1 minute. • Abbeel et. al. 2004

17/?? Simulated Highway Driving expert demonstration (left), learned control (right)

18/?? Urban Navigation

picture from a tutorial of Pieter Abbeel.

19/?? References Andrew Y. Ng, Stuart J. Russell: Algorithms for Inverse Reinforcement Learning. ICML 2000: 663-670

Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning via inverse reinforcement learning. ICML 2004

Pieter Abbeel, Adam Coates, Morgan Quigley, Andrew Y. Ng: An Application of Reinforcement Learning to Aerobatic Helicopter Flight. NIPS 2006: 1-8

Adam Coates, Pieter Abbeel, Andrew Y. Ng: Apprenticeship learning for helicopter control. Commun. ACM 52(7): 97-105 (2009)

20/??