Truly Batch Apprenticeship Learning with Deep Successor Features ∗
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) Truly Batch Apprenticeship Learning with Deep Successor Features ∗ Donghun Lee , Srivatsan Srinivasan and Finale Doshi-Velez SEAS, Harvard University fsrivatsansrinivasan, [email protected], fi[email protected], Abstract the expert wishes to achieve, rather than simply what they are reacting to, enabling agents to generalize better with the We introduce a novel Inverse Reinforcement Learn- knowledge of these “intentions” in related environments. ing (IRL) method for batch settings where only ex- In this work, we focus on IRL in batch settings: we must pert demonstrations are given and no interaction infer a reward function that induces the expert policy, given with the environment is allowed. Such settings only a fixed set of expert demonstrations. No simulator are common in health care, finance and education exists and environment dynamics (transition model) is un- where environmental dynamics are unknown and known. Performing analyses on batch data often ends up be- no reliable simulator exists. Unlike existing IRL ing the only reasonable alternative in domains such as health- methods, our method does not require on-policy care, finance, education, or industrial engineering where pre- roll-outs or assume access to non-expert data. We collected logs of expert behavior are relatively plentiful but introduce a robust epde off-policy estimator of fea- new data acquisition or a policy roll-out is costly and risky. ture expectations of any policy and also propose an Many existing IRL algorithms (e.g. [Abbeel and Ng, 2004; IRL warm-start strategy that jointly learns a near- Ratliff et al., 2006]) use the feature expectations of a policy expert initial policy and an expressive feature rep- as the proxy quantity that measures the similarity between ex- resentation directly from data, both of which to- pert policy and an arbitrary policy that IRL proposes. If a sim- gether render batch IRL feasible. We demonstrate ulator is available, feature expectations can be computed by our model’s superior performance in batch settings taking sample means across on-policy rollouts [Abbeel and with both classical control tasks and a real-world Ng, 2004]. However, new rollouts are not possible in batch clinical task of sepsis management in the ICU. settings. To estimate feature expectations in batch settings, a few IRL algorithms exist that either use a linear estimator 1 Introduction [Klein et al., 2012] or bypass the estimation by assuming the existence of additional data from an arbitrary policy [Boular- Reward design is a key challenge in Reinforcement Learn- ias et al., 2011]. Often, linear estimators do not possess the ing (RL). Manually identifying an appropriate reward func- representational power necessary to model real-world tasks, tion is often difficult, and poorly specified rewards can pose while the ability to access additional data is overly restrictive. safety risks [Leike et al., 2017]. Apprenticeship learning is Along with an off-policy estimator for feature expectations, the process of learning to act from expert demonstrations. successful batch IRL also requires methods to engineer ex- One type of apprenticeship learning, Imitation Learning (IL) pressive feature spaces and manage computational complex- (e.g. [Ross et al., 2011; Ho and Ermon, 2016]), directly ity. Our work addresses all these batch IRL challenges. learns a policy from demonstrations. In contrast, Inverse Re- inforcement Learning (IRL) aims to recover the expert pol- Contributions Specifically, our work makes two key con- icy by learning a reward function, which, when solved, in- tributions that together enable a batch version of max-margin duces policies that are similar to the expert. While there ex- IRL that empirically scales well across classical control and ist theoretical connections between IL and IRL [Piot et al., real-world tasks, using only expert demonstrations and no 2017], one may be preferred depending on the application. additional inputs whatsoever. Firstly, we propose the Deep In particular, directly learning a policy (imitation) can be Successor Feature Network (DSFN), an off policy feature ex- brittle in cases of long-horizon planning, and environments pectations estimator that replaces the on-policy roll-outs in with strong covariate or dynamics shifts [Piot et al., 2013; standard IRL algorithms. Secondly, to mitigate a potential Fu et al., 2017]. Besides addressing these issues, the learned high bias caused by data support mismatch in DSFN, we reward function in IRL can also be used to identify expert propose Transition Regularized Imitation Learning (TRIL), motivations underlying the actions: rewards describe what which warm-starts the IRL loop with a near-expert initial pol- icy and simultaneously provides an expressive feature space ∗Supplementary material (Appendix) to the main text can be for better IRL performance directly learned from the batch found here : https://tinyurl.com/ijcai-dsfn-supplement data without any manual feature engineering. 5909 Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence (IJCAI-19) 2 Related Work This paper focuses on batch IRL, where the IRL evaluation metric (for evaluating the IRL policy’s closeness with respect to expert) is feature expectations [Abbeel and Ng, 2004]. Our goal is to learn a reward function for imitating and inter- preting the expert policy under unknown dynamics with ex- pert demonstrations alone. This batch definition differs from other settings such as RE-IRL [Boularias et al., 2011] that requires access to non-expert data from an arbitrary policy for importance sampling or DM-IRL [Burchfiel et al., 2016] Figure 1: TRIL+DSFN: DSFN provides the crucial off-policy esti- that needs additional demonstration queries made to the ex- mate of the feature expectations of IRL policies in batch settings us- pert. Finally, while we adopt the max-margin IRL framework ing the three inputs De; π0; φ provided by TRIL. TRIL warm-starts the IRL loop by learning an initial policy π0(multi-class classifica- [Abbeel and Ng, 2004] in this work, our key contributions— 0 TRIL, DSFN— are generic and can be applied across a broad tion regularized by next state prediction s ) and generates a feature map φ from the shared hidden layers in the process. class of IRL methods (e.g. probabilistic IRL methods such as [Ziebart et al., 2008]). 3 Background Batch Feature-Expectations IRL. A major challenge in batch IRL is estimating feature expectations. [Klein et al., Markov Decision Process (MDP). An MDP is a tuple ] (S; A; T ; T0; R; γ) where s 2 S states (continuous, in this 2011 made an important observation that this problem is 0 similar to off-policy policy evaluation and proposed an es- work), a 2 A actions (discrete, in this work), T (s js; a); T0 timator based on LSTD-Q [Lagoudakis and Parr, 2003]; this the transition probabilities and the initial state distribution re- estimator suffers from the high sensitivity and weak repre- spectively, R(s; a) the reward function, and γ 2 [0; 1) the sentational power of linear systems. To address the compu- discount factor. A policy π(ajs) gives the probability of tational complexity aspect of IRL, [Klein et al., 2012] pro- taking an action a in a state s. The state-value function is π E P1 t defined as V (s) = π[ t=0 γ R(st; at)js0 = s]. The posed SCIRL that takes a supervised learning (classification) π approach where feature expectations are estimated by a sim- action-value function is defined as Q (s; a) = R(s; a) + E P1 t s1;a1∼π;:::[ t=1 γ R(st; at)]. The optimal policy under an ple sample mean. SCIRL assumes a deterministic expert and ∗ π relies on the heuristic assumption that feature expectations of MDP is given by πe = π = arg maxπ V (s)(8s 2 S). non-expert actions can be approximated by a multiplicative Batch Max-Margin IRL. We assume the existence of an constant factor of the feature expectations for expert actions expert policy πe that is optimal under some unknown, lin- (requires tuning for every domain). In contrast, our method is ear reward function of the form R(s; a) = w · φ for some scalable across domains with little tuning required and only reward weights w 2 Rd and some pre-defined feature map requires the expert demonstrations without any additional as- φ(s; a): S × A ! Rd. Both quantities are bounded i.e. sumptions. Moreover, SCIRL uses a linear classifier while jjwjj2 ≤ 1; jjφ(·)jj2 ≤ 1. Let De = f(s0; a0 ∼ πe; : : : ; sT )g our method admits any parametric model (e.g. neural nets). be a set of N expert trajectories sampled according to πe. Our flexible neural net parametrization for estimating feature Unlike traditional IRL methods, we assume we cannot sam- expectations in DSFN (similar in spirit to [Kulkarni et al., ple trajectories from any other policy and that T is unknown 2016]’s deep successor features for value functions in RL) (common in real-world batch settings). The state-action fea- along with a TRIL warm-start enables our model to scale the ture expectations of a policy µπ(s; a) 2 Rd and the overall best among alternatives, to real-world tasks. feature expectations µπ 2 Rd of a policy are defined as 1 Warm-starting batch IRL. Computational complexity π hX t i µ (s; a) = φ(s; a) + Es ;a ∼π;::: γ φ(st; at) and sensitivity to features are unavoidable practical chal- 1 1 (1) lenges in IRL, more so in batch settings in which off- t=1 π E π policy evaluations are required to compute feature expec- µ = s0∼T0 [µ (s0; π(s0))] tations. Standard IRL algorithms [Abbeel and Ng, 2004; Conceptually, the feature expectation µπ represents the (ex- Ratliff et al., 2006] assume manually engineered features and pected, discounted) amount of the feature accumulated while initialize with a simple (random, sample mean) policy.