Batch Policy Learning Under Constraints
Total Page:16
File Type:pdf, Size:1020Kb
Batch Policy Learning under Constraints Hoang M. Le 1 Cameron Voloshin 1 Yisong Yue 1 Abstract deed, many such real-world applications require the primary objective function be augmented with an appropriate set of When learning policies for real-world domains, constraints (Altman, 1999). two important questions arise: (i) how to effi- ciently use pre-collected off-policy, non-optimal Contemporary policy learning research has largely focused behavior data; and (ii) how to mediate among dif- on either online reinforcement learning (RL) with a focus on ferent competing objectives and constraints. We exploration, or imitation learning (IL) with a focus on learn- thus study the problem of batch policy learning un- ing from expert demonstrations. However, many real-world der multiple constraints, and offer a systematic so- settings already contain large amounts of pre-collected data lution. We first propose a flexible meta-algorithm generated by existing policies (e.g., existing driving behav- that admits any batch reinforcement learning and ior, power grid control policies, etc.). We thus study the online learning procedure as subroutines. We then complementary question: can we leverage this abundant present a specific algorithmic instantiation and source of (non-optimal) behavior data in order to learn se- provide performance guarantees for the main ob- quential decision making policies with provable guarantees jective and all constraints. As part of off-policy on both primary objective and constraint satisfaction? learning, we propose a simple method for off- We thus propose and study the problem of batch policy policy policy evaluation (OPE) and derive PAC- learning under multiple constraints. Historically, batch RL style bounds. Our algorithm achieves strong em- is regarded as a subfield of approximate dynamic program- pirical results in different domains, including in a ming (ADP) (Lange et al., 2012), where a set of transitions challenging problem of simulated car driving sub- sampled from the existing system is given and fixed. From ject to multiple constraints such as lane keeping an interaction perspective, one can view many online RL and smooth driving. We also show experimentally methods (e.g., DDPG (Lillicrap et al., 2016)) as running a that our OPE method outperforms other popular growing batch RL subroutine per round of online RL. In OPE techniques on a standalone basis, especially that sense, batch policy learning is complementary to any in a high-dimensional setting. exploration scheme. To the best of our knowledge, the study of constrained policy learning in the batch setting is novel. 1. Introduction We present an algorithmic framework for learning sequential decision making policies from off-policy data. We employ We study the problem of policy learning under multiple con- multiple learning reductions to online and supervised learn- straints. Contemporary approaches to learning sequential ing, and present an analysis that relates performance in the decision making policies have largely focused on optimizing reduced procedures to the overall performance with respect some cost objective that is easily reducible to a scalar value to both the primary objective and constraint satisfaction. function. However, in many real-world domains, choosing Constrained optimization is a well studied problem in su- the right cost function to optimize is often not a straight- pervised machine learning and optimization. In fact, our ap- forward task. Frequently, the agent designer faces multiple proach is inspired by the work of Agarwal et al. (2018) in the competing objectives. For instance, consider the aspirational context of fair classification. In contrast to supervised learn- task of designing autonomous vehicle controllers: one may ing for classification, batch policy learning for sequential care about minimizing the travel time while making sure decision making introduces multiple additional challenges. the driving behavior is safe, consistent, or fuel efficient. In- First, setting aside the constraints, batch policy learning 1California Institute of Technology, Pasadena, CA. Correspon- itself presents a layer of difficulty, and the analysis is signif- dence to: Hoang M. Le <[email protected]>. icantly more complicated. Second, verifying whether the th constraints are satisfied is no longer as straightforward as Proceedings of the 36 International Conference on Machine passing the training data through the learned classifier. In Learning, Long Beach, California, PMLR 97, 2019. Copyright 2019 by the author(s). sequential decision making, certifying constraint satisfac- Batch Policy Learning under Constraints tion amounts to an off-policy policy evaluation problem, policy ⇡ and transition dynamics p. We similarly define the which is a challenging problem and the subject of active vector-value function for the constraint costs G⇡ :X Rm ⇡ t 7! research. In this paper, we develop a systematic approach as G (x)=E [ 1 γ g(xt,at) x0 = x]. Define C(⇡) t=0 | to address these challenges, provide a careful error analysis, and G(⇡) as the expectation of C⇡(x) and G⇡(x), respec- and experimentally validate our proposed algorithms. In tively, over the distributionP χ of initial states. summary, our contributions are: We formulate the problem of batch policy learning un- 2.1. Batch Policy Learning under Constraints • der multiple constraints, and present the first approach In batch policy learning, we have a pre-collected dataset, n of its kind to solve this problem. The definition of con- D= (x ,a ,x0 ,c(x ,a ),g (x ,a ) , generated { i i i i i 1:m i i }i=1 straints is general and can subsume many objectives. from (a set of) historical behavioral policies denoted jointly Our approach utilizes multi-level learning reductions, by ⇡D. The goal of batch policy learning under constraints is and we show how to instantiate it using various batch to learn a policy ⇡ ⇧ from D that minimizes the primary 2 RL and online learning subroutines. We show that objective cost while satisfying m different constraints: guarantees from the subroutines provably lift to pro- min C(⇡) ⇡ ⇧ vide end-to-end guarantees on the original constrained 2 (OPT) batch policy learning problem. s.t. G(⇡) ⌧ m While leveraging techniques from batch RL as a sub- where G( )=[g1( ),...,gm( )]> and ⌧ R is a vector • · · · 2 routine, we provide a refined theoretical analysis for of known constants. We assume that (OPT) is feasible. general non-linear function approximation that im- However, the dataset D might be generated from multiple proves upon the previously known sample complexity policies that violate the constraints. result (Munos & Szepesvari´ , 2008). To evaluate off-policy learning performance and con- 2.2. Examples of Policy Learning with Constraints • straint satisfaction, we propose a simple new technique Counterfactual & Safe Policy Learning. In conventional for off-policy policy evaluation (OPE), which is used online RL, the agent needs to “re-learn” from scratch when as a subroutine in our main algorithm. We show that it the cost function is modified. Our framework enables coun- is competitive to other OPE methods. terfactual policy learning assuming the ability to compute We validate our algorithm and analysis with two ex- the new cost objective from the same historical data. A • perimental settings. First, a simple navigation do- simple example is safe policy learning (Garcıa & Fernandez´ , main where we consider safety constraint. Second, we 2015). Define safety cost g(x, a)=φ(x, a, c) as a new consider a high-dimensional racing car domain with function of existing cost c and features associated with cur- smooth driving and lane centering constraints. rent state-action pair. The goal here is to counterfactually avoid undesirable behaviors observed from historical data. We experimentally study this safety problem in Section 5. 2. Problem Formulation Other examples from the literature that belong to this safety We first introduce notation. Let X Rd be a bounded and ⇢ perspective include planning under chance constraints (Ono closed d-dimensional state space. Let A be a finite action et al., 2015; Blackmore et al., 2011). The constraint here is space. Let c :X A [0, C] be the primary objective cost ⇥ 7! G(⇡)=E[I(x Xerror)] = P(x Xerror) ⌧. function that is bounded by C. Let there be m constraint 2 2 cost functions, g :X A [0, G], each bounded by Multi-objective Batch Learning. Traditional policy learn- i ⇥ 7! G. To simplify the notation, we view the set of constraints ing (RL or IL) presupposes that the agent optimizes a single as a vector function g :X A [0, G]m where g(x, a) cost function. In reality, we may want to satisfy multiple ⇥ 7! is the column vector of individual g (x, a). Let p( x, a) objectives that are not easily reducible to a scalar objective i ·| denote the (unknown) transition/dynamics model that maps function. One example is learning fast driving policies un- state/action pairs to a distribution over the next state. Let der multiple behavioral constraints such as smooth driving γ (0, 1) denote the discount factor. Let χ be the initial and lane keeping consistency (see Section 5). 2 states distribution. 2.3. Equivalence between Constraint Satisfaction and We consider the discounted infinite horizon setting. An Regularization MDP is defined using the tuple (X, A,c,g,p,γ,χ). A pol- icy ⇡ ⇧ maps states to actions, i.e., ⇡(x) A. The Our constrained policy learning framework accommodates 2 2 value function C⇡ :X R corresponding to the pri- several existing regularized