<<

Robotics: Science and Systems 2020 Corvalis, Oregon, USA, July 12-16, 2020 Explaining Multi-stage Tasks by Learning Temporal Formulas from Suboptimal Demonstrations

Glen Chou, Necmiye Ozay, and Dmitry Berenson Electrical Engineering and , University of Michigan, Ann Arbor, MI 48109 Email: gchou, necmiye, dmitryb @umich.edu { }

Abstract—We present a method for learning to perform multi- stage tasks from demonstrations by learning the logical structure and atomic propositions of a consistent (LTL) formula. The learner is given successful but potentially suboptimal demonstrations, where the demonstrator is optimizing a cost function while satisfying the LTL formula, and the cost function is uncertain to the learner. Our algorithm uses Fig. 1. Multi-stage manipulation: first fill the cup, then grasp it, and then the Karush-Kuhn-Tucker (KKT) optimality conditions of the deliver it. To avoid spills, a pose constraint is enforced after the cup is grasped. demonstrations together with a counterexample-guided falsifi- In this paper, our insight is that by assuming that demonstra- cation strategy to learn the atomic proposition parameters and tors are goal-directed (i.e. approximately optimize an objective logical structure of the LTL formula, respectively. We provide function that may be uncertain to the learner), we can regu- theoretical guarantees on the conservativeness of the recovered atomic proposition sets, as well as completeness in the search larize the LTL learning problem without being given formula- for finding an LTL formula consistent with the demonstrations. violating behavior. In particular, we learn LTL formulas which We evaluate our method on high-dimensional nonlinear systems are parameterized by their high-level logical structure and low- by learning LTL formulas explaining multi-stage tasks on 7- level AP regions, and we show that to do so, it is important to DOF arm and quadrotor systems and show that it outperforms consider demonstration optimality both in terms of the quality competing methods for learning LTL formulas from positive examples. of the discrete high-level logical decisions and the continuous I.INTRODUCTION low-level control actions. We use the Karush-Kuhn-Tucker Imagine demonstrating a multi-stage task to a robot arm (KKT) optimality conditions from continuous optimization to barista, such as preparing a drink for a customer (Fig. 1). How learn the shape of the low-level APs, along with notions of should the robot understand and generalize the demonstration? discrete optimality to learn the high-level task structure. We One popular method is inverse reinforcement learning (IRL), solve a mixed integer linear program (MILP) to jointly recover which assumes a level of optimality on the demonstrations, LTL and cost function parameters which are consistent with and aims to learn a reward function that replicates the demon- the demonstrations. We make the following contributions: strator’s behavior when optimized [1, 4, 32, 36]. Due to this 1) We develop a method for -varying, constrained inverse representation, IRL works well on short-horizon tasks, but can optimal control, where the demonstrator optimizes a cost struggle to scale to multi-stage, constrained tasks [14, 28, 40]. function while respecting an LTL formula, where the Transferring reward functions across environments (i.e. from parameters of the atomic propositions, formula structure, one kitchen to another) can also be difficult, as IRL may and an uncertain cost function are to be learned. We require overfit to aspects of the training environment. It may instead only positive demonstrations, can handle demonstration be fruitful to decouple the high- and low-level task structure, suboptimality, and for fixed formula structure, can extract learning a logical/temporal abstraction of the task that is guaranteed conservative estimates of the AP regions. valid for different environments which can combine low-level, 2) We develop conditions on demonstrator optimality needed environment-dependent skills. Linear temporal logic (LTL) is to learn high- and low-level task structure: AP regions can well-suited for representing this abstraction, since it can unam- be learned with discrete feasibility, while logical structure biguously specify high-level temporally-extended constraints requires various levels of discrete optimality. We develop [5] as a function of atomic propositions (APs), which can be variants of our method under these different assumptions. used to describe salient low-level state-space regions. To this 3) We provide theoretical analysis of our method, showing end, a growing community in controls and anomaly detection that under mild assumptions, it is guaranteed to return the has focused on learning linear temporal logic (LTL) formulas shortest LTL formula which is consistent with the demon- to explain trajectory data. However, the vast majority of these strations, if one exists. We also prove various results on methods require both positive and negative examples in order our method’s conservativeness and on formula learnability. to regularize the learning problem. While this is acceptable 4) We evaluate our method on learning complex LTL formu- in anomaly detection, where one expects to observe formula- las demonstrated on nonlinear, high-dimensional systems, violating trajectories, in the context of robotics, it can be show that we can use demonstrations of the same task on unsafe to ask a demonstrator to execute formula-violating different environments to learn shared high-level task struc- behavior, such as spilling the drink or crashing into obstacles. ture, and show that we outperform previous approaches.

1 II.RELATED WORK by a trajectory ξ , denoted ϕ = ξ , are given in App. A of xu | xu There is extensive literature on inferring temporal logic [16]. Note that negation only appears directly before APs. Let formulas from data via decision trees [9], genetic algorithms the size of the grammar be Ng = NAP + No, where No is the number of temporal/boolean operators in the grammar. A use- [11], and Bayesian inference [38, 40]. However, most of . ful derived operator is “eventually” ϕ = ϕ. these methods require positive and negative examples as ♦[t1,t2] > U[t1,t2] input [13, 26, 27, 31], while our method is designed to We assume the demonstrator optimizes a parametric cost func- only use positive examples. Other methods require a space- tion (encoding efficiency concerns, etc.) while satisfying LTL s p discretization [3, 39, 40], while our approach learns LTL formula ϕ(θ , θ ) (encoding constraints defining the task): formulas in the original continuous space. Some methods learn Problem 1 (Forward problem): c AP parameters, but do not learn logical structure or perform an minimize c(ξxu, θ ) ξxu incomplete search, relying on formula templates [6, 29, 42], subject to ξ = ϕ(θs, θp) xu | while other methods learn structure but not AP parameters η¯(ξxu) ¯ [38]. Perhaps the method most similar to ours is [23], which ∈ S ⊆ C where c( , θc) is a potentially non-convex cost function, pa- learns parametric signal temporal logic (pSTL) formulas from rameterized· by θc Θc. The LTL formula ϕ(θs, θp) is positive examples by fitting formulas that the data tightly parameterized by θs ∈ Θs, encoding the logical and temporal satisfies. However, the search over logical structure in [23] . structure of the formula,∈ and by θp = θp NAP , where θp Θp is incomplete, and tightness may not be the most informative i i=1 i i defines the shape of the region where{ p} holds. Specifically,∈ metric given goal-directed demonstrations (c.f. Sec. VIII). To i we consider APs of the form: x = p g (η (x), θp) 0, our knowledge, this is the first method for learning LTL | i ⇔ i i i ≤ where ηi( ): is a known nonlinear function, formula structure and parameters in continuous spaces on . · X → C > gi( , ) = [gi,1( , ), . . . , gi,N ineq ( , )] is a vector-valued para- high-dimensional systems from only positive examples. · · · · i · · metric function, and is the space in which the constraint IRL [1, 19, 24, 25, 36] searches for a reward function that is evaluated, elementsC of which are denoted constraint states replicates a demonstrator’s behavior when optimized, but these κ . In the manipulation example, the joint angles are x, the methods can struggle to represent multi-stage, long-horizon end∈ effectorC pose is κ, and η( ) are the forward kinematics. As tasks [28]. To alleviate this, [28, 35] learn sequences of reward p . · p shorthand, let Gi(κ, θi ) = max ∈{ ineq} gi,m(κ, θi ) . functions, but in contrast to temporal logic, these methods are m 1,...,Ni Define the subset of where p holds/does not hold, as restricted to learning tasks which can be described by a single i  . C fixed sequence. Temporal logic generalizes this, being able to (θp) = κ G (κ, θp) 0 (2) Si i { | i i ≤ } represent tasks that involve more choices and can be completed p . p p c i(θi ) = cl( κ Gi(κ, θi ) > 0 ) = cl( i(θi ) ) (3) with multiple different sequences. Some work [34, 43] aims A { | } S to learn a reward function given that the demonstrator satisfies To ensure that Problem 1 admits an optimum, we have defined (θp) to be closed; that is, states on the boundary of an AP a known temporal logic formula; we will learn both jointly. Ai i Finally, there is relevant work in constraint learning. These can be considered either inside or outside. For these boundary methods generally focus on learning time-invariant constraints states, our learning algorithm can automatically detect if the [12, 14, 15, 17] or a fixed sequence of task constraints demonstrator intended to visit or avoid the AP (c.f. Sec. IV-B). ¯ [33], which our method subsumes by learning time-dependent Any a priori known constraints are encoded in , where η¯( ) ¯ S · constraints that can be satisfied by different sequences. is known. In this paper, we encode in the system dynamics, start state, and if needed, a goal stateS separate from the APs. III.PRELIMINARIESAND PROBLEM STATEMENT We are given N demonstrations ξdem Ns of duration T , s { j }j=1 j We consider discrete-time nonlinear systems xt+1 = which approximately solve Prob. 1, in that they are feasible f(x , u , t), with state x and control u , where we de- (satisfy the LTL formula and known constraints) and achieve t t ∈ X ∈ U . note state/control trajectories of the system as ξxu = (ξx, ξu). a possibly suboptimal cost. Note that Prob. 1 can be modeled We use linear temporal logic (LTL) [5], which augments with continuous (ξxu) and boolean decision variables (referred standard propositional logic to express properties holding on to collectively as Z) [41]; the boolean variables determine the trajectories over (potentially infinite) periods of time. In this high-level plan, constraining the trajectory to obey boolean de- paper, we will be given finite-length trajectories demonstrating cisions that satisfy ϕ(θs, θp), while the continuous component tasks that can be completed in finite time. To ensure that the synthesizes a low-level trajectory implementing the plan. We formulas we learn can be evaluated on finite trajectories, we will use different assumptions of demonstrator optimality on focus on learning formulas, given in positive normal form, the continuous/boolean parts of the problem, depending on if which are described in a parametric temporal logic similar to θp (Sec. IV), θs (Sec. V), or θc (Sec. VI) are being learned, bounded LTL [21], and which can be written with the grammar and discuss how these different degrees of optimality can affect the learnability of LTL formulas (Sec. VII). ϕ ::= p p ϕ ϕ ϕ ϕ ϕ ϕ ϕ , (1) | ¬ | 1 ∨ 2 | 1 ∧ 2 | [t1,t2] | 1 U[t1,t2] 2 Our goal is to learn the unknown structure θs and AP . NAP parameters p of the LTL formula s p , as well as where p = pi i=1 are atomic propositions (APs) and θ ϕ(θ , θ ) ∈ P { } c NAP is known to the learner. t1 t2 are nonnegative integers. unknown cost function parameters θ , given demonstrations The semantics, describing satisfaction≤ of an LTL formula ϕ ξdem Ns and the a priori known safe set ¯. { j }j=1 S IV. LEARNING ATOMIC PROPOSITION PARAMETERS (θp) [ Tj Zj ,..., Tj Zj ] [ Tj Zj ,..., Tj Zj ] i=1 1,i i=Tj 1,i i=1 2,i i=Tj 2,i We develop methods for learning unknown AP parameters  W W  V  W W  θp when the cost function parameters θc and formula structure s ∧ θ are known. We first review recent results [17] on learning T T T T [ j Zj ,..., j Zj ] [ j Zj ,..., j Zj ] time-invariant constraints via the KKT conditions (Sec. IV-A), i=1 1,i i=Tj 1,i ♦ ♦ i=1 2,i i=Tj 2,i show how the framework can be extended to learn θp (Sec. W W W W [Zj ,...,Zj ] p p [Zj ,...,Zj ] IV-B), and develop a method for extracting states which are 1,1 1,Tj 1 2 2,1 2,Tj ϕ = ( p ) ∧ ( p ) ξdem |= ϕ guaranteed to satisfy/violate pi (Sec. IV-C). In this section, Fig. 2. A DAG model of ♦[0,Tj −1] 1 ♦[0,Tj −1] 2 . j WTj j V WTj j we will assume that demonstrations are locally-optimal for the iff the first entry at the root node, ( i=1 Z1,i) ( i=1 Z2,i), is true. continuous component and feasible for the discrete component. constraint parameterizations (i.e. unions of boxes [17]), Prob. A. Learning time-invariant constraints via KKT 2-3 are MILP-representable and can be efficiently solved; we Consider a simplified variant of Prob. 1 that only involves discuss this in further detail in Sec. IV-B. always satisfying a single AP; this reduces Prob. 1 to a B. Modifying KKT for multiple atomic propositions standard trajectory optimization problem: Having built intuition with the single AP case, we return minimize c(ξxu) to Prob. 1 and discuss how the KKT conditions change in ξxu p subject to g(η(x), θ ) 0, x ξxu (4) the multiple-AP setting. We first adjust the primal feasibility η¯(ξ ) ¯ ≤ ∀ ∈ condition (5c). Recall from Sec. III that we can solve Prob. xu ∈ S ⊆ C To ease notation, θc is assumed known in Sec. IV-V and rein- 1 by finding a continuous trajectory ξxu and a set of boolean variables Z enforcing that ξ = ϕ(θs, θp). For each ξdem, troduced in Sec. VI. Suppose we rewrite the constraints of (4) xu | j k k ¬k p j p NAP×Tj j p as h (η(ξxu)) = 0, g (η(ξxu)) 0, and g (η(ξxu), θ ) let Z (θi ) 0, 1 , and let the (i, t)th index Zi,t(θi ) ≤ ≤ ∈ { dem } p 0, where k/ k group together known/unknown constraints. indicate if on ξ , constraint state κt = pi for parameters θ : ¬ j | i Then, with Lagrange multipliers λ and ν, the KKT conditions j p p j p p Zi,t(θi ) = 1 ⇔ κt ∈ Si(θi ); Zi,t(θi ) = 0 ⇔ κt ∈ Ai(θi ) (6) (first-order necessary conditions for local optimality [10]) of the jth demonstration ξdem, denoted KKT(ξdem), are: Since LTL operators have equivalent boolean encodings [41], j j the truth value of ϕ(θs, θp) can be evaluated as a function of k j j p s j p s s Primal h (η(xt )) = 0, t = 1,...,Tj (5a) Z , θ , and θ , denoted as Φ(Z , θ , θ ) (we suppress θ , as k j feasibility: g (η(xt )) ≤ 0, t = 1,...,Tj (5b) it is assumed known for now). For example, we can evaluate ¬k j p s p g (η(xt ),θ ) ≤ 0, t = 1,...,Tj (5c) the truth value of ϕ(θ , θ ) = (♦[0,Tj −1]p1) (♦[0,Tj −1]p2) j,k T∧j j p Lagrange mult. λt ≥ 0, t = 1,...,Tj (5d) on dem by calculating Zj p j,¬k ξj Φ( , θ ) = ( t=1 Z1,t(θ1 )) nonnegativity: λt ≥ 0, t = 1,...,Tj (5e) Tj j p ∧ j,k k j ( t=1 Z2,t(θ2 )) (c.f. Fig. 2). Boolean encodings of common Complementary λt g (η(xt )) = 0, t = 1,...,Tj (5f) W j,¬k ¬k j p temporal and logical operators can be found in [8]. Enforcing slackness: λt g (η(xt ),θ ) = 0, t = 1,...,Tj (5g) W j p dem j,k> k j that Zi,t(θi ) satisfies (6) can be done with a big-M formulation Stationarity: ∇xt c(ξj ) + λt ∇xt g (η(xt )) j ineq Ni j,¬k> ¬k j p and binary variables si,t 0, 1 [7]: + λt ∇xt g (η(xt ),θ ) (5h) ∈ { } j,k> k j j p j > j ineq j + νt ∇xt h (η(xt )) = 0, t = 1,...,Tj gi(κt , θi ) ≤ M(1 − si,t), 1 ineq si,t − Ni ≤ MZi,t − M Ni j p j > j ineq j (7) g (κ , θ ) ≥ −Ms , 1 ineq s − N ≥ −M(1 − Z ) where denotes elementwise multiplication. We vectorize the i t i i,t N i,t i i,t j,k N ineq j,¬k N ineq j,k N ineq i multipliers λt R k , λt R ¬k , and νt R k , where 1 is a d-dimensional ones vector, M is a large positive j,k j,k∈ j,k > ∈ ∈ d i.e. λt = [λt,1 , . . . , λ k ] . We drop (5a)-(5b), as they t,Nineq number, and M (0, 1). In practice, M and M can be involve no decision variables. Then, we can find a constraint carefully chosen to∈ improve the solver’s performance. Note which makes the demonstrations locally-optimal by finding j j j Ns that si,m,t, the mth component of si,t, encodes if κt satisfies a θp that satisfies the KKT conditions for each demonstration: j p j j a negated gi,m(κt , θi ), i.e. if si,m,t = 1 or 0, then κt satisfies Problem 2 (KKT, exact): j p gi,m(κ , θ ) or 0. We can rewrite the enforced constraint p j,k j,¬k j,k Tj t i find θ , λ , λ , ν , j = 1, ..., Ns j p ≤ ≥j t t t t=1 as gi(κt , θi ) (2si,t 1) 0 for each i, t; we use this form { dem Ns } − ≤ subject to KKT(ξj ) j=1 to adapt the remaining KKT conditions. While enforcing (7) { } p p If the demonstrations are only approximately locally-optimal, is hard in general, if gi(κ, θi ) is affine in θi for fixed κ, p Prob. 2 may become infeasible. In this case, we can relax (7) is MILP-representable; henceforth, we assume gi(κ, θi ) stationarity and complementary slackness to cost penalties: is of this form. Note that this can still describe non-convex Problem 3 (KKT, suboptimal): regions, as the dependency on κ can be nonlinear. To modify Ns dem dem complementary slackness (5g) for the multi-AP case, we note minimize j=1 stat(ξj ) 1 + comp(ξj ) 1 p j,k j,¬k j,k that the elementwise product in (5g) is MILP-representable: θ ,λt ,λt ,νt k k k k dem h j,¬k j p j i j j subject to (5c)P (5e), ξ , j = 1,...,Ns  ineq j λi,t , −gi(κt , θi ) (2si,t − 1) ≤ MQi,t, Qi,t12 ≤ 1N (8) − ∀ i j ineq dem dem Ni ×2 where stat(ξj ) denotes the LHS of Eq. (5h) and comp(ξj ) where Qi,t 0, 1 . Intuitively, (8) enforces that either denotes the concatenated LHSs of Eqs. (5f) and (5g). For some 1) the Lagrange∈ { multiplier} is zero and the constraint is inactive, p j i.e. gi,m(κ, θi ) [ M, 0] or [0,M] if si,m,t = 0 or 1, 2) If forcing κ to (not) satisfy pi renders Prob. 5 infeasible, we ∈ − ∈ p the Lagrange multiplier is nonzero and gi,m(κt, θi ) = 0, or can deduce that to be consistent with the KKT conditions, both. The stationarity condition (5h) must also be modified to κ must (not) satisfy pi. Similarly, continuous volumes of κ consider whether a particular constraint is negated; this can be which must (not) satisfy pi can be extracted by solving: done by modifying the second line of (5h) to terms of the form > Problem 6 (Volume extraction): j,¬k j ¬k p λi,t (2si,t 1) xt gi (η(xt), θ ). The KKT conditions minimize ε − ∇ ε,κ ,θp,λj,k,λj,¬k, for the multi-AP case, denoted KKT (ξdem), then are: near t i,t LTL j j,k j j j  νt ,si,t,Qi,t,Z Primal Equations (5a) − (5b), t = 1,...,Tj (9a) dem Ns subject to KKTLTL(ξj ) j=1 feasibility: Equation (7), i = 1,...,NAP, t = 1,...,Tj (9b) { } κnear κquery ∞ ε Lagrange Equation (5d), t = 1,...,Tj (9c) k − p k ≤ p j,¬k Gi(κnear, θi ) > 0 OR Gi(κnear, θi ) 0 nonneg.: λi,t ≥ 0, i = 1,...,NAP, t = 1,...,Tj (9d) ≤ Complem. Equation (5f), t = 1,...,T (9e) j Prob. 6 searches for the largest box centered around κquery slackness: Equation (8), i = 1,...,NAP, t = 1,...,Tj (9f) i i i i dem j,k> k j contained in s/ ¬s. An explicit approximation of s/ ¬s can Stationarity: ∇x c(ξj ) + λ ∇x g (η(x )) G G G G t t t t then be obtained by solving Prob. 6 for many different κquery. N ineq h i X j,¬k> j  ¬k j p p s + λi,t (2si,t − 1) ∇xt gi (η(xt ),θ i ) (9g) V. LEARNING TEMPORAL LOGIC STRUCTURE (θ , θ ) i=1 j,k> k j We will discuss how to frame the search over LTL structures + ν ∇x h (η(x )) = 0, t = 1,...,Tj t t t θs (Sec. V-A), the learnability of θs based on demonstration j As mentioned in Sec. III, if κt lies on the boundary of AP i, the optimality (Sec. V-B), and how we combine notions of discrete j p s p KKT conditions will automatically determine if κt i(θi ) and continuous optimality to learn θ and θ (Sec. V-C). j p ∈ Sj or κt i(θi ) based on whichever option enables si,t to take values∈ A that satisfy (9). To summarize, our approach is A. Representing LTL structure j dem to 1) find Z , which determines the feasibility of ξj for We adapt [31] to search for a directed acyclic graph (DAG), s p j j , that encodes the structure of a parametric LTL formula and ϕ(θ , θ ), 2) find si,m,t, which link the value of Z from j p isD equivalent to its parse tree, with identical subtrees merged. the AP-containment level (i.e. κt i(θi )) to the single- j p ∈ S dem Hence, each node still has at most two children, but can have constraint level (i.e. gi,m(κt , θi ) 0), and 3) enforce that ξj satisfies the KKT conditions for≤ the continuous optimization multiple parents. This framework enables both a complete p j search over length-bounded LTL formulas and encoding of problem defined by θ and fixed values of si,t. Finally, we can write the problem of recovering θp for a fixed θs as: specific formula templates through constraints on [31]. Each node in is labeled with an AP or operatorD from (1) Problem 4 (Fixed template): and has at mostD two children; binary operators like and find θp, λj,k, λj,¬k, νj,k, sj , Qj , Zj, i, j, t ∧ ∨ t i,t t i,t i,t have two, unary operators like ♦[t1,t2] have one, and APs have dem Ns ∀ subject to KKT (ξ ) none (see Fig. 2). Formally, a DAG with NDAG nodes, = LTL j j=1 D { } (X, L, R), can be represented as: X 0, 1 NDAG×Ng , where We can also encode prior knowledge in Prob. 4, i.e. known ∈ { } p Xu,v = 1 if node u is labeled with element v of the grammar AP labels or a prior on θi ; we discuss this in App. B of [16]. and 0 else, and L, R 0, 1 NDAG×NDAG , where L = 1 / ∈ { } u,v C. Extraction of guaranteed learned AP Ru,v = 1 if node v is the left/right child of node u and 0 else. As with the constraint learning problem, the LTL learning The DAG is enforced to be well-formed (i.e. there is one root problem is also ill-posed: there can be many θp which explain node, no isolated nodes, etc.) with further constraints; see [31] the demonstrations. Despite this, we can measure our confi- for more details. Since defines a parametric LTL formula, s D dence in the learned APs by checking if a constraint state κ is we set θ = . To ensure that demonstration j satisfies the LTL formula encodedD by , we introduce a satisfaction matrix guaranteed to satisfy/not satisfy pi. Denote i as the feasible p p dem NDAG×Tj D dem F Sj 0, 1 , where S encodes the truth value set of Prob. 4, projected onto Θi (feasible set of θi ). Then, ∈ { } j,(u,t) we say κ is learned to be guaranteed contained in/excluded of the subformula for the subgraph with root node u at time t p p (i.e., Sdem = 1 iff the suffix of ξdem starting at time t satisfies from i(θ ) if for all θ i, Gi(κ) 0 / 0. Denote by: j,(u,t) j S i i ∈ F ≤ ≥ the subformula). This can be encoded with constraints: i . \ i . \ Gs= {κ | Gi(κ, θ) ≤ 0} (10) G¬s= {κ | Gi(κ, θ) ≥ 0}(11) Sdem Φt M(1 X ) (12) θ∈Fi θ∈Fi | j,(u,t) − uv| ≤ − u,v as the sets of κ which are guaranteed to satisfy/not satisfy pi. t where Φuv is the truth value of the subformula for the To query if κ is guaranteed to satisfy/not satisfy pi, we can subgraph rooted at u if labeled with v, evaluated on the suffix check the feasibility of the following problem: of ξdem starting at time t. The truth values are recursively p j Problem 5 (Query containment of κ in/outside of i(θi )): generated, and the leaf nodes, each labeled with some AP i, S j p ¬ p j,k j, k j,k j j j have truth values set to Zi (θi ). Next, we can enforce that the find θ , λt , λi,t , νt , si,t, Qi,t, Z , i, j, t dem Ns ∀ demonstrations satisfy the formula encoded in by enforcing: subject to KKTLTL(ξj ) j=1 D { p } p dem G (κ, θ ) 0 OR G (κ, θ ) 0 S = 1, j = 1,...,Ns (13) i i ≥ i i ≤ j,(root,1) We will also use synthetically-generated invalid trajectories that AP constraint is relaxed, then the constraint must hold ¬s N¬s ¬s N¬s ξ j=1 (Sec. V-C). To ensure ξ j=1 do not satisfy the to satisfy the specification. Intuitively, this means that the { } { } ¬s formula, we add more satisfaction matrices Sj and enforce: demonstrator does not visit/avoid APs which will needlessly ¬s increase the cost and are not needed to complete the task. Sj,(root,1) = 0, j = 1,...,N¬s. (14) Further spec-optimality details are presented in App. B of [16]. After discussing learnability, we will show how can be As globally-optimal demonstrations must also be spec-optimal, integrated into the KKT-based learning framework inD Sec. V-C. i.e. ϕ ϕ -SO (c.f. Lem. 1), we will use spec-optimality to g ⊆ µ B. A detour on learnability vastly reduce the search space when searching for formulas which make the demonstrations globally-optimal (Sec. V-C). When learning only the AP parameters θp (Sec. IV), we as- 1.5

feasible 1 (p1 p1) p2 sumed that the demonstrator chooses any assignment ) ♦ p1 2 ∨¬ ∨ ♦(p 0.5 p ∨ 2 (♦ ∨ p2) of Z consistent with the specification, then finds a locally- ) ¬ 0 p 1 ∨ ϕ ( p p ) p (♦ µ-SO 2 1 ♦ 2 -0.5 ( ¬ U ∧ optimal trajectory for those fixed Z. Feasibility is enough if the ♦p1) ( ϕg ∧ ♦p2) s s p p -1 ϕf structure θ of ϕ(θ , θ ) is known: to recover θ , we just need ((p1 p1) (p2 p2)) -1.5 ♦ to find some Z which is feasible with respect to the known θs -3 -2 -1 0 1 2 3 ∨ ¬ ∧ ∨ ¬ j p s dem Fig. 3. L: demonstrations satisfying ϕ = (¬p2U[0,Tj ]p1)∧♦[0,Tj ]p2. R: some (i.e. Φ(Z , θ , θ ) = 1) and makes ξj locally-optimal; that demonstration-consistent formulas under various discrete optimality conditions. is, the demonstrator can choose an arbitrarily suboptimal high- level plan as long as its low-level plan is locally-optimal for the Algorithm 1: Falsification s dem Ns ¯ ˆs ˆp chosen high-level plan. However, if θ is also unknown, only 1 Input: {ξj }j=1, S, Output: θ , θ ¬s using boolean feasibility is not enough to recover meaningful 2 NDAG ← 0, {ξ } ← {} logical structure, as this makes any formula ϕ for which 3 while ¬ consistent do 4 N ← N + 1 Φ(Zj, θp, θs) = 1 consistent with the demonstration, including DAG DAG 5 while Problem 8 is feasible do trivially feasible formulas always evaluating to . Consider ˆs ˆp dem Ns ¬s p p > 6 θ , θ ← Problem 8({ξj }j=1, {ξ },NDAG) the example in Fig. 3: θ1 , θ2 are known and we are given two 7 for j = 1 to Ns do j dem kinematic demonstrations minimizing path length under input 8 ξxu ← Problem 7(ξj ) dem constraints, formula ϕ = ( p p ) p , and j c(ξj ) ¬s ¬s 2 [0,Tj −1] 1 ♦[0,Tj −1] 2 9 ¬ U ∧ if c(ξxu) < (1+δ) then {ξ } ← {ξ } ∪ ξxu ; start/goal constraints. Assuming boolean feasibility, we cannot dem N j c(ξj ) 10 if W s (c(ξ ) < ) then consistent ← >; break; distinguish between formulas in ϕf , the set of formulas for j=1 xu (1+δ) which the demonstrations are feasible in the discrete variables C. Counterexample-guided framework and locally-optimal in the continuous variables. In this section, we will assume that the demonstrator returns On the other end of the spectrum, we can assume the a solution to Prob. 1 which is boundedly-suboptimal with demonstrator is globally-optimal. This invalidates many struc- respect to the globally optimal solution, in that c(ξdem) tures in , i.e. the blue trajectory should not visit both and j ϕf 1 (1 + δ)c(ξ∗), for a known δ, where c(ξ∗) is the cost of≤ if ϕ = ( p ) ( p ); we achieve aS lower j j S2 ♦[0,Tj −1] 1 ∨ ♦[0,Tj −1] 2 the optimal solution. This is reasonable as the demonstration cost by only visiting one. Using global optimality, we can should be feasible (completes the task), but may be suboptimal distinguish between all but the formulas with globally-optimal in terms of cost (i.e. path length, etc.), and δ can be estimated trajectories of equal cost (formulas in ϕg), i.e. we cannot learn from repeated demonstrations. Under this assumption, any the ordering constraint ( p p ) from only the blue 2 [0,Tj −1] 1 trajectory ξ satisfying the known constraints η¯(ξ ) ¯ trajectory, as it coincides¬ withU the globally-optimal trajectory xu xu at a cost lower than the suboptimality bound, i.e. c(ξ ∈) S for ϕ = ( p ) ( p ); we need the yellow xu ≤ ♦[0,Tj −1] 1 ♦[0,Tj −1] 2 c(ξdem)/(1 + δ), must violate ϕ(θs, θp) [14, 15]. We can use trajectory to distinguish∧ the two. We now define an optimality j this to reject candidate structures θˆs and parameters θˆp. If we condition between feasibility and global optimality: can find a counterexample trajectory that satisfies the candidate Definition 1 (Spec-optimality): A demonstration ξdem is µ- j LTL formula ϕ(θˆs, θˆp) at a lower cost by solving Prob. 7, spec-optimal (µ-SO), where µ Z+, if for every index set . .∈ Problem 7 (Counterexample search): ι = (i1, t1), ..., (iµ, tµ) in = ι im 1, ..., NAP , tm { } I { | ∈ { } ∈ find ξxu 1, ..., T , m = 1, ..., µ , at least one of the following holds: ˆs ˆp { j} } subject to ξxu = ϕ(θ , θ ) • ξdem is locally-optimal after removing the constraints | ¯ dem j η¯(ξxu) (ξj ) j ∈ S dem ⊆ C associated with pim on κ , for all (im, tm) ι. c(ξ ) < c(ξ )/(1 + δ) tm ∈ xu j • For each index (im, tm) ι, the formula is not satisfied then ϕ(θˆs, θˆp) cannot be consistent with the demonstration. for a perturbed Z, denoted∈ Zˆ, where ˆ p Zim,tm (θim ) = ˆs ˆp p p Using this insight, we search for consistent θ , θ by iteratively Z (θ ), for all m = 1, . . . , µ, and Zˆ 0 0 (θ 0 ) = s p im,tm im i ,t i proposing candidate θˆ , θˆ by solving Prob. 8 (a modified ¬ p 0 0 Zi0,t0 (θi0 ) for all (i , t ) / ι. Prob. 4, to be discussed shortly) and searching for counterex- dem ∈ ˆ • ξj is infeasible with respect to Z. amples proving the parameters are invalid, eventually returning Spec-optimality enforces a level of logical optimality: if a a consistent formula; this is summarized in Alg. 1. We present j dem state κt on demonstration ξj lies inside/outside of AP i (i.e. falsification loop heuristics in App. C of [16]. We now discuss G (κj, θp) 0 / 0), and the cost c(ξdem) can be lowered if the core components of Alg. 1 (Probs. 7 and 8) in detail. i t i ≤ ≥ j Counterexample generation: We propose different methods locally-optimal upon relaxing the constraint (zero stationarity to solve Prob. 7 based on the dynamics. For piecewise affine contribution), and (15c)-(15e) model the infeasible case. systems, Prob. 7 can be solved directly as a MILP [41]. Remark 1: If µ = 1, the infeasibility constraints (15c)-(15e) However, the LTL planning problem for general nonlinear can be ignored (since together with (15a), they are redundant), 1 2 2 systems is challenging [20, 30]. Probabilistically-complete and we can modify (15f) to bnj + bnj 1, bnj 0, 1 . sampling-based methods [20, 30] or falsification tools [2] can Remark 2: It is only useful to enforce≤ spec-optimality∈ { } on index pairs where j p for be applied, but can be slow on high-dimensional systems. (i1, t1),..., (iµ, tµ) Gim (κtm , θim ) = 0 For simplicity and speed, we solve Prob. 7 by finding a all m = 1, ..., µ; otherwise the infeasibility case automatically s p p trajectory ξˆxu = ϕ(θˆ , θˆ ) and boolean assignment Z for a holds. If θ is unknown, we won’t know a priori when this kinematic approximation| of the dynamics via solving a MILP, holds, but if θp are (approximately) known, we can pre-process then warm-start the nonlinear optimizer with ξˆ and constrain so that spec-optimality is only enforced for salient ι . xu ∈ I it to be consistent with Z, returning some ξxu. If c(ξxu) < Remark 3: Prob. 8 with spec-optimality constraints (15) can dem ˆs ˆp c(ξj )/(1 + δ), then we return, otherwise, we generate a be used to directly search for a ϕ(θ , θ ) which can be satisfied new ξˆxu. Whether this method returns a valid counterexample by visiting a set of APs in any order (i.e. surveillance-type depends on if the nonlinear optimizer converges to a feasible tasks) without using the loop in Alg. 1, since (15) directly solution; hence, this approach is not complete. However, we enforces that any AP (1-SO) or a set of APs (µ-SO) which show that it works well in practice (see Sec. VIII). were visited and which prevent the trajectory cost from being s p Unifying parameter and structure search: When both θp lowered must be visited for any candidate ϕ(θˆ , θˆ ). s and θ are unknown, they must be jointly learned due to their VI.LEARNING COST FUNCTION PARAMETERS (θp, θs, θc) interdependence: learning the structure involves finding an If θc is unknown, it can be learned by modifying KKT p s LTL unknown boolean function of θ , parameterized by θ , while to also consider θc in the stationarity condition: all terms like p learning the AP parameters θ requires knowing which APs c(ξdem) should be modified to c(ξdem, θc). When s ξxu j ξxu j were selected or negated, determined by θ . This can be done ∇ c dem ∇ c( , ) is affine in θ for fixed ξj , the stationarity condition by combining the KKT (9) and DAG constraints (12)-(14) into is· representable· with a MILP constraint. However, the falsifi- a single MILP, which can then be integrated into Alg. 1: cation loop in Alg. 1 requires a fixed cost function in order p s Problem 8 (Learning θ , θ by global optimality, KKT): to judge if a trajectory is a counterexample. Thus, one valid dem ¬s p j,k j,¬k j,k j j j c find D, Sj , Sj , θ , λt , λi,t , νt , si,t, Qi,t, Z , ∀i, j, t approach is to first solve Prob. 8, searching also for θ , then dem Ns s.t. {KKTLTL(ξj )}j=1 fixing θc, and running Alg. 1 for the fixed θc (see App. A well-formedness constraints for D of [16]). Note that this procedure either eventually returns an Equations (12) − (13), j = 1,...,N s LTL formula consistent with the fixed θc, or Alg. 1 becomes Equation (14), j = 1,...,N¬s infeasible, and a new θc must be generated and Alg. 1 rerun. j p In Prob. 8, since 1) the Z (θ ) at the leaf nodes of are While this procedure is guaranteed to eventually return a set i i D constrained via (7) to be consistent with θp and ξdem and 2) the c s p dem j of θ , θ , and θ which make each ξj globally-optimal with c s p formula defined by is constrained to be satisfied for the Z respect to c(ξxu, θ ) under ϕ(θ , θ ), it may require iterating D dem c via (12), the low-level demonstration ξj must be feasible for through an infinite number of candidate θ and hence is not the overall LTL formula defined by the DAG, i.e. ϕ(θs, θp), guaranteed to terminate in finite time (Cor. C.3). Nevertheless, s dem p where θ = . KKTLTL(ξj ) then chooses AP parameters θ we note that for certain simple classes of formulas (Rem. 3), demD c s p to make ξj locally-optimal for the continuous optimization a consistent set of θ , θ , and θ can be recovered in one shot. induced by a fixed realization of boolean variables. Overall, p s dem VII.THEORETICAL ANALYSIS Prob. 8 finds a pair of θ and θ which makes ξj locally- optimal for a fixed Zj which is feasible for ϕ(θs, θp), i.e. In this section, we prove that our method is complete under Φ(Zj, θp, θs) = 1, for all j. To also impose the spec-optimality some assumptions, without (Thm. 1) or with (Cor. 2) spec- conditions (Def. 1), we can add these constraints to Prob. 8: optimality, and that we can compute guaranteed conservative ˆ j estimates of i/ i (Thm. 2). Finally, we show using stronger dem,Zn 1 S A Sj,(root,1) ≤ bnj (15a) optimality assumptions on the demonstrator shrinks the set of > kλj,¬k ∇ g¬k(η(xj ), θp )k ≤ M(1 − b2 ), m = 1, ..., µ im,tm xt im t im nj (15b) consistent formulas (Thm. 3). See App. C of [16] for proofs. g¬k(η(xj ), θp ) ≥ −M(1 − ej ), m = 1, ..., µ Assumption 1: Prob. 7 is solved with a complete planner. im t im nm (15c) > j j p 3 Assumption 2: Each demonstration is locally-optimal (i.e. 1 i e ≥ Zˆ (θ ) − b , m = 1, ..., µ m nm imtm im nj (15d) Nineq ¬k j p j 3 satisfies the KKT conditions) for fixed boolean variables. g (η(x ), θ ) ≤ M(Zˆ + b ) (15e) p s c im t im im,tm nj Assumption 3: The true parameters θ , θ , and θ are in Nim 1 2 3 3 j ineq p s c bnj + bnj + bnj ≤ 1, bnj ∈ {0, 1} , enm ∈ {0, 1} (15f) the hypothesis space of Prob. 8: θ Θp, θ Θs, θ Θc. ∈ ∈ ∈s p j Theorem 1 (Completeness & consistency, unknown θ , θ ): dem,Zˆ for n = 1,..., , where S n is the satisfaction matrix Under Assumptions 1-3, Alg. 1 is guaranteed to return a |I| j for dem where the leaf nodes are perturbed to take the values s p dem s p dem ξj formula ϕ(θ , θ ) such that 1) ξj = ϕ(θ , θ ) and 2) ξj ˆj s p | of Zn, where n indexes an ι . (15a) models the case when is globally-optimal under ϕ(θ , θ ), for all j, 3) if such a ∈ I dem the formula is not satisfied, (15b) models when ξj remains formula exists and is representable by the provided grammar. Corollary 1 (Shortest formula): Let N ∗ be the minimal different environments (Fig. 5). A point robot must first go to p s dem size DAG for which there exists (θ , θ ) such that ξj = the mug (p1), then go to the coffee machine (p2), and then s p | ϕ(θ , θ ) for all j. Under Assumptions 1-3, Alg. 1 is guaran- go to goal (p3) while avoiding obstacles (p4, p5). As the floor teed to return a DAG of length N ∗. maps differ, θp also differ, and are assumed known. We add . Lemma 1: All globally-optimal trajectories are µ-SO. two relevant primitives to the grammar, sequence: ϕ1 ϕ2 = Corollary 2 (Alg. 1 with spec-optimality): By modifying ϕ ϕ , enforcing that ϕ cannot occur untilQ after ϕ ¬ 2 U[0,Tj −1] 1 2 . 1 Alg. 1 so that Prob. 8 uses constraints (15), Alg. 1 still returns has occurred for the first time, and avoid: ϕ = [0,Tj −1] ϕ, ˆs ˆp dem V ¬ a consistent solution ϕ(θ , θ ) if one exists, i.e. each ξj is enforcing ϕ never holds over [1,Tj]. Then, the true formula ˆs ˆp ∗ feasible and globally optimal for each ϕ(θ , θ ). is: ϕ = p4 p5 (p1 p2) (p2 p3) [0,Tj −1]p3. V ∧ V ∧ Q ∧ Q ∧ ♦ Remark 4: Please refer to App. C of [16] for a corollary Suppose first that we are given the blue demonstration in proving that the modified falsification loop described in Sec. Env. 2. Running Alg. 1 with 1-SO constraints (15) terminates VI is guaranteed to return a consistent formula, if it terminates. in one iteration at NDAG = 14 with ϕ0 = p4 p5 p V ∧ V ∧ Theorem 2 (Conservativeness for unknown θ ): Suppose [0,T −1]p2 [0,T −1]p3 (p1 p2): always avoid obstacles ♦ j ∧ ♦ j ∧ Q that θs and θc are known, and θp is unknown. Then, extracting 1 and 2, eventually reach coffee and goal, and visit mug i i s and ¬s, as defined in (10)-(11), from the feasible set of before coffee. This formula is insufficient to complete the true G G p i task (the ordering constraint between coffee and goal is not Prob. 4 projected onto Θi (denoted i), returns s i and i F G ⊆ S ¬s i, for all i 1,...,NAP . learned). This is because the optimal trajectories satisfying ϕ0 G ⊆ A ∈ { } ∗ ∗ Theorem 3 (Distinguishability): For the consistent formula and ϕ are the same cost, i.e. both ϕ0 and ϕ are consistent sets defined in Sec. V-B, we have ϕg ϕµ˜-SO ϕµˆ-SO ϕf , with the demonstration and could have been returned, and ⊆ ⊆ ⊆ ∗ for µ˜ > µˆ. ϕ0, ϕ ϕg (c.f. Sec. VII). Now, we also use the blue demonstration∈ from Env. 1 (two examples total). Running Alg. VIII.EXPERIMENTAL RESULTS 1 terminates in two iterations at NDAG = 14 with the formulas We show that our algorithm outperforms a competing ϕ1 = p4 p5 ♦[0,Tj −1]p1 ♦[0,Tj −1]p2 ♦[0,Tj −1]p3 method, can learn shared task structure from demonstrations V ∧∗ V ∧ ∧ ∧ and ϕ2 = ϕ . Since the demonstration in Env. 1 doubles back p s across environments, and can learn LTL formulas θ , θ and to the coffee before going to goal, increasing its cost over c uncertain cost functions θ on high-dimensional problems. first going to goal and then to coffee, the ordering constraint Please refer to the supplementary video for visualizations of between the two is learnable. We also plot the generated coun- the results: https://youtu.be/cpUEcWCUMqc. terexample (Fig. 5, yellow), which achieves a lower cost, as ϕ1 involves no ordering constraints. We use the learned formula

4 4 4 3 3 3 to plan a path completing the task in a new environment 2 2 2

1 1 1 0 0 0 (App. E of [16]). Overall, this example suggests we can use -1 -1 -1

-2 -2 -2 -3 -3 -3 demonstrations in different environments to learn shared task -4 -4 -4 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 1.5 2 2.5 3 -1 -0.5 0 0.5 1 1.5 2 2.5 3 structure and disambiguate between potential explanations. Fig. 4. Toy example for baseline comparison [23].

Baseline comparison: Likely the closest method to ours is 3 4 3 2 [23], which learns a pSTL formula that is tightly satisfied by 2 1 1 the demonstrations via solving a nonconvex problem to lo- 0 0

p dem p dem -1 cal optimality: arg maxθp minj τ(θ , ξj ), where τ(θ , ξj ) -1 -2

dem -2 measures how tightly ξj fits the learned formula. We run the -3 -3 -4 authors’ code [22] on a toy problem (see Fig. 4), where the -6 -4 -2 0 2 4 6 -2 0 2 4 6 8 demonstrator has kinematic constraints, minimizes path length, Fig. 5. Different environments (different θp) with shared task (same θs). .2KQMbi`iBQMb and satisfies start/goal constraints and ϕ = ♦[0,8]p1, where *QmMi2`2tKTH2b > > p > p6 x = p [I × , I × ] x [3, 2, 1, 2] = [3, θ ] . We 1 2 2 2 2 1 p assume| the⇔ structure− θs is known,≤ and− we aim to learn θp 2 p3 p5 to explain why the demonstrator deviated from an optimal p1 1 Fig. 6. Demonstrations and counterexamples for the manipulation task. straight-line path to the goal. Solving Prob. 6 returns s = 1 (Fig. 4, right). On the other hand, we run TeLExG multipleS Multi-stage manipulation task: We consider the setup in , converging to different local optima, each corresponding Figs. 1, 6 of teaching a 7-DOF Kuka iiwa robot arm to to a “tight” θp (Fig. 4, center): TeLEx cannot distinguish prepare a drink: first move the end effector to the button on the p between multiple different “tight” θ , which makes sense, as faucet (p1), then grasp the cup (p2), then move the cup to the the method tries to find any “tight” solution. This example customer (p3), all while avoiding obstacles. After grasping the p suggests that if the demonstrations are goal-directed, a method cup, an end-effector pose constraint (α, β, γ) 4(θ4 ) (p4) that leverages their optimality is likely to better explain them. must be obeyed. We add two “distractor” APs: a∈ different S cup Learning shared task structure: In this example, we show (p5) and a region (p6) where the robot can hand off the cup. that our method can extract logical structure shared between We also modify the grammar to include the sequence operator . demonstrations that complete the same high-level task, but in , (defined as before), and add an “after” operator ϕ ϕ = Q 1 T 2 [0,Tj −1](ϕ2 [0,Tj −1]ϕ1), that is, ϕ1 must hold after and → 1 1 including the first timestep where ϕ2 holds. The true formula ∗ is: ϕ = (p1 p2) (p2 p3) ♦[0,Tj −1]p3 (p4 p2). Q ∧ Q i∧ i i ∧ T We use a kinematic arm model: jt+1 = jt + ut, i = 1,..., 7, 0.5 0.5 2 where ut 2 1 for all t. Two suboptimal human demon- k k ≤ T −1 strations (δ = 0.7) optimizing c(ξ ) = j j 2 xu t=1 t+1 t 2 0 0 are recorded in virtual reality. We assume we havek nominal− k 1 2 3 4 1 2 3 4 p P estimates of the AP regions i(θi,nom) (i.e. from a vision Fig. 7. Quadrotor surveillance demonstrations and learning curves. system), and we want to learnS the θs and θp of ϕ∗. ♦[0,Tj −1]p3 [0,Tj −1] p4, where p1-p3 represent the regions We run Alg. 1 with the 1-SO constraints (15), and encode ∧ ¬ p p p p p p of interest and p4 is the building. We aim to learn θi for the p p the nominal θi by enforcing that Θi = θi θi θi,nom 1 > > { | k − k ≤ parameterization i(θi ) = [I3×3, I3×3] [x, y, z] θi , 0.05 . At NDAG = 11, the inner loop runs for 3 iterations p S { − ≤ } } assuming θ4,6 = 0 (the building is not hovering). The demon- (each taking 30 minutes on an i7-7700K processor), returning c T −1 2 strations minimize c(ξxu, θ ) = r∈R t=1 γr(rt+1 rt) , candidates ϕ1 = (p1 p3) (p2 p3) (♦[0,Tj −1]p3) (p4 p3), − Q ∧ Q ∧ ∧ T where R = x, y, z, α,˙ β,˙ γ˙ and γr = 1, i.e. equal penalties to ϕ2 = (p1 p3) (p2 p3) ( − p3) (p4 p2), and P P ♦[0,Tj 1] path length{ and angular acceleration.} We assume γ [0.1, 1] ϕ = ϕ∗.Qϕ says∧ thatQ before∧ going to the∧ customer,T the r 3 1 and is unknown: we want to learn the weights for each∈ state. robot has to visit the button and cup in any order, and then Solving Prob. 8 with 1-SO conditions (at N = 12) takes must satisfy the pose constraint after visiting the cup. ϕ has DAG 2 44 minutes and recovers θp, θs, and θc in one shot. To evaluate the meaning of ϕ∗, except the robot can go to the button the learned θp, we show in Fig. 7 that the coverage of the or cup in any order. Note that ϕ is a stronger formula 3 i and i for each p (computed by fixing the learned θs than ϕ , and ϕ than ϕ ; this is a natural result of the s ¬s i 2 2 1 andG runningG Prob. 6) monotonically increases with more data. falsification loop, which returns incomparable or stronger In terms of recovered θs, with one demonstration, we return formulas with more iterations, as the counterexamples rule out ϕ = p p p p . weaker or equivalent formulas. Also note that the distractor 1 ♦[0,Tj −1] 2 ♦[0,Tj −1] 3 ♦[0,Tj −1] 4 [0,Tj −1] 1 This highlights the∧ fact that since∧ we are not∧ provided labels,¬ APs don’t feature in the learned formulas, even though both there is an inherent ambiguity of how to label the regions demonstrations pass through p . This happens for two reasons: 6 of interest (i.e. p , i = 1,..., 3 can be associated with any we increase N incrementally and there was no room to i DAG of the green boxes in Fig. 7 and be consistent). Also, one include distractor objects in the formula (since spec-optimality of the regions of interest in ϕ gets labeled as the obstacle may enforce that p -p appear in the formula), and even if 1 3 (i.e. p and p are swapped), since one demonstration is not N were not minimal, p would not be guaranteed to show 1 4 DAG 6 enough to disambiguate which of the four p should touch the up, since visiting p does not increase the trajectory cost. i 6 ground. Note that this ambiguity can be eliminated if labels are We plot the counterexamples in Fig. 6: blue/purple are from provided (see App. B of [16]) or if more demonstrations are iteration 1; orange is from iteration 2. They save cost by violat- provided: for two and more demonstrations, we learn ϕ = ing the ordering and pose constraints: from the left start state, i ϕ∗, i = 2,..., 4. When using all four demonstrations, we the robot can save cost if it visits the cup before the button recover the cost parameters θc and structure θs exactly, i.e. (blue, orange trajectories), and loosening the pose constraint ϕ(θˆs, θˆp) = ϕ∗, and fixing the learned θs and running Prob. 6 can reduce joint space cost (orange, purple trajectories). The returns i = and i = , for all i. The learned θc, θs, right demonstration produces no counterexample in iteration s i ¬s i and θp areG usedS to planG trajectoriesA that efficiently complete the 2, as it is optimal for this formula (changing the order does not task for different initial and goal states (see App. G of [16]). lower the optimal cost). For the learned θp, θp = θp except i i,nom Overall, this example suggests that our method can jointly for p , p , where the box shrinks slightly from the nominal; 2 3 recover consistent θp, θs, and θc for high-dimensional systems. this is as doing so enables a Lagrange multiplier can be increased to reduce the KKT residual. We use the learned θp, IX.CONCLUSION s θ to plan formula-satisfying trajectories from new start states We present an method that learns LTL formulas with (see App. F of [16]). Overall, this example suggests that Alg. unknown atomic propositions and logical structure from only p s 1 can recover θ and θ on a high-dimensional problem and positive demonstrations, assuming the demonstrator is opti- ignore distractor APs, despite demonstration suboptimality. mizing an uncertain cost function. We use both implicit (KKT) Multi-stage quadrotor surveillance: We demonstrate that we and explicit (algorithmically generated lower-cost trajectories) p s c can jointly learn θ , θ , and θ in one shot on a 12D nonlinear optimality conditions to reduce the hypothesis space of LTL quadrotor system (see App. G of [16]). We are given four specifications consistent with the demonstrations. In future demonstrations of a quadrotor surveilling a building (Fig. 7): work, we aim to robustify our method to mislabeled demon- it needs to visit three regions of interest (Fig. 7, green) while strations, explicitly consider demonstration suboptimality aris- not colliding with the building. All visitation constraints can ing from risk, and reduce our method’s computation time. be learned directly with 1-SO (see Rem. 3) and collision- Acknowledgments: We thank Daniel Neider for insightful discussions. This avoidance can also be learned with 1-SO, with enough demon- work is supported in part by an NDSEG fellowship, NSF grants IIS-1750489, strations. The true formula is ϕ∗ = p p ECCS-1553873, and ONR grants N00014-17-1-2050, N00014-18-1-2501. ♦[0,Tj −1] 1 ∧♦[0,Tj −1] 2 ∧ REFERENCES Techniques and Applications - 6th International Sympo- sium, ISoLA 2014, Imperial, Corfu, Greece, October 8- [1] Pieter Abbeel and Andrew Y. Ng. Apprenticeship learn- 11, 2014, Proceedings, Part II, pages 391–403, 2014. ing via inverse reinforcement learning. In International [12] Sylvain Calinon and Aude Billard. A probabilistic Conference on Machine Learning (ICML), 2004. doi: programming by demonstration framework handling con- 10.1145/1015330.1015430. straints in joint space and task space. In International [2] Yashwanth Annpureddy, Che Liu, Georgios E. Fainekos, Conference on Intelligent Robots and Systems (IROS), and Sriram Sankaranarayanan. S-taliro: A tool for tem- 2008. doi: 10.1109/IROS.2008.4650593. poral logic falsification for hybrid systems. In Tools and [13] Alberto Camacho and Sheila A. McIlraith. Learning Algorithms for the Construction and Analysis of Systems interpretable models expressed in linear temporal logic. - 17th International Conference, TACAS 2011, Held In Proceedings of the Twenty-Ninth International Con- as Part of the Joint European Conferences on Theory ference on Automated Planning and Scheduling, ICAPS and Practice of Software, ETAPS 2011, Saarbrucken,¨ 2018, Berkeley, CA, USA, July 11-15, 2019, pages 621– Germany, March 26-April 3, 2011. Proceedings, pages 630, 2019. 254–257, 2011. [14] Glen Chou, Dmitry Berenson, and Necmiye Ozay. Learn- [3] Brandon Araki, Kiran Vodrahalli, Thomas Leech, Cris- ing constraints from demonstrations. Workshop on the Al- tian Ioan Vasile, Mark Donahue, and Daniela Rus. Learn- gorithmic Foundations of Robotics (WAFR), 2018. URL ing to plan with logical automata. In Robotics: Science https://arxiv.org/abs/1812.07084. and Systems XV, University of Freiburg, Freiburg im [15] Glen Chou, Necmiye Ozay, and Dmitry Berenson. Learn- Breisgau, Germany, June 22-26, 2019, 2019. ing parametric constraints in high dimensions from [4] Brenna Argall, Sonia Chernova, Manuela Veloso, and demonstrations. 3rd Conference on Robot Learning Brett Browning. A survey of robot learning from demon- (CoRL), 2019. URL https://arxiv.org/abs/1910.03477. stration. Robotics and Autonomous Systems, 57:469–483, [16] Glen Chou, Necmiye Ozay, and Dmitry Berenson. Ex- 2009. planing multi-stage tasks by learning temporal logic [5] Christel Baier and Joost-Pieter Katoen. Principles of formulas from suboptimal demonstrations. Robotics: . MIT Press, 2008. ISBN 978-0-262- Science and Systems XVI (RSS), extended version, 2020. 02649-9. URL https://arxiv.org/abs/2006.02411. [6] Alexey Bakhirkin, Thomas Ferrere,` and Oded Maler. [17] Glen Chou, Necmiye Ozay, and Dmitry Berenson. Learn- Efficient parametric identification for STL. In Proceed- ing constraints from locally-optimal demonstrations un- ings of the 21st International Conference on Hybrid der cost function uncertainty. In Robotics and Automa- Systems: Computation and Control (part of CPS Week), tion Letters (RA-L), 2020. URL https://arxiv.org/abs/ HSCC 2018, Porto, Portugal, April 11-13, 2018, pages 2001.09336. 177–186, 2018. doi: 10.1145/3178126.3178132. URL [18] Stephane´ Demri and Philippe Schnoebelen. The com- https://doi.org/10.1145/3178126.3178132. plexity of propositional linear temporal in simple [7] Dimitris Bertsimas and John Tsitsiklis. Introduction to cases. Inf. Comput., 174(1):84–103, 2002. Linear Optimization. Athena Scientific, 1st edition, 1997. [19] Peter Englert, Ngo Anh Vien, and Marc Toussaint. ISBN 1886529191. Inverse kkt: Learning cost functions of manipulation [8] Armin Biere, Keijo Heljanko, Tommi A. Junttila, Timo tasks from demonstrations. International Journal of Latvala, and Viktor Schuppan. Linear encodings of Robotics Research (IJRR), 36(13-14):1474–1488, 2017. bounded LTL model checking. Logical Methods in doi: 10.1177/0278364917745980. Computer Science, 2(5), 2006. [20] Jie Fu, Ivan Papusha, and Ufuk Topcu. Sampling- [9] Giuseppe Bombara, Cristian Ioan Vasile, Francisco based approximate optimal control under temporal logic Penedo, Hirotoshi Yasuoka, and Calin Belta. A decision constraints. In Proceedings of the 20th International tree approach to data classification using signal temporal Conference on Hybrid Systems: Computation and Con- logic. In Proceedings of the 19th International Con- trol, HSCC 2017, Pittsburgh, PA, USA, April 18-20, ference on Hybrid Systems: Computation and Control, 2017, pages 227–235, 2017. HSCC 2016, Vienna, Austria, April 12-14, 2016, pages [21] Sumit Kumar Jha, Edmund M. Clarke, Christopher James 1–10, 2016. Langmead, Axel Legay, Andre´ Platzer, and Paolo Zuliani. [10] Stephen Boyd and Lieven Vandenberghe. Convex Opti- A bayesian approach to model checking biological sys- mization. Cambridge University Press, New York, NY, tems. In Computational Methods in Systems Biology, 7th USA, 2004. ISBN 0521833787. International Conference, CMSB 2009, Bologna, Italy, [11] Sara Bufo, Ezio Bartocci, Guido Sanguinetti, Massimo August 31-September 1, 2009. Proceedings, pages 218– Borelli, Umberto Lucangelo, and Luca Bortolussi. Tem- 234, 2009. poral logic based monitoring of assisted ventilation in [22] Susmit Jha. susmitjha/telex. URL https://github.com/ intensive care patients. In Leveraging Applications of susmitjha/TeLEX. Formal Methods, Verification and Validation. Specialized [23] Susmit Jha, Ashish Tiwari, Sanjit A. Seshia, Tuhin Sahai, and Natarajan Shankar. Telex: learning signal temporal 2018. logic from positive examples using tightness metric. [35] Pravesh Ranchod, Benjamin Rosman, and George Dim- Formal Methods in System Design, 54(3):364–387, 2019. itri Konidaris. Nonparametric bayesian reward segmen- [24] Miles Johnson, Navid Aghasadeghi, and Timothy Bretl. tation for skill discovery using inverse reinforcement Inverse optimal control for deterministic continuous-time learning. In 2015 IEEE/RSJ International Conference on nonlinear systems. In IEEE Conference on Decision and Intelligent Robots and Systems, IROS 2015, Hamburg, Control (CDC), 2013. Germany, September 28 - October 2, 2015, pages 471– [25] Arezou Keshavarz, Yang Wang, and Stephen P. Boyd. 477, 2015. Imputing a convex objective function. In IEEE Inter- [36] Nathan D. Ratliff, J. Andrew Bagnell, and Martin Zinke- national Symposium on Intelligent Control (ISIC), pages vich. Maximum margin planning. In Machine Learning, 613–619. IEEE, 2011. Proceedings of the Twenty-Third International Confer- [26] Zhaodan Kong, Austin Jones, Ana Medina Ayala, ence (ICML 2006), Pittsburgh, Pennsylvania, USA, June Ebru Aydin Gol, and Calin Belta. Temporal logic 25-29, 2006, pages 729–736, 2006. inference for classification and prediction from data. [37] Francesco Sabatino. Quadrotor control: modeling, non- In 17th International Conference on Hybrid Systems: linearcontrol design, and simulation, 2015. Computation and Control (part of CPS Week), HSCC’14, [38] Ankit Shah, Pritish Kamath, Julie A. Shah, and Shen Berlin, Germany, April 15-17, 2014, pages 273–282, Li. Bayesian inference of temporal task specifications 2014. from demonstrations. In Advances in Neural Information [27] Zhaodan Kong, Austin Jones, and Calin Belta. Temporal Processing Systems (NeurIPS) 2018, pages 3808–3817, logics for learning and detection of anomalous behavior. 2018. IEEE Transactions on Automatic Control, 62(3):1210– [39] Prashant Vaidyanathan, Rachael Ivison, Giuseppe Bom- 1222, 2017. bara, Nicholas A. DeLateur, Ron Weiss, Douglas Dens- [28] Sanjay Krishnan, Animesh Garg, Richard Liaw, Brijen more, and Calin Belta. Grid-based temporal logic infer- Thananjeyan, Lauren Miller, Florian T. Pokorny, and ence. In 56th IEEE Annual Conference on Decision and Ken Goldberg. SWIRL: A sequential windowed inverse Control, CDC 2017, Melbourne, Australia, December 12- reinforcement learning algorithm for robot tasks with 15, 2017, pages 5354–5359, 2017. delayed rewards. International Journal of Robotics [40] Marcell Vazquez-Chanlatte, Susmit Jha, Ashish Tiwari, Research (IJRR), 38(2-3), 2019. Mark K. Ho, and Sanjit A. Seshia. Learning task specifi- [29] Karen Leung, Nikos Arechiga,´ and Marco Pavone. Back- cations from demonstrations. In Advances in Neural In- propagation for parametric STL. In 2019 IEEE Intelligent formation Processing Systems 31: Annual Conference on Vehicles Symposium, IV 2019, Paris, France, June 9- Neural Information Processing Systems 2018, NeurIPS 12, 2019, pages 185–192, 2019. doi: 10.1109/IVS. 2018, 3-8 December 2018, Montreal,´ Canada, pages 2019.8814167. URL https://doi.org/10.1109/IVS.2019. 5372–5382, 2018. 8814167. [41] Eric M. Wolff, Ufuk Topcu, and Richard M. Murray. [30] Lening Li and Jie Fu. Sampling-based approximate opti- Optimization-based trajectory generation with linear tem- mal temporal logic planning. In 2017 IEEE International poral logic specifications. In 2014 IEEE International Conference on Robotics and Automation, ICRA 2017, Conference on Robotics and Automation, ICRA 2014, Singapore, Singapore, May 29 - June 3, 2017, pages Hong Kong, China, May 31 - June 7, 2014, pages 5319– 1328–1335, 2017. 5325, 2014. [31] Daniel Neider and Ivan Gavran. Learning linear temporal [42] Zhe Xu, Alexander J. Nettekoven, A. Agung Julius, properties. In 2018 Formal Methods in Computer Aided and Ufuk Topcu. Graph temporal logic inference for Design, FMCAD 2018, Austin, TX, USA, October 30 - classification and identification. In 58th IEEE Conference November 2, 2018, pages 1–10, 2018. on Decision and Control, CDC 2019, Nice, France, [32] Andrew Y. Ng and Stuart J. Russell. Algorithms for December 11-13, 2019, pages 4761–4768. IEEE, 2019. inverse reinforcement learning. In International Confer- [43] Weichao Zhou and Wenchao Li. Safety-aware appren- ence on Machine Learning (ICML), pages 663–670, San ticeship learning. In Computer Aided Verification - 30th Francisco, CA, USA, 2000. ISBN 1-55860-707-2. International Conference, CAV 2018, Held as Part of the [33] A. L. Pais, Keisuke Umezawa, Yoshihiko Nakamura, Federated Logic Conference, FloC 2018, Oxford, UK, and A. Billard. Learning robot skills through motion July 14-17, 2018, Proceedings, Part I, pages 662–680, segmentation and constraints extraction. ACM/IEEE 2018. International Conference on Human-Robot Interaction (HRI), 2013. [34] Ivan Papusha, Min Wen, and Ufuk Topcu. Inverse optimal control with regular language specifications. In 2018 Annual American Control Conference, ACC 2018, Milwaukee, WI, USA, June 27-29, 2018, pages 770–777,