Extending Symbolic Planning Operators Through Targeted Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Main Track AAMAS 2021, May 3-7, 2021, Online SPOTTER: Extending Symbolic Planning Operators through Targeted Reinforcement Learning Vasanth Sarathy∗ Daniel Kasenberg∗ Shivam Goel Smart Information Flow Technologies Tufts University Tufts University Lexintgon, MA Medford, MA Medford, MA [email protected] [email protected] [email protected] Jivko Sinapov Matthias Scheutz Tufts University Tufts University Medford, MA Medford, MA [email protected] [email protected] ABSTRACT used to complete a variety of tasks), available human knowledge, Symbolic planning models allow decision-making agents to se- and interpretability. However, the models are often handcrafted, quence actions in arbitrary ways to achieve a variety of goals in difficult to design and implement, and require precise formulations dynamic domains. However, they are typically handcrafted and that can be sensitive to human error. Reinforcement learning (RL) tend to require precise formulations that are not robust to human approaches do not assume the existence of such a domain model, error. Reinforcement learning (RL) approaches do not require such and instead attempt to learn suitable models or control policies models, and instead learn domain dynamics by exploring the en- by trial-and-error interactions in the environment [28]. However, vironment and collecting rewards. However, RL approaches tend RL approaches tend to require a substantial amount of training in to require millions of episodes of experience and often learn poli- moderately complex environments. Moreover, it has been difficult cies that are not easily transferable to other tasks. In this paper, to learn abstractions to the level of those used in symbolic planning we address one aspect of the open problem of integrating these approaches through low-level reinforcement-based exploration. approaches: how can decision-making agents resolve discrepancies Integrating RL and symbolic planning is highly desirable, enabling in their symbolic planning models while attempting to accomplish autonomous agents that are robust, resilient and resourceful. goals? We propose an integrated framework named SPOTTER that Among the challenges in integrating symbolic planning and RL, uses RL to augment and support (“spot”) a planning agent by dis- we focus here on the problem of how partially specified symbolic covering new operators needed by the agent to accomplish goals models can be extended during task performance. Many real-world that are initially unreachable for the agent. SPOTTER outperforms robotic and autonomous systems already have preexisting models pure-RL approaches while also discovering transferable symbolic and programmatic implementations of action hierarchies. These knowledge and does not require supervision, successful plan traces systems are robust to many anticipated situations. We are inter- or any a priori knowledge about the missing planning operator. ested in how these systems can adapt to unanticipated situations, autonomously and with no supervision. KEYWORDS In this paper, we present SPOTTER (Synthesizing Planning Op- erators through Targeted Exploration and Reinforcement). Unlike Planning; Reinforcement Learning other approaches to action model learning, SPOTTER does not have ACM Reference Format: access to successful symbolic traces. Unlike other approaches to Vasanth Sarathy, Daniel Kasenberg, Shivam Goel, Jivko Sinapov, and Matthias learning low-level implementation of symbolic operators, SPOT- Scheutz. 2021. SPOTTER: Extending Symbolic Planning Operators through TER does not know a priori what operators to learn. Unlike other Targeted Reinforcement Learning. In Proc. of the 20th International Confer- approaches which use symbolic knowledge to guide exploration, ence on Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7, 2021, IFAAMAS, 9 pages. SPOTTER does not assume the existence of partial plans. We focus on the case where the agent is faced with a symbolic 1 INTRODUCTION goal but has neither the necessary symbolic operator to be able to synthesize a plan nor its low-level controller implementation. Symbolic planning approaches focus on synthesizing a sequence of The agent must invent both. SPOTTER leverages the idea that the operators capable of achieving a desired goal [9]. These approaches agent can proceed by planning alone unless there is a discrepancy rely on an accurate high-level symbolic description of the dynamics between its model of the environment and the environment’s dy- of the environment. Such a description affords these approaches namics. In our approach, the agent attempts to plan to the goal. the benefit of generalizability and abstraction (the model canbe When no plan is found, the agent explores its environment using ∗The first two authors contributed equally. an online explorer looking to reach any state from which it can plan to the goal. When such a state is reached, the agent spawns of- Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems fline “subgoal” RL learners to learn policies which can consistently (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online. achieve those conditions. As the exploration agent acts, the subgoal © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. learners learn from its trajectory in parallel. The subgoal learners 1118 Main Track AAMAS 2021, May 3-7, 2021, Online regularly attempt to generate symbolic preconditions from which be the set of propositional state variables 5 . A fluent state f is a their candidate operators have high value; if such preconditions complete assignment of values to all fluents in 퐹. That is, jfj = j퐹 j, can be found, the operator is added with those preconditions into and f includes positive literals (5 ) and negative literals (:5 ). f0 the planning domain. We evaluate the approach with experiments represents the initial fluent state. We assume full observability of in a gridworld environment in which the agent solves three puzzles the initial fluent state. We can define a partial fluent state f˜ to refer involving unlocking doors and moving objects. to a partial assignment of values to the fluents 퐹. The goal condition In this paper, our contributions are as follows: a framework for is represented as a partial fluent state f˜6. We define L¹퐹º to be the integrated RL and symbolic planning; algorithms for solving prob- set of all partial fluent states with fluents in 퐹. lems in finite, deterministic domains in which the agent has partial The operators under this formalism are open-world. A partial domain knowledge and must reach a seemingly unreachable goal planning operator can be defined as > = h?A4¹>º, eff ¹>º, BC0C82¹>ºi. state; and experimental results showing substantial improvements ?A4¹>º 2 L¹퐹º are the preconditions, and eff ¹>º 2 L¹퐹º are the over baseline approaches in terms of cumulative reward, rate of effects, BC0C82¹>º ⊆ 퐹 are those fluents whose values are known learning and transferable knowledge learned. not to change during the execution of the operator. An operator > is applicable in a partial fluent state f˜ if ?A4¹>º ⊆ f˜. The result 1.1 Running example: GridWorld of executing an operator > from a partial fluent state f˜ is given Throughout this paper, we will use as a running example a Grid- by the successor function X ¹f,˜ >º = eff ¹>º [ A4BCA82C ¹f,˜ BC0C82¹>ºº, 0 World puzzle (Figure 1) which an agent must unlock a door which is where A4BCA82C ¹f,˜ 퐹 º is the partial fluent state consisting only of 0 blocked by a ball. The agent’s planning domain abstracts out the no- f˜’s values for fluents in 퐹 . The set of all partial fluent states defined tion of specific grid cells, and so all navigation is framed in termsof through repeated application of the successor function beginning at going to specific objects. Under this abstraction, no action sequence f˜ provides the set of reachable partial fluent states Δf˜ . A complete allows the agent to “move the ball out of the way”. Planning actions operator is an operator>˜ where all the fluents are fully accounted for, namely, 85 2 퐹, 5 2 eff ¸ ¹>˜º [ eff − ¹>˜º [BC0C82¹>˜º, where eff ¸ ¹>º = are implemented as handcrafted programs that navigate the puzzle − and execute low-level actions (up, down, turn left/right, pickup). f5 2 퐹 : 5 2 eff ¹>ºg and eff ¹>º = f5 2 퐹 : :5 2 eff ¹>ºg. We Because the door is blocked, the agent cannot generate a symbolic assume all operators > satisfy eff ¹>ºn?A4¹>º < ;; these are the only 1 plan to solve the problem; it must synthesize new symbolic actions operators useful for our purposes. A plan c) is a sequence of or operators. An agent using SPOTTER performs RL on low-level operators h>1, . , >=i. A plan c) is executable in state f0 if, for all actions to learn to reach states from which a plan to the goal exists, 8 2 f1, ··· , =g, ?A4¹>8 º ⊆ f˜8−1 where f˜8 = X ¹f˜8−1, >8 º. A plan c) is and thus can learn a symbolic operator corresponding to “moving said to solve the task ) if executing c) from f0 induces a trajectory the ball out of the way”. hf0, >1, f˜1, . , >=, f˜=i that reaches the goal state, namely f˜6 ⊆ f˜=. An open-world forward search (OWFS) is a breadth-first plan 2 BACKGROUND search procedure where each node is a partial fluent state f˜. The successor relationships and applicability of $ are specified as de- In this section, we provide a background of relevant concepts in fined above and used to generate successor nodes in the search planning and learning. space. A plan is synthesized once f˜6 has been reached. If f0 is a complete state and the operators are complete operators, then each 2.1 Open-World Symbolic Planning node in the search tree will be complete fluent states, as well. We formalize the planning task as an open-world variant of propo- We also define the notion of a “relevant” operator and “regressor” sitional STRIPS [8], ) = h퐹, $, f0, f˜6i.