SPOTTER: Extending Symbolic Planning Operators Through Targeted Reinforcement Learning

SPOTTER: Extending Symbolic Planning Operators through Targeted Reinforcement Learning Vasanth Sarathy∗ Daniel Kasenberg∗ Shivam Goel Smart Information Flow Technologies Tufts University. Tufts University. Lexintgon, MA Medford, MA Medford, MA [email protected] [email protected] [email protected] Jivko Sinapov Matthias Scheutz Tufts University. Tufts University. Medford, MA Medford, MA [email protected] [email protected] ABSTRACT used to complete a variety of tasks), available human knowledge, Symbolic planning models allow decision-making agents to se- and interpretability. However, the models are often handcrafted, quence actions in arbitrary ways to achieve a variety of goals in difficult to design and implement, and require precise formulations dynamic domains. However, they are typically handcrafted and that can be sensitive to human error. Reinforcement learning (RL) tend to require precise formulations that are not robust to human approaches do not assume the existence of such a domain model, error. Reinforcement learning (RL) approaches do not require such and instead attempt to learn suitable models or control policies models, and instead learn domain dynamics by exploring the en- by trial-and-error interactions in the environment [28]. However, vironment and collecting rewards. However, RL approaches tend RL approaches tend to require a substantial amount of training in to require millions of episodes of experience and often learn poli- moderately complex environments. Moreover, it has been difficult cies that are not easily transferable to other tasks. In this paper, to learn abstractions to the level of those used in symbolic planning we address one aspect of the open problem of integrating these approaches through low-level reinforcement-based exploration. approaches: how can decision-making agents resolve discrepancies Integrating RL and symbolic planning is highly desirable, enabling in their symbolic planning models while attempting to accomplish autonomous agents that are robust, resilient and resourceful. goals? We propose an integrated framework named SPOTTER that Among the challenges in integrating symbolic planning and RL, uses RL to augment and support (“spot”) a planning agent by dis- we focus here on the problem of how partially specified symbolic covering new operators needed by the agent to accomplish goals models can be extended during task performance. Many real-world that are initially unreachable for the agent. SPOTTER outperforms robotic and autonomous systems already have preexisting models pure-RL approaches while also discovering transferable symbolic and programmatic implementations of action hierarchies. These knowledge and does not require supervision, successful plan traces systems are robust to many anticipated situations. We are inter- or any a priori knowledge about the missing planning operator. ested in how these systems can adapt to unanticipated situations, autonomously and with no supervision. KEYWORDS In this paper, we present SPOTTER (Synthesizing Planning Op- erators through Targeted Exploration and Reinforcement). Unlike Planning, Reinforcement Learning other approaches to action model learning, SPOTTER does not have ACM Reference Format: access to successful symbolic traces. Unlike other approaches to Vasanth Sarathy, Daniel Kasenberg, Shivam Goel, Jivko Sinapov, and Matthias learning low-level implementation of symbolic operators, SPOT- Scheutz. 2021. SPOTTER: Extending Symbolic Planning Operators through TER does not know a priori what operators to learn. Unlike other Targeted Reinforcement Learning. In Proc. of the 20th International Confer- approaches which use symbolic knowledge to guide exploration, ence on Autonomous Agents and Multiagent Systems (AAMAS 2021), London, arXiv:2012.13037v1 [cs.AI] 24 Dec 2020 SPOTTER does not assume the existence of partial plans. UK, May 3–7, 2021, IFAAMAS, 10 pages. We focus on the case where the agent is faced with a symbolic 1 INTRODUCTION goal but has neither the necessary symbolic operator to be able to synthesize a plan nor its low-level controller implementation. Symbolic planning approaches focus on synthesizing a sequence of The agent must invent both. SPOTTER leverages the idea that the operators capable of achieving a desired goal [9]. These approaches agent can proceed by planning alone unless there is a discrepancy rely on an accurate high-level symbolic description of the dynamics between its model of the environment and the environment’s dy- of the environment. Such a description affords these approaches namics. In our approach, the agent attempts to plan to the goal. the benefit of generalizability and abstraction (the model canbe When no plan is found, the agent explores its environment using ∗The first two authors contributed equally. an online explorer looking to reach any state from which it can plan to the goal. When such a state is reached, the agent spawns of- Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems fline “subgoal” RL learners to learn policies which can consistently (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, London, achieve those conditions. As the exploration agent acts, the subgoal UK. © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. learners learn from its trajectory in parallel. The subgoal learners regularly attempt to generate symbolic preconditions from which be the set of propositional state variables 푓 . A fluent state 휎 is a their candidate operators have high value; if such preconditions complete assignment of values to all fluents in 퐹. That is, 휎 = 퐹 , | | | | can be found, the operator is added with those preconditions into and 휎 includes positive literals (푓 ) and negative literals ( 푓 ). 휎 ¬ 0 the planning domain. We evaluate the approach with experiments represents the initial fluent state. We assume full observability of in a gridworld environment in which the agent solves three puzzles the initial fluent state. We can define a partial fluent state 휎˜ to refer involving unlocking doors and moving objects. to a partial assignment of values to the fluents 퐹. The goal condition In this paper, our contributions are as follows: a framework for is represented as a partial fluent state 휎˜푔. We define 퐹 to be the L( ) integrated RL and symbolic planning; algorithms for solving prob- set of all partial fluent states with fluents in 퐹. lems in finite, deterministic domains in which the agent has partial The operators under this formalism are open-world. A partial domain knowledge and must reach a seemingly unreachable goal planning operator can be defined as 표 = 푝푟푒 표 , eff 표 , 푠푡푎푡푖푐 표 . ⟨ ( ) ( ) ( )⟩ state; and experimental results showing substantial improvements 푝푟푒 표 퐹 are the preconditions, and eff 표 퐹 are the ( ) ∈ L( ) ( ) ∈ L( ) over baseline approaches in terms of cumulative reward, rate of effects, 푠푡푎푡푖푐 표 퐹 are those fluents whose values are known ( ) ⊆ learning and transferable knowledge learned. not to change during the execution of the operator. An operator 표 is applicable in a partial fluent state 휎˜ if 푝푟푒 표 휎˜. The result ( ) ⊆ 1.1 Running example: GridWorld of executing an operator 표 from a partial fluent state 휎˜ is given by the successor function 훿 휎,˜ 표 = eff 표 푟푒푠푡푟푖푐푡 휎,˜ 푠푡푎푡푖푐 표 , Throughout this paper, we will use as a running example a Grid- ( ) ( ) ∪ ( ( )) where 푟푒푠푡푟푖푐푡 휎,˜ 퐹 is the partial fluent state consisting only of World puzzle (Figure 1) which an agent must unlock a door which is ( ′) blocked by a ball. The agent’s planning domain abstracts out the no- 휎˜’s values for fluents in 퐹 ′. The set of all partial fluent states defined tion of specific grid cells, and so all navigation is framed in termsof through repeated application of the successor function beginning at going to specific objects. Under this abstraction, no action sequence 휎˜ provides the set of reachable partial fluent states Δ휎˜ . A complete allows the agent to “move the ball out of the way”. Planning actions operator is an operator표˜ where all the fluents are fully accounted for, namely, 푓 퐹, 푓 eff 표˜ eff 표˜ 푠푡푎푡푖푐 표˜ , where eff 표 = are implemented as handcrafted programs that navigate the puzzle ∀ ∈ ∈ + ( ) ∪ − ( ) ∪ ( ) + ( ) 푓 퐹 : 푓 eff 표 and eff 표 = 푓 퐹 : 푓 eff 표 . We and execute low-level actions (up, down, turn left/right, pickup). { ∈ ∈ ( )} − ( ) { ∈ ¬ ∈ ( )} Because the door is blocked, the agent cannot generate a symbolic assume all operators 표 satisfy eff 표 푝푟푒 표 ≠ ; these are the only ( )\1 ( ) ∅ plan to solve the problem; it must synthesize new symbolic actions operators useful for our purposes. A plan 휋푇 is a sequence of operators 표 , . , 표푛 . A plan 휋 is executable in state 휎 if, for all or operators. An agent using SPOTTER performs RL on low-level ⟨ 1 ⟩ 푇 0 푖 1, , 푛 , 푝푟푒 표푖 휎˜푖 1 where 휎˜푖 = 훿 휎˜푖 1, 표푖 . A plan 휋푇 is actions to learn to reach states from which a plan to the goal exists, ∈ { ··· } ( ) ⊆ − ( − ) and thus can learn a symbolic operator corresponding to “moving said to solve the task 푇 if executing 휋푇 from 휎0 induces a trajectory 휎 , 표 , 휎˜ , . , 표푛, 휎˜푛 that reaches the goal state, namely 휎˜푔 휎˜푛. the ball out of the way”. ⟨ 0 1 1 ⟩ ⊆ An open-world forward search (OWFS) is a breadth-first plan 2 BACKGROUND search procedure where each node is a partial fluent state 휎˜. The successor relationships and applicability of 푂 are specified as de- In this section, we provide a background of relevant concepts in fined above and used to generate successor nodes in the search planning and learning. space. A plan is synthesized once 휎˜푔 has been reached. If 휎0 is a complete state and the operators are complete operators, then each 2.1 Open-World Symbolic Planning node in the search tree will be complete fluent states, as well. We formalize the planning task as an open-world variant of propo- We also define the notion of a “relevant” operator and “regressor” sitional STRIPS [8], 푇 = 퐹, 푂, 휎 , 휎˜푔 .

SPOTTER: Extending Symbolic Planning Operators Through Targeted Reinforcement Learning

Classical Planning in Deep Latent Space

Discovering Underlying Plans Based on Shallow Models

Learning to Plan from Raw Data in Grid-Based Games

Learning Complex Action Models with Quantifiers and Logical Implications

Arxiv:1803.02208V1 [Cs.AI] 4 Mar 2018 Subbarao Kambhampati Arizona State University E-Mail: [email protected] 2 Hankz Hankui Zhuo Et Al

Towards Learning in Probabilistic Action Selection: Markov Systems and Markov Decision Processes

Outline of Machine Learning

From Unlabeled Images to PDDL (And Back)

Integrating Planning, Execution and Learning to Improve Plan Execution

Action Model Learning: Challenges and Techniques

Discrete Word Embedding for Logical Natural Language Understanding

Extending Symbolic Planning Operators Through Targeted Reinforcement Learning