Main Track AAMAS 2021, May 3-7, 2021, Online

SPOTTER: Extending Symbolic Planning Operators through Targeted

Vasanth Sarathy∗ Daniel Kasenberg∗ Shivam Goel Smart Information Flow Technologies Tufts University Tufts University Lexintgon, MA Medford, MA Medford, MA [email protected] [email protected] [email protected] Jivko Sinapov Matthias Scheutz Tufts University Tufts University Medford, MA Medford, MA [email protected] [email protected]

ABSTRACT used to complete a variety of tasks), available human knowledge, Symbolic planning models allow decision-making agents to se- and interpretability. However, the models are often handcrafted, quence actions in arbitrary ways to achieve a variety of goals in difficult to design and implement, and require precise formulations dynamic domains. However, they are typically handcrafted and that can be sensitive to human error. Reinforcement learning (RL) tend to require precise formulations that are not robust to human approaches do not assume the existence of such a domain model, error. Reinforcement learning (RL) approaches do not require such and instead attempt to learn suitable models or control policies models, and instead learn domain dynamics by exploring the en- by trial-and-error interactions in the environment [28]. However, vironment and collecting rewards. However, RL approaches tend RL approaches tend to require a substantial amount of training in to require millions of episodes of experience and often learn poli- moderately complex environments. Moreover, it has been difficult cies that are not easily transferable to other tasks. In this paper, to learn abstractions to the level of those used in symbolic planning we address one aspect of the open problem of integrating these approaches through low-level reinforcement-based exploration. approaches: how can decision-making agents resolve discrepancies Integrating RL and symbolic planning is highly desirable, enabling in their symbolic planning models while attempting to accomplish autonomous agents that are robust, resilient and resourceful. goals? We propose an integrated framework named SPOTTER that Among the challenges in integrating symbolic planning and RL, uses RL to augment and support (“spot”) a planning agent by dis- we focus here on the problem of how partially specified symbolic covering new operators needed by the agent to accomplish goals models can be extended during task performance. Many real-world that are initially unreachable for the agent. SPOTTER outperforms robotic and autonomous systems already have preexisting models pure-RL approaches while also discovering transferable symbolic and programmatic implementations of action hierarchies. These knowledge and does not require supervision, successful plan traces systems are robust to many anticipated situations. We are inter- or any a priori knowledge about the missing planning operator. ested in how these systems can adapt to unanticipated situations, autonomously and with no supervision. KEYWORDS In this paper, we present SPOTTER (Synthesizing Planning Op- erators through Targeted Exploration and Reinforcement). Unlike Planning; Reinforcement Learning other approaches to action model learning, SPOTTER does not have ACM Reference Format: access to successful symbolic traces. Unlike other approaches to Vasanth Sarathy, Daniel Kasenberg, Shivam Goel, Jivko Sinapov, and Matthias learning low-level implementation of symbolic operators, SPOT- Scheutz. 2021. SPOTTER: Extending Symbolic Planning Operators through TER does not know a priori what operators to learn. Unlike other Targeted Reinforcement Learning. In Proc. of the 20th International Confer- approaches which use symbolic knowledge to guide exploration, ence on Autonomous Agents and Multiagent Systems (AAMAS 2021), Online, May 3–7, 2021, IFAAMAS, 9 pages. SPOTTER does not assume the existence of partial plans. We focus on the case where the agent is faced with a symbolic 1 INTRODUCTION goal but has neither the necessary symbolic operator to be able to synthesize a plan nor its low-level controller implementation. Symbolic planning approaches focus on synthesizing a sequence of The agent must invent both. SPOTTER leverages the idea that the operators capable of achieving a desired goal [9]. These approaches agent can proceed by planning alone unless there is a discrepancy rely on an accurate high-level symbolic description of the dynamics between its model of the environment and the environment’s dy- of the environment. Such a description affords these approaches namics. In our approach, the agent attempts to plan to the goal. the benefit of generalizability and abstraction (the model canbe When no plan is found, the agent explores its environment using ∗The first two authors contributed equally. an online explorer looking to reach any state from which it can plan to the goal. When such a state is reached, the agent spawns of- Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems fline “subgoal” RL learners to learn policies which can consistently (AAMAS 2021), U. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.), May 3–7, 2021, Online. achieve those conditions. As the exploration agent acts, the subgoal © 2021 International Foundation for Autonomous Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. learners learn from its trajectory in parallel. The subgoal learners

1118 Main Track AAMAS 2021, May 3-7, 2021, Online

regularly attempt to generate symbolic preconditions from which be the set of propositional state variables 푓 . A fluent state 휎 is a their candidate operators have high value; if such preconditions complete assignment of values to all fluents in 퐹. That is, |휎| = |퐹 |, can be found, the operator is added with those preconditions into and 휎 includes positive literals (푓 ) and negative literals (¬푓 ). 휎0 the planning domain. We evaluate the approach with experiments represents the initial fluent state. We assume full observability of in a gridworld environment in which the agent solves three puzzles the initial fluent state. We can define a partial fluent state 휎˜ to refer involving unlocking doors and moving objects. to a partial assignment of values to the fluents 퐹. The goal condition In this paper, our contributions are as follows: a framework for is represented as a partial fluent state 휎˜푔. We define L(퐹) to be the integrated RL and symbolic planning; algorithms for solving prob- set of all partial fluent states with fluents in 퐹. lems in finite, deterministic domains in which the agent has partial The operators under this formalism are open-world. A partial domain knowledge and must reach a seemingly unreachable goal planning operator can be defined as 표 = ⟨푝푟푒(표), eff (표), 푠푡푎푡푖푐(표)⟩. state; and experimental results showing substantial improvements 푝푟푒(표) ∈ L(퐹) are the preconditions, and eff (표) ∈ L(퐹) are the over baseline approaches in terms of cumulative reward, rate of effects, 푠푡푎푡푖푐(표) ⊆ 퐹 are those fluents whose values are known learning and transferable knowledge learned. not to change during the execution of the operator. An operator 표 is applicable in a partial fluent state 휎˜ if 푝푟푒(표) ⊆ 휎˜. The result 1.1 Running example: GridWorld of executing an operator 표 from a partial fluent state 휎˜ is given Throughout this paper, we will use as a running example a Grid- by the successor function 훿 (휎,˜ 표) = eff (표) ∪ 푟푒푠푡푟푖푐푡 (휎,˜ 푠푡푎푡푖푐(표)), ′ World puzzle (Figure 1) which an agent must unlock a door which is where 푟푒푠푡푟푖푐푡 (휎,˜ 퐹 ) is the partial fluent state consisting only of ′ blocked by a ball. The agent’s planning domain abstracts out the no- 휎˜’s values for fluents in 퐹 . The set of all partial fluent states defined tion of specific grid cells, and so all navigation is framed in termsof through repeated application of the successor function beginning at going to specific objects. Under this abstraction, no action sequence 휎˜ provides the set of reachable partial fluent states Δ휎˜ . A complete allows the agent to “move the ball out of the way”. Planning actions operator is an operator표˜ where all the fluents are fully accounted for, namely, ∀푓 ∈ 퐹, 푓 ∈ eff + (표˜) ∪ eff − (표˜) ∪푠푡푎푡푖푐(표˜), where eff + (표) = are implemented as handcrafted programs that navigate the puzzle − and execute low-level actions (up, down, turn left/right, pickup). {푓 ∈ 퐹 : 푓 ∈ eff (표)} and eff (표) = {푓 ∈ 퐹 : ¬푓 ∈ eff (표)}. We Because the door is blocked, the agent cannot generate a symbolic assume all operators 표 satisfy eff (표)\푝푟푒(표) ≠ ∅; these are the only 1 plan to solve the problem; it must synthesize new symbolic actions operators useful for our purposes. A plan 휋푇 is a sequence of or operators. An agent using SPOTTER performs RL on low-level operators ⟨표1, . . . , 표푛⟩. A plan 휋푇 is executable in state 휎0 if, for all actions to learn to reach states from which a plan to the goal exists, 푖 ∈ {1, ··· , 푛}, 푝푟푒(표푖 ) ⊆ 휎˜푖−1 where 휎˜푖 = 훿 (휎˜푖−1, 표푖 ). A plan 휋푇 is and thus can learn a symbolic operator corresponding to “moving said to solve the task 푇 if executing 휋푇 from 휎0 induces a trajectory the ball out of the way”. ⟨휎0, 표1, 휎˜1, . . . , 표푛, 휎˜푛⟩ that reaches the goal state, namely 휎˜푔 ⊆ 휎˜푛. An open-world forward search (OWFS) is a breadth-first plan 2 BACKGROUND search procedure where each node is a partial fluent state 휎˜. The successor relationships and applicability of 푂 are specified as de- In this section, we provide a background of relevant concepts in fined above and used to generate successor nodes in the search planning and learning. space. A plan is synthesized once 휎˜푔 has been reached. If 휎0 is a complete state and the operators are complete operators, then each 2.1 Open-World Symbolic Planning node in the search tree will be complete fluent states, as well. We formalize the planning task as an open-world variant of propo- We also define the notion of a “relevant” operator and “regressor” sitional STRIPS [8], 푇 = ⟨퐹, 푂, 휎0, 휎˜푔⟩. We consider 퐹 (fluents) to nodes, as used in backward planning search [9]. For a partial fluent state 휎˜ with fluents 푓푖 and their valuations 푐푖 ∈ {푇푟푢푒, 퐹푎푙푠푒}, operator 표 is relevant at partial fluent state 휎˜ when:

(1) 푟푒푠푡푟푖푐푡 (휎˜\eff (표), 푠푡푎푡푖푐(표)) = 휎˜\eff (표) and (2) 휎˜ ⊇ eff (표) ∪ 푟푒푠푡푟푖푐푡 (푝푟푒(표), 푠푡푎푡푖푐(표))

When a relevant operator 표 is found for a particular 휎˜, it can be regressed to generate a partial fluent state 훿−1 (휎,˜ 표) = 푝푟푒(표) ∪ (휎˜\eff (표)).2 Regression is the “inverse” of operator application in that if applying any operator sequence ⟨표1, . . . , 표푛⟩ yields final ′ ′′ −1 −1 ′′ state 휎˜ , if we let 휎˜ = 훿 (... (훿 (휎,˜ 표푛) ...), 표1), then 휎˜ ⊆ 휎˜. In particular, 휎˜ ′′ is the minimal partial fluent state from which ′ application of ⟨표1, ··· , 표푛⟩ results in 휎˜ .

Figure 1: The agent’s (red triangle) goal is to unlock the door (yellow square). SPOTTER can learn how to move the blue ball out of the way. The learned representation is sym- 1We also assume (eff + (표) ∪ eff − (표)) ∩ 푠푡푎푡푖푐 (표) = ∅; i.e. that postconditions and bolic and can be used to plan (without additional learning) static variables do not conflict. 2This is a slight abuse of notation because 훿−1 is not the function inverse of 훿, but to achieve different goals like reaching the green square. they can be thought of as inverse in the sense described in this paragraph.

1119 Main Track AAMAS 2021, May 3-7, 2021, Online

2.2 Reinforcement Learning state for a given MDP state, and an executor function 푒 : 푂 ↦→ 푋푀 We formalize the environment in which the agent acts as a Markov that determines a mapping between an operator and an executor Decision Process (MDP) 푀 = ⟨푆,퐴,푝,푟,휄,훾⟩, where 푆 is the set in the underlying MDP. of states, 퐴 is the set of actions, 푝 is the probability distribution For the purposes of this paper, we assume that for each operator ( | ) × × → R 푝 푠푡+1 푠푡 , 푎푡 , 푟 : 푆 퐴 푆 is a reward function, 휄 is a 표 ∈ 푂, 푒(표) is accurate to 표; that is, for every 표 ∈ 푂, 퐼푒 (표) ⊇ {푠 ∈ 푆 : probability distribution over initial states, and 훾 ∈ (0, 1] [28]. A 푑(푠) ⊇ 푝푟푒(표)} and policy for 푀 is defined as the probability distribution 휋푀 (푎 | 푠) that 1 if 푑(푠) ⊇ eff (표) establishes the probability of an agent taking an action 푎 given that  it is in the current state 푠. We define the set of all such policies in 훽푒 (표) (푠푖푛푖푡 , 푠) = ∪푟푒푠푡푟푖푐푡 (푑(푠푖푛푖푡 ), 푠푡푎푡푖푐(표))  푀 as Π푀 . We let 푆0 = {푠 ∈ 푆 : 휄(푠) > 0}. An RL problem typically 0 otherwise. ∗ ∈  consists of finding an optimal policy 휋푀 Π푀 that maximizes the ★ The objective of this task is to find an executor 푥 ∈ 푋푀 such expected discounted future rewards obtained from 푠 ∈ 푆: ★ that 훽 ★ (푠 , 푠) = 1 if and only if 푑(푠) ⊇ 휎˜ , and 퐼 ★ ⊇ 푆 , and 푥 Õ 푥 푖푛푖푡 푔 푥 0 ∗ = terminates in finite time. 휋푀 arg max 푣휋푀 (푠), 휋 T ★ ∈ 푀 푠 ∈푆 A solution to a particular IPT is an executor 푥 푋푀 having the properties defined above. where 푣휋 (푠) is the value function and captures the expected dis- 푀 A planning solution to a particular IPT T is a mapping 휋푇 : counted future rewards obtained when starting at state 푠 and se- ∗ 푆0 → 푂 such that for every 푠0 ∈ 푆0, 휋푇 (푠0) = ⟨표1, . . . , 표푛⟩ is lecting actions according to the policy 휋푀 : executable at 푠0 and achieves goal state 휎˜푔. Assuming all operators  ∞  are accurate, executing in the MDP 푀 the corresponding executors Õ 푡  푣 ( ) = E휋  훾 푟푡 | 푠0 = 푠 . 푒(표 ), . . . , 푒(표 ) in sequence will yield a final state 푠 such that푑(푠) ⊇ 휋푀 푠 푀   1 푛  푡=0  휎˜ as desired.   푔 At each time step, the agent executes an action 푎 and the en- An IPT T is said to be solvable if a solution exists. It is said to vironment returns the next state 푠′ ∈ 푆 (sampled from p) and an be plannable if a planning solution exists. immediate reward 푟. The experience is then used by the agent to 3.1 The Operator Discovery Problem learn and improve its current policy 휋푀 . Q-learning [31] is one learning technique in which an agent uses As we noted earlier, symbolic planning domains can be sensitive to experiences to estimate the optimal Q-function 푞∗ (푠, 푎) for every human errors. One common error is when the domain is missing state 푠 ∈ 푆 and 푎 ∈ 퐴, where 푞∗(푠, 푎) is the expected discounted an operator, which then prevents the agent from synthesizing a sum of future rewards received by performing action 푎 in state 푠. plan that requires such an operator. We define a stretch-Integrated The Q-function is updated as follows: Planning Task, stretch-IPT4, that captures difficult but achievable   goals – those for which missing operators must be discovered. 푞(푠, 푎) ← 푞(푠, 푎) + 훼 푟 + 훾 max푞(푠′, 푎′) − 푞(푠, 푎) , 푎′ ∈퐴 Definition 3.2. (Stretch-IPT). A Stretch-IPT T˜ is an IPT T for which a solution exists, but a planning solution does not. where 훼 ∈ (0, 1] is the learning rate. The Q-learner can explore the environment, e.g., by following an 휖-greedy policy, in which Sarathy et al. considered something similar in a purely symbolic the agent selects a random action with probability 휖 and otherwise planning context as MacGyver Problems [24]; here we extend these follows an action with the largest 푞(푠, 푎). ideas to integrated symbolic-RL domains. A planning solution is desirable because a plannable task affords the agent robustness to 3 PROPOSED SPOTTER FRAMEWORK variations in goal descriptions and a certain degree of generality We begin by introducing a framework for integrating the planning and ability to transfer decision-making capabilities across tasks. and learning formulations. We define an integrated planning task We are interested in turning a stretch-IPT into an plannable IPT, that enables us to ground symbolic fluents and operators in an MDP, and specifically study how an agent can automatically extend its specify goals symbolically, and realize action hierarchies. task description and executor function to invent new operators and We define an executor for a given MDP 푀 = ⟨푆,퐴,푝,푟,휄,훾⟩ as a their implementations. triple 푥 = ⟨퐼푥, 휋푥, 훽푥 ⟩ where 퐼푥 ⊆ 푆 is an initiation set, 휋푥 (푎|푠푖푛푖푡 , 푠) Definition 3.3. (Operator Discovery Problem). Given a stretch- is the probability of performing 푎 given that the executor initialized IPT T˜ = ⟨푇, 푀,푑, 푒⟩ with 푇 = ⟨퐹, 푂, 휎0, 휎˜푔⟩, construct a set of ′ ′ ′ at state 푠푖푛푖푡 and the current state is 푠, and 훽푥 (푠푖푛푖푡 , 푠) expresses the operators 푂 = {표 , ··· , 표 } and their executors 푥 ′ , ··· , 푥 ′ ∈ 푋 1 푚 표1 표푚 probability of terminating 푥 at 푠 given that 푥 was initialized at 푠푖푛푖푡 . such that the IPT ⟨푇 ′, 푀,푑, 푒′⟩ is plannable, with 푇 ′ = ⟨퐹, 푂 ∪ 3 We define 푋 as the set of executors for 푀. ′ 푀 푂 , 휎0, 휎˜푔⟩ and the executor function Definition 3.1. (Integrated Planning Task) We can formally ( ′ 푒(표) if 표 ∈ 푂 define an Integrated Planning Task (IPT) as T = ⟨푇, 푀,푑, 푒⟩ where 푒 (표) = ′ . 푥표 if 표 ∈ 푂 푇 = ⟨퐹, 푂, 휎0, 휎˜푔⟩ is an open-world STRIPS task, 푀 = ⟨푆,퐴,푝,푟,휄,훾⟩ is an MDP, a detector function 푑 : 푆 ↦→ L(퐹) determines a fluent In the rest of the section, we will outline an approach for solving the operator discovery problem. 3An executor is simply an option [29] where the policy and termination condition depend on where it was initialized. 4akin to “stretch goals” in business productivity

1120 Main Track AAMAS 2021, May 3-7, 2021, Online

3.1.1 The Operator Discovery Problem in GridWorld. In the ex- Algorithm 2: solve(T , 푠, Σ푟푒푎푐ℎ, Σ푝푙푎푛, 휏, impasse, 퐿) ample GridWorld puzzle, the SPOTTER agent is equipped with a T: Integrated Planning Task high-level planning domain specified in an open-world extension Input: Input: 푠: Initial MDP state from which to plan of PDDL that can be grounded down into an open-world STRIPS Input: Σ푟푒푎푐ℎ: Set of reachable fluent states problem. This domain is an abstraction of the environment which Input: Σ푝푙푎푛: Set of plannable fluent states ignores the absolute positions of objects. In particular, the core Input: 휏: Value threshold parameter movement actions forward, turnLeft, and 푡푢푟푛푅푖푔ℎ푡 in the MDP Input: impasse: true if this algorithm was called from learn are absent from the planning domain, which navigates in terms Input: 퐿: A set of learners of objects using, e.g., the operator 푔표푇표푂푏 푗 (푎푔푒푛푡, 표푏 푗푒푐푡), with 1: 휎 ← 푇 .푑 (푠) preconditions ¬ℎ표푙푑푖푛푔(푎푔푒푛푡, 표푏 푗푒푐푡), ¬푏푙표푐푘푒푑(표푏 푗푒푐푡), and 2: if T.휎˜푔 ⊆ 휎 then 푖푛푅표표푚(푎푔푒푛푡, 표푏 푗푒푐푡), and effect푛푒푥푡푇표퐹푎푐푖푛푔(푎푔푒푛푡, 표푏 푗푒푐푡). All 3: return impasse, 퐿 initial operators are assumed to satisfy the closed-world assump- 4: end if 5: 휋 , visitedNodes ← owfs(T.푇, 휎) tions, except that putting down an object (푝푢푡퐷표푤푛(푎푔푒푛푡, 표푏 푗푒푐푡)) 푇 6: Σ .add(visitedNodes) leaves unknown whether at the conclusion of the action, some other 푟푒푎푐ℎ 7: if 휋푇 ≠ ∅ then object (e.g., a door) will be blocked. Each operator has a correspond- 8: Σ푝푙푎푛 .add(all visitedNodes along 휋푇 ) ing hand-coded executor. The goal here is 휎˜푔 = {표푝푒푛(푑표표푟)}, 9: for operator 표푖 in 휋푇 do which is not achievable using planning alone from the initial state. 10: 푠 ← execute(T.푒 (표푖 ), 푠) 11: 휎 ← T.푑 (푠) 3.2 Planning and Execution 12: if 푝푟푒 (표푖+1) ⊈ 휎 then 13: return learn(T, 푠, Σ , Σ , 휏, 퐿) Our overall agent algorithm is given by Algorithm 1. We consider 푟푒푎푐ℎ 푝푙푎푛 14: end if the agent to be done when it has established that the input IPT 15: end for T is a plannable IPT. This is true once the agent is able to find a 16: if T.휎˜푔 ⊆ 휎 then complete solution to the IPT through planning and execution, alone, 17: return impasse, 퐿 and without requiring any learning. Algorithm 1 shows that the 18: else agent repeatedly learns in the environment until it has completed 19: return learn(T, 푠, Σ푟푒푎푐ℎ, Σ푝푙푎푛, 휏, 퐿) a run in which there were no planning impasses, i.e., situations 20: end if where no plan was found through a forward search in the symbolic 21: else T space. The set of learners 퐿 is maintained between runs of solve. 22: return learn( , 푠, Σ푟푒푎푐ℎ, Σ푝푙푎푛, 휏, 퐿) Two sets of fluent states will be important in Algorithms 2-4: 23: end if Σ푟푒푎푐ℎ and Σ푝푙푎푛. Σ푟푒푎푐ℎ contains all states known to be reachable from the initial state via planning. Σ푝푙푎푛 contains all states from RL agents which attempt to construct policies to reach particular which a plan to the goal 휎˜푔 is known. 휎 Algorithm 2 (solve) begins with the agent performing an OWFS partial fluent states from which the goal ˜푔 is reachable by planning. (open-world forward search) of the symbolic space specified by the These partial fluent states correspond to operator effects. Foreach task. If a plan is found (line 7), the agent attempts to execute the such set of effects, the system attempts to find sets of preconditions plan (line 9). If unexpected states are encountered such that the from which that operator can consistently achieve high value; when agent cannot perform the next operator, it will call Algorithm 3 one such set of preconditions is found, the policy along with the (learn). Otherwise, the agent continues with the next operator corresponding set of preconditions and effects is used to define a until all the operators in a plan are complete. If the goal conditions new operator and its corresponding executor, which are added to are satisfied (line 16), the algorithm returns success. Otherwise, it the IPT. turns to learn. Algorithm receives as input, among other things, a set of learners 퐿 (which may grow during execution, as described below). 퐿 in- 3.3 Learning Operator Policies cludes an exploration agent ℓ푒푥푝푙 with corresponding policy 휋푒푥푝푙 (initialized in Algorithm 1). 퐿 also contains a set of offline “subgoal” Broadly, in Algorithm 3 (learn), the agent follows an exploration learners, which initially is empty. policy 휋 , e.g., 휖-greedy, to explore the environment, spawning 푒푥푝푙 In each time step, learn executes an action 푎 according to its ′ exploration policy 휋푒푥푝푙 , with resulting MDP state 푠 and corre- sponding fluent state 휎 (lines 4-6). The agent will attempt to check if Algorithm 1: spotter(T ) 휎 is a fluent state from which it can plan to the goal. First theagent Input: T: Integrated Planning Task checks if 휎 is already known to be a state from which it can plan to 1: impasse ← true the goal (휎 ∈ Σ푝푙푎푛; line 7). If not, the agent attempts to plan from 2: 퐿 ← {ℓ푒푥푝푙 } 휎 to the goal (line 10). If there is a plan, then the agent regresses 3: impasse while do each operator in the plan in reverse order (표푛, ··· , 표1). As described 4: 푠 ∼ 푇 .푀.휄 (·) 0 in Section 2.1, after regressing through 표푖 , the resulting fluent state 5: impasse, 퐿 ← solve(T, 푠0, ∅, ∅, 휏, false, 퐿)) 휎˜푖−1 is the most general possible partial fluent state (i.e., containing 6: end while the least possible fluents) such that executing ⟨표 , ··· , 표 ⟩ from 7: return T 푖 푛 휎˜푖−1 results in 휎˜푔; this makes each 휎˜푖−1 a prime candidate for some

1121 Main Track AAMAS 2021, May 3-7, 2021, Online

★ new operator 표 ’s effects. That is, assuming 푠푡푎푡푖푐(표) = ∅, 휎˜푖−1 Upon reaching a state from which a plan to the goal exists, the ⟨ ··· ⟩ ∈ guarantees that 표푖, , 표푛 is a plan to 휎˜푔, while allowing the cor- agent stops exploring. For each of its subgoal learners ℓ휎˜푠푔 퐿, responding executor to terminate in the largest possible set of fluent it attempts to construct sets of preconditions (characterized as a states. Each such partial fluent state 휎˜푠푔 is chosen as a subgoal, for partial fluent state 휎˜푝푟푒 ) from which its policy can consistently which a new learner ℓ휎˜푠푔 is spawned and added to 퐿 (lines 12-18). achieve the subgoal state 휎˜푠푔 (see Section 3.4) (lines 28-34). If any 휎˜푠푔 is also added to the set of “plannable” fluent states Σ푝푙푎푛. such precondition sets 휎˜푝푟푒 exist, the agent constructs the operator ★ ( ★) ( ★) Each subgoal learner ℓ휎˜푠푔 is trained each time step using as re- 표 such that 푝푟푒 표 = 휎˜푝푟푒 , eff 표 = 휎˜푠푔, and without static ward the indicator function 1휎 ⊇휎˜푠푔 , which returns 1 if that learner’s fluents (all other variables are unknown once the operator isexe- subgoal is satisfied by 푠′ and 0 otherwise (lines 23-25). cuted; 푠푡푎푡푖푐(표★) = ∅). The corresponding executor is constructed ★ ( ) = ⟨ ★ ★ ★ ⟩ While in this case the exploration learner ℓ푒푥푝푙 may be conceived as makeExecutor ℓ휎˜푠푔 , 표 퐼푥 , 휋푥 , 훽푥 , where as an RL agent whose reward function is 1 whenever any such ′ ′ ★ 퐼푥★ = {푠 ∈ 푆 : 푇 .푑(푠 ) ⊇ 푝푟푒(표 )} subgoal is achieved (line 22) and whose exploration policy 휋푒푥푝푙 is 휖-greedy, nothing prevents it from being defined differently asa  ( ) 1 if 푎 = arg max ℓ휎˜푠푔 .푞 푠, 푎  ′ random agent or even symbolic learner as discussed in Section 6. 휋푥★ (푎 | 푠푖푛푖푡 , 푠) = 푎 ∈푇 .푀.퐴 0 otherwise ( 1 if 푇 .푑(푠) ⊇ 휎˜푠푔 훽푥★ (푠푖푛푖푡 , 푠) = 0 otherwise. Algorithm 3: learn(T, 푠, Σ푟푒푎푐ℎ, Σ푝푙푎푛, 휏, 퐿) Input: T: Integrated Planning Task Note that while the definition of the Operator Discovery Problem Input: 푠: Initial MDP state allows the constructed operators’ executors to depend on 푠푖푛푖푡 , in Input: Σ푟푒푎푐ℎ: Set of reachable fluent states practice the operators are ordinary options. Options are sufficient, 5 Input: Σ푝푙푎푛: Set of plannable fluent states provided the operators being constructed have no static fluents. Con- Input: 휏: Value threshold structing operators with static fluents is a topic for future work. Input: 퐿: set of learners Constructed operators and their corresponding executors are ← 1: 푑표푛푒 false added to the IPT (line 32), which ideally becomes plannable as a 2: 휎 ← 푇 .푑 (푠) result, solving the Operator Discovery Problem. Because the agent 3: while ¬푑표푛푒 do is currently in a state from which it can plan to the goal, control is 4: 푎 ∼ 휋푒푥푝푙 (· | 푠) 5: 푠′ ∼ 푇 .푝 (· | 푠, 푎) passed back to solve (with impasse set to true because this episode 6: 휎 ← T.푑 (푠′) required learning). 7: if 휎 ⊇ 휎˜ for some 휎˜ ∈ Σ then 푝푙푎푛 3.3.1 Learning operator policies in GridWorld. The puzzle shown in 8: 푑표푛푒 ← true Figure 1 presents a Stretch-IPT for which a solution exists in terms 9: else 10: 휋푇 , visitedNodes ← owfs(T.푇, 휎) of the MDP-level actions, but no planning solution exists. Thus, 11: if 휋푇 ≠ ∅ then SPOTTER enters learn, and initially moves around randomly (the 12: 휎˜푠푔 ← 휎˜푔 exploration policy has not yet received a reward). Eventually, in 13: for 표푖 in reversed(휋푇 ) do the course of random action, the agent moves the ball out of the −1 14: 휎˜푠푔 ← 훿 (휎˜푠푔, 표푖 ) way (Algorithm learn, line 11). From this state, the action plan 15: Σ푝푙푎푛 .add(휎˜푠푔) 16: ℓ휎˜푠푔 ← spawnLearner(휎˜푠푔) 휋푇 = 푔표푇표푂푏 푗 (푎푔푒푛푡, 푘푒푦); 푝푖푐푘푈 푝(푎푔푒푛푡, 푘푒푦); 17: 퐿 ← 퐿 ∪ {ℓ } 휎˜푠푔 푔표푇표푂푏 푗 (푎푔푒푛푡,푑표표푟);푢푠푒퐾푒푦(푎푔푒푛푡,푑표표푟) 18: end for 19: 푑표푛푒 ← true achieves the goal 휎˜푔 = {표푝푒푛(푑표표푟)}. Regressing through this plan 20: end if from the goal state (Algorithm learn, lines 13-17), the agent iden- 21: end if tifies the subgoal ′ 22: ℓ푒푥푝푙 .train(푠, 푎, 1푑표푛푒=true, 푠 ) 23: for ℓ휎˜푠푔 ∈ 퐿 do 휎˜푠푔 = {푛푒푥푡푇표퐹푎푐푖푛푔(푎푔푒푛푡,푏푎푙푙), 24: ℓ . (푠, 푎, 1 , 푠′) 휎˜푠푔 train 휎 ⊇휎˜푠푔 ℎ푎푛푑푠퐹푟푒푒(푎푔푒푛푡), 푖푛푅표표푚(푎푔푒푛푡, 푘푒푦), 25: end for 26: 푠 ← 푠′ 푖푛푅표표푚(푎푔푒푛푡,푑표표푟), 푙표푐푘푒푑(푑표표푟), 27: end while ¬ℎ표푙푑푖푛푔(푎푔푒푛푡, 푘푒푦), ¬푏푙표푐푘푒푑(푑표표푟)} 28: for ℓ휎˜푠푔 ∈ 퐿 do (as well as other subgoals corresponding to the suffixes of 휋푇 ), and 29: for 휎˜푝푟푒 ∈ gen-precon(ℓ휎˜푠푔 , Σ푟푒푎푐ℎ, 휏) do ★ constructs a corresponding subgoal learner, ℓ which uses RL to 30: 표 ← ⟨휎˜푝푟푒, 휎˜푠푔, ∅⟩ 휎˜푠푔 ★ ★ learn a policy which consistently achieves 휎˜ . 31: 푥 ← makeExecutor(ℓ휎˜푠푔 , 표 ) 푠푔 32: T.addOperator(표∗, 푥★) 5 33: end for This also indicates why we used our open-world planning formalism: in closed-world planning, all fluents not in the effects are static, requiring executors which depend 34: end for on 푠푖푛푖푡 . By specifying static fluents for each operator, we can leverage handmade 35: return solve(T, 푠, Σ푟푒푎푐ℎ, Σ푝푙푎푛, 휏, true, 퐿) operators which (largely) satisfy the closed-world assumption while allowing the operators returned by SPOTTER to be open-world.

1122 Main Track AAMAS 2021, May 3-7, 2021, Online

3.4 Generating Preconditions gen-precon can be terminated at any time during the while loop, The prior algorithms allow an agent to plan in the symbolic space, and will yield zero or more plausible preconditions for a partic- and when stuck explore a subsymbolic space with RL learners. We ular learner-operator. If terminated before any nodes have been noted that each learner is connected to a particular subgoal – from expanded, gen-precon will simply yield elements from Σ푟푒푎푐ℎ. which the agent can generate a symbolic plan – which in turn, However, allowing this algorithm to run longer will yield more maps onto the effects of a potential new operator. What remains general preconditions, i.e., those with fewer fluents. is to define the preconditions of such an operator. Algorithm 4 3.4.1 Precondition generation in GridWorld. Returning to the puz- (gen-precon) incrementally generates increasingly general sets zle in Figure 1, the precondition generation algorithm uses the value of preconditions for each learner. function for the subgoal learner ℓ휎˜푠푔 to determine a conjunction gen-precon begins with initializing a set of above-threshold of fluents such that the MDP states satisfying those fluents havea fluent states Σ>휏 to empty. This set will represent the output of high average state value (say, > 0.9), and which is plannable from the algorithm, which in turn is essentially a form of graph search the environment’s start state.6 This conjunction specifies the new over the space of possible sets of preconditions. More specifically, operator’s preconditions. The operator is added to the planning gen-precon first adds all the fluent states in the set of reachable domain, and this augmented domain can be solved using planning fluent states Σ푟푒푎푐ℎ to the search queue and to Σ>휏 (lines 3-7). The alone, and can be used in other plans to achieve other goals in the learner ℓ has visited states in the MDP and has been updating its environment. Q-table as part of learn (see line 24 in learn). We probe this Q-table and extract the fluent states that the agent has visited and 4 EXPERIMENTS populate a set of “been” fluent states Σ (line 9). While there are 푏푒푒푛 We evaluate SPOTTER on MiniGrid [5], a platform based on pro- fluent states (or partial fluent states) in the queue, a set of“succes- cedurally generated 2D gridworld environments populated with sor” nodes is computed for each one (node) by capturing the set objects – agent, balls, doors and keys. The agent can navigate the of fluents common to both the node and a particular “been” fluent world, manipulate objects, and move between rooms. state 휎 ′, for each 휎 ′ ∈ Σ . If the average value (according to ℓ’s 푏푒푒푛 To establish the fact that (1) operator discovery is possible, and 푞 function) of all states satisfying 휎˜ is above the threshold 휏, 푐표푚푚표푛 (2) that the symbolic operators discovered by SPOTTER are useful then the above-threshold common partial fluent state 휎˜ is 푐표푚푚표푛 for performing new tasks in the environment, we structure our added to Σ . The idea here is to only allow sets of preconditions >휏 evaluation into three puzzles that the agent must solve in sequence, which are guaranteed reachable by the planner (generalizations much like three levels in puzzle video games (Fig. 1). The envi- of elements of Σ ), and to compute the set of all such sets of 푟푒푎푐ℎ ronment in each level consists of two rooms with a locked door preconditions for which the agent can consistently achieve the separating them. The agent and the key to the room are randomly learner’s subgoal 휎˜ (where for the purposes of this paper, a set 푠푔 placed in the leftmost room. In puzzle 1, the agent’s task is to pickup of preconditions 휎˜ “consistently achieves” 휎˜ if the average 푝푟푒 푠푔 the key, and use it to open the door. The episode terminates, and value of all MDP states satisfying 휎˜ is above the threshold 휏). 푝푟푒 the agent receives a reward inversely proportional to the number of elapsed time steps, when the door is open. In puzzle 2 the agent’s goal is again to open the door, but a ball is placed directly in front Algorithm 4: gen-precon(ℓ, Σ푟푒푎푐ℎ, 휏) of the door, which the agent must pick up and move elsewhere ℓ: Learner along with Q-tables Input: before picking up the key (this is the running example). Puzzle 3 Input: Σ : Fluent states reachable from the initial state 푟푒푎푐ℎ is identical to puzzle 2, except that the agent’s goal is now not to Input: 휏: Value threshold open the door, but to go to a goal location (green square) in the far 1: Σ>휏 ← ∅ 2: queue ← ∅ corner of the rightmost room. The high-level planning domain and 3: for 휎 in Σ푟푒푎푐ℎ do low-level RL actions are as defined for the running example. 4: if value(휎) > 휏 then The evaluation is structured so that for almost all initial condi- 5: queue.add(휎) tions in puzzle 1, SPOTTER’s planner can produce a plan that can 6: Σ>휏 .add(휎) be successfully executed to reach the goal state. In puzzle 2, the 7: end if door is blocked by the ball, and the agent has no operator represent- 8: end for ing “move the ball out of the way” (and no combination of existing 9: Σ ← getFluentStatesFromQ(ℓ) 푏푒푒푛 operators can safely achieve this effect given the agent’s planning 10: while queue do domain). The agent must discover such an operator. Finally, puzzle 11: 푛표푑푒 ← queue.pop() ′ 3 is designed to test whether the learned operator can be used in 12: for 휎 in Σ푏푒푒푛 do ′ 13: 휎˜푐표푚푚표푛 ← 푛표푑푒 ∩ 휎 the execution of different goals in the same environment. 14: if value(휎˜푐표푚푚표푛) > 휏 ∧ 휎˜푐표푚푚표푛 not in Σ>휏 then Figure 2 shows the average results of running SPOTTER 10 15: queue.add(휎˜푐표푚푚표푛) times on puzzles 1 through 3. The algorithm was allowed to re- 16: Σ>휏 .add(휎˜푐표푚푚표푛) tain any operators/executors discovered between puzzles 2 and 3. 17: end if In each case we employed a constant learning rate 훼 = 0.1, dis- 18: end for count factor 훾 = 0.99, and an 휖-greedy exploration policy with 19: end while 6 20: return Σ>휏 The preconditions produced are too lengthy to include in this paper, but can be found in the supplementary material.

1123 Main Track AAMAS 2021, May 3-7, 2021, Online

do not handle cases where new representations need to be learned from the environment.

5 DISCUSSION The SPOTTER architecture assumes the existence of a planning do- main with operators which correspond to executors which are more or less correct. It does not assume that these executors are reward- optimal. Further, some tasks can be more efficiently performed if they are not first split into subgoals. Thus, the performance ofraw RL systems eventually overtakes that of SPOTTER on any particular environment. This is not a serious flaw – SPOTTER also produces knowledge that can be more easily applied to perform other tasks in the same environment. The environments used to test SPOTTER (puzzles 1 through 3) are deterministic. In stochastic environments, it is often difficult to design planning domains where operators have guaranteed effects. While SPOTTER can handle stochastic environments, it would need more robust metrics for assessing operator confidence. Figure 2: Experimental performance across three tasks. We Future work could also emphasize adapting this work to high- report mean cumulative reward (and standard deviation in dimensional continuous state and action spaces using deep rein- lighter color) obtained by our approach (SPOTTER) and forcement learning. “Symbolizing” such spaces can be difficult, and three baseline algorithms: tabular Q-learning over primitive in particular such work would have to rethink how to generate actions (VQL), tabular Q-learning with primitive and high- candidate preconditions, since the existing approach enumerates level action executors (HLAQL), and HLAQL with q-updates over all states, which clearly would not work in deep RL domains. trickling down from HLAs to primitives (HLALQL). The key advantage of SPOTTER is that the agent can produce operators that can potentially be applied to other tasks in the same environment. Because these operators’ executors are policies (here, decaying exploration constant 휖 beginning at 0.9 and decaying to- policies over finite, atomic MDPs), they do not generalize particu- wards 0.05. 휖 is decayed exponentially using the formula 휖(푡) = larly well to environments with different dynamics or unseen start 휖 + (휖 − 휖 )푒−휆푡 , where 휆 = −푙표푔(0.01)/푁 , 푁 being the 푚푖푛 푚푎푥 푚푖푛 states (e.g., an operator learned in puzzle 2 could not possibly be maximum number of episodes. The value threshold was set to applied in puzzle 3 if the door was moved up or down by even one 휏 = 0.9. We ran for 10,000 episodes on puzzle 1, 20,000 episodes on cell). While function approximation could be helpful to solving this puzzle 2, and 10,000 episodes on puzzle 3. problem, an ideal approach might be a form of program synthe- The experimental results show that SPOTTER achieved higher sis [27], in which the agent learns programs that can be applied overall rewards than the baselines in the given time frame and did regardless of environment. so more quickly. Crucially, the agent learned the missing opera- Furthermore, in this work, we manually sequenced the three tor for moving the blue ball out of the way in Level 2, and was tasks to elicit the discovery of operators that would be useful in the immediately able to use this operator in Level 3. This is demon- final task environment. A possible avenue for future work would strated both by the fact that the agent did not experience any drop be to develop automated task sequencing methods (i.e., curriculum in performance when transitioning to Level 3 and also we know learning [22]) as to improve performance in downstream tasks. from running the experiment that the agent did not enter learn or gen-precon in Level 3. It is important to note that the baselines did converge at around 800,000 - 1,000,000 episodes, significantly 6 RELATED WORK later than SPOTTER.7. In this section, we review related work that overlaps with various The results suggest that SPOTTER significantly outperformed aspects of the proposed integrated planning and learning literature. comparable baselines. The HLAQL and HLALQL baselines have their action space augmented with the same high-level executors 6.1 Learning Symbolic Action Models provided to SPOTTER. SPOTTER does not use any function approx- Early work in the symbolic AI literature explored how agents can imation techniques and the exploration and subgoal learners are adjust and improve their symbolic models through experimentation. tabular Q-learners themselves. Accordingly, we did not compare Gill proposed a method for learning by experimentation in which against any deep RL baselines. We also did not compare transfer the agent can improve its domain knowledge by finding missing learning and curriculum learning approaches as these approaches operators [10]. The agent is able design experiments at the symbolic

7In the supplementary material (https://github.com/spotter-ai/spotter), we provide the level based on observing the symbolic fluent states and comparing learned operator described in PDDL, learning curves for the baselines over 2,000,000 against an operator’s preconditions and effects. Other approaches episodes, videos showing SPOTTER’s integrated planning and learning, and a second (to name a few: [11, 17, 21, 25, 26, 30]), comprising a significant experiment exploring how running SPOTTER for longer produces operators with more general preconditions. Code implementing SPOTTER and the baselines along body of literature, have explored recovering from planning failures, with experiments can also be found in the above link. refining domain models, and open-world planning.

1124 Main Track AAMAS 2021, May 3-7, 2021, Online

These approaches do not handle the synthesis of the underly- RL techniques have also been used to improve the quality of ing implementation of an operator as we have proposed. That is, symbolic plans. The DARLING framework uses symbolic planning they synthesize new operators, but no executors. That said, the and reasoning to constrain exploration and reduce the search space, rich techniques offered by these approaches can be useful in inte- and uses RL to improve the quality of the plan produced [19]. While grating into our online learner. As we describe in Section 3.3, the our approach shares the common goal of reducing the brittleness online learner follows an 휖-greedy exploration policy. Future work of planning domains, they do not modify the planning model. The will explore how our online learner could be extended to conduct PEORL framework works to choose a plan that maximizes a re- symbolically-guided experiments as these approaches suggest. ward function, thereby improving the quality of the plan using More recently, there has been a growing body of literature ex- RL [34]. More recently Lyu et al. propose a framework (SDRL) for ploring how domain models can be learned from action traces generalizing the PEORL framework with intrinsic goals and inte- [3, 14, 16, 36]. The ARMS system learns PDDL planning operators gration with a deep reinforcement learning (DRL) machinery [20]. from examples. The examples consist of successful fluent state and However, neither PEORL nor SDRL synthesizes new operators or operator pairs corresponding to a sequence of transitions in the learns planning domains. Our work also differs from the majority symbolic domain [35]. Subsequent work has explored how domains of model-based RL approaches [23, 32] in that we are interested in can be learned with partial knowledge of successful traces [2, 6], STRIPS-like explicit operators that are useful in symbolic planning, and with neural networks capable of approximating from partial and thereby transferable to a wide range of tasks within a domain. traces [33] and learning models from pixels [4, 7]. In the RL context, there has been recent work in learning sym- 7 CONCLUSION bolic concepts [18] from low-level actions. Specifically, Konidaris Automatically synthesizing new operators during task performance et al. assume the agent has available to it a set of high-level actions, is a crucial capability needed for symbolic planning systems. As we couched in the options framework for hierarchical reinforcement have examined here, such a capability requires a deep integration learning. The agent must learn a symbolic planning domain with between planning and learning, one that benefits from leveraging operators and their preconditions and effects. This approach to RL to fill in missing connections between symbolic states. While integrating planning and learning is, in a sense, a reverse of our exploring the MDP state space, SPOTTER can identify states from approach. While their use case is an agent that has high-level ac- which symbolic planning is possible, use this to identify subgoals tions but no symbolic representation, ours assumes that we have for which operators can be synthesized, and learn policies for these (through hypothesizing from the backward search) most of the operators by RL. SPOTTER thus allows a symbolic planner to ex- symbolic scaffolding, but need to learn the policies themselves. plore previously unreachable states, synthesize new operators, and accomplish new goals. 6.2 Learning Action Executors There has been a tradition of research in leveraging planning do- 8 ACKNOWLEDGEMENTS main knowledge and symbolically provided constraints to improve This work was funded in part by NSF grant IIS-2044786 and DARPA the sample efficiency of RL systems. Grzes et al. use a potential func- grant W911NF-20-2-0006. tion (as a difference between source and destination fluent states) to shape rewards of a Q-learner [13]. While aligned with our own REFERENCES targeted Q-learners, their approach requires the existence of sym- [1] Javier Segovia Aguas, Jonathan Ferrer-Mestres, and Anders Jonsson. 2016. Plan- bolic plans, which our agent does not possess. A related approach ning with Partially Specified Behaviors.. In CCIA. 263–272. (PLANQ) combines STRIPS planning with Q-learning to allow an [2] Diego Aineto, Sergio Jiménez, and Eva Onaindia. 2019. Learning STRIPS action models with classical planning. arXiv preprint arXiv:1903.01153 (2019). agent to converge faster to an optimal policy [12]. Planning with [3] Ankuj Arora, Humbert Fiorino, Damien Pellier, Marc Métivier, and Sylvie Pesty. Partially Specified Behaviors (PPSB) is based on PLANQ, and takes 2018. A review of learning planning action models. The Knowledge Engineering as input symbolic domain information and produces a Q-values for Review 33 (2018). [4] Masataro Asai and Alex Fukunaga. 2017. Classical planning in deep latent space: each of the high-level operators [1]. Like Grzes et al., PLANQ and Bridging the subsymbolic-symbolic boundary. arXiv preprint arXiv:1705.00154 PPSB assume the existence of a plan, or at least a plannable task. (2017). [5] Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. 2018. Minimalistic Other recent approaches have extended these ideas to incorporate Gridworld Environment for OpenAI Gym. https://github.com/maximecb/gym- using partial plans [15]. The approach proffered by Illanes et al. minigrid. improves the underlying learner’s sample efficiency, but can also [6] Stephen N Cresswell, Thomas Leo McCluskey, and Margaret M West. 2013. Ac- quiring planning domain models using LOCM. Knowledge Engineering Review be provided with symbolic goals. However, their approach assumes 28, 2 (2013), 195–213. the agent has a complete (partial order) plan for its environment, [7] Andrea Dittadi, Thomas Bolander, and Ole Winther. 2018. Learning to Plan from and can use that plan to construct a reward function for learning Raw Data in Grid-based Games.. In GCAI. 54–67. [8] Richard E Fikes and Nils J Nilsson. 1971. STRIPS: A new approach to the applica- options to accomplish a task. One key difference between their tion of theorem proving to problem solving. 2, 3-4 (1971), work and ours is that we construct our own operators, whereas 189–208. [9] Malik Ghallab, Dana Nau, and Paolo Traverso. 2016. Automated planning and they use a set of existing operators to learn policies. acting. Cambridge University Press. Generally, while these methods learn action implementations, [10] Yolanda Gil. 1994. Learning by experimentation: Incremental refinement of they assume the existence of a successful plan or operator defini- incomplete planning domains. In Proceedings 1994. Elsevier, 87–95. tion. In the proposed framework, the agent has neither and must [11] Evana Gizzi, Mateo Guaman Castro, and Jivko Sinapov. 2019. Creative Problem hypothesize operators and their corresponding executors. Solving by Robots Using Action Primitive Discovery. In 2019 Joint IEEE 9th

1125 Main Track AAMAS 2021, May 3-7, 2021, Online

International Conference on Development and Learning and Epigenetic Robotics [23] Jun Hao Alvin Ng and Ronald PA Petrick. 2019. Incremental Learning of Planning (ICDL-EpiRob). IEEE, 228–233. Actions in Model-Based Reinforcement Learning.. In IJCAI. 3195–3201. [12] Matthew Grounds and Daniel Kudenko. 2005. Combining reinforcement learn- [24] Vasanth Sarathy and Matthias Scheutz. 2018. MacGyver problems: Ai challenges ing with symbolic planning. In Adaptive Agents and Multi-Agent Systems III. for testing resourcefulness and creativity. Advances in Cognitive Systems 6 (2018), Adaptation and Multi-Agent Learning. Springer, 75–86. 31–44. [13] Marek Grzes and Daniel Kudenko. 2008. Plan-based reward shaping for rein- [25] Wei-Min Shen. 1989. Learning from the environment based on percepts and actions. forcement learning. In 2008 4th International IEEE Conference Intelligent Systems, Ph.D. Dissertation. Carnegie Mellon University. Vol. 2. IEEE, 10–22. [26] Wei-Min Shen and Herbert A Simon. 1989. Rule Creation and Rule Learning [14] Chad Hogg, Ugur Kuter, and Hector Munoz-Avila. 2010. Learning methods to Through Environmental Exploration.. In IJCAI. Citeseer, 675–680. generate good plans: Integrating htn learning and reinforcement learning. In [27] Armando Solar-Lezama. 2009. The sketching approach to program synthesis. In Twenty-Fourth AAAI Conference on Artificial Intelligence. Citeseer. Asian Symposium on Programming Languages and Systems. Springer, 4–13. [15] León Illanes, Xi Yan, Rodrigo Toro Icarte, and Sheila A McIlraith. 2020. Symbolic [28] Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An intro- Plans as High-Level Instructions for Reinforcement Learning. In Proceedings duction. MIT press. of the International Conference on Automated Planning and Scheduling, Vol. 30. [29] Richard S Sutton, Doina Precup, and Satinder Singh. 1999. Between MDPs and 540–550. semi-MDPs: A framework for temporal abstraction in reinforcement learning. [16] Sergio Jiménez, Tomás De La Rosa, Susana Fernández, Fernando Fernández, and Artificial intelligence 112, 1-2 (1999), 181–211. Daniel Borrajo. 2012. A review of machine learning for automated planning. The [30] Kartik Talamadupula, J Benton, Subbarao Kambhampati, Paul Schermerhorn, and Knowledge Engineering Review 27, 4 (2012), 433–467. Matthias Scheutz. 2010. Planning for human-robot teaming in open worlds. ACM [17] Saket Joshi, Paul Schermerhorn, Roni Khardon, and Matthias Scheutz. 2012. Transactions on Intelligent Systems and Technology (TIST) 1, 2 (2010), 1–24. Abstract planning for reactive robots. In 2012 IEEE International Conference on [31] Christopher JCH Watkins and Peter Dayan. 1992. Q-learning. Machine learning Robotics and Automation. IEEE, 4379–4384. 8, 3-4 (1992), 279–292. [18] George Konidaris, Leslie Pack Kaelbling, and Tomas Lozano-Perez. 2018. From [32] John Winder, Stephanie Milani, Matthew Landen, Erebus Oh, Shane Parr, Shawn skills to symbols: Learning symbolic representations for abstract high-level plan- Squire, Marie desJardins, and Cynthia Matuszek. 2019. Planning with Ab- ning. Journal of Artificial Intelligence Research 61 (2018), 215–289. stract Learned Models While Learning Transferable Subtasks. arXiv preprint [19] Matteo Leonetti, Luca Iocchi, and Peter Stone. 2016. A synthesis of automated arXiv:1912.07544 (2019). planning and reinforcement learning for efficient, robust decision-making. Arti- [33] Zhanhao Xiao, Hai Wan, Hankui Hankz Zhuo, Jinxia Lin, and Yanan Liu. 2019. ficial Intelligence 241 (2016), 103–130. Representation Learning for Classical Planning from Partially Observed Traces. [20] Daoming Lyu, Fangkai Yang, Bo Liu, and Steven Gustafson. 2019. SDRL: in- arXiv preprint arXiv:1907.08352 (2019). terpretable and data-efficient deep reinforcement learning leveraging symbolic [34] Fangkai Yang, Daoming Lyu, Bo Liu, and Steven Gustafson. 2018. Peorl: Inte- planning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. grating symbolic planning and hierarchical reinforcement learning for robust 2970–2977. decision-making. arXiv preprint arXiv:1804.07779 (2018). [21] Tom M Mitchell, Richard M Keller, and Smadar T Kedar-Cabelli. 1986. Explanation- [35] Qiang Yang, Kangheng Wu, and Yunfei Jiang. 2007. Learning action models from based generalization: A unifying view. Machine learning 1, 1 (1986), 47–80. plan examples using weighted MAX-SAT. Artificial Intelligence 171, 2-3 (2007), [22] Sanmit Narvekar, Bei Peng, Matteo Leonetti, Jivko Sinapov, Matthew E. Taylor, 107–143. and Peter Stone. 2020. Curriculum Learning for Reinforcement Learning Domains: [36] Terry Zimmerman and Subbarao Kambhampati. 2003. Learning-assisted auto- A Framework and Survey. Journal of Machine Learning Research 21, 181 (2020), mated planning: looking back, taking stock, going forward. AI Magazine 24, 2 1–50. http://jmlr.org/papers/v21/20-212.html (2003), 73–73.

1126