Learning to Plan with Portable Symbols

Learning to Plan with Portable Symbols Steven James 1 Benjamin Rosman 1 2 George Konidaris 3 Abstract the state space remains complex and planning continues to We present a framework for autonomously learn- be challenging. ing a portable symbolic representation that de- On the other hand, the classical planning approach is to rep- scribes a collection of low-level continuous en- resent the world using abstract symbols, with actions repre- vironments. We show that abstract representa- sented as operators that manipulate these symbols (Ghallab tions can be learned in a task-independent space et al., 2004). Such representations use only the minimal specific to the agent that, when combined with amount of state information necessary for task-level plan- problem-specific information, can be used for ning. This is appealing since it mitigates the issue of reward planning. We demonstrate knowledge transfer sparsity and admits solutions to long-horizon tasks, but in a video game domain where an agent learns raises the question of how to build the appropriate abstract portable, task-independent symbolic rules, and representation of a problem. This is often resolved manu- then learns instantiations of these rules on a per- ally, requiring substantial effort and expertise. Fortunately, task basis, reducing the number of samples re- recent work demonstrates how to learn a provably sound quired to learn a representation of a new task. symbolic representation autonomously, given only the data obtained by executing the high-level actions available to the agent (Konidaris et al., 2018). 1. Introduction A major shortcoming of that framework is the lack of On the surface, the learning and planning communities op- generalisability—an agent must relearn the appropriate sym- erate in very different paradigms. Agents in reinforcement bolic representation for each new task it encounters. This is learning (RL) interact directly with the environment in order a data- and computation-intensive procedure involving clus- to learn either an optimal behaviour or the dynamics of the tering, probabilistic multi-class classification, and density environment (Sutton & Barto, 1998). The latter produces a estimation in high-dimensional spaces. learned forward model of the transition dynamics, which can then be used in some form of tree search to compute optimal We introduce a framework for deriving a symbolic ab- actions (Coulom, 2006; Kocsis & Szepesvari´ , 2006). straction over a portable state space known as agent space (Konidaris et al., 2012). Because agent space depends only Unfortunately, this approach founders when confronted with on the sensing capabilities of the agent (which remain con- low-level, high-dimensional and continuous state and action stant regardless of the environment), it is independent of the spaces. The innate action space of a robot, for instance, underlying state space and thus a suitable mechanism for involves directly actuating motors at a high frequency, but transfer. it would take thousands of such actuations to accomplish most useful goals. Thus planning is simply infeasible, even We demonstrate successful transfer in the Treasure Game with a perfect model. (Konidaris et al., 2015), indicating that an agent is able to learn symbols that generalise to different tasks, reducing Approaches such as hierarchical reinforcement learning the amount of experience required to learn a high-level (Barto & Mahadevan, 2003) tackle this problem by abstract- representation of a new task. ing away the low-level action space using higher-level skills, which can accelerate learning and planning. While skills alleviate the problem of reasoning over low-level actions, 2. Background 1University of the Witwatersrand, Johannesburg, South Africa We assume that the tasks faced by an agent can be mod- 2Council for Scientific and Industrial Research, Pretoria, South elled as a semi-Markov decision process (SMDP) M = 3 Africa Brown University, Providence RI 02912, USA. Correspon- hS; O; T ; Ri, where S ⊆ Rn is the n-dimensional continu- dence to: Steven James <[email protected]>. ous state space and O(s) is the set of temporally-extended actions known as options available to the agent at state s. Accepted to the ICML/IJCAI/AAMAS 2018 Workshop on Planning 0 and Learning (PAL-18), Stockholm, Sweden. The reward function R(s; o; τ; s ) specifies the feedback Learning to Plan with Portable Symbols the agent receives from the environment when it executes can be executed at a given state, and the image, which rep- option o from state s and arrives in state s0 after τ steps. T resents the distribution of states an agent may find itself in describes the dynamics of the environment, specifying the after executing o from states drawn from some distribution. probability of arriving in state s0 after option o is executed Figure1 illustrates how the precondition and image are used o 0 from s for τ timesteps: Tss0 = Pr(s ; τ j s; o): to calculate the probability of executing a two-step plan. An option o is defined by the tuple hIo; πo; βoi, where Io is For continuous state spaces, we cannot represent the im- the initiation set that specifies the states in which the option age of an arbitrary option; however, we can do so for a can be executed, πo is the option policy which specifies the subclass known as subgoal options (Precup, 2000), whose action to execute, and βo is the termination condition, where terminating states are independent of their starting states βo(s) is the probability of option o halting in state s. (Konidaris et al., 2018). That is, for any subgoal option o, Pr(s0 j s; o) = Pr(s0 j o). We can thus substitute the 2.1. Portable Skills option’s image for its effect. If an option is not subgoal, we may be able to partition its initiation set into a finite We adopt the approach of Konidaris et al.(2012) whereby number of subsets, so that it becomes subgoal when ini- tasks are related because they are faced by the same agent. tiated from each of the individual subsets. That is, we For example, consider a robot equipped with various sensors divide an option o’s start states into classes C such that that is required to perform a number of as yet unspecified P (s0js; o; c) ≈ P (s0jo; c)8c 2 C. Given subgoal options, tasks. The only aspect that remains constant across all these we can construct a plan graph corresponding to an abstract tasks is the presence of the robot, and more importantly its MDP. sensors, which map the state space to a portable observation space D known as agent space. We may also assume that the option is abstract—that is, it obeys the frame and action outcomes assumptions (Pasula We define an observation function φ : S!D that maps et al., 2004). For each option, we can decompose the state states to observations and depends on the sensors available into two sets of variables s = [a; b] such that executing the to an agent. We assume the sensors may be noisy, but that option results in state s0 = [a; b0], where a is the subset of the noise has mean 0 in expectation, so that if s; t 2 S, then variables that remain unchanged. s = t =) E[φ(s)] = E[φ(t)]. We refer to the SMDP’s original state space as problem space. Whereas subgoal options induce an abstract MDP, abstract subgoal options allow us to construct a model corresponding Augmenting an SMDP with this new agent space produces to a factored abstract MDP. Equivalently, subgoal options the tuple hS; O; T ; R; γ; Di, where the observation space induce a PPDDL description (Younes & Littman, 2004), D remains constant across all tasks. We can use D to learn where each operator’s precondition and positive effect is agent-space options, whose option policies, initiation sets a single proposition. Abstract subgoal options result in and termination conditions are all defined in agent space. preconditions and effects with conjunctive propositions. Because D remains constant regardless of the underlying SMDP, these options can be transferred across tasks. o2? 2.2. Abstract Representations Z o ? 1 Much like Konidaris et al.(2018), we are interested in 1 learning an abstract representation to facilitate planning— Z o1 I that is, learning to plan. We define a probabilistic plan o2 pZ = fo1; : : : ; ong to be the sequence of options to be executed, starting from some state drawn from distribution Z. Z I It is useful to introduce the notion of a goal option, which o1 can only be executed when the agent has reached its goal. Appending this option to a plan means that the probability of (a) The agent begins at dis- (b) The agent estimates successfully executing a plan is equivalent to the probability tribution Z, and must deter- the effect of executing o1, mine the probability with given by Z1. It must then of reaching some goal. which it can execute the determine the probability A representation suitable for planning must allow us to first option o1. of executing o2 from Z1. calculate the probability of a given plan executing to com- Figure 1. An agent attempting to calculate the probability of exe- pletion. As a plan is simply a chain of options, we must cuting the plan pZ = fo1; o2g, which requires knowledge of the therefore learn when an option can be executed, as well as conditions under which o1 and o2 can be executed, as well as the the outcome of doing so. This corresponds to learning the effect of executing o1 (Konidaris et al., 2018). precondition, which expresses the probability that option o Learning to Plan with Portable Symbols 3. Building a Portable Symbolic Vocabulary Prior work (Konidaris et al., 2018) has defined symbols as names for precondition and effect distributions over low- level states, which are directly tied to the SMDP in which they were learned.

Load more