Learning to Plan from Raw Data in Grid-Based Games
Total Page:16
File Type:pdf, Size:1020Kb
EPiC Series in Computing Volume 55, 2018, Pages 54{67 GCAI-2018. 4th Global Con- ference on Artificial Intelligence Learning to Plan from Raw Data in Grid-based Games Andrea Dittadi, Thomas Bolander, and Ole Winther Technical University of Denmark, Lyngby, Denmark fadit, tobo, [email protected] Abstract An agent that autonomously learns to act in its environment must acquire a model of the domain dynamics. This can be a challenging task, especially in real-world domains, where observations are high-dimensional and noisy. Although in automated planning the dynamics are typically given, there are action schema learning approaches that learn sym- bolic rules (e.g. STRIPS or PDDL) to be used by traditional planners. However, these algorithms rely on logical descriptions of environment observations. In contrast, recent methods in deep reinforcement learning for games learn from pixel observations. However, they typically do not acquire an environment model, but a policy for one-step action selec- tion. Even when a model is learned, it cannot generalize to unseen instances of the training domain. Here we propose a neural network-based method that learns from visual obser- vations an approximate, compact, implicit representation of the domain dynamics, which can be used for planning with standard search algorithms, and generalizes to novel domain instances. The learned model is composed of submodules, each implicitly representing an action schema in the traditional sense. We evaluate our approach on visual versions of the standard domain Sokoban, and show that, by training on one single instance, it learns a transition model that can be successfully used to solve new levels of the game. 1 Introduction Automated planning has efficient methods for computing an action sequence leading to a desired goal under the condition that the initial state, actions and goal are all beforehand described in a logical language. Automated planning systems can for instance be used to solve puzzles in the classical Sokoban video game [7]. For a planning system to solve a specific Sokoban level, it only needs to be fed the logical description of the initial state of the level, of the available actions, and of the goal. The standard language for describing these is (some version of) PDDL. A planning system can thus solve Sokoban levels without any previous experience with the game, as long as the game is suitably described in PDDL. However, a human player playing the game for the first time will normally only be provided with the information that there are four available actions, Up, Down, Left, and Right, but will not be given any formal description of the transition models of these actions. The human player will have to learn the transition models from observing how her actions affect the state of the domain. Although current planning systems can be extended so that they can also learn action models from observing state transitions in the domain [19,1, 28, 32, 20, 17], the human D. Lee, A. Steen and T. Walsh (eds.), GCAI-2018 (EPiC Series in Computing, vol. 55), pp. 54{67 Learning to Plan from Raw Data in Grid-based Games Dittadi, Bolander and Winther Figure 1: A state from the Sokoban training level. Left: the agent is represented as a green circular shape, target cells as large blue squares, boxes as smaller and lighter blue squares, and walls are black. Right: each cell is represented by one pixel, so the superimposition of box and target is not evident. novice player is still facing a more complicated task. The existing frameworks for learning action models require that observations are made as transitions between logically described states (conjunctions of propositions or ground atoms). The human player, on the other hand, can only make observations based on the pixels on the screen, the raw data. Learning to play video games exclusively from pixel input has caught a lot of attention in the machine learning community in the last few years [16, 15, 18,5, 10]. However, most state-of-the- art algorithms use model-free reinforcement learning (RL) to learn a policy. The disadvantage of learning a policy for a specific instance of the domain rather than a general transition model of the domain is that for instance in a game like Sokoban, if a policy is learned to solve level 1, then this can not be immediately used to solve level 2. On the other hand, model-based RL methods applied to these domains currently lack the generalization and planning capabilities that are typical of traditional planning approaches [3,5,4, 30]. Our approach allows to learn a compact representation of the transition model that can be used in automated planning techniques. Since we would like to stay as close as possible to the problem that a novice human player would be facing, we are only providing our agent with raw pixel input. The proposed method differs significantly from the standard approach in automated planning by the lack of explicit logical action schemas. Instead, we train neural networks to give a compact and approximate representation of the transition function. However, it still shares some important properties with classical planning. First of all, the state space is not explicitly represented, but compactly represented by structures that can be exponentially more succinct than the state space they induce. Second, standard tree and graph search techniques like BFS, DFS and A* can be used to compute plans based on those compact action descriptions. Ideally, one should do informed search with an efficient heuristic, but if we want to solve the problem without providing our system with more information than a human player would have available, this heuristic also needs to be learned. Learning heuristics from raw pixel data is outside the scope of this paper. We test our agent on Sokoban (see Fig.1), a classical planning domain in which an agent in a 2D world pushes boxes and moves in 4 possible directions [7]. The goal is to place a box on all target locations. Boxes can not be pulled, and a box can only be pushed if the cell that the box is attempted to be pushed to is currently unoccupied. For this reason, one of the major challenges in solving Sokoban puzzles is the risk of deadlocks that would make the puzzle irreversibly unsolvable. The main contribution of this paper is a framework for learning a compact and general representation of the transition model exclusively from raw data. This learned model can then 55 Learning to Plan from Raw Data in Grid-based Games Dittadi, Bolander and Winther be employed in standard search algorithms to solve problems and play games. We make the assumption that actions have local preconditions and effects in the environment, and that these are typically limited. The predictive model can thus learn to focus on a small subset of inputs, leading to efficient learning and generalization. 2 The Learning Framework The planning problems considered in this work can be modeled as tuples (S; A; T ; G; s0) where S is a finite set of states, A is a finite set of actions, T : S × A ! S is a deterministic transition function, G ⊆ S is a set of goal states, and s0 2 S is an initial state. In each state s, starting 0 from s0, an agent chooses an action a 2 A that leads to s = T (s; a). No distinction is made regarding the applicability of actions: s0 = T (s; a) is defined for all s and a, with s0 = s if a is not applicable in s. The agent's task is to find a plan, a sequence of actions, that leads from s0 to a goal. In contrast to typical RL settings in which the goals are given in the form of a reward signal, the set of goal states is given as a goal test function g : S ! f0; 1g, g(s) = 1G(s), with 1A the indicator function of a set A. The states are fully observable, and each observation is a visual frame, i.e. a matrix of numbers in the range [0; 1]. Moreover, we assume that the agent is aware of its own position in the environment. We assume that the preconditions and effects of all actions are restricted to a local K- neighbourhood of the agent, defined as the K × K submatrix of the state centered around the agent's location (i; j), with K an integer denoting the neighbourhood size:1 K K−1 K−1 K−1 K−1 Li;j(s) = s i− 2 : i+ 2 ; j− 2 : j+ 2 : This assumption, which we call the K-locality assumption, implies that s0 only differs from s in K the submatrix Li;j(s), and the changes in that submatrix only depend on the submatrix itself. We here only consider domains for which the K-locality assumption holds for some K. The value of K is not necessarily initially known, and has to be learned. When K is given (explicitly or implicitly), we call local state a local K-neighbourhood, and denote it by sL 2 SK , with SK the set of all possible local states of size K. Let K∗ be the least value of K for which a domain satisfies the K-locality assumption. We will assume that early on in the learning phase the agent observes transitions where the K-locality assumption does not hold for K < K∗. Thus, we can consider K∗ known for each domain after a few environment interactions. Given s, a, and the agent's position (i; j), s0 is completely defined by the changes in the 0 local neighbourhood following action a.