Classical Planning in Latent Space: From Unlabeled Images to PDDL (and back)

Masataro Asai, Alex Fukunaga Graduate School of Arts and Sciences University of Tokyo

Abstract Current domain-independent, classical planners re- ? quire symbolic models of the problem domain and Initial state Goal state image Original instance as input, resulting in a knowledge acquisi- image (black/white) Mandrill image tion bottleneck. Meanwhile, although recent work in has achieved impressive results in Figure 1: An image-based 8-puzzle. many fields, the knowledge is encoded in a subsym- bolic representation which cannot be directly used [ ] by symbolic systems such as planners. We propose Mnih et al., 2015; Graves et al., 2016 . However, the current LatPlan, an integrated architecture combining deep state-of-the-art in pure NN-based systems do not yet provide learning and a classical planner. Given a set of un- guarantees provided by symbolic planning systems, such as labeled training image pairs showing allowed ac- deterministic completeness and solution optimality. tions in the problem domain, and a pair of images Using a NN-based perceptual system to automatically pro- representing the start and goal states, LatPlan uses vide input models for domain-independent planners could a Variational to generate a discrete greatly expand the applicability of planning technology and latent vector from the images, based on which a offer the benefits of both paradigms. We consider the problem PDDL model can be constructed and then solved of robustly, automatically bridging the gap between such sub- by an off-the-shelf planner. We evaluate LatPlan us- symbolic representations and the symbolic representations ing image-based versions of 3 planning domains: required by domain-independent planners. 8-puzzle, LightsOut, and Towers of Hanoi. Fig. 1 (left) shows a scrambled, 3x3 tiled version of the the photograph on the right, i.e., an image-based instance of the 8-puzzle. Even for humans, this photograph-based task is ar- 1 Introduction guably more difficult to solve than the standard 8-puzzle be- Recent advances in domain-independent planning have cause of the distracting visual aspects. We seek a domain- greatly enhanced their capabilities. However, planning prob- independent system which, given only a set of unlabeled lems need to be provided to the planner in a structured, sym- images showing the valid moves for this image-based puz- bolic representation such as PDDL [McDermott, 2000], and zle, finds an optimal solution to the puzzle. Although the 8- in general, such symbolic models need to be provided by a hu- puzzle is trivial for symbolic planners, solving this image- man, either directly in a modeling language such as PDDL, or based problem with a domain-independent system which (1) via a compiler which transforms some other symbolic prob- has no prior assumptions/knowledge (e.g., “sliding objects”, lem representation into PDDL. This results in the knowledge- “tile arrangement”), and (2) must acquire all knowledge from acquisition bottleneck, where the modeling step is sometimes the images, is nontrivial. Such a system should not make as- the bottleneck in the problem solving cycle. In addition, the sumptions about the image (e.g., “a grid-like structure”). The requirement for symbolic input poses a significant obstacle to only assumption allowed about the nature of the task is that it applying planning in new, unforeseen situations where no hu- can be modeled and solved as a classical planning problem. man is available to create such a model or a generator, e.g., We propose Latent-space Planner (LatPlan), an integrated autonomous spacecraft exploration. In particular this first re- architecture which uses NN-based image processing to com- quires generating symbols from raw sensor input, i.e., the pletely automatically generate a propositional, symbolic symbol grounding problem [Steels, 2008]. problem representation. LatPlan consists of 3 components: Recently, significant advances have been made in neural (1) a NN-based State Autoencoder (SAE), which provides network (NN) deep learning approaches for perceptually- a bidirectional mapping between the raw input of the world based cognitive tasks including image classification [Deng et states and its symbolic/categorical representation, (2) an ac- al., 2009] and object recognition [Ren et al., 2015], as well tion model generator which generates a PDDL model of the as NN-based problem-solving systems for problem solving problem domain using the symbolic representation acquired model generation method of of Sec. 2.2 could be applied to such “symbols”. However, such a trivial SAE lacks the cru- The latent layer The output converges cial properties of generalization – ability to encode/decode converges to the categorical distrib. to the input unforeseen world states to symbols – and robustness – two similar images that represent “the same world state” should map to the same symbolic representation. Thus, we need a mapping where the symbolic representation captures the Figure 2: Step 1: Train the State Autoencoder by minimiz- “essence” of the image, not merely the raw pixel vector. The ing the sum of the reconstruction loss (binary cross-entropy main technical contribution of this paper is the proposal of between the input and the output) and the variational loss a SAE which is implemented as a Variational Autoencoder of Gumbel-Softmax (KL divergence between the actual la- [Kingma et al., 2014] with a Gumbel-Softmax (GS) activa- tent distribution and the random categorical distribution as the tion function [Jang et al., 2017]. target). As the training continues, the output of the network An AutoEncoder (AE) is a type of Feed-Forward Network converges to the input images. Also, as the Gumbel-Softmax (FFN) that uses to produce an image temperature τ decreases during training, the latent values ap- that matches the input [Hinton and Salakhutdinov, 2006]. proaches the discrete categorical values of 0 and 1. The intermediate layer is said to have a Latent Represen- tation of the input and is considered to be performing data compression. AEs are commonly used for pretraining a neu- by the SAE, and (3) a symbolic planner. Given only a set of ral network. A Variational AutoEncoder (VAE) [Kingma and unlabeled images from the domain as input, we train (unsu- Welling, 2013] is a type of AE that forces the latent layer (the D pervised) the SAE and use it to generate , a PDDL repre- most compressed layer in the AE) to follow a certain distri- sentation of the image-based domain. Then, given a planning bution (e.g., Gaussian) for given input images. Since the tar- problem instance as a pair of initial and goal images such get random distribution prevents backpropagating the gradi- as Fig. 1, LatPlan uses the SAE to map the problem to a sym- ent, most VAE implementations use reparametrization tricks, D bolic planning instance in , and uses the planner to solve the which decompose the target distribution into a differentiable problem. We evaluate LatPlan using image-based versions of distribution and a purely random distribution that does not the 8-puzzle, LightsOut, and Towers of Hanoi domains. require the gradient. For example, the Gaussian distribution N(σ, µ) can be decomposed into µ + σN(1, 0). 2 LatPlan: System Architecture Gumbel-Softmax (GS) reparametrization is a technique for This section describe the LatPlan architecture and the cur- enforcing a categorical distribution on a particular layer of the rent implementation, LatPlanα. LatPlan works in 3 phases. In neural network. [Jang et al., 2017]. A “temperature” param- Phase 1 (symbol-grounding, Sec. 2.1), a State AutoEncoder eter τ, which controls the magnitude of approximation to the providing a bidirectional mapping between raw data (e.g., im- categorical distribution, is decreased by an annealing sched- ages)1 and symbols is learned (unsupervised) from a set of ule τ ← max(0.1, exp(−rt)) where t is the current training unlabeled images of representative states. In Phase 2 (action epoch and r is an annealing ratio. Using a GS layer in the net- model generation, Sec. 2.2), the operators available in the do- work forces the layer to converge to a discrete one-hot vector main is generated from a set of pairs of unlabeled images, when the temperature approaches near 0. and a PDDL domain model is generated. In Phase 3 (plan- The SAE is comprised of multilayer combined ning, Sec. 2.3), a planning problem instance is input as a pair with Dropouts and Batch Normalization in both the encoder of images (i, g) where i shows an initial state and g shows and the decoder networks, with a GS layer in between. The a goal state. These are converted to symbolic form using the input to the GS layer is the flat, last layer of the encoder net- SAE, and the problem is solved by the symbolic planner. For work. The output is an (N,M) matrix where N is the number example, an 8-puzzle problem instance in our system consists of categorical variables and M is the number of categories. of an image of the start (scrambled) configuration of the puz- The input is fed to a fully connected layer of size N × M, zle (i), and an image of the solved state (g). Finally, the sym- which is reshaped to a (N,M) matrix and processed by the bolic, latent-space plan is converted to a sequence of human- GS activation function. comprehensible images visualizing the plan (Sec. 2.4). Our key observation is that these categorical variables can be used directly as propositional symbols by a symbolic rea- 2.1 Symbol Grounding with a State Autoencoder soning system, i.e., this provides a solution to the symbol grounding problem in our architecture The State Autoencoder (SAE) provides a bidirectional map- . We obtain the propo- M = 2 ping between images and a symbolic representation. sitional representation by specifying , effectively ob- taining N propositional state variables. It is possible to spec- First, note that a direct 1-to-1 mapping between images ify different M for each variable and represent the world us- and discrete objects can be trivially obtained simply by us- ing multi-valued representation as in SAS+ [Backstr¨ om¨ and ing the array of discretized pixel values as a “symbol”. The Nebel, 1995]. In this paper, we use M = 2 for all variables 1Although the LatPlan architecture can, in principle, be applied for simplicity. to various unstructured data input including images, texts or low- The trained SAE provides bidirectional mapping between level sensors, in the rest of the paper we refer to “images” for sim- the raw inputs (subsymbolic representation) to and from their plicity and also because the current implementation is image-based. symbolic representations: Input 1: (a) Training images for training the State AutoEncoder and (b) image 2.2 Domain Model Generation pairs representing valid actions The model generator takes as input a trained SAE, and a Encode Subsymbolic set R contains pairs of raw images. In each image pair Input 2: Symbolic Decode Initial state (prei, posti) ∈ R, prei and posti are images representing the & goal state Intermediate state of the world before and after some action a is executed, image Encode Action definitions states i Initial State in PDDL/SAS respectively. In each ground action image pair, the “action” is PDDL Plan implied by the difference between prei and posti. The output Domain- Plan Simulator independent Action1 of the model generator is a PDDL domain file for a grounded Classical Planner Action2 Action3 unit-cost STRIPS planning problem. Goal State For each (prei posti) ∈ R we apply the learned SAE Solution Plan as images to prei and posti to obtain (Encode(prei), Encode(posti)), the symbolic representations (latent space vectors) of the state Figure 3: Classical planning in latent space: We use the before and after action a is executed. This results in a set of learned State AutoEncoder (Sec. 2.1) to convert pairs of im- i symbolic ground action instances A. ages (pre, post) first to symbolic ground actions and then to Ideally, a model generation component would induce a a PDDL domain (Sec. 2.2) We also encode initial and goal complete action model from a limited set of symbolic ground state images into a symbolic ground actions and then a PDDL action instances. However, action model learning from a problem. A classical planner finds the symbolic solution plan. limited set of action instances is a nontrivial area of ac- Finally, intermediate states in the plan are decoded back to a tive research [Konidaris et al., 2014; Mourao˜ et al., 2012; human-comprehensible image sequence. Yang et al., 2007; Celorrio et al., 2012]. Since the focus of this paper is on the overall LatPlan architecture and the SAE, we leave model induction for future work. • b = Encode(r) maps an image r to a boolean vector b. Instead, the current implementation LatPlanα generates a model based on all ground actions, i.e., R contains image • r˜ = Decode(b) maps a boolean vector b to an image r˜. pairs representing all ground actions that are possible in this domain, so A (generated by applying the SAE to all elements Encode(r) maps raw input r to a symbolic representation of R) contains all symbolic ground actions possible in the by feeding the raw input to the encoder network, extract domain (We discuss in Sec. 5 that a robot could traverse the the activation in the GS layer, and take the first row in the whole state space by random-walk with resets and eventually N × 2 matrix, resulting in a binary vector of length N. Sim- collect R). In the experiments Sec. 3, we generate image pairs ilarly, Decode(b) maps a binary vector b back to an im- for all ground actions using an external image generator. It is age by concatenating b and its complement ¯b to obtain a important to note that while R contains all possible actions, R N × 2 matrix and feeding it to the decoder network. These is not used for training the SAE. As explained in Sec. 2.1, the are lossy compression/decompression functions, so in gen- SAE is trained using at most 12000 images while the entire eral, r˜ = Decode(Encode(r)) is similar to r, but may have state space is much larger. negligible errors from r. This is acceptable for our purposes. LatPlanα compiles A directly into a PDDL model as fol- lows. For each action (Encode(pre ), Encode(post )) ∈ A, It is not sufficient to simply use traditional activation func- i i each bit b (1 ≤ j ≤ N) in these boolean vectors is mapped tions such as sigmoid or softmax and round the continuous j to propositions (b -true) and (b -false) when the en- activation values in the latent layer to obtain discrete 0/1 j j coded value is 1 and 0 (resp.). Encode(pre ) is directly used values. As explained in Sec. 2.4, we need to map the sym- i as the preconditions of action a . The add/delete effects of ac- bolic plan back to images, so we need a decoding network i tion i are computed by taking the bitwise difference between trained for 0/1 values approximated by a smooth function, Encode(pre ) and Encode(post ). For example, when b e.g., GS or similar approach such as [Maddison et al., 2017]. i i j changes from 1 to 0, it compiles into (and (b -false) A rounding-based scheme would be unable to restore the im- j (not (b -true))). ages from the latent layer because the decoder network is j The initial and the goal states are similarly created by ap- trained using continuous activation values. Also, represent- plying the SAE to the initial and goal images. ing the rounding operation as a layer of the network is in- feasible because rounding is non-differentiable, precluding 2.3 Planning with an Off-the-Shelf Planner backpropagation-based training of the network. The PDDL instance generated in the previous step can be In some domains, an SAE trained on a small fraction of the solved by an off-the-shelf planner. LatPlanα uses the Fast possible states successfully generalizes so that it can Encode Downward planner [Helmert, 2006]. However, on the mod- and Decode every possible state in that domain. In all our els generated by LatPlanα, the invariant detection routines in experiments below, on each domain, we train the SAE using the Fast Downward PDDL to SAS translator (translate.py) be- randomly selected images from the domain. For example, on came a bottleneck, so we wrote a trivial, replacement PDDL the 8-puzzle, the SAE trained on 12000 randomly generated to SAS converter without the invariant detection. configurations out of 362880 possible configurations is used LatPlan inherits all of the search-related properties of the by the domain model generator (Sec. 2.2) to Encode every planner which is used. For example, if the planner is com- 8-puzzle state. plete and optimal, LatPlan will find an optimal plan for the given plan (if one exists), with respect to the portion of the 8-puzzle. The entire state space consists of 362880 states (9!). state-space graph captured by the acquired model. Domain- From any specific goal state, the reachable number of states is independent heuristics developed in the planning literature 181440 (9!/2) Note that the same image is used for each digit are designed to exploit structure in the domain model. Al- in all states, e.g., the tile for the “1” digit is the same image in though the structure in models acquired by LatPlan may all states. Training takes about 40 minutes with 1000 epochs not directly correspond to those in hand-coded models, intu- on a single NVIDIA GTX-1070. itively, there should be some exploitable structure. The search results in Sec. 3.3 suggest that indeed, latent space has some exploitable structure. 2.4 Visualizing/Executing the Plans Since the actions comprising the plan are SAE-generated la- tent bit vectors, the “meaning” of each symbol (and thus the plan) is not necessarily clear to a human observer. However, we can obtain a step-by-step visualization of the world (im- 0-tile corresponds to the blank ages) as the plan is executed (e.g. Fig. 4) by starting with the tile in standard 8-puzzle latent state representation of the initial state, applying (simu- lating) actions step-by-step (according to the PDDL model Figure 4: Output of solving the MNIST 8-puzzle instance acquired above) and Decode’ing the latent bit vectors for with the longest (31 steps) optimal plan. [Reinefeld 1993] each intermediate state to images using the SAE. In this paper, we evaluate LatPlan in Sec. 3 using puz- Scrambled Photograph 8-puzzle The above MNIST 8- zle domains such as the 8-puzzle, LightsOut, and Towers of puzzle described above consists of images where each digit Hanoi. Thus, physically “executing” the plan is not neces- is cleanly separated from the black region. To show that Lat- sary, as finding the solution to the puzzles is the objective, Plan does not rely on cleanly separated objects, we solve 8- so a “mental image” of the solution (i.e., the image sequence puzzles generated by cutting and scrambling real photographs visualization) is sufficient. In domains where actions have ef- (similar to sliding tile puzzle toys sold in stores). We used the fects in the world, it will be necessary to consider how actions “Mandrill” image, a standard benchmark in the image pro- found by LatPlan (transitions between latent bit vector pairs) cessing literature. The image was first converted to greyscale can be mapped to actuation (future work). and then rounded to black/white (0/1) values.

3 Experimental Evaluation All of the SAE networks used in the evaluation have the same network topology except the input layer which should fit the size of the input images. They are implemented using TensorFlow and Keras and consist of the following layers: [Input, GaussianNoise(0.1), fc(4000), relu, bn, dropout(0.4), Original Right-eye tile corresponds fc(4000), relu, bn, dropout(0.4), fc(49x2), GumbelSoftmax, Mandrill to the blank tile in dropout(0.4), fc(4000), relu, bn, dropout(0.4), fc(4000), relu, image: standard 8-puzzle bn, dropout(0.4), fc(input), sigmoid]. Here, fc = fully con- nected layer, bn = Batch Normalization, and tensors are re- Figure 5: Output of solving a photograph-based 8-puzzle shaped accordingly. The last layers can be replaced with (Mandrill). [fc(input × 2), GumbelSoftmax, TakeFirstRow] for better reconstruction when we can assume that the input image is Towers of Hanoi (ToH) Disks of various sizes must be binarized. The network is trained to minimize the sum of moved from one peg to another, with the constraint that a the variational loss and the reconstruction loss (binary cross- larger disk can never be placed on top of a smaller disk. We entropy) using Adam optimizer (lr:0.001) for 1000 epochs. generated the training and planning inputs for this task, with The latent layer has 49 bits, which sufficiently covers the 3 and 4 disks. Each input image has a dimension of 24 × 122 total number of states in any of the problems that are used and 32 × 146 (resp.), where each disk is presented as a 8px in the following experiments. This could be reduced for each line segment. Due to the smaller number of states, we used domain (made more compact) with further engineering. images of all states as the training input. 3-disk ToH is solved successfully and optimally using 3.1 Solving Various Puzzle Domains with LatPlan the default hyperparameters (Fig. 6, top). As the images be- MNIST 8-puzzle This is an image-based version of the 8- come more complex, training the SAE becomes more dif- puzzle, where tiles contain hand-written digits (0-9) from the ficult. On 4-disks, the SAE trained with the default hy- MNIST database [LeCun et al., 1998]. Each digit is shrunk perparameters (Fig. 6, middle) is confused, resulting in a to 14x14 pixels, so each state of the puzzle is a 42x42 image. flawed model which causes the planner to choose subopti- Valid moves in this domain swap the “0” tile with a neighbor- mal moves (dashed box). Sometimes, the size/existence of ing tile, i.e., the “0” serves as the “blank” tile in the classic disks is confused (red box). Tuning the hyperparameters to re- 8puzzle Twisted LightsOut Twisted LightsOut ↑Result of solving 3-disk Tower of Hanoi with the default network parameters. +N(0,0.3) +N(0,0.3) +salt(0.06)

↑4-disk ToH with the default parameters (optimal plan wrto the flawed model by a confused SAE) Figure 9: SAE robustness vs noise: Corrupted initial state image r and its reconstruction Decode(Encode(r)) by SAE ↑4-disk ToH with tuned parameters on MNIST 8-puzzle and Twisted LightsOut. Images are cor- (optimal plan for the Binarized results of the last steps→ correct model) rupted by Gaussian noise of σ up to 0.3 for both prob- lems, and by salt noise up to p = 0.06 for Twisted Light- Figure 6: Output of solving ToH with 3 and 4 disks. The third sOut. LatPlanα successfully solved the problems. The SAE picture is the result of SAE with different parameters. maps the noisy image to the correct symbolic vector b = Encode(r), conduct planning, then map b back to the de- noised image Decode(b).

3.3 Are Domain-Independent Heuristics Effective Figure 7: Output of solving 4x4 LightsOut (left) and its bina- in Latent Space? rized result (right). Although the goal state shows two blurred switches, they have low values (around 0.3) and disappear in Domain-independent heuristics in the planning literature as- the binarized image. sume certain types of structures that are common in bench- mark domains. An interesting question is whether latent space contains such exploitable structures. We compare search us- duce the SAE loss corrects this problem. Increasing the train- ing a single PDB with greedy merging [Sievers et al., 2012] ing epochs (10000) and tuning the network shape (fc(6000), and blind heuristics (i.e., breadth-first search) in Fast Down- N = 29) allows the SAE to learn a clearer model, allow- ward. The numbers of nodes expanded were: ing correct model generation, resulting in the optimal 15-step plan (Fig. 6, bottom). • MNIST 8-puzzle (6 instances, mean(StdDev)): Blind LightsOut A video game where a grid of lights is in some 176658(25226), PDB 77811(32978) on/off configuration (+: On), and pressing a light toggles its • Mandrill 8-puzzle (1 instance with 31-step optimal solu- state (On/Off) as well as the state of all of its neighbors. The tion, corresponding to the 8-puzzle instance [Reinefeld, goal is all lights Off. Unlike the 8-puzzle where each move 1993]): Blind 335378, PDB 88851 affects only two adjacent tiles, a single operator in 4x4 Light- • ToH (4 disks, 1 instance): Blind 55, PDB 17, sOut can simultaneously flip 5/16 locations. Also, unlike 8- • 4x4 LightsOut (1 instance): Blind 952, PDB 27, puzzle and ToH, the LightsOut game allows some “objects” • 3x3 Twisted LightsOut (1 instance): Blind 522, PDB 214 (lights) to disappear. This demonstrates that LatPlan is not limited to domains with highly local effects and static objects. The domain-independent PDB heuristic significantly re- Twisted LightsOut In all of the above domains, the “ob- duced node expansions. Search times (< 3 seconds for all jects” correspond to rectangles. To show that LatPlan does instances) were also faster for all instances with the PDB. not rely on rectangular regions, we demonstrate its result on Although total runtimes including heuristic initialization is “Twisted LightsOut”, a distorted version of the game where slightly slower than blind search, in domains where goal the original LightsOut image is twisted around the center. Un- states and operators are the same for all instances (e.g., 8- like previous domains, the input images are not binarized. puzzle) PDBs can be reused [Korf and Felner, 2002], PDB generation time can be amortized across many instances. 3.2 Robustness to Noisy Input Although these results show that existing heuristics for We show the robustness of the system against the input noise. classical planning are able to exploit some structure in the We corrupted the initial/goal state inputs by adding Gaussian acquired latent state space model and reduce search effort or salt noise, as shown in Fig. 9. The system is robust enough compared to blind search, much more work is required in to successfully solve the problem, because our SAE is a De- order to understand how the features in latent space interact noising Autoencoder [Vincent et al., 2008] which has an in- with existing heuristics. In addition, a deeper understanding ternal GaussianNoise layer which adds a Gaussian noise to of the symbolic latent space may lead to new search heuristics the inputs (only during training) and learn to reconstruct the which better exploit the properties of latent space. original image from a corrupted version of the image. 4 Related Work [Konidaris et al., 2014] propose a method for generating PDDL from a low-level, sensor actuator space of an agent characterized as a semi-MDP. The inputs to their system are 33 variables representing accurate structured input (e.g., x/y Figure 8: Output of solving 3x3 Twisted LightsOut. distances) or categorical states (the on/off state of a button etc.) while LatPlan takes noisy unstructured images (e.g., for To our knowledge, LatPlan is the first completely auto- 8-puzzle, 42x42=1764-dimensional arrays). mated system of the kind. However, as a proof-of-concept, it Compared to learning from observation (LfO) in the has significant limitations to be addressed in future work. In robotics literature [Argall et al., 2009], (1) LatPlan is trained particular, the domain model generator in LatPlanα does not based on image pairs showing individual actions, not plan ex- perform action model learning from a small set of sample ac- ecutions (sequence of actions); (2) LatPlan focuses on PDDL tions because the focus of this paper is on SAE, not on action for high-level (puzzle-like) tasks, not on motion planning learning. Thus our generator requires the entire set of latent tasks. This significantly affects the data collection scheme: states, transitions and in turn images. While this is obviously While LfO has action segmentation issue because it does impractical, this is not a fundamental limitation of the Lat- not know when an action starts/ends in the plan traces (e.g. Plan architecture. The primitive generator is merely a place- video clip), LatPlan does not, because it can assume that a holder for investigating the overall feasibility of an SAE- robot can explore the world by itself, initiating/terminating based end-to-end planning system (our major contribution) its own action and taking pictures by a camera. The robot and is supposed to be easily replaced by the more sophis- can perform a random walk under physical constraints and ticated ones [Cresswell et al., 2013; Konidaris et al., 2014; supervision, which ensure the legal moves (e.g., the physi- Mourao˜ et al., 2012; Yang et al., 2007]. To our knowledge, cal tile in 8-puzzle). If we further assume that it can “reset” all previous domain learning methods require the structured the world (e.g., into a random configuration), then, the robot (e.g., propositional) representations of states. A related topic could eventually obtain images of the entire state space. is how to specify a goal condition for LatPlan (e.g. “0,1,2 in A closely related line of work in LfO is learning of board the correct places” in a 8-puzzle), rather than a single state. game play from videos [Barbu et al., 2010; Kaiser, 2012; When obtaining the training images is expensive, another Kirk and Laird, 2016]. Unlike LatPlan, these works make rel- bottleneck in LatPlan is the number of images required to atively strong assumptions about the environment, e.g., that train the SAE. Developing more effective learner for mini- there is a grid-like environment. mizing the training data, such as one-shot learning methods There is a large body of previous work using neural net- [Lake et al., 2013], is ongoing work in the DL community. works to directly solve combinatorial tasks, such as TSP LatPlan’s plan can be invalid depending on the SAE accu- [Hopfield and Tank, 1985] or Tower of Hanoi [Bieszczad racy (Sec. 3). Unlike symbolic planning, plan validation re- and Kuchar, 2015]. Although they use NNs to solve search quires manual checking, which does not scale and is prone to problems, they assume a fully symbolic representation of errors. An important direction for future work is the devel- the problem as input. Other line of hybrid systems em- opment of crowd sourcing [Yuen et al., 2011], or automated bed NNs inside a search algorithm to provide search con- approaches similar to inception score [Salimans et al., 2016]. trol knowledge [Silver et al., 2016; Arfaee et al., 2011; Finally, we do not claim that the specific implementation Satzger and Kramer, 2013]. In contrast, we use a NN-based of SAE in this paper works robustly on all images. Making SAE for symbol grounding, not for search control. a robust autoencoder is not a problem unique to LatPlan, but Deep (DRL) has solved complex rather, a fundamental problem in deep learning. Our contribu- image-based problems [Mnih et al., 2015]. For unit-action- tion is the demonstration that it is possible to leverage some cost planning, LatPlan does not require a reinforcement sig- existing deep learning techniques quite effectively in an plan- nal (reward function). Also, it can provide guarantees of com- ning system, and future work will continue leveraging further pleteness and solution cost optimality. improvements in image processing techniques.

5 Discussion and Conclusion References We proposed LatPlan, an integrated architecture for planning [Arfaee et al., 2011] Shahab Jabbari Arfaee, Sandra Zilles, and which, given only a set of unlabeled images and no prior Robert C. Holte. Learning Heuristic Functions for Large State knowledge, generates a classical planning problem model, Spaces. Artificial Intelligence, 175(16-17):2075–2098, 2011. solves it with a symbolic planner, and presents the result- [Argall et al., 2009] Brenna Argall, Sonia Chernova, ing plan as a human-comprehensible sequence of images. Manuela M. Veloso, and Brett Browning. A Survey We demonstrated its feasibility using image-based versions of Robot Learning from Demonstration. Robotics and of planning/state-space-search problems (8-puzzle, Towers of Autonomous Systems, 57(5):469–483, 2009. Hanoi, Lights Out). The key technical contribution is the SAE, [Backstr¨ om¨ and Nebel, 1995] Christer Backstr¨ om¨ and Bernhard which leverages the Gumbel-Softmax reparametrization tech- Nebel. Complexity Results for SAS+ Planning. Computa- [ ] nique Jang et al., 2017 and learns (unsupervised) a bidi- tional Intelligence, 11(4):625–655, 1995. rectional mapping between raw images and a propositional representation usable by symbolic planners. Aside from the [Barbu et al., 2010] Andrei Barbu, Siddharth Narayanaswamy, key assumptions about the deterministic environment and the and Jeffrey Mark Siskind. Learning Physically-Instantiated sufficient training images, we avoid assumptions about the Game Play through Visual Observation. In ICRA, pages input domain. Thus, we have shown that domains with differ- 1879–1886, 2010. ent characteristics can all be solved by the same system. In [Bieszczad and Kuchar, 2015] Andrzej Bieszczad and Skyler other words, LatPlan is a domain-independent, image-based Kuchar. Neurosolver Learning to Solve Towers of Hanoi Puz- classical planner. zles. In IJCCI, volume 3, pages 28–38, 2015. [Celorrio et al., 2012] Sergio Jimenez´ Celorrio, Tomas´ de la [Maddison et al., 2017] Chris J. Maddison, Andriy Mnih, and Rosa, Susana Fernandez,´ Fernando Fernandez,´ and Daniel Yee Whye Teh. The Concrete Distribution: A Continuous Re- Borrajo. A Review of for Automated Plan- laxation of Discrete Random Variables. In ICLR, 2017. ning. Knowledge Eng. Review, 27(4):433–467, 2012. [McDermott, 2000] Drew V.McDermott. The 1998 AI Planning [Cresswell et al., 2013] Stephen Cresswell, Thomas Leo Mc- Systems Competition. AI Magazine, 21(2):35–55, 2000. Cluskey, and Margaret Mary West. Acquiring planning [Mnih et al., 2015] Volodymyr Mnih, Koray Kavukcuoglu, domain models using LOCM. Knowledge Eng. Review, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, 28(2):195–213, 2013. Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg [Deng et al., 2009] Jia Deng, Wei Dong, Richard Socher, Li-Jia Ostrovski, et al. Human-Level Control through Deep Rein- Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hier- forcement Learning. Nature, 518(7540):529–533, 2015. archical Image Database. In CVPR, pages 248–255. IEEE, [Mourao˜ et al., 2012] Kira Mourao,˜ Luke S. Zettlemoyer, 2009. Ronald P. A. Petrick, and Mark Steedman. Learning STRIPS [Graves et al., 2016] Alex Graves, Greg Wayne, Malcolm Operators from Noisy and Incomplete Observations. In UAI, Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska- pages 614–623, 2012. Barwinska,´ Sergio Gomez´ Colmenarejo, Edward Grefen- [Reinefeld, 1993] Alexander Reinefeld. Complete Solution of stette, Tiago Ramalho, John Agapiou, et al. Hybrid Comput- the Eight-Puzzle and the Benefit of Node Ordering in IDA. In ing using a Neural Network with Dynamic External Memory. IJCAI, pages 248–253, 1993. Nature, 538(7626):471–476, 2016. [Ren et al., 2015] Shaoqing Ren, Kaiming He, Ross Girshick, [Helmert, 2006] Malte Helmert. The Fast Downward Planning and Jian Sun. Faster R-CNN: Towards Real-time Object De- System. J. Artif. Intell. Res.(JAIR), 26:191–246, 2006. tection with Region Proposal Networks. In NIPS, pages 91– [Hinton and Salakhutdinov, 2006] Geoffrey E Hinton and Rus- 99, 2015. lan R Salakhutdinov. Reducing the Dimensionality of Data [Salimans et al., 2016] Tim Salimans, Ian Goodfellow, Woj- with Neural Networks. Science, 313(5786):504–507, 2006. ciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved Techniques for Training Gans. In NIPS, pages [Hopfield and Tank, 1985] John J Hopfield and David W Tank. 2226–2234, 2016. ”Neural” Computation of Decisions in Optimization Prob- lems. Biological cybernetics, 52(3):141–152, 1985. [Satzger and Kramer, 2013] Benjamin Satzger and Oliver Kramer. Goal Distance Estimation for Automated Planning [ ] Jang et al., 2017 Eric Jang, Shixiang Gu, and Ben Poole. Cat- using Neural Networks and Support Vector Machines. egorical Reparameterization with Gumbel-Softmax. In ICLR, Natural Computing, 12(1):87–100, 2013. 2017. [Sievers et al., 2012] Silvan Sievers, Manuela Ortlieb, and [Kaiser, 2012] Lukasz Kaiser. Learning Games from Videos Malte Helmert. Efficient Implementation of Pattern Database Guided by Descriptive Complexity. In AAAI, 2012. Heuristics for Classical Planning. In SOCS, 2012. [Kingma and Welling, 2013] Diederik P Kingma and Max [Silver et al., 2016] David Silver, Aja Huang, Chris J Maddi- Welling. Auto-Encoding Variational Bayes. In ICLR, 2013. son, Arthur Guez, Laurent Sifre, George Van Den Driess- [Kingma et al., 2014] Diederik P Kingma, Shakir Mohamed, che, Julian Schrittwieser, Ioannis Antonoglou, Veda Pan- Danilo Jimenez Rezende, and Max Welling. Semi-supervised neershelvam, Marc Lanctot, et al. Mastering the Game of learning with deep generative models. In NIPS, pages 3581– Go with Deep Neural Networks and Tree Search. Nature, 3589, 2014. 529(7587):484–489, 2016. [Kirk and Laird, 2016] James R Kirk and John E Laird. Learn- [Steels, 2008] Luc Steels. The Symbol Grounding Problem has ing General and Efficient Representations of Novel Games been solved. So what’s next? In Manuel de Vega, Arthur Through Interactive Instruction. Advances in Cognitive Sys- Glenberg, and Arthur Graesser, editors, Symbols and Embod- tems, 4, 2016. iment. Oxford University Press, 2008. [ ] [Konidaris et al., 2014] George Konidaris, Leslie Pack Kael- Vincent et al., 2008 Pascal Vincent, Hugo Larochelle, Yoshua bling, and Tomas´ Lozano-Perez.´ Constructing Symbolic Rep- Bengio, and Pierre-Antoine Manzagol. Extracting and Com- resentations for High-Level Planning. In AAAI, pages 1932– posing Robust Features with Denoising . In 1938, 2014. ICML, pages 1096–1103. ACM, 2008. [Yang et al., 2007] Qiang Yang, Kangheng Wu, and Yunfei [Korf and Felner, 2002] Richard E. Korf and Ariel Felner. Dis- Jiang. Learning Action Models from Plan Examples using joint Pattern Database Heuristics. Artificial Intelligence, Weighted MAX-SAT. Artificial Intelligence, 171(2-3):107– 134(1-2):9–22, 2002. 143, 2007. [ ] Lake et al., 2013 Brenden M Lake, Ruslan R Salakhutdinov, [Yuen et al., 2011] Man-Ching Yuen, Irwin King, and Kwong- and Josh Tenenbaum. One-shot Learning by Inverting a Com- Sak Leung. A Survey of Crowdsourcing Systems. In Pri- positional Causal Process. In NIPS, pages 2526–2534, 2013. vacy, Security, Risk and Trust (PASSAT) and 2011 IEEE Third [LeCun et al., 1998] Yann LeCun, Leon´ Bottou, Yoshua Ben- Inernational Conference on Social Computing (SocialCom), gio, and Patrick Haffner. Gradient-Based Learning Applied 2011 IEEE Third International Conference on, pages 766– to Document Recognition. Proc. of the IEEE, 86(11):2278– 773. IEEE, 2011. 2324, 1998.