Classical Planning with Simulators: Results on the Atari Video Games
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) Classical Planning with Simulators: Results on the Atari Video Games Nir Lipovetzky Miquel Ramirez Hector Geffner The University of Melbourne Australian National University ICREA & U. Pompeu Fabra Melbourne, Australia Canberra, Australia Barcelona, Spain [email protected] [email protected] [email protected] Abstract Geffner and Bonet, 2013]. This is because there is no com- pact PDDL-like encoding of the domain and the goal to be The Atari 2600 games supported in the Arcade achieved in each game is not given, precluding the automatic [ ] Learning Environment Bellemare et al., 2013 all derivation of heuristic functions and other inferences. Indeed, feature a known initial (RAM) state and actions there are no goals but rewards, and the planner must select that have deterministic effects. Classical planners, actions on-line while interacting with a simulator that just re- however, cannot be used off-the-shelf as there is turns successor states and rewards. no compact PDDL-model of the games, and action effects and goals are not known a priori. Indeed, The action selection problem in the Atari games can be reinforcement learning [ there are no explicit goals, and the planner must se- addressed as a problem Sutton and ] lect actions on-line while interacting with a simula- Barto, 1998 over a deterministic MDP where the state tran- net- tor that returns successor states and rewards. None sitions and rewards are not known, or alternatively, as a benefit planning problem [ et al. of this precludes the use of blind lookahead algo- Coles , 2012; Keyder and ] rithms for action selection like breadth-first search Geffner, 2009 with unknown state transitions and rewards. on-line plannning setting or Dijkstra’s yet such methods are not effective ALE supports the two settings: an learning over large state spaces. We thus turn to a different where actions are selected after a lookahead, and a setting class of planning methods introduced recently that that must produce controllers for mapping states into have been shown to be effective for solving large actions reactively without any lookahead. In this work, we are planning problems but which do not require prior interested in the on-line planning setting. knowledge of state transitions, costs (rewards) or The presence of unknown transition and rewards in the goals. The empirical results over 54 Atari games Atari games does not preclude the use of blind-search meth- show that the simplest such algorithm performs ods like breadth-first search, Dijkstra’s algorithm [Dijkstra, at the level of UCT, the state-of-the-art planning 1959], or learning methods such as LRTA* [Korf, 1990], method in this domain, and suggest the potential UCT [Kocsis and Szepesvari,´ 2006], and Q-learning [Sutton of width-based methods for planning with simula- and Barto, 1998; Bertsekas and Tsitsiklis, 1996]. Indeed, the tors when factored, compact action models are not net-benefit planning problem with unknown state transitions available. and rewards over a given planning horizon, can be mapped into a standard shortest-path problem which can be solved optimally by Dijkstra’s algorithm. For this, we just need to Introduction map the unknown rewards r(a; s) into positive (unknown) ac- The Arcade Learning Environment (ALE) provides a chal- tion costs c(a; s) = C − r(a; s) where C is a large constant lenging platform for evaluating general, domain-independent that exceeds the maximum possible reward. The fact that the AI planners and learners through a convenient interface to state transition and cost functions f(a; s) and c(a; s) are not hundreds of Atari 2600 games [Bellemare et al., 2013]. Re- known a priori doesn’t affect the applicability of Dijkstra’s al- sults have been reported so far for basic planning algorithms gorithm, which requires the value of these functions precisely like breadth-first search and UCT, reinforcement learning al- when the action a is applied in the state s. gorithms, and evolutionary methods [Bellemare et al., 2013; The limitation of the basic blind search methods is that Mnih et al., 2013; Hausknecht et al., 2014]. The empirical they are not effective over large state spaces, neither for solv- results are impressive in some cases, yet a lot remains to be ing problems off-line, nor for guiding a lookahead search for done, as no method approaches the performance of human solving problems on-line. In this work, we thus turn to a re- players across a broad range of games. cent class of planning algorithms that combine the scope of While all these games feature a known initial (RAM) state blind search methods with the performance of state-of-the- and actions that have deterministic effects, the problem of art classical planners: namely, like “blind” search algorithms selecting the next action to be done cannot be addressed they do not require prior knowledge of state transitions, costs, with off-the-shelf classical planners [Ghallab et al., 2004; or goals, and yet like heuristic algorithms they manage to 1610 search large state spaces effectively. The basic algorithm in ated, the state is pruned if it doesn’t pass a simple novelty this class is called IW for Iterated Width search [Lipovet- test. More precisely, zky and Geffner, 2012]. IW consists of a sequence of calls • The novelty of a newly generate state s in a search al- IW(1), IW(2), .., IW(k), where IW(i) is a standard breadth- gorithm is 1 if s is the first state generated in the search first search where states are pruned right away when they fail that makes true some atom X = x, else it is 2 if s is the to make true some new tuple (set) of at most i atoms. Namely, first state that makes a pair of atoms X = x and Y = y IW(1) is a breadth-first search that keeps a state only if the true, and so on. state is the first one in the search to make some atom true; IW(2) keeps a state only if the state is the first one to make • IW(i) is a breadth-first search that prunes newly gener- a pair of atoms true, and so on. Like plain breadth-first and ated states when their novelty is greater than i. iterative deepening searches, IW is complete, while search- • IW calls IW(i) sequentially for i = 1; 2;::: until a ter- ing the state space in a way that makes use of the struc- mination condition is reached, returning then the best ture of states given by the values of a finite set of state vari- path found. ables. In the Atari games, the (RAM) state is given by a vec- tor of 128 bytes, which we associate with 128 variables X , i For classical planning, the termination condition is the i = 1;:::; 128, each of which may take up to 256 values x . j achievement of the goal. In the on-line setting, as in the Atari A state s makes an atom X = x true when the value of i j games, the termination condition is given by a time window the i-th byte in the state vector s is x . The empirical results j or a maximum number of generated nodes. The best path over 54 Atari games show that IW(1) performs at the level found by IW is then the path that has a maximum accu- of UCT, the state-of-the-art planning method in this domain, mulated reward. The accumulated reward R(s) of a state s and suggest the potential of width-based methods for plan- reached in an iteration of IW is determined by the unique ning with simulators when factored, compact action models parent state s0 and action a leading to s from s0 as R(s) = are not available. R(s0) + r(a; s0). The best state is the state s with maximum The paper is organized as follows. We review the iterated reward R(s) generated but not pruned by IW, and the best width algorithm and its properties, look at the variations of path is the one that leads to the state s from the current state. the algorithm that we used in the Atari games, and present The action selected in the on-line setting is the first action the experimental results. along such a path. This action is then executed and the pro- cess repeats from the resulting state. Iterated Width The Iterated Width (IW) algorithm has been introduced as Performance and Width a classical planning algorithm that takes a planning problem IW is a systematic and complete blind-search algorithm like as an input, and computes an action sequence that solves the breadth-first search (BRFS) and iterative deepening (ID), but problem as the output [Lipovetzky and Geffner, 2012]. The unlike these algorithms, it uses the factored representation of algorithm however applies to a broader range of problems. the states in terms of variables to structure the search. This We will characterize such problems by means of a finite and structured exploration has proved to be very effective over discrete set of states (the state space) that correspond to vec- classical planning benchmark domains when goals are single tors of size n. Namely, the states are structured or factored , atoms.1 For example, 37% of the 37,921 problems consid- and we take each of the locations in the vector to represent a ered in [Lipovetzky and Geffner, 2012] are solved by IW(1) variable Xi, and the value at that vector location to represent while 51:3% are solved by IW(2).