Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty

Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI-16) Monte Carlo Tree Search in Continuous Action Spaces with Execution Uncertainty Timothy Yee, Viliam Lisy,´ Michael Bowling Department of Computing Science University of Alberta Edmonton, AB, Canada T6G 2E8 tayee, lisy, bowling @ualberta.ca { } Abstract and thus the original candidates. This approach reveals a tension between exploring a larger set of candidate actions Real world applications of artificial intelligence of- to increase the probability that a good action is considered, ten require agents to sequentially choose actions and more accurately evaluating promising candidates through from continuous action spaces with execution un- deeper search or more execution outcomes, increasing the certainty. When good actions are sparse, domain probability the best candidate is selected. Monte Carlo tree knowledge is often used to identify a discrete set search (MCTS) methods, such as UCT [Kocsis and Szepes- of promising actions. These actions and their un- vari, 2006], are well suited for balancing this sort of tradeoff, certain effects are typically evaluated using a re- however, many of the successful variants and enhancements cursive search procedure. The reduction of the are designed for finite, discrete action spaces. A number problem to a discrete search problem causes severe of recent advances have sought to address this shortcoming. limitations, notably, not exploiting all of the sam- The classical approach of progressive widening (or unprun- pled outcomes when evaluating actions, and not using) [Coulom, 2007; Chaslot et al., 2008] can handle contin- ing outcomes to help find new actions outside the uous action spaces by considering a slowly growing discrete original set. We propose a new Monte Carlo tree set of sampled actions. cRAVE [Couetoux¨ et al., 2011] com- search (MCTS) algorithm specifically designed for bines this with a modification of the RAVE heuristic [Gelly exploiting an execution model in this setting. Us- and Silver, 2011] to do generalization from similar (but not ing kernel regression, it generalizes the information exactly the same) actions. HOOT [Mansley et al., 2011] re- about action quality between actions and to unex- places the UCB algorithm in UCT with HOO [Bubeck et al., plored parts of the action space. In a high fidelity 2011], an algorithm with theoretical guarantees in continuous simulator of the Olympic sport of curling, we show action spaces. However, none of these methods make use of that this approach significantly outperforms exist- one critical insight: samples of execution uncertainty from a ing MCTS methods. particular action provide information about any action that could have generated that execution outcome. 1 Introduction We use this insight to propose a novel variant of Monte Many real world problems involve selecting sequences of ac- Carlo tree search, KR-UCT (Kernel Regression UCT), de- tions from a continuous space of actions. Examples include signed specifically for reasoning about continuous actions choosing target motor velocities in robot navigation; choos- with execution uncertainty. Instead of evaluating only a dis- ing the angle, offset, and speed to hit a billiard ball; or choos- crete set of candidate actions, the algorithm considers the en- ing the angle, velocity, and rotation to throw a curling stone. tire continuous space of actions, with candidates acting only Execution of these actions is often fundamentally uncertain as initialization. The core of our approach is in the use of ker- due to limited human or robot skill and the stochastic nature nel regression to generalize action value estimates over the of physical reality. In this paper, we focus on algorithms for entire parameter space, with the execution uncertainty model choosing good actions in continuous action, continuous state, as its generalization kernel. KR-UCT distinguishes itself in a stochastic planning problems when a model of the execution number of key ways. First, it allows information sharing be- uncertainty is known. tween all actions under consideration. Second, it can identify One often-used approach [Smith, 2007; Archibald et al., actions outside of the initial candidates for further exploration 2009; Yamamoto et al., 2015] for such challenging planning by combining kernel regression and kernel density estimation problems is to address the continuous action space by us- to optimize an exploration-exploitation balance akin to the ing domain knowledge to identify a small, discrete set of popular UCB formula [Auer et al., 2002]. Third, it can ulti- candidate actions. Then, the continuous space of stochas- mately select actions outside of the candidate set allowing it tic outcomes is sampled for each action. Finally, for each to improve on less-than-perfect domain knowledge. sampled outcome, a heuristic function (possibly preceded We evaluate KR-UCT in a high fidelity simulation of the by a very shallow search) is used to evaluate the outcomes Olympic sport of curling. Curling is an example of a chal- 690 lenging action selection problem with continuous actions, the selection function (e.g., UCT) as an additional term with continuous stochastic outcomes, sequential decisions, execu- relative weight decreasing with more simulations. tion uncertainty, and the added challenge of an adversary. We show that the proposed algorithm significantly outperforms 2.2 Progressive Widening existing MCTS techniques. The improvement is apparent not Most selection functions in MCTS, including UCT, require only at short horizons, which allows exploring a large number trying every action once. So, obviously, they are not directly of different shots, but also at long horizons when evaluating applicable in continuous action spaces. Even if the action only tens of samples of execution outcomes. Furthermore, space is finite, but very large, having too many options can we show that existing MCTS improvements, such as RAVE result in a very shallow lookahead. The same solution to this and progressive widening do not improve standard UCT as problem was independently introduced in [Coulom, 2007] as significantly as KR-UCT in this domain. progressive widening and [Chaslot et al., 2008] as progressive unpruning. It artificially limits the number of actions evalu- 2 Background ated in a node by MCTS based on the number of visits to the We begin by describing the core algorithms that KR-UCT will node. Only after the quality of the best available action is build upon, along with the main building blocks of the com- estimated sufficiently well, additional actions are taken into petitors used our evaluation. consideration. The order of adding the actions could be done randomly or by exploiting domain knowledge. 2.1 Monte Carlo Tree Search If a domain includes stochastic outcomes, such as being Monte Carlo Tree Search (MCTS) is a simulation-based the result of execution uncertainty, the outcomes are com- search approach to planning in finite-horizon sequential monly represented by chance nodes in the search tree. If the decision-making settings. The core of the approach is to it- set of outcomes is finite and small, then the next state can eratively simulate executions from the current state to a ter- be sampled from the known probability distribution over the minal state, incrementally growing a tree of simulated states outcomes. If the possible outcomes are large or even con- (nodes) and actions (edges). Each simulation starts by visit- tinuous then one can simply sample a small number of out- ing nodes in the tree, selecting which actions to take based on comes [Kearns et al., 2002] or slowly grow the number of a selection function and information maintained in the node. sampled outcomes as the node is repeatedly visited, in the Consequently, it transitions to a successor state. When a node same way that progressive widening grows the number of ac- is visited whose immediate children are not all in the tree, the tions [Couetoux¨ et al., 2011]. node is expanded by adding a new leaf to the tree. Then, a UCT assures that the tree grows deeper more quickly in the rollout policy (e.g., random action selection) is applied from promising parts of the search tree. The progressive widening the new leaf to a terminal state. The value of the terminal strategies add that it also grows wider in the same parts of the state is then returned as the value for that new leaf and the search tree. information stored in the tree is updated. In the simplest case, a tree with height 1, MCTS starts with an empty tree and adds 2.3 Kernel Regression a single leaf each iteration. The most common selection function for MCTS is Up- Kernel regression is a nonparametric method for estimating per Confidence Bounds Applied to Trees (UCT) [Kocsis and the conditional expectation of a real-valued random vari- Szepesvari, 2006]. Each node maintains the mean of the re- able from data. In its simplest form [Nadaraya, 1964; Watson, 1964], it estimates the expected value of a point as wards received for each action, v¯a, and the number of times an average of the values of all points in the data set, weighted each action has been used, na. It first uses each of the actions once and then decides what action to use based on the size based on a typically non-linear function of the distance from of the one-sided confidence interval on the reward computed the point. The function defining the weight given a pair of points is called the kernel and further denoted K. For a data based on the Chernoff-Hoeffding bound as: n set (xi,yi)i=0, the estimated expected values is: log b nb n argmax v¯a + C (1) i=0 K(x, xi)yi a na E(y x)= n . (2) s P | K(x, x ) P i=0 i The constant C controls the exploration-exploitation tradeoff and is typically tuned for the specific domain.

Load more