Quick viewing(Text Mode)

Solving Robot Navigation Problems with Initial Pose Uncertainty Using Real-Time Heuristic Search

Solving Robot Navigation Problems with Initial Pose Uncertainty Using Real-Time Heuristic Search

From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved.

Solving Problems with Initial Pose Uncertainty Using Real-Time Heuristic Search

Sven Koenig Reid G. Simmons College of Computing Computer Science Department Georgia Institute of Technology Carnegie Mellon University [email protected] [email protected]

Abstract Westudy goal-directednavigation tasks in mazes,where the knowthe mazebut do not knowtheir initial pose(po- sition and orientation). Thesesearch tasks can be modeled as planningtasks in large non-deterministicdomains whose GoalPose PtLtsiblcSlatting Poses states are sets of poses. Theycan be solvedefficiently by (StartingBelief) interleaving planningand plan execution,which can reduce the sumof planningand plan-executiontime becauseit al- Figure 1: Goal-Directed Navigation Task #1 lows the robots to gather informationearly. Weshow how Min-MaxLRTA*, a real-time heuristic search method,can solve these and other planningtasks in non-deterministic domainsefficiently. It allowsfor fine-grainedcontrol over savings gained. As an example we consider goal-directed howmuch planning to do betweenplan executions, uses navigation tasks in mazes where the robots knowthe maze heuristic knowledgeto guideplanning, and improvesits plan- but do not knowtheir initial pose (position and orienta- executiontime as it solves similar planningtasks, until its tion). Interleaving planning and plan execution allows the plan-executiontime is at least worst-caseoptimal. We also showthat Min-MaxLRTA* solves the goal-directed naviga- robots to makeadditional observations, whichreduces their tion tasks fast, convergesquickly, and requires only a small pose uncertainty and thus the numberof situations that their amountof memory. plans have to cover. This makes subsequent planning more efficient. Agents that interleave planning and plan execution have Introduction to overcometwo problems: first, they have to make sure Situated agents (such as robots) have to take their planning that they make progress towards the goal instead of cy- time into account to solve planning tasks efficiently (Good cling forever; second, they should be able to improvetheir 1971). For single-instance planning tasks, for example, plan-execution time as they solve similar planning tasks, they should attempt to minimize the sum of planning and otherwise they do not behaveefficiently in the long run in plan-execution time. Finding plans that minimizethe plan- case similar planning tasks unexpectedly repeat. Weshow execution time is often intractable. Interleaving planning how Min-MaxLRTA*, a real-time heuristic search method and plan executionis a general principle that can reduce the that extends LRTA*(Korf 1990) to non-deterministic do- planning time and thus also the sumof planning and plan ex- mains, can be used to address these problems. Min-Max ecution time for sufficiently fast agents in non-deterministic LRTA*interleaves planning and plan execution and plans domains (Genesereth & Nourbakhsh1993). Without inter- only in the part of the domainaround the current state of leaving planning and plan execution, the agents have to find the agents. This is the part of the domainthat is immedi- a large conditional plan that solves the planning task. When ately relevant for them in their current situation. It allows interleaving planning and plan execution, on the other hand, for fine-grained control over howmuch planning to do be- the agents have to find only the beginning of such a plan. tween plan executions, uses heuristic knowledgeto guide After the executionof this subplan, the agents repeat the pro- planning, and improvesits plan-execution time as it solves cess fromthe state that actually resulted from the execution similar planningtasks, until its plan-executiontime is at least of the subplaninstead of all states that could have resulted worst-case optimal. Weshow that Min-MaxLRTA* solves from its execution. Since actions are executed before their the goal-directed navigation tasks fast, convergesquickly, complete consequencesare known,the agents are likely to and requires only a small amountof memory. incur someoverhead in terms of the numberof actions ex- Planning methodsthat interleave planning and plan exe- ecuted, but this is often outweighedby the computational cution have been studied before. This includes assumptive Copyright1998, American Association for Artificial Intelli- planning, deliberation scheduling (including anytimealgo- gence(www.aaai.org). All rights reserved. rithms), on-line algorithms and competitiveanalysis, real-

Koenig 145 From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. time heuristic search, reinforcementlearning, robot explo- ration techniques, and sensor-based planning. The proceed- ings of the AAAI-97Workshop on On-Line Search give a goodoverview of these techniques. The contributions of this paper are threefold: First, this paper extends our Min-Max Figure 2: Goal-Directed Navigation Task #2 LRTA*(Koenig & Simmons1995), a real-time heuristic search method for non-deterministic domains, from look- ahead one to arbitrary Iook-aheads. It showshow the search igate to any of the given goal poses and stop. Since there space of Min-MaxLRTA* can be represented more com- might be manyposes that produce the same sensor reports pactly than is possible with minimaxtrees and discusses as the goal poses, solving the goal-directed navigation task howit can be solved efficiently using methodsfrom Markov includes localizing the robot sufficiently so that it knows gametheory. Second, this paper illustrates an advantageof that it is at a goal pose whenit stops. Weuse the following real-time heuristic search methodsthat has not been studied notation: P is the finite set of possible robot poses(pairs of before. It wasalready knownthat real-time heuristic search location and orientation). A(p) is the set of possible actions methodsare efficient alternatives to traditional search meth- that the robot can execute in pose p E P: left, right, and ods in deterministic domains. For example, they are among possibly forward, succ(p, a) is the pose that results from the few search methodsthat are able to find suboptimalplans the execution of action a 6. A(p) in posep E P. o(p) is the for the twenty-four puzzle, a sliding-tile puzzle with more observation that the robot makesin pose p E P: whether than 7 x 1024 states (Korf 199_3 ). They have also been used or not there are walls immediatelyadjacent to it in the four to solve STRIPS-typeplanning tasks (Boner, Loerincs. directions (front, left, behind, right). Therobot starts Geffner 1997). The reason whythey are efficient in deter- pose Pstart E P and then repeatedly makes an observation ministic domainsis because they trade-offminimizing plan- and executes an action until it decides to stop. It knowsthe ning time and minimizing plan-execution time. However, maze, but is uncertain about its start pose. It could be in manydomains from , control, and scheduling are any pose in Pstart C_ P. Werequire only that o(p) = o(ff) nondeterministic. Wedemonstrate that real-time heuristic for all p: p’ E Patart, whichautomatically holds after the search methodscan also be used to speed up problemsolving first observation, and pstart E Pstort, whichautomatically in nondeterministic domainsand that there is an additional holds for Pstart = {P : P 6- P A o(p) = o(pstart)}. The reason whythey are efficient in these domains,namely that robot has to navigate to any pose in PooazC_ P and stop. they allow agents to gather informationearly. This informa- Evenif the robot is not certain about its pose, it can tion can be used to resolve someof the uncertainty caused maintain a belief about its current pose. Weassume that by nondeterminism and thus reduce the amount of plan- the robot cannot associate probabilities or other likelihood ning done for unencounteredsituations. Third. this paper estimates with the poses. Then, all it can do is maintain a demonstrates an advantage of Min-MaxLRTA* over most set of possible poses ("belief"). For example,if the robot previous planning methodsthat interleave planning and plan has no knowledgeof its start pose for the goal-directed execution, namelythat it improvesits plan-execution time navigation task from Figure I, but observes walls around it as it solves similar planning tasks. As recognized in both exceptin its front, then the start belief of the robot contains (Dean et al. 1995) and (Stentz 1995), this is an important the seven possible start poses shownin the figure. Weuse property because no planning methodthat executes actions the following notation: B is the set of beliefs, bstart the before it has found a complete plan can guarantee a good start belief, andBgo~,! the set of goal beliefs. A(b) is the set plan-execution time right away, and methods that do not of actions that can be executedwhen the belief is b. O(b, a) improvetheir plan-execution time do not behaveefficiently is the set of possible observationsthat can be madeafter the in the long run in case similar planning tasks unexpectedly execution of action a whenthe belief was b. succ(b, a, o) repeat. is the successor belief that results if observationo is made after the execution of action a whenthe belief wasb. Then, The Robot Navigation Problem Westudy goal-directed navigation tasks with initial pose un- B = {b : b C_P A o(p) = o(p’) for all p,p’ E b} certainty (Nourbakhsh1996), an exampleof which is shown bstart m P.start in Figure I. A robot knowsthe maze, but is uncertain about Bgo~t = {b:bC_Pgo~tA o(p)=o(p’) forall p, p’ Eb} its start pose, wherea pose is a location (square) and orien- A(b) = A(p) for anypEb tation (north, east, south, west). Weassume that there is O(b,a) = {o(succ(p,a)) : pE uncertainty in actuation and sensing, fln our mazes,it is in- = {succ(p,a) : p E b A o(succ(p,a)) deed possible to makeactuation and sensing approximately stw£(b,a,o) 100 percent reliable.) Thesensors on-boardthe robot tell it in every pose whether there are walls immediatelyadjacent To understand the definition of A(b), notice that A(p) to it in the four directions (front, left, behind, right). The A(p’) for all p,p’ 6. b after the precedingobservation since actions are to moveforward one square (unless there is the observationdetermines the actions that can be executed. wall directly in front of the robot), turn left ninety degrees, Tounderstand the definition of Bgor,t, notice that the robot or turn right ninety degrees. Thetask of the robot is to nay- knowsthat it is in a goal poseif its belief is b C_Pgo~t. If

146 Robotics and Agents From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. Initially, the u-valuesu(s) 0 arc approximations theof minimax goal distances(measured in action executions)for all s E Givena set X, the expression"one-ofX" returns an elementof X accordingto an arbitrary rule. A subsequentinvocation of "one-ofX"can return the sameor a different element. Theex- pression "argmin~xf(x)" returns the set {z E X : f(x) Figure 3: Goal-Directed Navigation Task #3 min=,~xf(x’)}.

1. s :---- 8stQrt. the belief contains morethan one pose, however, the robot 2. If s E G,then stop successfully. does not knowwhich goal pose it is in. Figure 2 shows an 3. Generatea local searchspace St,, with a E St,, andSt,, n G = 0. example. To solve the goal-directed navigation task, it is 4. Updateu(s) for all s E Stss(Figure best for the robot to moveforward twice. At this point, the robot knowsthat it is in a goal pose but cannot be certain 5. a := one-ofargmin,eA(, ) max,,e .... (,.,~) u(s’). which of the two goal poses it is in. If it is important 6. Executeaction a. that is, changethe current state to a statein that the robot knowswhich goal pose it is in, we define succ(s,a) (accordingto the behaviorof nature). Bgoat = {b : b C Pgo=t A Ibl = l}. Min-Max LRTA* 7. s :ffi thecurrent state. can also be applied to localization tasks. These tasks are 8. (Ifs E St,,, thengo to 5.) identical to the goal-directed navigation tasks except that 9. Goto 2. the robot has to achieveonly certainty about its pose. In this case, we define Bgoai = {b : b C_ P A Ibm=1} The robot navigation domainis deterministic and small Figure 4: Min-MaxLRTA* (pose space). However,the beliefs of the robot depend on its observations, which the robot cannot predict with The minimax-searchmethod uses the temporaryvariables u’(s) certainty since it is uncertain about its pose. Wetherefore for all s E St,~. formulate the goal-directed navigation tasks as planning 1. Forall s E Stjs: ut(s) := u(s)and u(s) := co. tasks in a domainwhose states are the beliefs of the robot (belief space). Beliefs are sets of poses. Thus, the number 2. Ifu(s) < co for all s E Stss, then return. of beliefs is exponential in the numberof poses, and the 3. s’ := one-ofargmin,Es, .... (,)=oo max(u’(s), belief space is not only nondeterministic but can also be min=~A(,)max,,, E .... (o,=) u(s")). large. The pose space and the belief space differ in the 4. If max(ut(s’), 1 + min=EA(,,)max,,,~ (,,.=) u(s")) = oo, then observability of their states. After an action execution, the return. robot will usually not be able to determine its current pose with certainty, but it can alwaysdetermine its current belief 5. u(s’) := max(ut(s’),1 -I- rain=eAts,) maxs,,Es=eeO,.=) u(s")). for sure. Weuse the following notation: S denotes the 6. Goto 2. finite set of states of the domain,sstart E S the start state, Figure 5: Minimax-SearchMethod and G _C S the set of goal states. A(s) is the finite set (potentially nondeterministic) actions that can be executed in state s E S. succ(s, a) denotesthe set of successorstates navigationtask by first localizing itself and then movingto that can result from the execution of action a E A(s) in a goal pose,although this behavioroften doesnot minimize states ES. Then, the plan-executiontime. Figure 3 showsan example.To solve the goal-directednavigation task, it is best for the S = B robot to turn left andmove forward until it seesa corridor -~- openingon one of its sides. At this point, the robot has 8atar¢ batart localized itself andcan navigate to the goal pose. Onthe G = Bgo=t other hand,to solvethe correspondinglocalization task, it A(s) = A(b) fors=b is best for the robot to moveforward once. At this point, succ(s,a) = {succ(b,a,o) : o E O(b,a)} fors = b the robothas localized itself.

The actual successor state that results from the execution Min-Max LRTA* of action a in state s = b is succ(b, a, o) if observationo is Min-MaxLearning Real-Time A* (Min-Max LRTA*) madeafter the action execution. In general, traversing any a real-time heuristic search method that extends LRTA* path from the start state in the belief space to a goal state (Korf 1990) to non-deterministic domains by interleaving solves the goal-directed navigation task. To makeall goal- minimaxsearch in local search spaces and plan execu- directed navigation tasks solvable we require the mazesto be tion (Figure 4). Similar to game-playing approaches and strongly connected (every pose can be reached from every reinforcement-learningmethodssuch as (~-Learning (Heger other pose) and asymmetrical(localization is possible). This 1996), Min-MaxLRTA* views acting in nondetvrministic modestassumption allows the robot to solve a goal-directed domains as a two-player gamein which it selects an ac-

Koenig !47 From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. tion from the available actions in the current state. This often grows only polynomiaily in the depth of the tree, action determines the possible successor states from which Min-MaxLRTA* represents the local search space more a fictitious agent, called nature, chooses one. Acting in compactlyas a graph that contains every state at mostonce. deterministic domains is then simply a special case where This requires a more sophisticated minimax-searchmethod every action uniquely determines the successor state. Min- because there can nowbe paths of different lengths between MaxLRTA* uses minimaxsearch to solve planning tasks in any two states in the graph. Min-MaxLRTA uses a simple nondeterministic domains, a worst-case search methodthat method that is related to more general dynamicprogram- attempts to movewith every action execution as close to the ming methods from Markovgame theory (Littman 1996). goal state as possible under the assumptionthat nature is an It updatesall states in the local search spacein the order of opponent that tries to moveMin-Max LRTA* as far away their increasing newu-values. This ensures that the u-value from the goal state as possible. Min-MaxLRTA* associates of each state is updated only once. Moredetails on the a small amountof informationwith the states that allows it minimaxsearch methodare given in (Koenig 1997). to rememberwhere it has already searched. In particular, it After Min-MaxLRTA* has updated the u-values, it greed- associates a u-value u(s) _> 0 with each state s E S. The ily chooses the action for execution that minimizesthe u- values approximatethe minimaxgoal distances of the states. value of the successorstate in the worstcase (ties are broken The minimaxgoal distance gd(s) E [0, ~¢] of state s E arbitrarily). Therationale behindthis is that the u-valuesap- is the smallest numberof action executions with which a proximate the minimaxgoal distances and Min-MaxLRTA* goal state can be reached from state s, even for the most attempts to decrease its minimaxgoal distance as muchas vicious behavior of nature. Using action executions to mea- possible. Then, Min-MaxLRTA* can either generate an- sure the minimaxgoal distances and thus plan-execution other local searchspace, updatethe u-valuesof all states that times is reasonable if every actkmcan be executed in about it contains, and select another action for execution. How- the same amount of time. The minimaxgoal distances are ever, if the newstate is still part of the local searchspace (the defined by the following set of equations for all s E S: one that was used to determine the action whoseexecution resulted in the new state), Min-MaxLRTA* also has the gd(s)= { 0 . option (Line 8) to select another action for execution based 1 + mm,,eAfs}max,,e,~c(s,~)gd(s , ) otherwise.ifsEG on the current u-values. Min-MaxLRTA* with Line 8 is a special case of Min-MaxLRTA* without Line 8: After Min-MaxLRTA* updates the u-values as the search pro- Min-MaxLRTA* has run the minimax-search method on gresses and uses them to determine whichactions to execute. somelocal search space, the u-values do not changeif Min- It first checkswhether it has already reacheda goal state and MaxLRTA* runs the minimax-search method again on the thus can terminate successfully (Line 2). If not, it generates samelocal search space or a subset thereof. WheneverMin- the local search space Sis8 C_ S (Line 3). Whilewe require MaxLRTA* with Line 8 jumps to Line 5, the new current only that s E Stss and Slss N G = ¢, in practice Min-Max state is still part of the local search spaceStu and thus not a LRTA*constructs Stss by searching forward from s. It then goal state. Consequently, Min-MaxLRTA* can skip Line 2. updates the u-values of all states in the local search space Min-MaxLRTA* could nowsearch a subset of Stsa that in- (Line 4) and, based on these u-values, decides whichaction cludes the newcurrent state s, for example{s}. Since this to executenext (Line 5). Finally, it executesthe selected ac- does not change the u-values, Min-MaxLRTA* can, in this tion (Line 6), updatesits current state (Line 7), and iterates case, also skip the minimaxsearch. In the experiments, we the procedure. use Min-MaxLRTA* with Line 8, because it utilizes more Min-MaxLRTA* uses minimax search to update the u- informationof the searches in the local search spaces. values in the local search space (Figure 5). The minimax- search method assigns each state its minimaxgoal dis- Features of Min-MaxLRTA* tahoe under the assumptionthat the u-values of all states An advantage of Min-MaxLRTA* is that it does not de- in the local search space are lower bounds on the cor- pend on assumptionsabout the behavior of nature. This is rect minimaxgoal distances and that the u-values of all so because minimaxsearches assumethat nature is vicious states outside of the local search space correspondto their and always chooses the worst possible successor state. If correct minimax goal distances. Formally, if u(s) Min-MaxLRTA* can reach a goal state for the most vi- [0, co] denotes the u-values before the minimaxsearch cious behavior of nature, it also reaches a goal state if and fi(s) E [0, oo] denotes the u-values afterwards, then nature uses a different and therefore less vicious behav- fi(s) = max(u(s), I +mina~AIs)max,,~,~C,,,Ofi(s’)) ior. This is an advantage of Min-MaxLRTA* over Trial- all s E St,s and 6(s) = u(s) otherwise. Min-MaxLRTA* Based Real-Time DynamicProgramming (Barto, Bradtke, could represent the local search space as a minimaxtree, & Singh 1995), another generalization of LRTA*to non- which could be searched with traditional minimax-search deterministic domains. Trial-Based Real-Time Dynamic methods. However,this has the disadvantage that the mem- Programmingassumes that nature chooses successor states ory requirementsand the search effort can be exponential in randomlyaccording to given probabilities. Therefore, it the depth of the tree (the look-ahead of Min-MaxLRTA*), does not apply to planning in the pose space whenit is im- which can be a problemfor goal-directed navigation tasks possible to makereliable assumptionsabout the behavior of in large emptyspaces. Since the numberof different states nature, including tasks with perceptual aliasing (Chrisman

148 Robotics and Agents From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. 1992) and the goal-directed navigation tasks studied in this All proofs can be found in (Koenig 1997). The theo- paper. A disadvantage of Min-MaxLRTA* is that it cannot rem shows that Min-MaxLRTA* with initially admissible solve all planning tasks. This is so because it interleaves u-values reaches a goal state in safely explorable domains minimax searches and plan execution. Minimaxsearches after a finite numberof actionexecutions, that is, it is correct. limit the solvable planning tasks because they are overly The larger the initial u-values, the smaller the upper bound pessimistic. They can solve only planning tasks for which on the numberof action executions and thus the smaller the minimaxgoal distance of the start state is finite. In- its plan-execution time. For example, Min-MaxLRTA* terleaving planning and plan execution limits the solvable is fully informedif the initial u-values equal the minimax planning tasks further because it executes actions before goal distances of the states. In this case, Theorem1 pre- their complete consequences are known. Thus, even if the dicts that Min-MaxLRTA* reaches a goal state after at most minimaxgoal distance of the start state is finite, it is pos- gd(sstart) action executions. Thus, its plan-execution time sible that Min-MaxLRTA* accidentally executes actions is at least worst-case optimal and no other methodcan do that lead to a state whoseminimax goal distance is infinite, better in the worst-case. For the goal-directed navigation at which point the planning task becomesunsolvable. De- tasks, one can use the goal-distance heuristic to initialize spite these disadvantages, Min-MaxLRTA* is guaranteed the u-values, that is, u(s) = maxv~sgd({p}). The cal- to solve all safely explorable domains. These are domains culation of gd({p}) involves no pose uncertainty and can for whichthe minimaxgoal distances of all states are finite. be done efficiently without interleaving planning and plan To be precise: First, all states of the domaincan be deleted execution, by using traditional search methodsin the pose that cannot possibly be reached from the start state or can space. This is possible because the pose space is deter- be reached from the start state only by passing through a ministic and small. The u-values are admissible because goal state. The minimaxgoal distances of all remaining the robot needs at least maxv~sgd({p}) action executions states have to be finite. For example, the belief space of in the worst case to solve the goal-directed navigation task the robot navigation domainis safely explorable according frompose p’ = one-of art maxp~a gd({p}), even i f it knows to our assumptions, and so are other nondeterministic do- that it starts in that pose. The u-values are often only par- mains from robotics, including manipulation and assembly tially informed because they do not take into account that domains, which explains why using minimax methods is the robot might not know its pose and then might have popular in robotics (Lozano-Perez, Mason,& Taylor 1984). to execute additional localization actions to overcomeits Wetherefore assume in the following that Min-MaxLRTA* pose uncertainty. For the localization tasks, on the other is applied to safely explorable domains.For similar reasons, hand, it is difficult to obtain better informedinitial u-values we also assumein the following that action executions can- than those provided by the zero heuristic (zero-initialized not leave the current state unchanged.The class of domains u-values). that Min-MaxLRTA* is guaranteed to solve could be ex- tended further by using the methods in (Nourbakhsh1996) Fine-Grained Control to construct the local search spaces. Min-MaxLRTA* has three key features, namely that it Min-MaxLRTA* allows for fine-grained control over how allows for fine-grained control over howmuch planning to muchplanning to do betweenplan executions. For example, do between plan executions, uses heuristic knowledgeto Min-MaxLRTA* with line 8 and Slaa = S \ G = S rl guide planning, and improvesits plan-execution time as it performs a complete minimaxsearch without interleaving solves similar planningtasks, until its plan-executiontime is planning and plan execution, which is slow but produces at least worst-case optimal. The third feature is especially plans whoseplan-execution times are worst-case optimal. important since no methodthat executes actions before it On the other hand, Min-MaxLRTA* with SI,j = {s} per- knowstheir complete consequences can guarantee a good forms almost no planning betweenplan executions. Smaller plan-executiontime right away. In the following, we explain look-aheadsbenefit agents that can execute plans with a sim- these features of Min-MaxLRTA* in detail. ilar speed as they can generate them, because they prevent these agents from being idle, which can minimize the sum Heuristic Knowledge of planning and plan-execution time if the heuristic knowl- edge guides the search sufficiently. Larger Iook-aheadsare Min-MaxLRTA* uses heuristic knowledge to guide plan- neededfor slower agents, such as robots. ning. This knowledgeis provided in the form of admissi- ble initial u-values. U-valuesare admissible if and only if Improvement of Plan-Execution Time 0 < u(s) < gd(s) for all s 6 S. In deterministic domains, this definition reducesto the traditional definition of admis- If Min-MaxLRTA* solves the same planning task repeat- sible heuristic values for A*search (Pearl 1985). The larger edly (evenwith different start states) it can improveits be- its initial u-values, the better informedMin-Max LRTA*. havior over time by transferring domainknowledge, in the form of u-values, betweenplanning tasks. This is a famil- TheoremI Let ~( s ) denote the initial u-values. Then, Min- iar concept: Onecan argue that the minimaxsearches that MaxLRTA * with initially admissible u-values reaches a goal Min-MaxLRTA* performs between plan executions are in- state after at most ~(s,tart) + ~-’~,~s[gd(s) - ~(s)] action dependentof one another and that they are connected only executions, regardless of the behavtorof nature. via the u-values that transfer domain knowledgebetween

Koenig 149 From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. them. To see why Min-MaxLRTA* can do the same thing I. s,. := {s}. betweenplanning tasks, consider the following theorem. 2. Improveu(s) for all s E Sz,, (Figure5). Theorem2 Admissible initial u-values remain admissible 3. S: := 8. after every action execution of Min-MaxLRTA * and are 4. a := one-ofarg min,,EA(,,) max,,,¢,~,,~(.,,.,e) u(s"). monotonicaUynondecreasing. 5. If lance(d,a)l > 1. thenreturn. Nowassume that a series of planning tasks in the same 6. s’ := s". wheres" E s~zec(s’, a) is unique. domainwith the sameset of goal states are given, for exam- ple goal-directed navigation tasks with the samegoal poses 7. If s’ E St,~. thengo to 4. in the samemaze. The actual start poses or the beliefs of the 8. If s’ E G, then return. robot aboutits start poses do not needto be identical. If" the 9. St** := St,, O{s’} andgo to 2. initial u-values of Min-MaxLRTA* are admissible for the first planning task, then they are also admissible after Min- Figure 6: Generating Local Search Spaces MaxLRTA* has solved the task and are state-wise at least as informedas initially. Thus, they are also admissible for the g’lllldStifling Pt~se second planning task and Min-MaxLRTA* can continue to use the sameu-values across planning tasks. Thestart states of the planning tasks can differ since the admissibility of u-values does not dependon the start states. This way, Min- MaxLRTA* can transfer acquired domain knowledge from one planning task to the next, thereby makingits u-values Figure 7: Goal-Directed Navigation Task #4 better informed.Ultimately, better in formedu-values result in an improvedplan-execution time, although the improve- ment is not necessarily monotonic.The following theorems LRTA*might not be able to detect this "problem" by in- formalize this knowledgetransfer in the mistake-bounded trospection since it does not perform a complete minimax error model. The mistake-bounded error model is one way search but partially relies on observingthe actual successor of analyzing learning methods by bounding the numberof states of action executions,and nature can wait an arbitrarily mistakes that they make. long timeto reveal it or choosenot to reveal it at all. Thiscan Theorem3 Min-MaxLRTA * with initially admissible u- prevent the u-values from converging after a boundednum- values reaches a goal state after at most gd(sstart ) action ber of action executions and is the reason whywe analyzed executions, regardless of the behavior of natare, if its u- its behavior using the mistake-boundederror model. It is values do not change during the search. important to realize that, since Min-MaxLRTA* relies on observingthe actual successorstates of action executions, it Theorem 4 Assume that Min-Max LRTA * maintabzs u- can have computationaladvantages even over several search values across a series of planning tasks in the same do- episodes comparedto a complete minimaxsearch. This is main with the same set of goal states. Then, the number the case if nature is not as malicious as a minimaxsearch of planning tasks for which Min-MaxLRTA * with initially assumesand somesuccessor states do not occur in practice, admissible u-values reaches a goal state after more than for example,because the actual start pose of the robot never gd(sstart) action executions is boundedfrom above by equals the worst possible pose amongall start poses that are finite constant that depends only on the donuabzand goal consistentwith its start belief. states. Proof: If Min-MaxLRTA* with initially admissible u- Other Navigation Methods values reaches a goal state after morethan gd(sstart ) action Min-MaxLRTA* is a domain-independent planning method executions, then at least one u-value has changedaccording that does not only apply to goal-directed navigation and lo- to Theorem3. This can happen only a finite number of calization tasks with initial poseuncertainty, hut to planning times since the u-values are monotonically nondecreasing tasks in nondeterministicdomains in general. It can, for ex- and remain admissible according to Theorem2, and thus ample, also be used for moving-targetsearch, the task of a are boundedfrom above by the minimaxgoal distances. ¯ hunter to catch a movingprey (Koenig & Simmons1995). in this context, it counts as one mistake whenMin-Max In the following, we compare Min-MaxLRTA* to two more LRTA*reaches a goal state after morethan gd(saa,t) action specialized planning methodsthat can also be used to solve executions. Accordingto Theorem4. the u-values converge the goal-directed navigation or localization tasks. Anim- after a boundednumber of mistakes. The action sequence portant advantage of Min-MaxLRTA* over these methods after convergencedepends on the behavior of nature and is is that it can improveits plan-execution time as it solves not necessarily uniquely determined, but has gd(sstart) similar planning tasks. fewer actions, that is, the plan-execution time of Min-Max LRTA*is either worst-case optimal or better than worst- Information-Gain Method case optimal. This is possible because nature might not The Information-Gain Method (IG method) (Gencscreth be as malicious as a minimax search assumes. Min-Max Nourbakhsh1993) first demonstratedthe advantageof inter-

150 Robotics and Agents From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. leaving planning and plan execution for goal-directed nav- Homing Sequences igation tasks) It uses breadth-first search (iterative deep- Localization tasks are related to finding homingsequences ening) on an and-or graph around the current state in con- or adaptive homingsequences for deterministic finite state junction with pruning rules to find subplans that achieve automata whose states are colored. A homing sequence a gain in information, in the following sense: after the is a linear plan (action sequence) with the property that execution of the subplan, the robot has either solved the the observations madeduring its execution uniquely deter- goal-directed navigation task or at least reducedthe number mine the resulting state (Kohavi 1978). An adaptive hom- of poses it can be in. This way, the IG methodguarantees ing sequence is a conditional plan with the sameproperty progress towards the goal. There are similarities between (Schapire 1992). For every reduceddeterministic finite state the IG method and Min-MaxLRTA*: Both methods inter- , there exists a homingsequence that contains at leave minimaxsearches and plan execution. Zero-initialized most (n- ! )2 actions. Finding a shortest homingsequence Min-MaxLRTA* that generates the local search spaces for NP-completebut a suboptimal homingsequence of at most goal-directed navigation tasks with the method from Fig- (n - 1)2 actions can be found in polynomialtime (Schapire ure 6 exhibits a similar behavior as the IG method:it also 1992). Robotlocalization tasks can be solved with homing performsa breadth-first search aroundits current state until sequencessince the pose space is deterministic and thus can it finds a subplan whoseexecution results in a gain in in- be modeledas a deterministic finite state automaton. formation. The methoddoes this by starting with the local search space that contains only the current state. It performs a minimaxsearch in the local search space and then simu- Extensions of Min-MaxLRTA* lates the action executions of Min-MaxLRTA* starting from Min-MaxLRTA* uses minimaxsearches in the local search its current state. If the simulated action executions reach a spaces to update its u-values. For the robot navigation goal state or lead to a gain in information, then the method tasks, it is possible to combinethis with updates over a returns. However,if the simulated action executions leave greater distance, with only a small amountof additional ef- the local search space, the methodhalts the simulation, adds fort. For example, we knowthat gd(s) >_ gd(s’) for the state outside of the local search space to the local search two states s, d E S with s _Dd (recall that states are sets space, and repeats the procedure. Notice that, whenthe of poses). Thus, we can set u(s):= max(u(s),u(d))for methodreturns, it has already updated the u-values of all selected states s, s’ E S with s D s’. If the u-values are states of the local search space. Thus, Min-MaxLRTA* admissible before the update, they remain admissible after- does not need to improvethe u-values of these states again wards. The assignment could be done immediately before and can skip the minimaxsearch. Its action-selection step the local search space is generatedon Line 3 in Figure 4. It (Line 5) and the simulation have to break ties identically. is also straightforward to modifyMin-Max LRTA* so that it Then, Min-MaxLRTA* with Line 8 in Figure 4 executes ac- does not assumethat all actions can be executed in the same tions until it either reaches a goal state or gains information. amountof time, and change the theorems appropriately. Fi- There are also differences betweenthe IG methodand Min- nally, it is also straightforward(but not terribly exciting) Max LRTA*:Min-Max LRTA* can use small look-aheads drop the assumptionthat action executions cannot leave the that do not guarantee a gain in information and it can im- current state unchanged,and change the theoremsappropri- prove its plan-execution time as it solves similar planning ately. Future work will removesome of the limitations of tasks, at the cost of havingto maintain u-values. The second Min-MaxLRTA*. For example, our assumption that there advantage is important because no methodthat interleaves is no uncertainty in actuation and sensing is indeed a good planning and plan execution can guarantee a good plan- approximationfor robot navigation tasks in mazes. Future execution time on the first run. For instance, consider the work will investigate how Min-MaxLRTA* can be applied goal-directed navigation task from Figure 7 and assumethat to similar tasks in less structured environmentsin which Min-MaxLRTA* generates the local search spaces with the our assumption is not a good one. Similarly, Min-Max methodfrom Figure 6. Then, both the IG method and zero- LRTA*works with arbitrary Iook-aheads. Future work will initialized Min-MaxLRTA* move forward, because this is investigate howit can adapt its look-ahead automatically the fastest wayto eliminate a possible pose, that is, to gain to optimize the sum of planning and plan-execution time, information. Even Min-MaxLRTA* with the goal-distance possibly using methods from (Russell & Wefald1991). heuristic movesforward, since it follows the gradient of the u-values. However,moving forward is suboptimal. It is ExperimentalResults best for the robot to first localize itself by turning aroundand movingto a corridor end. If the goal-directed navigation Nourbakhsh (Nourbakhsh 1996) has already shown that task is repeated a sufficient numberof times with the same performing a complete minimaxsearch to solve the goal- start pose, Min-MaxLRTA* eventually learns this behavior. directed navigation tasks optimally can be completely in- feasible. Wetake this result for granted and showthat Min-MaxLRTA* solves the goal-directed navigation tasks n(Genesereth& Nourbakhsh1993) refers to the IG method fast, converges quickly, and requires only a small amount as the DelayedPlanning Architecture (DPA) with the viable plan of memory. Wedo this experimentally since the actual heuristic. It also contains someimprovements on the version of plan-execution time of Min-MaxLRTA* and its memory the IGmethod discussed here, that do not changeits character. requirements can be muchbetter than the upper bound of

Koenig 151 From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. after.., measudag... using... Min-MaxLRTA* with look-ahead one Min-MaxLRTA* with larger look-ahead (nsinl~the methodin Figure6) goal-directednavigation localization goal-directednavigation localization goal-distanceheuristic zero heuristic goal-distanceheuristic zae heuristic the first run plan-executiontime action ezacuuons I 13.32 13.33 50.48 12.24 planningtime stateexpansions I 13.32 13.33 73.46 26.62 memoryusage u-values remembered 31.88 13.32 :40.28 26.62 coovergcm:e plan-egecutiontime action executions 49.15 8.82 49.13 8.81 planningtime state expansions 49.15 8. 82 49.13 8.81 memoryusage u-values remembered 446.13 1,782.26 85.80 506.63 numberof rims until convergence 16.49 102.90 3.14 21.55 Figure 8: Experimental Results

TheoremI suggests. Weuse a simulation of the robot nav- Conclusions igation domainwhose interface matches the interface of an actual robot that operates in mazes (Nourbakhsh& Gene- Westudied goal-directed navigation (and localization) tasks sereth 1997). Thus, Min-MaxLRTA* could be run on that in mazes, wherethe robot knowsthe mazebut does not know robot. Weapply Min-MaxLRTA* to goal-directed naviga- its initial pose. These problemscan be modeledas nondeter- tion and localization tasks with two different Iook-aheads ministic planning tasks in large state spaces. Wedescribed each, namely look-ahead one (Sz~ = {s}) and the larger Min-MaxLRTA*, a real-time heuristic search method for look-ahead from Figure 6. As test domains, we use 500 ran- nondeterministicdomains and showedthat it can solve these domly generated square mazes. The same 500 mazes are tasks efficiently. Min-MaxLRTA* interleaves planning and used for all experiments.All mazeshave size 49 × 49 and the plan execution, whichallows the robot to gather information sameobstacle density, the samestart pose of the robot, and early that can be used to reduce the amountof planning done (for goal-directed navigation tasks) the samegoal location, for unencountered situations. Min-MaxLRTA* allows for which includes all four poses. The robot is provided with fine-grained control over howmuch planning to do between no knowledgeof its start pose and initially senses open- plan executions and uses heuristic knowledgeto guide plan- ings in all four directions. The mazes are constructed so ning. It amortizes learning over several search episodes, that, on average, morethan 1100 poses are consistent with whichallows it to find suboptimalplans fast and then im- this observation and have thus to be considered as poten- prove its plan-execution time as it solves similar planning tial start poses. Welet Min-MaxLRTA* solve the same tasks, until its plan-execution time is at least worst-case task repeatedly with the samestart pose until its behavior optimal. This is important since no methodthat executes converges. 2 To save memory, Min-MaxLRTA* generates actions before it knowstheir complete consequences can the initial u-values only on demandand never stores u-values guarantee a good plan-execution time right away, and meth- that are identical to their initial values. Line5 breaksties be- ods that do not improvetheir plan-execution time do not tween actions systematically according to a pre-determined behaveefficiently in the long run in case similar planning ordering on A(s) for all states s. Figure 8 showsthat Min- tasks unexpectedly repeat. Since Min-MaxLRTA* partially MaxLRTA* indeed produces good plans in large domains relies on observingthe actual successorstates of action ex- quickly, while using only a small amountof memory.Since ecutions, it does not plan for all possible successor states the plan-execution time of Min-MaxLRTA* after conver- and thus can still have computationaladvantages even over genceis no worsethan the minimaxgoal distance of the start several search episodes comparedto a complete minimax state, we knowthat its initial plan-executiontime is at most search if nature is not as malicious as a minimaxsearch 23 i, 15 I, 103, and 139 percent (respectively) of the worst- assumesand somesuccessor states do not occur in practice. case optimal plan-execution time. Min-MaxLRTA* also converges quickly. Consider, for example, the first case, wherethe initial plan-executiontime is worst. In this case, Acknowledgements Min-MaxLRTA* with look-ahead one more than halves its plan-execution time in less than 20 runs. This demon- Thanksto Matthias Heger, Richard Korf, Michael Littman, strates that this aspect of Min-MaxLRTA* is important if TomMitchell, and Illah Nourbakhshfor helpful discus- the heuristic knowledgedoes not guide planning sufficiently sions. Thanks also to Richard Kerr and Michael Littman well. for their extensive commentsand to Joseph Pembertonfor makinghis mazegeneration program available to us. This research was sponsored by the Wright Laboratory, Aeronau- tical Systems Center, Air Force Materiel Command,USAF, and the AdvancedResearch Projects Agency(ARPA) under -’Min-MaxLRTA* did not knowthat the start pose remained grant numberF33615-93- I- 1330. The views and conclu- the same.Otherwise, it couldhave used the decreaseduncertainty sions contained in this documentare those of the authors aboutits poseafter solvingthe goal-directednavigation task to and should not be interpreted as representing the official narrowdown its start pose and improveits plan-executiontime policies, either expressedor implied, of the sponsoringor- this way. ganizations or the U.S. government.

152 Robotics and Agents From: AIPS 1998 Proceedings. Copyright © 1998, AAAI (www.aaai.org). All rights reserved. References Russell, S., and Wefald, E. 1991. Do the Right Thing - Barto, A.; Bradtke, S.; and Singh, S. 1995. Learning Studies in Limited Rationali~.,. MITPress. to act using real-time dynamic programming.Artificial Schapire, R. 1992. The Design and Analysis of E17icient Intelligence 73(1):81-138. Learning Algorithms. MITPress. Bonet, B.; Locrincs, G.; and Geffner, H. 1997. A robust Stentz, A. 1995. The focussed D* algorithm for real- and fast action selection mechanism.In Proceedingsof the time replanning. In Proceedingsof the International Joint National Conferenceon . Conferenceon Artificial Intelligence, 1652-1659. Chrisman, L. 1992. Reinforcement learning with per- ceptual aliasing: The perceptual distinctions approach. In Proceedings of the National Conferenceon Artificial In- telligence, 183-188. Dean, T.; Kaelbling, L.: Kirman, J.: and Nicholson, A. 1995. Planning under time constraints in stochastic do- mains. Artificial h~telligence 76( I-2):35-74. Genesereth, M., and Nourbakhsh, I. 1993. Time-saving tips for problemsolving with incomplete information. In Proceedingsof the National Conferenceon Artificial In- telligence, 724-730. Good,I. 1971. Twenty-sevenprinciples of rationality. In Godambe,V., and Sprott, D., ods., Foundationsof Statisti- cal Inference. Holt, Rinehart, Winston. Hegel M. 1996. The loss from imperfect value functions in expectation-based and minimax-based tasks. Machine Learning22( 1-3 ): 197-225. Koenig, S., and Simmons,R. 1995. Real-time search in non-deterministic domains. In Proceedings of the Inter- national Joint Conferenceon Artificial Intelligence, 1660- 1667. Koenig, S. 1997. Goal-Directed Acting with Incomplete lnfornu~tion. Ph.D. Dissertation, School of ComputerSci- ence, Carnegie Mellon University, Pittsburgh (Pennsylva- nia). Kohavi, Z. 1978. Switching and Finite Automata Theol.. McGraw-Hill,second edition. Korf, R. 1990. Real-timeheuristic search. Artificial Intel- ligence 42(2-3): 189-211. Korf, R. 1993. Linear-space best-first search. Artificial Intelligence 62(1 ):41-78. Littman, M. 1996. Algorithms for Sequential Decision Making. Ph.D. Dissertation, Department of Computer Science, BrownUniversity, Providence (Rhode Island). Available as Technical Report CS-96-09. Lozano-Perez, T.; Mason, M.; and Taylor, R. 1984. Auto- matic synthesis of fine-motionstrategies for robots. Inter- national Journal of Robotics Research3(1 ):3-24. Nourbakhsh, I., and Genesereth, M. 1997. Teaching AI with robots. In Kortenkamp,D.; Bonasso, R.; and Murphy, R., eds., Artificial Intelligence Based Mobile Robotics: Case Studies of Successful Robot Systems. MITPress. Nourbakhsh, I. 1996. Interleaving Planning and Execu- tion. Ph.D. Dissertation, Departmentof ComputerScience, Stanford University, Stanford (California). Pearl, J. 1985. Heuristics: Intelligent Search Strategies for Computer Problem Solving. Addison-Wesley.

Koenig 153