From: AAAI Technical Report SS-99-07. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved.

Solving Markov Decision Problems Using Heuristic Search

Eric A. Hansen Shlomo Zilberstein Computer Science Department Computer Science Department Mississippi State University University of Massachusetts Mississippi State, MS39762 Amherst, MA01002 hansen~cs.msstate.edu [email protected]

Abstract AO*can find solutions more efficiently by focusing computation on reachable states. Wedescribe a heuristic search algorithm for Markov Neither of these heuristic search can solve decision problems, called LAO*,that is derived from the class of sequential decision problems considered in the classic heuristic search algorithm AO*. LAO* this paper. Problems of decision-theoretic planning shares the advantage heuristic search has over dy- and are typically modeled as namic programmingfor simpler classes of problems: it can find optimal solutions without evaluating all prob- MDPswith an indefinite (or infinite) horizon. For this lem states. The derivation of LAO*from AO*makes class of problems, the number of actions that must be it easier to generalize refinements of heuristic search taken to achieve an objective cannot be bounded, and a developed for simpler classes of problems for use in solution (that has a finite representation) must contain solving Markovdecision problemsmore efficiently. loops. Classic heuristic search algorithms, such as A* and AO*, cannot find solutions with loops. However, Hansen and Zilberstein (1998) have recently described Introduction a novel generalization of heuristic search, called LAO*, The Markov decision process (MDP), a model of se- that can. They show that LAO*can solve MDPswith- quential decision-making developed in operations re- out evaluating all problem states. This paper reviews search, allows reasoning about actions with probabilis- the LAO*algorithm and describes some extensions of tic outcomes in determining a course of action that it. This illustrates the relevance of the rich body of AI maximizes expected utility. It has been adopted as research on heuristic search to the problem of solving a frameworkfor artificial intelligence (AI) research MDPsmore efficiently. decision-theoretic planning (Dean et al 1995; Dearden and Boutilier 1997) and reinforcement learning (Barto MDPs and et al 1995). In adopting this model, AI researchers We consider an MDPwith a state set, S, and a fi- have adopted dynamic-programming algorithms devel- nite action set, A. Let A(i) denote the set of actions oped in operations research for solving MDPs.A draw- available in state i. Let pij(a) denote the back of dynamic programmingis that it finds a solution that taking action a in state i results in a transition to for all problem states. In this paper, a heuristic search state j. Let ci(a) denote the expected immediate cost approach to solving MDPsis described that represents of taking action a in state i. the problem space of an MDPas a search graph and Weare particularly interested in MDPsthat include uses a heuristic evaluation function to focus computa- a start state, s E S, and a set of terminal states, T C S. tion on the "relevant" problemstates, that is, on states For every terminal state i E T, we assume that no that are reachable from a given start state. action can cause a transition out of it (i.e., it is an The advantage of heuristic search over dynamic pro- absorbing state). We also assume that the immediate grammingis well-known for simpler classes of sequen- cost of any action taken in a terminal state is zero. In tial decision problems. For problems for which a solu- other words, the control problem ends once a terminal tion takes the form of a path from a start state to a goal state is reached. Because we are particularly interested state, the heuristic search algorithm A*can find an op- in problems for which a terminal state is used to model timal path by exploring a fraction of the state space. a goal state, from now on we refer to terminal states Similarly, for problems for which a solution takes the as goal states. form of a tree or an acyclic graph, the heuristic search For all non-terminal states, we assume that immedi-

42 1. Start with an arbitrary policy 5. 1. Start with an arbitrary evaluation function f. 2. Repeat until the policy does not change: 2. Repeat until the error bound of the evaluation function is less than e: (a) Compute the evaluation function f6 for policy by solving the following set of IS] equations For each state i E S, in ]S] unknowns.

I’(i) = + f(i) a~A(i)min i(a)+13~pij(a)f(j)].j~s[c

(b) For each state i E 3. Extract an e-optimal policy from the evaluation function as follows. For each state i E S,

,(i) := arg min.eA(O[ci(a)+~ZP’J(a)f’(J)l’,~s 6(i) := arg minaeA(i)[c’(a)+~EP’J(a)f(J)] "jes J

Resolve ties arbitrarily, but give preference to the currently selected action. Figure 2: Value iteration.

Figure 1: Policy iteration. For some problems, it is reasonable to assume that all possible policies are proper. Whenthis assumption is ate costs are positive for every action, that is, ci(a) > not reasonable, future costs can be discounted by a for all states i ¢ T and actions a E A(i). The objective factor/3, with 0

43 1. Start with an admissible evaluation function f. dated. Barto, Bradtke and Singh (1995) prove that RTDPconverges to an optimal solution without nec- 2. Perform some number of trials, where each trial essarily evaluating all problem states. They point out sets the current state to the start state and re- that this result generalizes the convergence theorem of peats the following steps until a goal state is Korf’s (1990) learning real-time heuristic search algo- reached or the finite trial ends. rithm (LRTA*). This suggests the following question. (a) Find the best action for the current state based Given that RTDPgeneralizes real-time search from on lookahead search of depth one (or more). problems with deterministic state transitions to MDPs with state transitions, is there a correspond- (b) Update the evaluation function for the current state (and possibly others) based on the same ing off-line search algorithm that can solve MDPs? lookahead search used to select the best action. LAO* (c) Take the action selected in (1) and set the cur- rent state to the state that results from the Hansen and Zilberstein (1998) describe a heuristic stochastic transition caused by the action. search algorithm called LAO*that solves the same class of problems as RTDP.It generalizes the heuris- 3. Extract a policy from the evaluation function f tic search algorithm AO*by allowing it to find solu- as follows. For each state i E S, tions with loops. Unlike RTDPand other dynamic- programming algorithms, LAO*does not represent a solution (or policy) as a simple mapping from states ~(i) := arg minaeA(0[c’(a)+flZP’J(a)f(J)l’jes to actions. Instead, it represents a solution as a cyclic graph. This representation generalizes the graphical representations of a solution used by search algorithms Figure 3: Trial-based RTDP. like A* (a simple path) and AO*(an acyclic graph). The advantage of representing a solution in the form of a graph is that it exhibits teachability amongstates RTDP explicitly and makes analysis of reachability easier. Whenan MDPis formalized as a graph-search prob- Becausepolicy iteration and value iteration evaluate all states to find an optimal policy, they are computation- lem, each state of the graph corresponds to a problem state and each arc corresponds to an action causing ally prohibitive for MDPswith large state sets. Barto, Bradtke, and Singh (1995) describe an algorithm called a transition from one state to another. When state transitions are stochastic, an arc is a hyperarc or k- real-time dynamic programming (RTDP) that avoids connector that connects a state to a set of k successor exhaustive updates of the state set. RTDPis related to states, with a probability attached to each successor. an algorithm called asynchronous value iteration that A graph that contains hyperarcs is called a hypergraph updates a subset of the states of an MDPeach itera- and corresponds to an AND/ORgraph (Martelli and tion. RTDPinterleaves a control step with an iteration of asynchronous value iteration. In the control step, a Montanari 1978; Nilsson 1980). A solution to an MDP formalized as an AND/ORgraph is a subgraph of the real or simulated action is taken that causes a transi- AND/ORgraph called a solution graph, defined as fol- tion from the current state to a new state. The subset lows: of states for which the evaluation function is updated includes the state in which the action is taken. ¯ the start state belongs to a solution graph Figure 3 summarizes trial-based RTDP, which solves an MDPby organizing computation as a sequence of . for every non-goal state in a solution graph, exactly trials. At the beginning of each trial, the current state one action (outgoing k-connector) is part of the so- is set to the start set. Each control step, an action is se- lution graph and each of its successor states belongs lected based on a lookahead of depth one or more. The to the solution graph evaluation function is updated for a subset of states e every directed path in a solution graph terminates that includes the current state. A trial ends when the at a goal state goal state is reached, or after a specified number of control steps. Unlike dynamic programming, a heuristic search al- Because trial-based RTDPonly updates states that gorithm can find an optimal solution graph without are reachable from a start state when actions are se- evaluating all states. Therefore a graph is not usually lected greedily based on the current evaluation func- supplied explicitly to the search algorithm. We refer tion, states that are not reachable may not be up- to G as the implicit graph; it is specified implicitly by

44 a start state s and a successor function. The search The explicit graph G* initially consists of the start algorithm works on an e~licit graph, Gt, that initially . state s. consists only of the start state. A tip state of the ex- Identify the best partial solution graph and its plicit graph is a state that does not have any successors . in the explicit graph. A tip state that does not corre- non-terminal tip states by searching forward from spond to a goal state is called a non-terminal tip state. the start state and following the best action for A non-terminal tip state can be expanded by adding each state. If the best solution graph has no non- to the explicit graph its outgoing connectors and any terminal tip states, go to 7. successor states not already in the explicit graph. Expand some non-terminal tip state n of the best In all heuristic search algorithms, three sets of states . partial solution graph and add any new successor can be distinguished. The implicit graph contains states to Gt. For each new state i added to G~ by all problem states. The explicit graph contains those expanding n, if i is a goal state then f(i) := 0; states evaluated in the course of the search. The so- else f(i) := h(i). lution graph contains those states that are reachable from the start state when a minimal-cost solution is Identify the ancestors in the explicit graph of the followed. The objective of a best-first heuristic search . expanded state n and create a set Z that contains algorithm is to find an optimal solution for a given the expanded state and all its ancestors. start state while generating as small an explicit graph as possible. o Perform dynamic programming on the states in A heuristic search algorithm for acyclic AND/OR set Z to update state costs and determine the best graphs, called AO*, is described by Martelli and Mon- action for each state. (AO* performs dynamic tanari (1973, 1978) and Nilsson (1980). LAO* programming using backwards induction. LAO* simple generalization of AO*that can find solutions uses policy iteration or value iteration.) with loops in AND/ORgraphs containing cycles. Like Go to 2. AO*, it has two principal steps: a step that performs forward expansion of the best partial solution and a Return the solution graph. dynamic-programming step. Both steps are the same as in AO*except that they allow a solution graph to Figure 4: AO* and LAO*. contain loops. Both AO* and LAO* repeatedly expand the best partial solution until a complete solution is found. A of AO*creates the algorithm LAO*. Figure 4 summa- partial solution graph is defined the same as a solution rizes the shared structure of AO*and LAO*.In both graph, except that a directed path may end at a non- algorithms, dynamic programming is used to find the terminal tip state. For every non-terminal tip state best partial solution graph by updating state costs and i of a partial solution graph, we assume there is an finding the best action for each state. admissible heuristic estimate h(i) of the minimal-cost In both AO* and LAO*, dynamic programming is solution graph for it. A heuristic evaluation function h performed on the set of states that includes the ex- is said to be admissible if h(i) < f*(i) for every state panded state and all of its ancestors in the explicit i. Wecan recursively calculate an admissible heuristic graph. 1 Someof these states may have successor states estimate ](i) of the optimal cost of any state i in the that are not in this set of states but are still part of the explicit graph as follows: explicit graph. The costs of these successor states are treated as constants by the dynamic-programming al- 0 ifi is a goal state gorithm because they cannot be affected by any change f(i) h(i) if i is a non-terminal tip state in the cost of the expandedstate or its ancestors. The dynamic-programming step of LAO* can be else minaeA(,) [c,(a) + ~’~jesP,j(a)f(j)]. performed using either policy iteration or value iter- For a partial solution graph that does not contain ation. An advantage of using policy iteration is that loops, these heuristic estimates can be calculated by it computesan exact cost for each state of the explicit dynamic programming in the form of a simple back- graph after a finite number of iterations, based on the wards induction algorithm, and this is what AO*does. heuristic estimates at the tip states. Whenvalue itera- For a partial solution graph that contains loops, dy- 1Moreefficient versions of AO*can avoid performing namic programming cannot be performed in the same dynamicprogramming on all ancestor states and the same way. However,it can be performed by using policy iter- is possible for LAO*,but we do not consider that possibility ation or value iteration and this simple generalization here.

45 tion is used, convergence to exact state costs is asymp- the formula used to update state costs as follows: totic. However, this disadvantage may be offset by the improved efficiency of value iteration for larger prob- lems. f(i) := max L seA(i) j¢S It is straightforward to show that LAO*shares the properties of AO*and other heuristic search algo- rithms. Given an admissible heuristic evaluation func- Use of the pathmax operation in heuristic search can tion, all state costs in the explicit graph are admissi- reduce the number of states that must be expanded to ble after each step and LAO*converges to an optimal find an optimal solution. solution without (necessarily) evaluating all problem states (Hansen and Zilberstein 1998). Solve-labeling LAO*converges when the best solution graph does Descriptions of the AO*algorithm usually include a not have any non-terminal tip states. Whenpolicy it- solve-labeling procedure that improves its efficiency. eration is used in the dynamic-programmingstep, the Briefly, a state is labeled solved if it is a goal state solution to which LAO*converges is optimal. When or if all of its successor states are labeled solved. In value iteration is used in the dynamic-programming other words, a state is labeled solved when an opti- step, the solution to which LAO*converges is e- mal solution is found for it. (The search algorithm optimal, where e is the error bound of the last value converges when the start state is labeled solved.) La- iteration step. (The error bound can be made arbi- beling states as solved can improvethe efficiency of the trarily small by increasing the number of iterations of forward search step of AO*because it is unnecessary to value iteration before checking for convergence). search below a solved state for non-terminal tip states. LAO*is closely related to an "envelope" approach Solve-labeling is useful for problems that are decom- to policy and value iteration described by Dean et al. posable into independent subproblems. But in a cyclic (1995) and Tash and Russell (1994). However, the solution, it is possible for every state to be reachable velope approach is not derived from AO*and no proof from every other state. Therefore an acyclic solution is given that it converges to an optimal solution with- found by AO*is more likely to contain independent out evaluating all problem states. subproblems than a cyclic solution found by LAO*. Nevertheless, it is possible for a solution found by Extensions LAO*to contain independent subproblems, and, when it does, the same solve-labeling procedure used by AO* LAO* can solve the same class of MDPs as RTDP may improve the efficiency of LAO*. and both algorithms converge to an optimal solution without evaluating all problem states. An advantage Weighted heuristic of deriving LAO*from the heuristic search algorithm AO*is that it makes it easier to generalize refinements Studies have shownthat use of a weighted heuristic can of simpler heuristic search algorithms for use in solv- improve the efficiency of A* in exchange for a bounded ing MDPsmore efficiently. This is illustrated in this decrease in the optimality of the solution it finds (Pohl section by extending LAO*in ways suggested by its 1973). Given the familiar decomposition of the eval- derivation from AO*. uation function, f(i) = g(i) + h(i), a weight w, with 0.5 < w < 1, can be used to create a weighted heuris- tic, f(i) (1- w)g(i) + w h(Use of this heuristi c Pathmax guarantees that solutions found by A* are no more A heuristic evaluation function h is said to be con- than a factor of w/(1 - w) worse than optimal. sistent if it is both admissible and h(i) <_ ci(a) Chakrabarti et al. (1988) show how to use /3~’]~jpij(a)h(j) for every state i and action a E A(i). weighted heuristic with AO*by describing a similar de- Consistency is a desirable property because it ensures composition of the evaluation function. It is straight- that state costs increase monotonically as the algo- forward to extend their approach to LAO*.This makes rithm converges. If a heuristic evaluation function is possible an e-admissible version of LAO*that can find admissible but not consistent, state costs need not in- an e-optimal solution by evaluating a fraction of the crease monotonically. However, they can be made to states that LAO*might have to evaluate to find an do so by adding a pathmax operation to the update optimal solution. step of the algorithm. This commonimprovement of Before a state is expanded by LAO*,its g-cost is A* and AO* can also be made to LAO*, by changing zero and its f-cost equals its h-cost. After a state is

46 expanded, its f-cost is updated as follows likely to be visited from a start state. This possibil- ity motivates the envelope algorithms of Dean et al. (1995) and Tash and Russell (1994). f(i) :=a~A(,)min[c,(a)+~Zpij(a)f(j)l.j~s Acknowledgments Having determined the action a that optimizes f(i), Support for this work was provided in part by the Na- the g-cost and h-cost are updated as follows: tional Science Foundation under grants IRI-9624992 and IRL9634938. g(i) := c~(a) +~Zpij(a)g(j), jEs References h(i) := ~Ep~j(a)h(j). Barto, A.G.; Bradtke, S.J.; and Singh, S.P. 1995. jes Learning to act using real-time dynamic program- ming. Artificial Intelligence 72:81-138. The g-cost of a state represents the backed-up cost of the expanded descendents along the best partial solu- Bertsekas, D. 1995. Dynamic Programming and Op- tion graph. The h-cost of a state represents the backed- timal Control. Athena Scientific, Belmont, MA. up estimated cost of the unexpanded descendents on Chakrabarti, P.P.; Ghose, S.; & DeSarkar, S.C. 1988. the frontier of the best partial solution graph. When Admissibility of AO*when heuristics overestimate. LAO*converges to a complete solution, the h-cost of Artificial Intelligence 34:97-113. every state visited by the solution equals zero and its Dean, T.; Kaelbling, L.P.; Kirman, J.; and Nicholson, f-cost equals its g-cost. Given this f = g + h decompo- A. 1995. Planning under time constraints in stochastic sition of the evaluation function, it is straightforward domains. Artificial Intelligence 76:35-74. to create a weighted heuristic for LAO*. Dearden, R and Boutilier, C. 1997. Abstraction Discussion and approximate decision-theoretic planning. Artifi- Wehave described a generalization of heuristic search, cial Intelligence 89:219-283. called LAO*, that can find optimal solutions for Hansen, E.A. and Zilberstein, S. 1998. Heuristic indefinite-horizon MDPswithout evaluating all prob- search in cyclic AND/ORgraphs. Proceedings of the lem states. Its derivation from AO*provides a founda- Fifteenth National Conference on Artificial Intelli- tion for a heuristic search approach to solving MDPs. gence, 412-418. Madison, WI. It also makes it easier to generalize enhancements of simpler search algorithms for use in solving MDPs Korf, R. 1990. Real-time heuristic search. Artificial more efficiently. We note that the enhancements de- Intelligence 42:189-211. scribed in this paper (with the exception of solve- Martelli, A. and Montanari, U. 1973. Additive labeling) may also be used to improve the efficiency AND/ORgraphs. In Proceedings of the Third Inter- of RTDP. national Joint Conference on Artificial Intelligence, Barto et al. (1995) do not discuss how to detect the 1-11. Stanford, CA. convergence of RTDPto an optimal solution. We have Martelli, A. and Montanari, U. 1978. Optimizing de- described how to detect the convergence of LAO*to cision trees through heuristically guided search. Com- an optimal solution, or an e-optimal solution. An in- munications of the ACM21(12):1025-1039. teresting question is whether similar convergence tests can be developed for RTDPand similar algorithms. Nilsson, N.J. 1980. Principles of Artificial Intelli- LAO*and RTDPare especially useful for problems gence. Palo Alto, CA: Tioga Publishing Company. for which an optimal solution, given a start state, vis- Pohl, I. 1973. First results on the effect of error in its a fraction of the state space. It is easy to describe heuristic search. Machine Intelligence 5:219-236. such problems. But there are also MDPswith optimal Tash, J. and Russell, S. 1994. Control strategies for solutions that visit the entire state space. An interest- ing question is how to recognize problems for which an a stochastic planner. In Proceedings of the Twelth National Conference on Artificial Intelligence, 1079- optimal solution visits a fraction of the state space and distinguish them from problems for which an optimal 1085. Seattle, WA. solution visits most, or all, of the state space. Even for problems for which an optimal solution visits the entire state space, LAO*may find useful partial solu- tions by focusing computation on states that are most

47