Solving Markov Decision Problems Using Heuristic Search

From: AAAI Technical Report SS-99-07. Compilation copyright © 1999, AAAI (www.aaai.org). All rights reserved. Solving Markov Decision Problems Using Heuristic Search Eric A. Hansen Shlomo Zilberstein Computer Science Department Computer Science Department Mississippi State University University of Massachusetts Mississippi State, MS39762 Amherst, MA01002 hansen~cs.msstate.edu [email protected] Abstract algorithm AO*can find solutions more efficiently by focusing computation on reachable states. Wedescribe a heuristic search algorithm for Markov Neither of these heuristic search algorithms can solve decision problems, called LAO*,that is derived from the class of sequential decision problems considered in the classic heuristic search algorithm AO*. LAO* this paper. Problems of decision-theoretic planning shares the advantage heuristic search has over dy- and reinforcement learning are typically modeled as namic programmingfor simpler classes of problems: it can find optimal solutions without evaluating all prob- MDPswith an indefinite (or infinite) horizon. For this lem states. The derivation of LAO*from AO*makes class of problems, the number of actions that must be it easier to generalize refinements of heuristic search taken to achieve an objective cannot be bounded, and a developed for simpler classes of problems for use in solution (that has a finite representation) must contain solving Markovdecision problemsmore efficiently. loops. Classic heuristic search algorithms, such as A* and AO*, cannot find solutions with loops. However, Hansen and Zilberstein (1998) have recently described Introduction a novel generalization of heuristic search, called LAO*, The Markov decision process (MDP), a model of se- that can. They show that LAO*can solve MDPswith- quential decision-making developed in operations re- out evaluating all problem states. This paper reviews search, allows reasoning about actions with probabilis- the LAO*algorithm and describes some extensions of tic outcomes in determining a course of action that it. This illustrates the relevance of the rich body of AI maximizes expected utility. It has been adopted as research on heuristic search to the problem of solving a frameworkfor artificial intelligence (AI) research MDPsmore efficiently. decision-theoretic planning (Dean et al 1995; Dearden and Boutilier 1997) and reinforcement learning (Barto MDPs and dynamic programming et al 1995). In adopting this model, AI researchers We consider an MDPwith a state set, S, and a fi- have adopted dynamic-programming algorithms devel- nite action set, A. Let A(i) denote the set of actions oped in operations research for solving MDPs.A draw- available in state i. Let pij(a) denote the probability back of dynamic programmingis that it finds a solution that taking action a in state i results in a transition to for all problem states. In this paper, a heuristic search state j. Let ci(a) denote the expected immediate cost approach to solving MDPsis described that represents of taking action a in state i. the problem space of an MDPas a search graph and Weare particularly interested in MDPsthat include uses a heuristic evaluation function to focus computa- a start state, s E S, and a set of terminal states, T C S. tion on the "relevant" problemstates, that is, on states For every terminal state i E T, we assume that no that are reachable from a given start state. action can cause a transition out of it (i.e., it is an The advantage of heuristic search over dynamic pro- absorbing state). We also assume that the immediate grammingis well-known for simpler classes of sequen- cost of any action taken in a terminal state is zero. In tial decision problems. For problems for which a solu- other words, the control problem ends once a terminal tion takes the form of a path from a start state to a goal state is reached. Because we are particularly interested state, the heuristic search algorithm A*can find an op- in problems for which a terminal state is used to model timal path by exploring a fraction of the state space. a goal state, from now on we refer to terminal states Similarly, for problems for which a solution takes the as goal states. form of a tree or an acyclic graph, the heuristic search For all non-terminal states, we assume that immedi- 42 1. Start with an arbitrary policy 5. 1. Start with an arbitrary evaluation function f. 2. Repeat until the policy does not change: 2. Repeat until the error bound of the evaluation function is less than e: (a) Compute the evaluation function f6 for policy by solving the following set of IS] equations For each state i E S, in ]S] unknowns. I’(i) = + f(i) a~A(i)min i(a)+13~pij(a)f(j)].j~s[c (b) For each state i E 3. Extract an e-optimal policy from the evaluation function as follows. For each state i E S, ,(i) := arg min.eA(O[ci(a)+~ZP’J(a)f’(J)l’,~s 6(i) := arg minaeA(i)[c’(a)+~EP’J(a)f(J)] "jes J Resolve ties arbitrarily, but give preference to the currently selected action. Figure 2: Value iteration. Figure 1: Policy iteration. For some problems, it is reasonable to assume that all possible policies are proper. Whenthis assumption is ate costs are positive for every action, that is, ci(a) > not reasonable, future costs can be discounted by a for all states i ¢ T and actions a E A(i). The objective factor/3, with 0 </~ < 1, to ensure that states have is to find a solution that minimizes the expected cumu- finite expected costs that can be computed by solving lative cost of reaching a goal state from the start state. the following system of ISI equations in ISI unknowns: Because the probabUistic outcomes of actions can cre- ate a non-zero probability of revisiting the same state, the worst-case number of steps needed to reach a goal state cannot be bounded. Hence, the MDPis said to have an indefinite horizon. (The results of this paper extend to MDPswithout goal states, for which the ob- For problems for which neither discounting is reason- jective is to optimize performance over an infinite hori- able nor all possible policies are proper, other optimal- zon. However, we focus on MDPswith goal states be- ity criteria - such as average cost per transition - can cause they generalize traditional AI state-space search be adopted (Bertsekas 1995). The rest of this paper problems.) assumes discounting. A policy 5 is said to dominate a policy 5~ if f6(i) < When an indefinite (or infinite) horizon MDP f6’ (i) for every state i. An optimal policy, 5", dom- solved using dynamic programming, a solution is repre- sented as a mapping from states to actions, ~ : S -+ A, inates every other policy and its evaluation function, called a policy. A policy is executed by observing the f*, satisfies the following system of recursive equations current state and taking the action prescribed for it. called the Bellman optimality equation: This representation of a solution implicitly contains 0 ifi is a goal state both branches and loops. Branching is present because l*(i) ] the state that stochastically results from an action de- else minaeA(i)[ Lci(a) + f~ EiesPij(a)f*(J)j. termines the next action. Looping is present because Figure 1 summarizes policy iteration, a well-known the same state can be revisited under a policy. dynamic-programming algorithm for solving MDPs. A policy is said to be proper if it ensures that a goal Figure 2 summarizes value iteration, another dynamic- state is reached from any state with probability 1.0. programming algorithm for solving MDPs. Both pol- For a proper policy, the expected cumulative cost for icy iteration and value iteration repeat an improve- each state i is finite and can be computed by solving ment step that updates the evaluation function for all the following system of IS[ equations in IS[ unknowns: states. Neither uses knowledge of the start state to focus computation on reachable states. As a result, both algorithms solve an MDPfor all possible starting states. 43 1. Start with an admissible evaluation function f. dated. Barto, Bradtke and Singh (1995) prove that RTDPconverges to an optimal solution without nec- 2. Perform some number of trials, where each trial essarily evaluating all problem states. They point out sets the current state to the start state and re- that this result generalizes the convergence theorem of peats the following steps until a goal state is Korf’s (1990) learning real-time heuristic search algo- reached or the finite trial ends. rithm (LRTA*). This suggests the following question. (a) Find the best action for the current state based Given that RTDPgeneralizes real-time search from on lookahead search of depth one (or more). problems with deterministic state transitions to MDPs with stochastic state transitions, is there a correspond- (b) Update the evaluation function for the current state (and possibly others) based on the same ing off-line search algorithm that can solve MDPs? lookahead search used to select the best action. LAO* (c) Take the action selected in (1) and set the current state to the state that results from the Hansen and Zilberstein (1998) describe a heuristic stochastic transition caused by the action. search algorithm called LAO*that solves the same class of problems as RTDP.It generalizes the heuris- 3. Extract a policy from the evaluation function f tic search algorithm AO*by allowing it to find solu- as follows. For each state i E S, tions with loops. Unlike RTDPand other dynamic- programming algorithms, LAO*does not represent a solution (or policy) as a simple mapping from states ~(i) := arg minaeA(0[c’(a)+flZP’J(a)f(J)l’jes to actions.

Load more