UAI 2009 BONET 59

Deterministic POMDPs Revisited

Blai Bonet Departamento de Computaci´on Universidad Sim´onBol´ıvar Caracas, Venezuela [email protected]

Abstract general frameworks for sequential decision making [19], yet the known algorithms scale very poorly. We study a subclass of POMDPs, called De- However, we have seen that an important collection terministic POMDPs, that is characterized of problems that involve uncertainty and partial in- by deterministic actions and observations. formation have a common characteristic: they have These models do not provide the same gen- actions with deterministic outcomes and the observa- erality of POMDPs yet they capture a num- tions generated at each decision stage also behave de- ber of interesting and challenging problems, terministically. Indeed, these models have been used and permit more efficient algorithms. Indeed, in recent proposals for planning with incomplete infor- some of the recent work in planning is built mation [15, 16, 27], appear in works of more general around such assumptions mainly by the quest scope [15, 20] and about causation [31], and are used of amenable models more expressive than the for learning partially- action models [1]. classical deterministic models. We provide These models were briefly considered in Littman’s the- about the fundamental properties of sis [21] under the name of Deterministic POMDPs Deterministic POMDPs, their relation with (det-pomdps) for which some important theoretical AND/OR search problems and algorithms, results were obtained. Among others, he showed that and their computational . a det-pomdp can be mapped into an mdp with an exponential number of states and thus solved with standard MDP algorithms, and that optimal non- 1 Introduction stationary policies of polynomial horizon can be com- puted in non-deterministic polynomial time. Unfor- The simplest model for sequential decision making is tunately, det-pomdps briefly appeared as a curiosity the deterministic model with known initial and goal of theoretical interest and then quickly fade out from states. Solutions are sequences of actions that map consideration, to the point that, up to our , the initial state into a goal state that can be computed there are no publications on this subject neither from with standard search algorithms. This model has been Littman or others. studied thoroughly in AI with important contributions Given the role of det-pomdps in recent investigations, such as A*, IDA*, and others [30, 34]. motivated mainly by the quest of amenable models for The deterministic model has strong limitations on the decision making with uncertainty and partial informa- type of problems that can be represented: it is not tion, we believe that det-pomdps should be further possible to model situations where actions have non- studied. In this paper, we carry out a systematic ex- deterministic outcomes or where states are not fully ploration of det-pomdps mainly from the complexity observable. In such cases, one must resort to more ex- perspective yet we outline novel algorithms for them. pressive formalisms such as Markov Decision Processes We three variants of the model: the fully ob- (mdps) and Partially Observable mdps (pomdps). The servable, the and the general case, and generality of these models comes with a cost since the two metrics of performance: worst- and expected-cost. computation of solutions increase in complexity, spe- As it will be shown, det-pomdps offer a tradeoff be- cially for pomdps, and thus one gains in generality but tween the classical deterministic model and the gen- loses in the ability to solve problems. pomdps, for ex- eral pomdp model. Furthermore, their characteristics ample, are widely used as they offer one of the most permits the use of standard and novel AND/OR algo- 60 BONET UAI 2009

rithms which are simpler and more efficient that the – finite sets of applicable actions Ai ⊆ A for i ∈ S, standard algorithms for s [32, 36, 37], or the pomdp – a finite of observations O, transformation proposed by Littman. – an initial subset of states b0 ⊆ S, or alternatively The paper is organized as follows. First, we give exam- an initial distribution of states b0 ∈ ∆S, ples of challenging problems that help us to establish the relevance of det-pomdps. We then present the – a subset T ⊆ S of goal (target) states, definition and variants of the model in Sect. 3, the re- – a deterministic transition function f(i, a) ∈ S, for lation with AND/OR graphs and algorithms in Sect. 4, i ∈ S, a ∈ Ai, that specifies the of applying complexity analyses in Sect. 5, and finish with a brief action a on state i, discussion in Sect. 6. – a deterministic observation function o(i, a) ∈ O, for i ∈ S, a ∈ A, that specifies the observation received 2 Examples when entering state i after the application of action a, and Numerous det-pomdps problem had been used to evaluate and develop different algorithms for planning – positive costs c(i, a), for i ∈ S, a ∈ Ai that tells the with uncertainty and partial . For immediate cost of applying a on i. reasons, we only provide few examples and brief de- scriptions for some of them. For simplicity, we assume that goal states are absorb- Mastermind. There is a secret word of length m over ing. That is, once a goal state is entered, the system an alphabet of n symbols. The goal is to discover the remains there and incurs in no costs, hence At = A, word by making guesses about it. Upon each guess, the f(t, a) = t and c(t, a) = 0 for t ∈ T . number of exact matches and near matches is returned. The two options for b0 depend on whether the interest The goal is to obtain a strategy to identify the secret. is in minimizing the worst-case total accumulated cost, Navigation in Partially-Known Terrains. There or the expected total accumulated cost (see below). In 1 is a robot in a n × n grid that must navigate from an any case, b0 is called the initial state and de- initial position to a goal position. Each cell of the grid scribes the different possibilities for the initial state is either traversable or untraversable. The robot has which is not known a priori. Its importance is crucial perfect sensing within its cell but the traversability of since, under the assumption of deterministic transi- a subset of cells is unknown. The task is to obtain a tions, if the initial state were known, then all future strategy for guiding the robot to its destination [20]. states would also be known, and the model reduces to the well-known deterministic model in AI [34]. Hence, Diagnosis. There are n binary tests for finding out the only source of uncertainty in det-pomdp mod- the state of a system among m possible states. An els comes from the initial state which further induces instance consists of a m×n binary matrix T such that an uncertainty on the observations generated at each Tij = 1 iff test j is positive when the state is i. The decision stage. Nonetheless, the model remains chal- goal is to get a strategy for identifying the state [29]. lenging, with respect to expressivity and computation, Coins. There are n coins from which one is a coun- as exemplified in the previous section. terfeit of different weight, and there is a 2-pan balance Although the state of the system may not be fully ob- scale. A strategy that spots the counterfeit and finds servable, it is possible (and indeed useful) to consider out whether is heavier or lighter is needed [30]. preconditions for actions. The role of preconditions is Domains from IPC. The problems in the 2006 and to permit the knowledge engineer to design econom- 2008 International Planning Competition for the track ical representations by leaving out from the specifi- on conformant planning consisted of domains covering cation unimportant details or undesirable situations. topics such as Blocksworld, circuit synthesis, universal For example, if one does not want to model the effects traversal sequences, sorting networks, communication of plugging a 120V artifact into a 240V outlet, then protocols and others [11], all of which are instances of a simple precondition can be used to avoid such sit- det-pomdps. uations. Preconditions in the det-pomdp model are expressed through the sets of applicable actions Ai. As it is standard in planning, situations in which there are 3 Model and Variants no actions available are called dead ends. Formally, a model is a tuple made of: det-pomdp 1The term ‘belief state’ refers to a subset of states (or a probability distribution on states) that is deemed possible – a finite state space S = {1, . . . , n}, by the agent at a given point in time. UAI 2009 BONET 61

3.1 Optimality Criteria On the real environment though, the current state transforms into a new state and an observation is gen- Our goal is to compute strategies that permit the agent erated, which is used to filter out states inconsistent to act optimally. We consider two optimality criteria, o . with it; i.e. the filtered belief is ba = {i ∈ ba : o(i, a) = and three variants of the model. For optimality, we o} for subsets, and, for distributions, is consider the minmax that minimizes the worst-case  cost of a policy, and the minexp that minimizes the o . 0 if ba(i) = 0 or o(i, a) 6= o, ba(i) = expected cost of a policy. The variants of the model ba(i)/ba(o) otherwise, depend on the observation model: where ba(o) is a normalization constant. • unobservable models in which O is a singleton and From the point of view of the agent, which does not thus observations return no information about the know the current state of the environment, the set of state. This class is a subclass of the so-called con- observations that may appear after the application of a . formant problems in planning [13]. at b is o(b, a) = {o(i, a): i ∈ sup(ba)}. Under the min- exp criterion, each observation has probability ba(o) of • fully-observable models in which o(i, a) = i. This being received once action a is applied at b. class corresponds to fully observable mdps except that the initial state is unknown, but after the ap- Littman proposes an interesting representation for the plication of the first action the state is revealed. reachable beliefs in the probabilistic case [21], in which a reachable belief b is represented by a table tb : S → • models with no assumptions on the observations. S ∪ {⊥} as follows. For b0, tb0 (i) is i or ⊥ whether o This case corresponds to general det-pomdps and b0(i) > 0 or not, and for ba where b is a reachable include the previous cases. belief, a ∈ A and o ∈ o(b, a), the table t o is given by b ba t o (i) = ⊥ if t (i) = ⊥, b (i) = 0 or o(i, a) 6= o, and ba b a t o (i) = f(i, a) otherwise. In words, the entry t (i) Two optimization criteria and three variants combine ba b into six models: unobservable models with the minmax tells what would be the current state had the initial and minexp criteria, fully-observable models with the state been equal to i. The relation between b and tb is minmax and minexp criteria, and general models with P minmax and minexp criteria. Among these, the un- b(i) ∝ j∈S b0(j)|{j : tb(j) = i}|. observable and the general models are the interesting ones; fully-observable models are trivial and patholog- We are interested in optimal policies that are closed ical and thus will not be considered again. with respect to the initial belief b0. Namely, that the policy is defined over all the beliefs that may appear For partially-observable problems, the most general during the execution of the policy. We say that a pol- form of a policy is a function that maps belief states icy π is closed in a subset B of beliefs if π is defined into actions. Beliefs are subsets of states for minmax on each b ∈ B, and bo ∈ B for each b ∈ B and models and probability distributions over states for π(b) o ∈ o(b, π(b)), and that π is closed with respect to minexp models. However, an optimal policy does not b , or just closed, if b ∈ Dom(π) and π is closed in need to specify an action for all possible beliefs: it just 0 0 Dom(π). We will only consider closed policies. The need to consider the beliefs that could appear during set of reachable beliefs from b using (closed) policy π the execution of the policy. . 0 is the minimum subset R = R(b0, π) ⊆ Dom(π) such 2 that b0 ∈ R and π|R is closed with respect to b0. A 3.2 Belief Dynamics and Closed Policies policy π is minimal if π = π|R(b0,π). In order to execute an action, the agent must be cer- tain about its applicability no matter what is the cur- 3.3 Graphs, Costs and Optimal Policies rent state of the environment. Otherwise, the model Let us reason about the trajectories generated by a might end up in an inconsistent configuration with re- policy π. At the beginning, the agent applies the ac- spect to the specification. Hence, the set A of appli- . b tion a = π(b ). Then, the agent incurs in certain cable actions at b is A = ∩{A : i ∈ sup(b)} where 0 0 b i immediate cost, which depends on the current state sup(b), called the support of b, is the collection of states . and a , and receives an observation o , which also de- that are deemed possible at b, i.e. sup(b) = b for sub- 0 0 . pends on the current state and action, that is used by sets of states and sup(b) = {i : b(i) > 0} for distribu- o0 the agent to update its initial belief into b1 = (b0) . tions on states. Once an applicable action is executed, a0 At the second decision stage, the agent applies the ac- each state in sup(b) changes into a new state making . tion a1 = π(b1), incurs in a cost and receives a new up a new belief ba = {f(i, a): i ∈ sup(b)} for subsets, . P 2 and ba(j) = f(i,a)=j b(i) for distributions. f|A means the restriction of f to Dom(f) ∩ A. 62 BONET UAI 2009

observations o1 that is used to update the belief into If Gπ is a DAG, the cost assigned by π at each belief is b = (b )o1 , and so on. This process continues until defined inductively bottom-up from the sink nodes up 2 1 a1 the agent is certain that a goal state has been reached, to the source b0. Indeed, being π closed, the sinks are i.e. until the current belief b is a target belief charac- target beliefs which has zero cost under both criteria, max exp . terized by sup(b) ⊆ T . It is important to remark that i.e. Vπ (b) = Vπ (b) = 0 if sup(b) ⊆ T . Once all during the execution of a policy the agent cannot pre- successors of b get values, the value at b is defined as dict with certainty neither the costs nor the observa- . V max(b) = cmax(b, π(b)) + max V max(b0), tions since s/he is not certain about the current state π 0 π (b,b )∈Gπ of the environment. exp . exp X exp o Vπ (b) = c (b, π(b)) + ba(o) · Vπ (ba), All possibles trajectories of π seeded at b0 form a o∈o(b,π(b)) labeled directed (multi-)graph Gπ = (V, E, `) where 0 max . V = R(b0, π) and there is an edge (b, b ) labeled with o where the costs over beliefs are defined as c (b, a) = 0 o exp P iff b is non-target and b = bπ(b) for some o ∈ o(b, π(b)). maxi∈b c(i, a) and c (b, a) = i∈S b(i) · c(i, a). The In probabilistic models, edges (b, b0) labeled with o cost assigned by policy π to belief b0, either finite or have attached probabilities equal to bπ(b)(o). infinite, is called the cost of π. Clearly, if Gπ is a DAG, exp max then Vπ (b0) ≤ Vπ (b0) < ∞. Deterministic actions imply |sup(ba)| ≤ |sup(b)| with strict inequality only if f(i, a) = f(j, a) for some A policy π is preferred to policy π0 under optimization ∆ ∆ i 6= j in sup(b). Therefore, if a is applicable in b criterion ∆ if Vπ (b0) < Vπ0 (b0). A policy π is optimal and {b1, . . . , bm} is the collection of possible beliefs under ∆ if no other policy is preferred to it. Hence, by m Theorem 2, if there is a policy of finite cost, then there after observations, then the collection {sup(bi)}i=1 is P o is an optimal policy π∗ whose graph is a DAG. From a partition of sup(ba) and thus o∈o(b,a) |sup(ba)| ≤ 0 now on, when we say an optimal policy, we mean an |sup(b)| for each a ∈ Ab. Furthermore, if b <π b de- 0 d 0 optimal policy of finite cost. If there is no policy of notes the existence of a path b b in Gπ and b <π b the existence of a deterministic path, i.e. one in which finite cost, we say there is no optimal policy. each node has outdegree of one, then 3.4 Policy Forms and Sizes Theorem 1. Let Gπ be the graph for π. Then, (a) if b < b0 then |sup(b)| ≥ |sup(b0)|, (b) if b < b0 and π π Unobservable Models. As there is no information |sup(b)| = |sup(b0)| then b

of a model M is the maximum over the diameters of 4.1 From Models to Graphs and Algorithms ∗ R (b, A) for all b ∈ R(b0,A). The model has polyno- mial diameter if its diameter is O(poly(|S|, |A|)). The relation between det-pomdps and AND/OR graphs is direct. For a model M, we an If R(b0,A) is of polynomial size, the model is of poly- AND/OR graph GM such that the solutions of GM co- nomial diameter but the converse is not necessarily incide with the solutions of M. Indeed, given a model true. If M is of polynomial diameter, the lengths of M with minmax criterion, the graph GM is given by the chunks are of polynomial length. Thus, unobserv- able problems of polynomial diameter have optimal . – V1 = {ba : b ∈ R(b0,A), a ∈ Ab}, policies of polynomial size. . – V2 = R(b0,A), General Models. Optimal policies are DAGs in . – T = {b ∈ V2 : sup(b) ⊆ T }, which paths correspond to sequences of beliefs with – if b ∈ V2 \ T then (b, ba) ∈ E for a ∈ Ab, non-increasing supports. As before, o – if ba ∈ V1 then (ba, b ) ∈ E for o ∈ o(b, a), Theorem 3. Models of polynomial diameter have op- . a – n0 = b0, timal policies of polynomial size. . – c(b) = 0 for b ∈ T , . max Proof. It remains to show the claim for general models. – c(b, ba) = c (b, a) for b ∈ V2 and a ∈ Ab, o . Let π be an optimal policy. A subset B of beliefs is – c(ba, ba) = 0 for ba ∈ V1 and o ∈ o(b, a). called an anti-chain iff there are no two different beliefs 0 0 b, b ∈ B with b <π b . An induction shows that if B is For minexp, the graph must be extended into a P o an anti-chain, then |sup(b0)| ≥ b∈B |sup(b)|. Thus, stochastic graph with labels p(ba, ba) = ba(o), and exp since |sup(b0)| ≤ |S|, all anti-chains have size at most c(b, ba) must be replaced by c (b, a). Also observe |S|. On the other hand, the assumption implies that that the beliefs in the graph can be represented with all paths in Gπ have polynomial length. Therefore, Gπ probability distributions or with Littman’s tables. is of polynomial size since the beliefs at depth k form an anti-chain and thus their number is at most |S|. This relation permits the use of diverse algorithms. Firstly, since the number of reachable beliefs is finite, then all reachable beliefs can be enumerated and Value 4 AND/OR Graphs or Policy Iteration applied on the resulting mdp over belief space (this is essentially Littman’s proposal); We establish a relation between policies and AND/OR both algorithms are guaranteed to converge in a fi- graphs that can be exploited by algorithms. An nite number of steps. However, we can do better since AND/OR graph is a tuple G = (V1 ∪ V2, E, T, n0, c) optimal policies are acyclic and thus consistently im- where V1 and V2 are finite sets of AND and OR nodes, proving policies exist. Therefore, more efficient algo- E ⊆ (V1∪V 2)\T ×(V1∪V2) is a subset of directed edges rithms can be used such as label setting methods and between nodes, T ⊆ V2 is subset of terminal nodes, Dial’s algorithm [3, 7]. ∗ n0 ∈ V1 ∪ V2 is the root node, and c : E ∪ T → R is a cost function over edges and terminal nodes [26]. A Explicit algorithms as the above require enough mem- solution is a subgraph H spanned by a subset of edges ory to compute the set of reachable beliefs. If no such memory is available, search-based algorithms that H(E) such that (1) n0 ∈ H, (2) if n ∈ H \ T is AND node then all its outbound edges are in H(E), and (3) generate beliefs incrementally and that use admissible if n ∈ H \T is OR node then one of its outbound edges heuristics to focus the search must be used. Relevant is in H(E). A solution is valid iff it is acyclic. Costs algorithms in this class are the classical AO* [6, 23, 26], c(H) can be associated to valid solutions H inductively LAO* [14], RTDP-like algorithms [2, 4], and LDFS [5]. from the sinks up to the root by AO* can only be used on acyclic graphs; LAO*, RTDP and LDFS do not have this restriction.  c(n) n ∈ T .  0 0 VH (n) = max(n,n0)∈H(E) c(n, n ) + VH (n ) n ∈ V1 0 0 5 Complexity  c(n, n ) + VH (n ) n ∈ V2 0 A stochastic graph associates probabilities p(n, n ) to The complexity of pomdps had been thoroughly 0 the edges (n, n ) incident at AND nodes n so that studied. Results for different optimization criteria, P 0 (n,n0)∈E p(n, n ) = 1. The cost VH is similarly de- probabilistic and non-deterministic variants, and so . P 0 fined except that VH (n) = (n,n0)∈H(E)(c(n, n ) + on are known [17, 22, 25, 28, 33]. Littman [21] 0 0 VH (n ))p(n, n ) for AND nodes. In any case, the cost obtained important complexity results about det- c(H) is defined as VH (n0), and a solution is optimal if pomdps for problems with non-negative costs (i.e. with its cost is no larger than the cost of any other solution. non-positive rewards): 64 BONET UAI 2009

• deciding the existence of a policy that incurs in zero Corollary 5. The PEF problem for general models of cost for an infinite-horizon det-pomdp is PSPACE- polynomial diameter is NP-complete. complete, and • deciding the existence of a policy of polynomial Proof. Hardness is direct since unobservable models length that incurs in zero cost for a det-pomdp is are a special case. Inclusion follows by guessing a pol- NP-complete icy of polynomial size (Theorem 3) and checking it.

In this section, we extend this results as follows: Sometimes, it is easy to verify that a model is of poly- • deciding the existence of a policy of finite cost nomial diameter. Indeed, if |sup(b0)| = k the number k for det-pomdps of polynomial diameter is NP- of reachable beliefs is O(n ). If k is a constant then complete, the number of reachable beliefs is polynomial as well as the diameter. For another case, consider the (global) • there are det-pomdps on which all policies of finite cost are of super-polynomial size, transition graph TM that is a directed graph whose nodes are the states and there is an edge (i, j) iff there • give new class of problems on which the existence of is action a ∈ A such that j = f(i, a). If T is acyclic, policies can be checked in non-deterministic polyno- i M except for self-loops, the model has linear diameter. mial time, and Examples are knowledge-gathering problems such as • give some sufficient conditions for testing whether a Mastermind, 12-coins and Diagnosis. The IPC prob- det-pomdp is of polynomial diameter. lem for sorting networks is also of this type.

All these results, as well as Littman’s results, are based Theorem 6. If |sup(b0)| is constant, then the diame- in the assumption that the models are encoded in a ter is polynomial. If TM is acyclic, except perhaps for flat language; that is, that the models are encoded self-loops, then the diameter is linear. with O(poly(|S|, |A|, h)) bits where h is the maximum As noted in Theorem 1, there is a close connection be- number of bits needed to specify a cost or probability. tween the diameter of a model and permutations over The general question that we address here is whether states. Thus, the study of permutations provides im- there is a policy that reaches the goal with certainty. portant insight for estimating the diameter of models. Namely,

Policy Existence for Flat Models (PEF): Is 5.1 Permutation Groups and Their Diameter there a policy of finite cost for flat model M? The fact that pef only considers flat models is impor- The set of permutations with composition forms a mul- tant because with compact representations, an expo- tiplicative group. The order of a permutation σ is the n nential number of states can be described and thus a least n such that σ is the identity permutation. It double-exponential number of beliefs could be needed. is well known that a permutation can be written in cycle notation as a product of disjoint cycles, and that Theorem 4. The PEF problem for unobservable mod- the order of σ = C1C2 ...Cm (in cycle notation) is the els of polynomial diameter is NP-complete. least common multiple for the lengths of the cycles. 1 2 3 4 5 6 7 8 For example, σ = can be Proof. Inclusion is direct by guessing a polynomially- 3 4 5 7 8 6 2 1 sized policy and checking it. For hardness, we re- written as σ = (1, 3, 5, 8)(2, 4, 7)(6) and its order is duce sat using the method in [25] almost verbatim. ord(σ) = lcm{4, 3, 1} = 12. Let φ be a 3-CNF formula with variables x , . . . , x 1 n Consider now an unobservable model with a single ac- and clauses C ,...,C . A variable x 1-appears (0- 1 m tion a with empty precondition that corresponds to a appears) in clause C if x ∈ C (x ∈ C). We construct an permutation σ = C C ··· C over states. Further- unobservable model M(φ) as follows. The set of states 1 2 m more, suppose that b has m states one from each is S = {[x ,C ] : 1 ≤ i ≤ n, 1 ≤ j ≤ m} ∪ {t, f}. States 0 i j cycle C . Then, the repeated application of a over t and f mean satisfiable and unsatisfiable respectively. i b generates a trajectory hb , . . . , b i of different be- The actions set(i, v), for 1 ≤ i ≤ n and v ∈ {0, 1}, are 0 0 k liefs if k < ord(σ). Since observations filter nothing, used to set the value of variables: set(i, v) maps state |sup(b )| = ··· = |sup(b )| and there is a chunk of [x ,C ] ([x ,C ]) to either t or [x ,C ] (to either t 0 k i j n j i+1 j length k + 1. If k is large, the size of an optimal policy or f) whether x v-appears in C or not. The initial i j could be large as well. We use this idea to proof belief is b0 = {[x1,Cj] : 1 ≤ j ≤ m} and t is the unique target state. M(φ) is of polynomial size and diameter, Theorem 7. There is an unobservable model for and has a policy of finite cost iff φ is satisfiable. which all policies are of super-polynomial size. UAI 2009 BONET 65

Proof. Let S = {1, . . . , n}. First, we show that there mapped 1-1 into sup(bn). Thus, it is enough to guess σ is a permutation σ over S with super-polynomial or- and then test, in polynomial-time using the algorithm der. If σ = C1 ...Cm in cycle notation, then |C1| + in [10], if σ is generated by the actions. ··· + |Cm| = n and ord(σ) = lcm{|C1|,..., |Cm|}, so we need to show that there are integers {d1, . . . , dm} More important to us are results about diameters of whose sum is n and lcm is super-polynomial in n. groups. Driscoll and Furst [8] showed that if the gen- The Prime Number Theorem says that the num- erators are all cycles of bounded length, the diameter ber of primes less than N is asymptotically equal to is O(n2), while McKenzie [24] showed that if all gener- N/ log N. Hence, the number of primes in [n1/2, n3/4] ators move at most k elements, the diameter is O(nk). 3/4 1/2 is asymptotically equal to 4n /3 log n − 2n / log n Theorem 9. If every action has empty precondition, 1/4 which dominates n . Therefore, for sufficiently large and is a permutation that moves a constant number 1/4 1/2 3/4 n, we can choose bn c different primes in [n , n ]. of states or all cycles are of bounded length, then the 3/4 1/4 Their sum is bounded by n × n = n and model is of polynomial diameter. thus these primes can be extended into a collection {d1, . . . , dm} of integers whose sum is equal to n. Since all primes are different and at least n1/2, their lcm is 6 Discussion 1 n1/4 k at least n 2 which dominates n for any constant k. Estimations on the expected order of random per- We have shown that the Deterministic POMDP model mutations are also known [9, 12, 38]. first studied by Littman is more relevant than pre- viously thought for two reasons. First, a number of For the model, let a be an action that is like σ and important and challenging problems in planning cor- let b0 be the subset containing the “first” state from respond to such models, and second since they pro- 0 each cycle Ci. Let a be another action that maps vide a tradeoff between the restricted yet efficient clas- s into a target if s is the “last” state in a cycle, or sical deterministic model and the general but inef- into s otherwise. The optimal plan is the sequence of ficient pomdp model. We have shown novel com- ord(σ) − 1 repetitions of a followed by a0. plexity results that show this tradeoff: while check- ing the existence of plans in classical deterministic However, even in cases of non-polynomial diameter, (flat) problems is NL-complete (via st-Reachability we can test in non-deterministic polynomial-time the [35]), and checking the existence of plans for pomdps existence of policies for some models (proof below). is EXPTIME-complete [21], we have that checking the existence of plans for s of polynomial di- Theorem 8. If each action is a permutation with det-pomdp ameter is NP-complete, and that for s in empty precondition, then the PEF problem is in NP, det-pomdp which the actions are permutations with empty pre- even if the model is not of polynomial diameter. condition is in NP. Furthermore, we give polynomial- time checkable conditions for polynomial diameter. The group Sn of all permutations over n elements is called the symmetric group of degree n. Let A ⊆ Sn We also proposed a relation between det-pomdps be a subset. Then, A generates the subgroup hAi of and AND/OR graphs that permit the use of general all permutations that can be formed by finite compo- AND/OR search algorithms. Although we did not per- sitions of elements from A. A permutation σ ∈ hAi is form an experimental evaluation, it is clear that such k-expressible if it is the product of at most k permuta- algorithms should outperform in practice the explicit tions from A. The diameter of hAi is the least integer mapping of a det-pomdp into an mdp. k such that every permutation in hAi is k-expressible. Problems related to generated groups have a tradition It is important to emphasize that det-pomdps are not in computer . For example, Furst, Hopcroft and “general” pomdps, and hence one should be careful Luks [10] gave a polynomial-time algorithm for decid- when evaluating general algorithms with them. Due ing whether a permutation σ is generated by the set to the reduced complexity of the model, det-pomdps A, and Jerrum [18] showed that computing the size of must be tackled as a more specialized class which is a smallest generator set is PSPACE-complete. likely to scale better. In the future, we would like to further study the re- Proof Sketch for Theorem 8. We do it for unobserv- lation between det-pomdps and permutation groups able models; the proof for general models is a bit more and to implement algorithms based of AND/OR involved. Since actions are permutations, the size of search. beliefs do not change with actions. If there is a solution b0 → · · · → bn where bn is a target belief, then by The- Acknowledgments. Thanks to the reviewers for orem 1, there is a permutation σ such that sup(b0) is valuable comments and pointers to related work. 66 BONET UAI 2009

References [19] L. P. Kaelbling, M. Littman, and A. R. Cassandra. Planning and acting in partially observable stochastic [1] E. Amir and A. Chang. Learning partially-observable domains. Artificial Intelligence, 101:99–134, 1999. deterministic action models. J. of Artificial Intelli- gence Research, 33:349–402, 1998. [20] M. Likhachev and A. Stentz. Ppcp: Efficient prob- abilistic planning with clear preferences in partially- [2] A. Barto, S. Bradtke, and S. Singh. Learning to act known environments. In AAAI, pages 860–867, 2006. using real-time dynamic programming. Artificial In- telligence, 72:81–138, 1995. [21] M. Littman. Algorithms for Sequential Decision Mak- ing. PhD thesis, Brown University, 1996. [3] D. Bertsekas. Dynamic Programming and Optimal [22] M. Littman, J. Goldsmith, and M. Mundhenk. The Control, (2 Vols). Athena Scientific, 1995. computational complexity of probabilistic planning. J. of Artificial Intelligence Research, 9:1–36, 1998. [4] B. Bonet and H. Geffner. Labeled RTDP: Improving the convergence of real-time dynamic programming. [23] A. Martelli and U. Montanari. Additive AND/OR In ICAPS, pages 12–21, 2003. graphs. In IJCAI, pages 1–11, 1973. [5] B. Bonet and H. Geffner. An algorithm better than [24] P. McKenzie. Permutations of bounded degree gener- AO*? In AAAI, pages 1343–1348, 2005. ate groups of polynomial diameter. Information Pro- cessing Letters, 19(5):253–254, 1984. [6] P. P. Chakrabarti. Algorithms for searching explicit AND/OR graphs and their applications to problem re- [25] M. Mundhenk, J. Goldsmith, C. Lusena, and E. Al- duction search. Artificial Intelligence, 65(2):329–345, lender. Complexity of finite-horizon markov decision 1994. process problems. J. ACM, 47(4):681–720, 2001.

[7] R. Dial. Algorithm 360: Shortest path forest with [26] N. Nilsson. Principles of Artificial Intelligence. Tioga, topological ordering. Comm. ACM, 12:632–633, 1969. 1980.

[8] J. R. Driscoll and M. Furst. On the diameter of per- [27] H. Palacios and H. Geffner. From conformant into mutation groups. In STOC, pages 152–160, 1983. classical planning: Efficient translations that may be efficient too. In ICAPS, 2007. [9] P. Erd¨osand P. Tur´an.On some problems of a statis- [28] C. H. Papadimitriou and J. N. Tsitsiklis. The com- tical group , IV. Acta Math. Acad. Sci. Hungar., plexity of markov decision processses. Math. of Oper- 19:413–435, 1969. ations Research, 12(3):441–450, 1987. [10] M. Furst, J. Hopcroft, and E. Luks. Polynomial-time [29] K. Pattipati and M. Alexandridis. Applications of algorithms for permutation groups. In FOCS, pages heuristic search and information theory to sequential 36–41, 1980. fault diagnosis. IEEE Trans. System, Man and Cy- bernetics, 20:872–887, 1990. [11] A. Gerevini, B. Bonet, and R. Givan, editors. 5th International Planning Competition, 2006. [30] J. Pearl. Heuristics. Morgan Kaufmann, 1983. [12] W. Goh and E. Schmutz. The expected order of a [31] J. Pearl. : Models, Rreasoning and Inferece. random permutation. Bull. Lond. Math. Soc., 23:34– Cambridge University Press, New York, 2000. 42, 1991. [32] J. Pineau, G. J. Gordon, and S. Thrun. Anytime [13] R. P. Goldman and M. S. Boddy. Expressive planning point-based approximations for large POMDPs. J. and explicit knowledge. In AIPS, pages 110–117, 1996. of Artificial Intelligence Research, 27:335–380, 2006.

[14] E. Hansen and S. Zilberstein. LAO*: A heuristic [33] J. Rintanen. Complexity of planning with partial ob- search algorithm that finds solutions with loops. Ar- servability. In ICAPS, pages 345–354, 2004. tificial Intelligence, 129:35–62, 2001. [34] S. Russell and P. Norvig. Artificial Intelligence: A [15] J. Hoffmann and R. Brafman. Contingent planning via Modern Approach. Prentice Hall, 1994. heuristic forward search with implicit belief states. In [35] M. Sipser. Introduction to Theory of Computation, ICAPS, pages 71–80, 2005. 2nd Edition. Thomson Course Technology, Boston, MA, 2005. [16] J. Hoffmann and R. Brafman. Conformant planning via heuristic forward search: A new approach. Artifi- [36] T. Smith and R. Simmons. Heuristic search value it- cial Intelligence, 170:507–541, 2006. eration for POMDPs. In UAI, pages 520–527, 2004. [17] D. Hsu, W.S. Lee, and N. Rong. What makes some [37] M. T. J. Spaan and N. A. Vlassis. Perseus: Random- POMDP problems easy to approximate? In NIPS, ized point-based value iteration for POMDPs. J. of 2007. Artificial Intelligence Research, 24:195–220, 2005.

[18] M. Jerrum. The complexity of finding minimum- [38] R. Stong. The average order of a permutation. Elec- length generator sequences. Theoretical Computer tronic Journal of Combinatorics, 5:#R41, 1998. Science, 36:265–289, 1985.