Deterministic Pomdps Revisited

UAI 2009 BONET 59 Deterministic POMDPs Revisited Blai Bonet Departamento de Computación Universidad SimónBol´ıvar Caracas, Venezuela [email protected] Abstract general frameworks for sequential decision making [19], yet the known algorithms scale very poorly. We study a subclass of POMDPs, called De- However, we have seen that an important collection terministic POMDPs, that is characterized of problems that involve uncertainty and partial in- by deterministic actions and observations. formation have a common characteristic: they have These models do not provide the same gen- actions with deterministic outcomes and the observa- erality of POMDPs yet they capture a num- tions generated at each decision stage also behave de- ber of interesting and challenging problems, terministically. Indeed, these models have been used and permit more efficient algorithms. Indeed, in recent proposals for planning with incomplete infor- some of the recent work in planning is built mation [15, 16, 27], appear in works of more general around such assumptions mainly by the quest scope [15, 20] and about causation [31], and are used of amenable models more expressive than the for learning partially-observable action models [1]. classical deterministic models. We provide These models were briefly considered in Littman’s the- results about the fundamental properties of sis [21] under the name of Deterministic POMDPs Deterministic POMDPs, their relation with (det-pomdps) for which some important theoretical AND/OR search problems and algorithms, results were obtained. Among others, he showed that and their computational complexity. a det-pomdp can be mapped into an mdp with an exponential number of states and thus solved with standard MDP algorithms, and that optimal non- 1 Introduction stationary policies of polynomial horizon can be computed in non-deterministic polynomial time. Unfor- The simplest model for sequential decision making is tunately, det-pomdps briefly appeared as a curiosity the deterministic model with known initial and goal of theoretical interest and then quickly fade out from states. Solutions are sequences of actions that map consideration, to the point that, up to our knowledge, the initial state into a goal state that can be computed there are no publications on this subject neither from with standard search algorithms. This model has been Littman or others. studied thoroughly in AI with important contributions Given the role of det-pomdps in recent investigations, such as A*, IDA*, and others [30, 34]. motivated mainly by the quest of amenable models for The deterministic model has strong limitations on the decision making with uncertainty and partial informa- type of problems that can be represented: it is not tion, we believe that det-pomdps should be further possible to model situations where actions have non- studied. In this paper, we carry out a systematic ex- deterministic outcomes or where states are not fully ploration of det-pomdps mainly from the complexity observable. In such cases, one must resort to more ex- perspective yet we outline novel algorithms for them. pressive formalisms such as Markov Decision Processes We present three variants of the model: the fully ob- (mdps) and Partially Observable mdps (pomdps). The servable, the unobservable and the general case, and generality of these models comes with a cost since the two metrics of performance: worst- and expected-cost. computation of solutions increase in complexity, spe- As it will be shown, det-pomdps offer a tradeoff be- cially for pomdps, and thus one gains in generality but tween the classical deterministic model and the gen- loses in the ability to solve problems. pomdps, for ex- eral pomdp model. Furthermore, their characteristics ample, are widely used as they offer one of the most permits the use of standard and novel AND/OR algo- 60 BONET UAI 2009 rithms which are simpler and more efficient that the – finite sets of applicable actions Ai ⊆ A for i ∈ S, standard algorithms for s [32, 36, 37], or the pomdp – a finite set of observations O, transformation proposed by Littman. – an initial subset of states b0 ⊆ S, or alternatively The paper is organized as follows. First, we give exam- an initial distribution of states b0 ∈ ∆S, ples of challenging problems that help us to establish the relevance of det-pomdps. We then present the – a subset T ⊆ S of goal (target) states, definition and variants of the model in Sect. 3, the re- – a deterministic transition function f(i, a) ∈ S, for lation with AND/OR graphs and algorithms in Sect. 4, i ∈ S, a ∈ Ai, that specifies the result of applying complexity analyses in Sect. 5, and finish with a brief action a on state i, discussion in Sect. 6. – a deterministic observation function o(i, a) ∈ O, for i ∈ S, a ∈ A, that specifies the observation received 2 Examples when entering state i after the application of action a, and Numerous det-pomdps problem had been used to evaluate and develop different algorithms for planning – positive costs c(i, a), for i ∈ S, a ∈ Ai that tells the with uncertainty and partial information. For space immediate cost of applying a on i. reasons, we only provide few examples and brief de- scriptions for some of them. For simplicity, we assume that goal states are absorb- Mastermind. There is a secret word of length m over ing. That is, once a goal state is entered, the system an alphabet of n symbols. The goal is to discover the remains there and incurs in no costs, hence At = A, word by making guesses about it. Upon each guess, the f(t, a) = t and c(t, a) = 0 for t ∈ T . number of exact matches and near matches is returned. The two options for b0 depend on whether the interest The goal is to obtain a strategy to identify the secret. is in minimizing the worst-case total accumulated cost, Navigation in Partially-Known Terrains. There or the expected total accumulated cost (see below). In 1 is a robot in a n × n grid that must navigate from an any case, b0 is called the initial belief state and de- initial position to a goal position. Each cell of the grid scribes the different possibilities for the initial state is either traversable or untraversable. The robot has which is not known a priori. Its importance is crucial perfect sensing within its cell but the traversability of since, under the assumption of deterministic transi- a subset of cells is unknown. The task is to obtain a tions, if the initial state were known, then all future strategy for guiding the robot to its destination [20]. states would also be known, and the model reduces to the well-known deterministic model in AI [34]. Hence, Diagnosis. There are n binary tests for finding out the only source of uncertainty in det-pomdp mod- the state of a system among m possible states. An els comes from the initial state which further induces instance consists of a m×n binary matrix T such that an uncertainty on the observations generated at each Tij = 1 iff test j is positive when the state is i. The decision stage. Nonetheless, the model remains chal- goal is to get a strategy for identifying the state [29]. lenging, with respect to expressivity and computation, Coins. There are n coins from which one is a coun- as exemplified in the previous section. terfeit of different weight, and there is a 2-pan balance Although the state of the system may not be fully ob- scale. A strategy that spots the counterfeit and finds servable, it is possible (and indeed useful) to consider out whether is heavier or lighter is needed [30]. preconditions for actions. The role of preconditions is Domains from IPC. The problems in the 2006 and to permit the knowledge engineer to design econom- 2008 International Planning Competition for the track ical representations by leaving out from the specifi- on conformant planning consisted of domains covering cation unimportant details or undesirable situations. topics such as Blocksworld, circuit synthesis, universal For example, if one does not want to model the effects traversal sequences, sorting networks, communication of plugging a 120V artifact into a 240V outlet, then protocols and others [11], all of which are instances of a simple precondition can be used to avoid such sit- det-pomdps. uations. Preconditions in the det-pomdp model are expressed through the sets of applicable actions Ai. As it is standard in planning, situations in which there are 3 Model and Variants no actions available are called dead ends. Formally, a model is a tuple made of: det-pomdp 1The term ‘belief state’ refers to a subset of states (or a probability distribution on states) that is deemed possible – a finite state space S = {1, . , n}, by the agent at a given point in time. UAI 2009 BONET 61 3.1 Optimality Criteria On the real environment though, the current state transforms into a new state and an observation is gen- Our goal is to compute strategies that permit the agent erated, which is used to filter out states inconsistent to act optimally. We consider two optimality criteria, o . with it; i.e. the filtered belief is ba = {i ∈ ba : o(i, a) = and three variants of the model. For optimality, we o} for subsets, and, for distributions, is consider the minmax that minimizes the worst-case cost of a policy, and the minexp that minimizes the o . 0 if ba(i) = 0 or o(i, a) 6= o, ba(i) = expected cost of a policy. The variants of the model ba(i)/ba(o) otherwise, depend on the observation model: where ba(o) is a normalization constant.

Deterministic Pomdps Revisited

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support