Nonapproximability Results for Partially Observable Markov

Journal of Articial Intelligence Research Submitted published

Nonapproximability Results for

Partially Observable Markov Decision Pro cesses

Christopher Lusena lusenacsengrukyedu

University of Kentucky

Dept of Computer Science Lexington KY

Judy Goldsmith goldsmitcsengrukyedu

University of Kentucky

Dept of Computer Science Lexington KY

Martin Mundhenk mundhenktiunitrierde

Universitat Trier

FB IV Informatik D Trier Germany

Abstract

We show that for several variations of partially observable Markov decision pro cesses

p olynomialtime algorithms for nding control p olicies are unlikely to or simply dont have

guarantees of nding p olicies within a constant factor or a constant summand of optimal

Here unlikely means unless some complexity classes collapse where the collapses con

sidered are P NP P PSPACE or P EXP Until or unless these collapses are shown

to hold any controlpolicy designer must choose b etween such p erformance guarantees and

ecient computation

Introduction

Life is uncertain realworld applications of articial intelligence contain many uncertain

ties In this work we show that uncertainty breeds uncertainty In a controlled sto chastic

system with uncertainty as mo deled by a partially observable Markov decision pro cess for

example plans can b e obtained eciently or with quality guarantees but rarely b oth

Planning over sto chastic domains with uncertainty is hard in some cases PSPACE

hard or even undecidable see Papadimitriou Tsitsiklis Madani Hanks Condon

Given that it is hard to nd an optimal plan or p olicy it is natural to try to nd

one that is go o d enough In the b est of all p ossible worlds this would mean having an

algorithm that is guaranteed to b e fast and to pro duce a p olicy that is reasonably close

to the optimal p olicy Unfortunately we show here that such an algorithm is unlikely or

in some cases imp ossible The implication for algorithm development is that developers

should not waste time working toward b oth guarantees at once

The particular mathematical mo dels we concentrate on in this pap er are Markov de

cision processes MDPs and partially observable Markov decision processes POMDPs

We consider b oth the straightforward representations of MDPs and POMDPs and succinct

representations since the complexity of nding p olicies is measured not in terms of the size

of the system but in terms of the size of the representation of the system

There has b een a signicant b o dy of work on heuristics for succinctly represented MDPs

see Boutilier Dean Hanks Blythe for surveys Some of this work grows

Lusena Goldsmith Mundhenk

out of the engineering tradition see for example Tsitsiklis and Van Roys article on

featurebased metho ds which dep ends on empirical evidence to evaluate algorithms While

there are obvious drawbacks to this approach our work argues that this may b e the most

appropriate way to verify the quality of an approximation algorithm at least if one wants

to do so in reasonable time

The same problems that plague approximation algorithms for uncompressed representa

tions carry over to the succinct representations and the compression introduces additional

complexity For example if there is no computable approximation of the optimal p olicy in

the uncompressed case then compression will not change this However it is easy to nd

the optimal p olicy for an innitehorizon fully observable MDP Bellman yet EXP

hard provably harder than p olynomial time to nd approximately optimal p olicies in

time measured in the size of the input if the input is represented succinctly see Section

Note that there are two interpretations to nding an approximation nding a p olicy

with value close to that of the optimal p olicy or simply calculating a value that is close to

the optimal value If we can do the former and can evaluate p olicies then we can certainly

do the latter Therefore we sometimes show that the latter cannot b e done or cannot b e

done in time p olynomial in the size of the input unless something unlikely is true

The complexity class PSPACE consists of those languages recognizable by a Turing

machine that uses only pn memory for some p olynomial p where n is the size of the

input Because each time step uses at most one unit of memory P PSPACE though

we do not know whether that is a prop er inclusion or an equality Because given a limit

on the amount of memory used there are only exp onentially many congurations of that

memory p ossible with a xed nite alphab et PSPACE EXP It is not known whether

this is a prop er inclusion or an equality either although it is known that P EXP Thus

a PSPACEhardness result says that the problem is apparently not tractable but an EXP

hardness result says that the problem is certainly not tractable

Researchers also consider problems that are Pcomplete under logspace or other highly

restricted reductions For example the p olicy existence problem for innitehorizon MDPs

is Pcomplete Papadimitriou Tsitsiklis This is useful information b ecause it is

generally thought that Pcomplete problems are not susceptible to signicant sp eedup via

parallelization For a more thorough discussion of Pcompleteness see Greenlaw Ho over

Ruzzo

We also know that NP PSPACE so P PSPACE would imply P NP Thus

any argument or b elief that P NP implies that P PSPACE For elab orations of this

complexity theory primer see any complexity theory text such as Papadimitriou

In this pap er we show that there is a necessary tradeo b etween running time guar

antees and p erformance guarantees for any general POMDP approximation algorithm

unless P NP or P PSPACE Table gives an overview of our results Note that

assuming P NP or P PSPACE this tells us that there is no algorithm that runs in

time p olynomial in the size of the representation of the POMDP that nds a p olicy that

is close to optimal for every instance It do es not say that fast algorithms will pro duce

farfromoptimal values for all POMDPs there are many instances where the algorithms

already in use or b eing developed will b e b oth fast and close We simply cant guarantee

that the algorithms will always nd a closetooptimal p olicy quickly

Nonapproximability Results for POMDPs

p olicy representation horizon problem complexity

Partial observability

stationary n app not unless PNP

stationary n ab oveavg value not unless PNP

timedep endent n app not unless PNP

historydep endent n app not unless PPSPACE

stationary app not unless PNP

timedep endent app uncomputable

Madani et al

Unobservability

timedep endent n app not unless PNP

Full observability

stationary n k additive app Phard

stationary TBN n k additive app EXPhard

Table Hardness for partially and fully observable MDPs

Heuristics and Approximations

The state of the art with resp ect to POMDP p olicynding algorithms is that there are

three types of algorithms in use or under investigation exact algorithms approximations

and heuristics Exact algorithms attempt to nd exact solutions In the nitehorizon

cases they run in worstcase time at least exp onential in the size of the POMDP and

the horizon assuming a straightforward representation of the POMDP In the innite

horizon they do not necessarily halt but can b e stopp ed when the p olicy is within of

optimal a checkable condition Approximation algorithms construct approximations to

what the exact algorithms nd Examples of this include gridbased metho ds Hauskrecht

Lovejoy White Heuristics come in two avors those that construct or

nd actual p olicies that can b e evaluated and those that sp ecify a means of choosing an

action for example most likely state which do not yield p olicies that can b e evaluated

using the standard linear algebrabased metho ds

The b est current exact algorithm is incremental pruning IP with p ointbased improve

ment Zhang Lee Zhang Littmans analysis of the witness algorithm Littman

Dean Kaelbling Cassandra Kaelbling Littman still applies This algo

rithm requires exp onential time in the worst case The underlying theory of these algorithms

Witness IP etc for innitehorizon cases dep ends on Bellmans and Sondiks work on

value iteration for MDPs and POMDPs Bellman Sondik Smallwoo d Sondik

The b est known family of approximation algorithms is known as grid metho ds The

basic idea is to use a nite grid of p oints in the b elief space the space of all probability

distributions over the states of the POMDP this is the underlying space for the algo

rithms mentioned ab ove to dene a p olicy Once the grid p oints are chosen all of these

algorithms use value iteration on the p oints to obtain a p olicy for those b elief states then

interpolate to the whole b elief space The dierence in the algorithms lies in the choice of

grid p oints An excellent survey app ears in Hauskrecht These algorithms are called

Lusena Goldsmith Mundhenk

approximation algorithms b ecause they approximate the pro cess of value iteration which

the exact algorithms algorithms carry out exactly

Heuristics that do not yield easily evaluated p olicies are surveyed in Cassandra

These are often very easy to implement and include techniques such as most likely state

choosing a state with the highest probability from the b elief state and acting as if the

system were fully observable and minimum entropy choosing the action that gives the

most information ab out the current state Others dep end on voting where several

heuristics or options are combined

There are heuristics based on nite histories or other uses of nite amounts of memory

within the algorithm Sondik Platzman Hansen a b Lusena Li Sit

tinger Wells Goldsmith Meuleau Kim Kaelbling Cassandra Meuleau

Peshkin Kim Kaelbling Peshkin Meuleau Kaelbling Hansen Feng

Kim Dean Meuleau None of these comes with pro ofs of closeness except

for some of Hansens work For the rest the tradeo has b een made b etween fast searching

through p olicy space and guarantees

Structure of This Paper

In Section we give formal denitions of MDPs and POMDPs and p olicies twophase tem

p oral Bayes nets TBNs are dened in Section In Section we dene approximations

and additive approximations and show a relationship b etween the two types of approxima

bility for MDPs and POMDPs

We separate the complexity results for nitehorizon p olicy approximation from those

for innitehorizon p olicies Section contains nonapproximability results for nitehorizon

POMDP p olicies Section contains nonapproximability for innitehorizon POMDP p oli

cies Although it is relatively easy to nd optimal MDP p olicies we consider approximating

MDP p olicies in Section since the succinctly represented case at least is provably hard

to approximate

Some of the more technical pro ofs are included in app endices in order to make the b o dy

of the pap er more readable However some pro ofs from other pap ers are sketched in the

b o dy of the pap er in order to motivate b oth the results and the pro ofs newly presented

here

Denitions

Note that MDPs are in fact sp ecial cases of POMDPs The complexity of nding and

approximating optimal p olicies dep ends on the observability of the system so our results

are segregated by observability However one set of denitions suces

Partially Observable Markov Decision Pro cesses

A partially observable Markov decision pro cess POMDP describ es a controlled sto chastic

system by its states and the consequences of actions on the system It is denoted as a tuple

M S s A O t o r where

S A and O are nite sets of states actions and observations

Nonapproximability Results for POMDPs

s S is the initial state

t S A S is the state transition function where ts a s is the probability

that state s is reached from state s on action a for every s S and a A either

ts a s if action a can b e applied on state s or ts a s

s S s S

o S O is the observation function where os is the observation made in state s

r S A Q is the reward function where r s a is the reward gained by taking

action a in state s

If states and observations are identical ie O S and o is the identity function or

a bijection then the MDP is called ful ly observable Another sp ecial case is unobservable

MDPs where the set of observations contains only one element ie in every state the same

observation is made and therefore the observation function is constant

Normally MDPs are represented by S S tables one for each action However we will

also discuss more succinct representations in particular twophase temporal Bayes nets

TBNS These will b e dened in Section

Policies and Performances

A p olicy describ es how to act dep ending on observations We distinguish three types of

p olicies

A stationary policy for M is a function O A mapping each observation

s s

to an action

A timedependent policy is a function O N A mapping each pair

t t

hobservation timei to an action

A historydependent policy is a function O A mapping each nite sequence

h h

of observations to an action

Notice that for an unobservable MDP a historydep endent p olicy is equivalent to a

timedep endent one

Recent algorithmic development has included consideration of nite memory p olicies

as well Hansen b a Lusena Li Sittinger Wells Goldsmith Meuleau

Kim Kaelbling Cassandra Meuleau Peshkin Kim Kaelbling Peshkin

Meuleau Kaelbling Hansen Feng Kim Dean Meuleau These

are p olicies that are allowed some nite amount of memory sucient allowances would

enable such a p olicy to simulate a full historydep endent p olicy over a nite horizon or

p erhaps a timedep endent p olicy or to use less memory more judiciously One variant

of nite memory p olicies which we call free nite memory policies xes the amount of

memory a priori

More formally a free nite memory p olicy with the nite set M of memory states

for POMDP M S A O t o r is a function O M A M mapping each

Note that making observations probabilistically do es not add any p ower to MDPs Any probabilistically

observable MDP can b e turned into one with deterministic observations with only a p olynomial increase

in its size

Lusena Goldsmith Mundhenk

hobservation memory statei pair to a pair haction memory statei The set of memory

states M can b e seen as a nite scratch memory

Free nite memory p olicies can also simulate stationary p olicies all hardness results for

stationary p olicies apply to free nite memory p olicies as well Because one can consider

a free nite memory p olicy to b e a stationary p olicy over the state space S M al l

upper bounds complexity class membership results for stationary policies hold for free

nite memory policies as wel l The advantages of free nite memory p olicies app ear in

the constants of the algorithms and in sp ecial probably large sub classes of POMDPs

where a nite amount of memory suces for an optimal p olicy The maze instances such as

McCallums maze McCallum Littman are such examples McCallums maze

requires only bit of memory to nd an optimal p olicy

Let M S s A O t o r b e a POMDP

A trajectory of length m for M is a sequence of states m

S which starts with the initial state of M ie s We use T s to denote the

set of lengthk tra jectories which end in state s

The exp ected reward obtained in state s after exactly k steps under p olicy is the

reward obtained in s by taking the action sp ecied by weighted by the probability that

s is actually reached after k steps

P Q

t o if is a stationary r s k r s os

i i i

T s

k k

p olicy

P Q

t o i if is a r s k r s os k

i i i

T s

k k

timedep endent p olicy and

Q P

t o o r s o o r s k

i i i

T s

k k

if is a historydep endent p olicy

A POMDP may b ehave dierently under optimal p olicies for each type of p olicy The

quality of a p olicy is determined by its performance ie by the exp ected rewards accrued

by it We distinguish b etween dierent p erformance metrics for POMDPs that run for a

nite number of steps and those that run indenitely

The nitehorizon performance of a policy for POMDP M is the exp ected sum of

rewards received during the rst jM j steps by following the p olicy ie perf M

P P

jM j

r s i Other work assumes that the horizon is poly jM j instead of

jM j This do es not change the complexity of any of our problems

The innitehorizon total discounted performance gives rewards obtained earlier in

the pro cess a higher weight than those obtained later For the total

P P

r s i discounted reward is dened as perf M

i sS

The innitehorizon average performance is the limit of all rewards obtained within n

perf M n steps divided by n for n going to innity perf M lim

If this limit is not dened the p erformance is dened as a lim inf

Nonapproximability Results for POMDPs

Let perf b e any of these p erformance metrics and let b e any p olicy type either

stationary timedep endent or historydep endent The value val M of M under the

metric chosen is the maximal p erformance of any p olicy of type for M ie val M

max perf M where is the set of all p olicies

For simplicity we assume that the size jM j of a POMDP M is determined by the size

n of its state space We assume that there are no more actions than states and that each

state transition probability is given as a binary fraction with n bits and each reward is an

integer of at most n bits This is no real restriction since adding unreachable dummy

states allows one to use more bits for transition probabilities and rewards Also it is

straightforward to transform a POMDP M with noninteger rewards to M with integer

rewards such that val M k c val M k for some constant c dep ending only on M k

and not on val M k

We consider problem instances that are represented in a straightforward way A PO

MDP with n states is represented by a set of n n tables for the transition function one

table for each action and a similar table for the reward function and for the observation

function We assume that the number of actions and the number of bits needed to store each

transition probability or reward do es not exceed n so such a representation requires O n

bits This can b e mo died to allow n bits without changing the complexity results In

the same way stationary p olicies can b e enco ded as lists with n entries and timedep endent

p olicies for horizon n as n n tables

For each type of POMDP each type of p olicy and each type of p erformance metric the

value problem is

given a POMDP a p erformance metric nitehorizon total discounted or average p er

formance and a p olicy type stationary timedep endent or historydep endent

calculate the value of the b est p olicy of the sp ecied type under the given p erformance

metric

The policy existence problem is

given a POMDP a p erformance metric and a p olicy type

decide whether the value of the b est p olicy of the sp ecied type under the given p erfor

mance metric is greater

Approximability

In previous work Papadimitriou Tsitsiklis Mundhenk Goldsmith Allen

der Mundhenk Goldsmith Lusena Allender Madani et al it was

shown that the p olicy existence problem is computationally intractable for most variations

of POMDPs or even undecidable for some innitehorizon cases For example we showed

that the stationary p olicy existence problems for POMDPs with or without negative re

wards are NPcomplete Computing an optimal p olicy is at least as hard as deciding the

existence problem Instead of asking for an optimal p olicy we might wish to compute a

p olicy that is guaranteed to have a value that is at least a large fraction of the optimal

value

Lusena Goldsmith Mundhenk

A p olynomialtime algorithm computing such a nearly optimal p olicy is called an

approximation for where indicates the quality of the approximation in the

following way Let A b e a p olynomialtime algorithm which for every POMDP M computes

an p olicy AM Notice that perf M AM val M for every M The algorithm A

is called an approximation if for every POMDP M

val M perf M AM

val M

See eg Papadimitriou for more detailed denitions Approximability distin

guishes NPcomplete problems There are problems which are approximable for all for

certain or for no unless P NP Note that this denition of approximation requires

that val M If a p olicy with p ositive p erformance exists than every approximation

algorithm yields such a p olicy b ecause a p olicy with p erformance or smaller cannot ap

proximate a p olicy with p ositive p erformance Hence any approximation straightforwardly

solves the decision problem

An approximation scheme yields an approximation for arbitrary If there is a

p olynomialtime algorithm that on input POMDP M and outputs an approximation of

the value in time p olynomial in the size of M then we say the problem has a Polynomial

Time Approximation Scheme PTAS If the algorithm runs in time p olynomial in the size

the scheme is a Fully PolynomialTime Approximation Scheme FPTAS All of M and

of the PTASs constructed here are FPTASs we state the theorems in terms of PTASs

b ecause that gives stronger results in some cases and b ecause we do not explicitly analyze

the complexity in terms of

If there is a p olynomialtime algorithm that outputs an approximation v to the value

of M val M with v k then we say that the problem has a k additive

approximation algorithm

In the context of POMDPs existence of a k additive approximation algorithm and a

PTAS are often equivalent This might seem surprising to readers who are more familiar

with reward criteria that have xed upp er and lower b ounds on the p erformance of a

solution for example the probability of reaching a goal state In these cases the xed

b ounds on p erformance will give dierent results However we are addressing the case

where there is no a priori upp er b ound on the p erformance of p olicies even though there

are computable upp er b ounds on the p erformance of a p olicy for each instance

Theorem For POMDPs with nonnegative rewards and at representations under

nitehorizon total or total discounted and innitehorizon total discounted reward met

rics if there exists a k additive approximation then we can determine in polynomial time

whether there is a policy with performance greater than

Pro of The theorem follows from two facts given a POMDP M with value we can

construct another POMDP M with value just by multiplying all rewards in the former

POMDP by under these reward metrics we can nd a lower b ound on if it is not

The computation of the lower b ound on the value of dep ends on the reward metric

Because there are no negative rewards in order for the exp ected reward to b e p ositive in the

nitehorizon case an action with p ositive reward must b e taken with nonzero probability

Nonapproximability Results for POMDPs

by the last step Consider only reachable states of the POMDP and let b e the lowest

nonzero transition probability to one of these states h the horizon and the smallest

h h

nonzero reward and set Then is a lower b ound on the probability of actually

reaching any particular state after h steps if this probability is nonzero in particular a

state with reward If the reward metric is discounted then let where

is the discount factor

Now consider the innitehorizon under a stationary p olicy This induces a Markov

pro cess and the p olicy has nonzero reward if there is a nonzero probability path to a

reward no de ie a state from which there is a p ositivereward action p ossible This is true

if and only if there is a nonzeroprobability simple path visiting each no de at most once

jS j

to a reward no de Such a path accrues reward at least for stationary p olicies

Since stationary p olicies have values b ounded by the timedep endent and historydep end

ent values for innitehorizon POMDPs this lower b ound for the stationary value of the

POMDP is also a lower b ound for other p olicies

Finally note that if the value of a POMDP with nonnegative rewards is then a

k additive approximation cannot return a p ositive value To determine whether there is

a p olicy with reward greater than for a given POMDP compute and then set such

and run the k additive approximation algorithm on M The that k ie

POMDP has p ositive value if and only if the approximation returns a p ositive value 2

Note that this do es not contradict the undecidability result of Madani et al The

problem that they proved undecidable is whether a POMDP with nonp ositive rewards has

a historydep endent or timedep endent value of Were asking whether it has value

in the nonnegative reward case answering this question even if we multiply the rewards

by do es not answer their question

Corollary For POMDPs with at representations and nonnegative rewards the PO

MDP value problem under nitehorizon or innitehorizon total discounted reward is k

additive approximable if and only if there exists a PTAS for that POMDP value problem

Note that the corollary dep ends only on Facts and from the pro of of Theorem

Thus any optimization problem with those prop erties will have a k additive approximation

if and only if it has a PTAS

Pro of Let val M and let A b e a p olynomialtime k additive approximation algo

rithm First the PTAS computes as in Theorem and checks whether If so it

k k

and thus k outputs Otherwise given it chooses such that

holds Let v A M Note that AM is the approximation to the value of M found by

is an approximation to running algorithm A Then v is an approximation to so

Supp ose instead that we have a PTAS for optimal p olicies for this problem Let

AM b e an algorithm that demonstrates this Let val M and AM v

If then v and we can stop Else we choose an such that Thus v

k k k k

k giving Since and is p olynomial size and is p olynomial

v v

and time computable in jM j since v is the output of AM we can choose

run AM This gives a k additive approximation 2

Lusena Goldsmith Mundhenk

A problem that is not approximable for some cannot have a PTAS Therefore

any multiplicative nonapproximability result yields an additive nonapproximability result

However an additive nonapproximability result only shows that there is no PTAS although

there might b e an approximation for some xed

NonApproximability for FiniteHorizon POMDPs

This section fo cuses on nitehorizon p olicies Because that is consistent throughout the

section we do not explicitly mention it in each theorem However as Section shows there

are signicant computational dierences b etween nite and innitehorizon calculations

The p olicy existence problem for POMDPs with negative and nonnegative rewards

is not suited for approximation If a p olicy with p ositive p erformance exists then every

approximation algorithm yields such a p olicy b ecause a p olicy with p erformance or smaller

cannot approximate a p olicy with p ositive p erformance Hence the decision problem is

straightforwardly solved by any approximation Therefore we concentrate on POMDPs

with nonnegative rewards Results for POMDPs with unrestricted rewards are stated as

corollaries Consider an approximation algorithm A that on input a POMDP M with

nonnegative rewards outputs a p olicy of type Then it holds that

perf M val M

We rst consider the question of whether an optimal stationary p olicy can b e approx

imated for POMDPs with nonnegative rewards It is known Littman Mundhenk

et al that the related decision problem is NPcomplete We include a sketch of that

pro of here since later pro ofs build on it The formal details can b e found in App endix A

Theorem Littman Mundhenk et al The stationary policy existence

problem for POMDPs with nonnegative rewards is NPcomplete

Pro of Membership in NP is straightforward b ecause a p olicy can b e guessed and evaluated

in p olynomial time To show NPhardness we reduce the NPcomplete satisability problem

Sat to it Let x x b e such a formula with variables x x and clauses

n n

C C where clause C l l l for l fx x g We say that

m j i i i

v j v j v j

variable x appears in C with signum resp if x resp x is a literal in C Without

i j i i j

loss of generality we assume that every variable app ears at most once in each clause The

idea is to construct a POMDP M having one state for each app earance of a variable

in a clause The set of observations is the set of variables Each action corresp onds to an

assignment of a value to a variable The transition function is deterministic The pro cess

starts with the rst variable in the rst clause If the action chosen in a certain state satises

the corresp onding literal the pro cess pro ceeds to the rst variable of the next clause or

with reward to a nal sink state T if all clauses were considered If the action do es not

satisfy the literal the pro cess pro ceeds to the next variable of the clause or with reward

to a sink state F A sink state will never b e left The partition of the state space into

observation classes guarantees that the same assignment is made for every app earance of

the same variable Therefore the value of M equals i is satisable The formal

reduction is in App endix A 2

Nonapproximability Results for POMDPs

Note that all p olicies have exp ected reward of either or Immediately we get the

nonapproximability result for POMDPs even if all tra jectories have nonnegative p erfor

mance

Theorem Let An optimal stationary policy for POMDPs with nonnegative

rewards is approximable if and only if P NP

Pro of The stationary value of a POMDP can b e calculated in p olynomial time by a

binary search using an oracle for the stationary p olicy existence problem for POMDPs

The number of bits to b e calculated is p olynomial in the size of M Knowing the value we

can try to x an action for an observation If the mo died POMDP still achieves the value

calculated b efore we can continue with the next observation until a stationary p olicy is

found which has the optimal p erformance This algorithm runs in p olynomial time with an

oracle solving the stationary p olicy existence problem for POMDPs Since the oracle is in

NP by Theorem the algorithm runs in p olynomial time if P NP

Now assume that A is a p olynomialtime algorithm that approximates the optimal

stationary p olicy for some with We show that this implies that P NP by

showing how to solve the NPcomplete problem Sat As in the pro of of Theorem

given an instance of Sat we construct a POMDP M The only change to the reward

function of the POMDP constructed in the pro of of Theorem is to make it a POMDP

with p ositive p erformances Now reward is obtained if state F is reached and reward

e is obtained if state T is reached Hence is satisable if and only if M has value d

d e

Assume that p olicy is the output of the approximation algorithm A If is satisable

Because the p erformance of every p olicy for then perf M

M is either if is not satisable or d e if is satisable it follows that has

p erformance if and only if is satisable So in order to decide Sat we can

construct M run the approximation algorithm A on it take its output and calculate

perf M That output shows whether is in Sat All these steps are p olynomial

time b ounded computations It follows that Sat is in P and hence P NP 2

Of course the same nonapproximability result holds for POMDPs with p ositive and

negative rewards

Corollary Let Any optimal stationary policy for POMDPs is approxi

mable if and only if P NP

Using the same pro of technique as ab ove we can show that the value is nonapproxi

mable to o

Corollary Let The stationary value for POMDPs is approximable if and

only if P NP

A similar argument can b e used to show that a p olicy with p erformance at least the

average of all p erformances for a POMDP cannot b e computed in p olynomial time unless

P NP Note that in the pro of of Theorem the only p erformance greater than or equal

to the average of all p erformances is that of an optimal p olicy

Lusena Goldsmith Mundhenk

Corollary The fol lowing are equivalent

There exists a polynomialtime algorithm that for a given POMDP M computes a

stationary policy under which M has performance greater than or equal to the average

stationary performance of M

P NP

Thus even calculating a p olicy whose p erformance is ab ove average is likely to b e infeasible

We now turn to timedep endent p olicies The timedep endent p olicy existence problem

for POMDPs is known to b e NPcomplete as is the stationary one

Theorem Mundhenk et al The timedependent policy existence problem for

unobservable MDPs is NPcomplete

Papadimitriou and Tsitsiklis proved a theorem similar to Theorem Their

MDPs had only nonp ositive rewards and their formulation of the decision problem was

whether there is a p olicy with p erformance The pro of by Mundhenk et al like

theirs uses a reduction from Sat We mo dify this reduction to show that an optimal

timedep endent p olicy is hard to approximate even for unobservable MDPs

Theorem Let Any optimal timedependent policy for unobservable MDPs

with nonnegative rewards is approximable if and only if P NP

Pro of We give a reduction from Sat with the following prop erties For a formula

with m clauses we show how to construct an unobservable MDP M with value if is

satisable and with value if is not satisable Therefore an approximation

could b e used to distinguish b etween satisable and unsatisable formulas in p olynomial

time

For formula we rst show how to construct an unobservable M from which M

will b e constructed The formal presentation app ears in App endix B M simulates the

following strategy At the rst step one of the m clauses is chosen uniformly at random

At step i the assignment of variable i is determined Because the with probability

pro cess is unobservable it is guaranteed that each variable gets the same assignment in all

clauses b ecause its value is determined in the same step If a clause is satised by this

assignment a nal state will b e reached If not an error state will b e reached

of M such that the initial state Now construct M from m copies M M

of M is the initial state of M the initial state of M is the nal state T of M and

i i

is reached The error states of all the M s are reward is gained if the nal state of M

identied as a unique sink state F

To illustrate the construction in Figure we give an example POMDP consisting of a

chain of copies of M obtained for the formula x x x x x x

The dotted resp solid arrows The dashed arrows indicate a transition with probability

are probability transitions on action resp The actions corresp ond to assignments

to the variables

If is satisable then a timedep endent p olicy simulating m rep etitions of any satis

fying assignment has p erformance If is not satisable then under any assignment at

Nonapproximability Results for POMDPs

start

T T

1 2

M M

4 3

T T

4 3 1

Figure An example unobservable MDP for x x x x x x

least one of the m clauses of is not satised Hence the probability that under any time

dep endent p olicy the nal state T of M is reached is at most Consequently the

m m

probability that the nal state of M is reached is at most e This prob

ability equals the exp ected reward Since for large enough m it holds that e

the theorem follows 2

Note that the timedep endent p olicy existence problem for POMDPs with nonnegative

rewards is NLcomplete Mundhenk et al The class NL consists of those languages

recognizable by nondeterministic Turing machines that use a readonly input tap e and

additional readwrite tap es with O log n tap e cells It is known that NL P and that NL

is prop erly contained in PSPACE Unlike the case of stationary p olicies approximability

of timedep endent p olicies is harder than the p olicy existence problem unless NL NP

Unobservability is a sp ecial case of partial observability Hence we get the same non

approximability result for POMDPs even for unrestricted rewards

Corollary Let Any optimal timedependent policy for POMDPs is appro

ximable if and only if P NP

Corollary Let The timedependent value of POMDPs is approximable if

and only if P NP

Note that the pro of of Theorem assumed a total exp ected reward criterion The

discounted reward criterion is also useful in the nite horizon To show the result for a

Lusena Goldsmith Mundhenk

discounted reward criterion we only need to change the reward in the pro of of Theorem

m n

as follows Multiply the nal reward by where is the discount factor m the

number of clauses and n the number of variables of the formula

Papadimitriou and Tsitsiklis proved that a problem very similar to history

dep endent p olicy existence is PSPACEcomplete

Theorem Papadimitriou Tsitsiklis Mundhenk et al The history

dependent policy existence problem for POMDPs is PSPACEcomplete

To describ e a horizon n historydep endent p olicy for a POMDP with c observations

c We do not address the case of succinctly represented p olicies explicitly takes space

for POMDPs here For an analysis of their complexity see Mundhenk a If c this

is exp onential space Therefore we cannot exp ect that a p olynomialtime algorithm outputs

a historydep endent p olicy and we restrict consideration to p olynomialtime algorithms that

approximate the historydep endent value the optimal p erformance under any history

dep endent p olicy of a POMDP Burago de Rougemont and Slissenko considered

the class of POMDPs with a b ound of q on the number of states corresp onding to an

observation where the rewards corresp onded to the probability of reaching a xed set of goal

states and thus were b ounded by They showed that for any xed q the optimal history

dep endent p olicies for POMDPs in this class can b e approximated to within an additive

constant k We showed in Prop osition that POMDP historydep endent discounted or

totalreward value problems that can b e approximated to within an additive constant k

have p olynomialtime approximation schemes Prop osition as long as there are no a

priori bounds on either the number of states per observation or the rewards

Notice however that Theorem do es not give us information ab out the classes of

POMDPs that Burago et al considered Because of the restrictions asso ciated with

the parameter q our hardness results do not contradict their result

Finally we show that the historydep endent value of POMDPs with nonnegative re

wards is not approximable under total exp ected or discounted rewards unless P

PSPACE Consequently the value has no PTAS or k additive approximation under the

same assumption

The historydep endent p olicy existence problem for POMDPs with nonnegative re

wards is NLcomplete Mundhenk et al Hence b ecause NL is a prop er sub class of

PSPACE approximability of the historydep endent value is strictly harder than the p olicy

existence problem

Theorem Let The historydependent value of POMDPs with nonnegative

rewards is approximable if and only if P PSPACE

Pro of The historydep endent value of a POMDP M can b e calculated using binary search

over the historydep endent p olicy existence problem The number of bits to b e calculated is

p olynomial in the size of M Therefore by Theorem this calculation can b e p erformed

in p olynomial time using a PSPACE oracle If P PSPACE it follows that the history

dep endent value of a POMDP M can b e exactly calculated in p olynomial time

The set Qsat of true quantied Bo olean formulae is one of the standard PSPACE

complete sets To conclude P PSPACE from an approximation of the historydep endent

Nonapproximability Results for POMDPs

value problem we use a transformation of instances of Qsat to POMDPs similar to the

pro of of Theorem in Mundhenk b

The set Qsat can b e interpreted as a twoplayer game Player sets the existentially

quantied variables and player sets the universally quantied variables Player wins if

the alternating choices determine a satisfying assignment to the formula and player wins

if the determined assignment is not satisfying A formula is in Qsat if and only if player

has a winning strategy This means player has a resp onse to every choice of player so

that in the end the formula will b e satised

The version where player makes random choices and player s goal is to win with

corresp onds to Ssat stochastic satisability which is also PSPACE com probability

plete The instances of Ssat are formulas which are quantied alternatingly with existential

quantiers and random quantiers R The meaning of the random quantier R is that

an assignment to the resp ective variable is chosen uniformly at random from f g A

sto chastic Bo olean formula

x Rx x Rx

is in Ssat if and only if

there exists b for random b exists b for random b P r obb b is true

If has r random quantiers then the strategy of player determines a set of

means that more than assignments to The term P r obb b is true

of these assignments satisfy

From the pro of of IP PSPACE by Shamir it follows that for every PSPACE

set A and constant c there is a p olynomialtime reduction f from A to Ssat such that

for every instance x and formula f x x Rx the following holds

If x A then b for random b P r ob b b is true and

x n

if x A then b for random b P r ob b b is true

x n

This means that player either has a strategy under which she wins with very high prob

ability or the probability of winning under any strategy is very small We show how to

transform a sto chastic Bo olean formula into a POMDP with a large historydep endent

value if player has a winning strategy and a much smaller value if player wins

For an instance x Rx of Ssat where is a formula with n variables

x x we construct a POMDP M as follows The role of player is taken by

the controller of the pro cess A strategy of player determines a p olicy of the controller

and vice versa Player app ears as probabilistic transitions in the pro cess The pro cess

M has three stages The rst stage consists of one step The pro cess chooses uniformly

at random one of the variables and an assignment to it and stores the variable and the

assignment More formally from the initial state s one of the states x b i n

b f g is reached each with probability n It is not observable which variable

assignment was stored by the pro cess However whenever that variable app ears later the

pro cess checks that the initially xed assignment is chosen again If the p olicy gives a

dierent assignment during the second stage the pro cess halts with reward There is

a deterministic transition to a nal state which we refer to as s or less formally the

end

dead end state If such an inconsistency o ccurs during the third stage the pro cess halts

Lusena Goldsmith Mundhenk

x x x

observation

Figure The rst stage of M

with reward and notices that the p olicy cheats There is a deterministic transition to a

sink state which we refer to as s or less formally as the penalty box b ecause the player

cheat

sent there cannot reenter the game later If eventually the whole formula is passed either

reward or reward is obtained dep endent on whether the formula was satised or not

The rst stage is sketched in Figure In this and the following gures dashed arrows

represent random transitions all of equal probability irregardless of the action chosen

solid arrows represent deterministic transitions corresp onding to the action True and

dotted arrows represent deterministic transitions corresp onding to the action False

The second stage starts in each of the states x b and has n steps during which an

assignment to the variables x x x is xed Let A denote the part of the pro cess

second stage during which it is assumed that value b is assigned to variable x If a variable

x is existentially quantied then the assignment is the action in f g chosen by the p olicy

If a variable x is randomly quantied then the assignment is chosen uniformly at random

by the pro cess indep endent of the action of the p olicy In the second stage it is observable

which assignment was made to every variable If the variable assignment from the rst

stage do es not coincide with the assignment made to that variable during the second stage

the tra jectory on which that happ ens ends in the dead end state that yields reward Let

r b e the number of random quantiers of Every strategy of player determines

assignments Every assignment x b x b induces n tra jectories n have the

n n

form

s x b x b x b x b

i i i i n n

for i n that pass stage without reaching the dead end state and continue with

the rst state of the third stage and n that deadend in stage The latter n tra jectories

that do not reach stage are of the form

s x b x b x b s

i i i i

end

for i n Accordingly M has n tra jectories that reach the third stage

The structure of stage is sketched for x in Figure

The third stage checks whether is satised by that tra jectorys assignment The

pro cess passes sequentially through the whole formula checking each literal in each clause for

Nonapproximability Results for POMDPs

observation

x x

end

observation observation

Figure The second stage of M A for the quantier prex Rx x Rx x

an assignment to the resp ective variable The case of a cheating policy ie one that answers

during the third stage with another assignment than xed during the second stage must b e

punished Whenever the variable corresp onding to the initial stored assignment app ears

the pro cess checks that the stored assignment is consistent with the current assignment If

eventually the whole formula passes the checking either reward or reward is obtained

dep ending on whether the formula was satised and the p olicy was not cheating or not

Let C b e that instance of the third stage where it is checked whether x always gets

assignment b It is essentially the same deterministic pro cess as dened in the pro of of

Theorem but whenever an assignment to a literal containing x is asked for if x do es

c c

not get assignment b the pro cess go es to state s Otherwise the pro cess go es to state

cheat

s If the assignment chosen by the p olicy satised the formula reward is obtained

end

otherwise the reward is

The overall structure of M is sketched in Figure Note that the dashed arrows

represent random transitions all of equal probability irregardless of the action chosen

solid arrows represent deterministic transitions corresp onding to the action True dotted

arrows represent deterministic transitions corresp onding to the action False and dotdash

arrows represent transitions that are forced whatever the choice of action

Consider a formula Ssat with variables x x x and r random quantiers and

consider M Because the third stage is deterministic the pro cess has n tra jectories

n of which reach stage Now assume that is a p olicy which is consistent with the

We can also regard the interaction b etween the pro cess and the p olicy as an interactive pro of system

where the p olicy presents a pro of and the pro cess checks its correctness

Lusena Goldsmith Mundhenk

observation

x x

1 1

observations

A A

10 11

observation x

C C

10 11

observation x

s s

cheat end

Figure A sketch of M

observations from the n steps during the second stage ie whenever it is asked to give an

assignment to a variable during the third stage it do es so according to the observations

during the second stage and therefore it assigns the same value to every app earance of a

variable in C Because Ssat a fraction of more than of the tra jectories that

k a

reach stage corresp ond to a satisfying assignment and are continued under this p olicy

to state s where they receive reward Hence the historydep endent value of M is

end

c c

For Ssat an inconsistent or cheating p olicy on M may have p erformance

greater than Therefore we have to p erform a probability amplication as in the

pro of of Theorem that punishes cheating We construct M from k copies M M

k k

of M the exact value of k will b e determined later such that the initial state of M

is the initial state of M and the initial state of M is the nal state s of M If in

i i

end

some rep etition a tra jectory is caught cheating then it is sent to the p enalty b ox and is

not continued in the following rep etitions Hence it cannot collect any more rewards More

formally the states s of all the M s are identied as a unique sink state of M

cheat k

If Ssat then in each round or rep etition exp ected rewards can b e

collected and hence the value of M is k

Consider a formula Ssat Then a noncheating p olicy for M has p erformance

less than k Cheating p olicies may have b etter p erformances We claim that for all k

the value of M is at most k n The pro of is an induction on k Consider M

which has the same value as M Hence the value of M is at most As an inductive

hypothesis let us assume that M has value at most k n In the inductive step

we consider M ie M followed by M Assume that a p olicy cheats on j

k k

of the assignments From the n tra jectories that corresp ond to an assignment at least

is trapp ed for cheating under a cheating p olicy and at most n may obtain reward

j n

and the rewards Then the reward obtained in the rst round is at most

Nonapproximability Results for POMDPs

j j

b ecause a fraction of of obtained in the following rounds are multiplied by

r r

n n

the tra jectories are sent to the p enalty b ox Using the induction hypothesis we obtain the

following upp er b ound for the p erformance of M under for an arbitrary j

j n j

perf M valM

f k k

r r

n n

j n j

c c

k n

r r

n n

k j

k n

n n

k n

This completes the induction step Hence we proved that for Ssat and for every k

n the value of M is at most k

This Eventually we have to x the constants We choose c such that

guarantees that

c c

Next we choose k such that

c c

n k

Let M b e the POMDP that consists of k rep etitions of M as describ ed ab ove

Because k is linear in the number n of variables of and hence linear in the length of

can b e transformed to M in p olynomial time The ab ove estimates guarantee that

c c

value of M for Ssat k n k

The righthand side of this inequality is a lower b ound for an approximation of the value

of M for Ssat Hence

d c

if Ssat then M has value k and

if Ssat then M has value k

Hence a p olynomialtime approximation of the value of M shows whether is in

Ssat

Concluding let A b e any set in PSPACE There exists a p olynomialtime function f

which maps every instance x of A to a b ounded error sto chastic formula f x with

error and reduces A to Ssat Transform into the POMDP M Using the

x x

approximate value of M one can answer Ssat and hence x A in p olynomial

x x

time This shows that A is in P and consequently P PSPACE 2

Corollary Let The historydependent value of POMDPs with general

rewards is approximable if and only if P PSPACE

Lusena Goldsmith Mundhenk

MDPs

Calculating the nitehorizon p erformance of stationary p olicies is in GapL Mundhenk

et al which is a sub class of the class of p olynomial time computable functions The

stationary p olicy existence problem for MDPs is shown to b e Phard by Papadimitriou and

Tsitsiklis from which it follows that nding an optimal stationary p olicy for MDPs

is Phard So it is not surprising that approximating the optimal p olicy is also Phard We

include the following theorem b ecause it allows us to present one asp ect of the reduction

used in the pro of of Theorem in isolation

Theorem The problem of k additive approximating the optimal stationary policy for

MDPs is Phard

The pro of shows this for the case of nonnegative rewards the unrestricted case follows

immediately By Prop osition this shows that nding a multiplicative approximation

scheme for this problem is also Phard

Pro of Consider the Pcomplete problem Cvp given a Bo olean circuit C and input x is

C x A Bo olean circuit and its input can b e seen as a directed acyclic graph Each

no de represents a gate and every gate has one of the types AND OR NOT or The

gates of type or are the input gates which represent the bits of the xed input x to the

circuit Input gates have indegree All NOT gates have indegree and all AND and OR

gates have indegree There is one gate having outdegree This gate is called the output

gate from which the result of the computation of circuit C on input x can b e read

From such a circuit C an MDP M can b e constructed as follows Because the basic

idea of the construction is very similar to one shown in Papadimitriou Tsitsiklis

we leave out technical details As an initial simplifying assumption assume that the circuit

has no NOT gates Each gate of the circuit b ecomes a state of the MDP The start state

is the output gate Reverse all edges of the circuit Hence a transition in M leads from a

gate in C to one of its predecessors A transition from an OR gate dep ends on the action

and is deterministic On action its left predecessor is reached and on action its right

predecessor is reached A transition from an AND gate is probabilistic and do es not dep end

the left predecessor is reached and with probability on the action With probability

the right predecessor is reached

Continue considering a circuit without NOT gates If an input gate with value is

reached a large p ositive reward is gained and if an input gate with value is reached no

reward is gained which makes the total exp ected reward noticeably smaller than otherwise

If C x then the actions can b e chosen at the OR gates so that every tra jectory reaches

an input gate with value if this condition holds then it must b e that C x Hence

the MDP has a large p ositive value if and only if C x

If the circuit has NOT gates we need to remember the parity of the number of NOT

gates on each tra jectory If the parity is even everything go es as describ ed ab ove If the

parity is o dd then the role of AND and OR gates is switched and the role of and gates

is switched If a NOT gate is reached the parity bit is ipp ed For every gate in the circuit

we now take two MDP states one for even and one for o dd parity Hence if G is the set of

gates in C the MDP has states G f g The state transition function is

Nonapproximability Results for POMDPs

if s is an OR gate and p or s is an AND gate

and p p p and s is predecessor a of s

if s is an OR gate and p or s is an AND gate

and p p p and s is predecessor of s

if s is an OR gate and p or s is an AND gate

ts p a s p

and p p p and s is predecessor of s

if s is a NOT gate and s is a predecessor of s and

p p

if s is an input gate or the sink state and s is the sink

state

Now we have to sp ecify the reward function If an input gate with value is encountered

jC jk

on a tra jectory where the parity of NOT gates is even then reward is obtained

where jC j is the size of circuit C The same reward is obtained if an input gate with value

is encountered on a tra jectory where the parity of NOT gates is o dd All other tra jectories

obtain reward

jC jk jC j

Thus each tra jectory receives reward either or There are at most tra

jectories for each p olicy If a p olicy chooses the correct values for all the gates in order to

prove that C x in other words if C x then the exp ected value of an optimal

jC jk jC jk jC j

p olicy is Otherwise the exp ected value is at least k lower than

jC jk jC jk

ie at most k

Thus if an approximation algorithm is within an additive constant k of the optimal

jC jk jC jk

p olicy it will either give a value k or k By insp ection of the output

one can immediately determine whether C x Thus any k additive approximation for

this problem must take at least p olynomial time

In Figure an example circuit and the MDP to which it is transformed are given Every

gate of the circuit is transformed to two states of the MDP one copy for even parity of

NOT gates passed on that tra jectory indicated by a thin outline of the state and one copy

for odd parity of NOT gates passed on that tra jectory indicated by a thick outline of the

state A solid arrow indicates the outcome of action choose the left predecessor and a

dashed arrow indicates the outcome of action choose the right predecessor Dotted arrows

on any action The circuit in Figure has value indicate a transition with probability

The p olicy which chooses the right predecessor in the starting state yields tra jectories

which all end in an input gate with value and which therefore obtains the optimal value

There have b een several recent approximation algorithms introduced for structured

MDPs many of which are surveyed in Boutilier et al More recent work includes

a variant of p olicy iteration by Koller and Parr and heuristic search in the space of

nite controllers by Hansen and Feng and Kim et al While these algorithms

are often highly eective in reducing the asymptotic complexity and actual run times of

p olicy construction they all run in time exp onential in the size of the structured represen

tation or oer only weak p erformance guarantees We show that exp onential asymptotic

complexity is necessary for any algorithm scheme that pro duces approximations for all

For this we consider MDPs represented by TBNs Boutilier Dearden Goldszmidt

Until now we have describ ed the state transition function for MDPs by a function

Lusena Goldsmith Mundhenk

.. .

. .

. . .

. .

. . .

. .

. . .

......

.. .

. ..

. .

. _ _

......

.. ..

......

.. ..

......

.. ..

......

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

......

. . . .

. .

. . . .

. .

. . . .

. .

. . :

. .

. . ^ : ^ :

. .

. ... .

...

. ... .

...

......

. . .

......

. . .

......

. .

......

. .

. . .

. . ..

. . . . .

. .

. . . .

. . .

. . .. .

. . .

. . . . .

. .

. . . .

. . .

. . .. .

. . . . .

. . .

......

. . .

. .

. . . .

. . .

. . ..

. .

. . .

. . . .

.. . . .

. . . .

. . .

. .

.. .

. . . .

. .

. . .

. ..

. .

.. .

. .

. ..

. .

. . .

. .

.. .

. . .

. .. .

. .

. . .

. .

. . .

. .

. . .

.. .

. .

.. .

. .

. ..

. .

. ..

. .

.. .

. .

. ..

. .

. ..

. .

. . .

. .

. ..

. .

. . .

......

. . . .

. .

. . . .

. .

. . . .

. .

......

.. ..

......

.. ..

......

. . . .

......

. . _

......

. .

. . . .

......

. .

. . . .

......

. .

. . . .

......

. .

......

. .

. . . .

......

. .

...... ^ _ ^ _

. . . .

......

. .

......

. .

......

. .

......

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

1 1

0 1 1

Figure A circuit the MDP it is reduced to and the tra jectories according to an optimal

p olicy for the MDP

ts a s that computes the probability of reaching state s from state s under action a

We assumed that the transition function was represented explicitly A twophase temporal

Bayes net TBN is a succinct representation of an MDP or POMDP Each state of the

system is describ ed by a vector of values called uents Note that if each of n uents is

twovalued then the system has states Actions are describ ed by the eect they have

on each uent by means of two data structures They are a dep endency graph and a set

of functions enco ded as conditional probability tables decision trees arithmetic decision

diagrams or in some other data structure

The dep endency graph is a directed acyclic graph with no des partitioned into two sets

g The rst set of no des represents the state at time t the v fv v g and fv

second at time t The edges are from the rst set of no des to the second asynchronous

or within the second set synchronous The value of the k uent at time t under action

a dep ends probabilistically on the values of the predecessors of v in this graph Note that

the synchronous edges must form a directed acyclic graph in order for the dep endencies to

b e evaluated The probabilities are sp elled out for each action in the corresp onding data

structure for v and a We will indicate that sto chastic function by f

We make no assumptions ab out the structure of rewards for TBNs In fact the nal

TBN constructed in the pro of of Theorem has very large rewards which are computed

implicitly in time p olynomial in the size of the TBN one can explicitly compute any

individual bit of the reward This has the eect of making the p otential value of the TBN

to o large to write down with p olynomially many bits

Theorem The problem of k additive approximating any optimal stationary policy for

an MDP in TBNrepresentation is EXPhard

Pro of The general strategy is similar to the pro of of Theorem We give a reduction

from the EXPcomplete succinct circuit value problem to the problem for MDPs in TBN

representation An instance of the succinct circuit value problem is a Bo olean circuit S that

describ es a circuit C and an input x ie S describ es an instance of the at circuit value

Nonapproximability Results for POMDPs

problem We can assume that in C each gate is a predecessor to at most two other gates

Then every gate in C has four neighbors two of which output the input to C and two of

which get the output of C as input if there are fewer neighbors the missing neighbors are

set to a ctitious gate Consider a gate i of C Say that the output of neighbors and is

the input to gate i and the output of gate i is input to neighbors and Now the circuit

S on input i k outputs j s where gate j is the k neighbor of gate i and s enco des

the type of gate i AND OR NOT and The idea is to construct from C an MDP

M as in the pro of of Theorem However we do it succinctly Hence we construct from

S a TBNrepresentation of an MDP M S The actions of M S are and for choosing

neighbor of the current stategate or resp ectively neighbor The states of M S are

tuples i p t r where i is a gate of C p is the parity bit as in the pro of of Theorem

t is the type of gate i and r is used for a random bit Every gate number i is given in

binary using say l bits Then the TBN has l uents i i i p t r Let

i p t r The i f f f f f f b e the sto chastic functions that calculate i

p t r

simplest is f for the uent r that is used as random bit if from state i i i the next

state is chosen uniformly at random from one of the two predecessors of gate i in C This

happ ens if the type t of gate i is AND and the parity p is or if t is OR and p is In

these cases r determines its value or by ipping a coin Otherwise r equals Notice

that r is indep endent of the action

The functions f for the uents v determine the bits of the next states If t is an AND

and the parity p is even then randomly a predecessor of gate i is chosen Randomly

means here that the random bit r determines whether predecessor or predecessor is

chosen Hence v is the c bit of j where j s is the output of S on input i r

Accordingly t s is the type of the chosen gate and p p remains unchanged The same

happ ens if t is an OR and the parity p is o dd If t is a NOT there is only one predecessor

of i and that one must b e chosen for i and t The parity bit p is ipp ed to p If t is an

OR and the parity p is even then on action a f g the predecessor a of gate i is chosen

is the c bit of j where j s is the output of S on input i a Accordingly Hence v

t s and p p The same happ ens if t is an AND and the parity p is o dd Hence the

function f can b e calculated as follows

input i p t r a

if t OR and p or t AND and p

then calculate S i a j s

else if t OR and p or t AND and p

then calculate S i r j s

else if t NOT

then calculate S i j s

else j

output the c bit of j

The state is a sink state which is reached from the input gates within one step and

which is never left The type t of the next state or gate is calculated accordingly

If this is not the case and d is the maximum outdegree of a gate we can replace the circuit by one with

maximum outdegree and size at most log d larger Since d jC j such a substitution will not aect

the asymptotic complexity of any of our algorithms

Lusena Goldsmith Mundhenk

One can also simulate the circuit S for function f in the ab ove algorithm by a TBN

Note that in general circuits can have more than one output We consider this more

general mo del here

Claim Every Boolean circuit can be simulated by a TBN to which it can be transformed

in polynomial time

Pro of We sketch the construction idea Let R b e a circuit with n input gates and n

output gates The outcome of the circuit on any input b b is usually calculated as

follows At rst calculate the outcome of all gates that get input only from input gates

Next calculate the outcome of all gates that get their inputs only from those gates whose

outcome is already calculated and so on This yields an enumeration of the gates of a

circuit in top ological order ie such that the outcome of a gate can b e calculated when

all the outcomes of gates with a smaller index are already calculated We assume that the

gates are enumerated in this way and that g g are the input gates and that g g

n s

are the other gates where the smallest index of a gate which is neither an output nor an

input gate equals l maxn n

Now we dene a TBN T simulating R as follows T has a uent for every gate of

R say uents v v The basic idea is that uents v v represent the input gates

s n

of R In one time step values are propagated from the input no des v v through all

The dep endency graph has the v and the outputs copied to v gate no des v v

n l

following edges according to the wires of the circuit R If an input gate g i n

If the output of a noninput outputs an input to gate g then we get an edge from v to v

j i

Finally the no des to v gate g n i s is input to gate g then we get an edge from v

i j

j i

stand for the value bits If gate g pro duces the ith output bit then there is an v v

Because the circuit R has no lo op the graph is lo opfree to o to v edge from v

i j

dep end on the functions calculated by v The functions asso ciated to the no des v

the resp ective gate and are as follows Each of the value no des v for i n which

Hence stands for the input bits has exactly one predecessor whose value is copied into v

f is the oneplace identity function f x x with probability for i n Now

i i

we consider the no des which come from internal gates of the circuit If g is an AND gate

then f x y x y where x and y are the predecessors of gate g If g is an OR gate

i i i

then f x y x y and if g is a NOT gate then f x x all with probability

i i i

By this construction it follows that the TBN T simulates the Bo olean circuit R Notice

that the number of uents of T is at most the double of the number of gates of R The

transformation from R to T can b e p erformed in p olynomial time 2

An example of a Bo olean circuit and the TBN to which it is transformed as describ ed

ab ove is given in Figure

Now we can construct from the circuit S that is a succinct representation of a circuit

C a TBN T with uents i i i p t r as already dened plus additional uents for

the gates of S using the technique from the ab ove Claim Taking the action a the parity

p the gate type t and the random bit r into account we can construct T according to

contains the bit describ ed by the the description of function f ab ove so that uent v

function f i i i p t r ab ove Notice that the function f is dep endent only for v

c j

the uents p t r and the action a on the predecessors of the gate of S represented by v

Nonapproximability Results for POMDPs

output output

gate gate

......

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

......

. . . .

. .

. . . .

. .

......

. . . .

......

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

......

. .

.. . .

......

. . . .

. .

......

. . . .

......

. . . .

......

. . . .

......

. .

......

. . . .

. .

. . . .

......

. . . .

. .

......

. .

......

. .

......

. . . .

......

. .

......

. . . .

......

. .

. . . .

. .

......

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

......

. . . .

......

. . . .

. . .

. . . .

. .

. . . .

. .

. . . .

. .

. . . . .

. . . .

......

. . . .

. .

. . . .

. . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. . . .

. .

. . .

. .

. . . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . . .

. .

. . .

. .

. . . .

. .

. . .

. .

......

. . .

......

. . . .

. . . . .

......

. . . .

......

. .

. . . .

......

. . . .

......

. . . . .

. . .

......

. . . . .

. . . .

. . . . .

. . . .

......

. . .

......

. . .

. . . .

......

. .

......

. .

. . . . .

. . . .

. . .

......

. . .

. . . .

. . .

. . . .

. . .

. . . .

. . .

. .

. . . . .

. . . .

. .

......

. . . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. . . . .

......

. . . .

. .

. . .

......

. . . . .

......

. . . . .

. . .

......

. .

. . . . .

......

. . .

. . . . .

. . .

. . . .

. . .

. . . . .

. .

. . . .

. . . . .

......

. . . .

. . .

. .

. . . . .

. .

. . .

. . . .

. .

......

. . . . .

. .

. . . .

. . .

. . . .

. . . . .

. . .

. . . . .

. . . .

. . .

. .

. . .

. .

. . . .

. .

. . . . .

. .

. . .

. . . .

. .

. . .

. .

. . .

. .

. . .

. . . .

. .

. . .

. .

. . .

. .

. . . .

. .

. . .

. . . .

. .

. . . .

. .

. . . .

. .

. . .

......

. . .

......

. .

......

. .

......

. .

. . . . .

. . . .

......

. . . .

. . .

. . . .

. . .

......

. .

. . .

. . . .

. . .

. . . . .

. .

......

. .

......

. .

......

. .

......

. .

. . .

. . . .

. . .

. .

. . .

. . . .

. .

. . .

. .

. . . .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . .

. .

. . . .

. .

. . .

......

. . . .

. .

. . .

. .

. . . .

. . .

. . . .

. . .

. .

. . .

. . . .

. . . . .

. .

......

. . . .

. .

. . . .

. . .. .

......

....

. . .

. .

. . .

. .

. . . . .

......

. .

. . .

. .

. ....

. . .

. .

. . .

. .

. . .

. .

. . .

......

. .

input input

gate gate

v P r obv

v v

v P r obv

v v

v v P r obv

v v

v v P r obv

v v

Figure A Bo olean circuit which outputs the binary sum of its input bits and a TBN

representing the circuit Only functions f the identity function f simulating

a NOT gate f simulating an AND gate and f simulating an OR gate are

describ ed

Lusena Goldsmith Mundhenk

Hence it has at most arguments and can b e describ ed by a small table This holds for

Hence into v all uents of T Finally the function f for T just copies the value of v

S c S

from S we can construct a MDP in TBN representation similar to the MDP in the pro of

jS jk

if any of Theorem Next we sp ecify the rewards of this MDP The reward is

action is taken on a state representing an input gate with value and parity or with value

and parity Otherwise the reward equals This reward function can b e represented

by a circuit which on binary input i a b outputs the b bit of the reward obtained in state

jS jk

i on action a Since it requires bits to represent the reward b can b e represented

using only jS j k bits

jS jk

on If C x then there is a choice of actions for each state that gives reward

every tra jectory similar to the pro of of Theorem However if C x any p olicy has

jS j

tra jectories at least one tra jectory that receives a reward Now there are at most

jS j jS jk

k b etween p ossible values As and therefore there is a gap of at least

ab ove we conclude that any k additive approximation to the factored MDP problem gives

a decision algorithm for the succinct circuit value problem Therefore the lower b ound of

EXPhardness for the factored MDP value problem holds for this approximation problem

as well 2

The following structured representation is more general than the representations more

common to the AIplanning community We say that an MDP has a succinct representation

or is a succinct MDP if there are Bo olean circuits C and C such that C s a s i pro duces

t r t

the ith bit of the transition probability ts a s and C s a i pro duces the ith bit of the

reward r s a Similar to the pro of of Theorem we can also prove nonapproximability

of MDP values for succinctly represented MDPs

Theorem The problem of k additive approximating the optimal stationary policy for a

succinctly represented MDP is EXPhard

NonApproximability for InniteHorizon POMDPs

The discounted value of an innitehorizon POMDP is the maximum total discounted p er

formance When we discuss the p olicy existence problem or the average case p erformance

in the innite horizon it is necessary to sp ecify the reward criterion We generalize the

value function as follows

The value val M of M is M s maximal p erformance under any p olicy of

type ie val M max perf M

Note that a timedep endent or historydep endent innitehorizon p olicy for a POMDP is

not necessarily nitely representable For fullyobservable MDPs it turned out see eg Put

erman that the discounted or average value is the p erformance of a stationary p olicy

This means that no historydep endent p olicy p erforms b etter than the b est stationary one

As an imp ortant consequence an optimal p olicy is nitely representable For POMDPs

this do es not hold Madani et al showed that the timedep endent innitehorizon

p olicyexistence problem for POMDPs is not decidable under average p erformance or under

total discounted p erformance In contrast we show that the same problem for stationary

p olicies is NPcomplete

Nonapproximability Results for POMDPs

Theorem The stationary innitehorizon policyexistence problem for POMDPs under

total discounted or average performance is NPcomplete

The hardness pro of is essentially the same as for Theorem Note that in that

construction every stationary p olicy obtains reward for at most one step namely when

sink state T is reached meaning that the formula is satised All other steps yield reward

Therefore for this construction the total discounted value is greater than if and only if

the nitehorizon value is so To make the construction work for average value we have to

mo dify it such that once the sink state T is reached every subsequent action brings reward

Therefore the average value equals if the formula is satisable and it equals if it is

unsatisable Hence b oth the problems are NPhard

Containment in NP for the total discounted p erformance follows from the guessand

check approach Guess a stationary p olicy calculate its p erformance and accept if and only

if the p erformance is p ositive The total discounted and the average p erformance can b oth

b e calculated in p olynomial time

In the same way the techniques proving nonapproximability results for the stationary

p olicy in the nite horizon case Corollary can b e mo died to obtain nonapproxima

bility results for innite horizons

Theorem The stationary innitehorizon value of POMDPs under total discounted or

average performance can be approximated if and only if P NP

The innitehorizon timedep endent p olicyexistence problems are undecidable Madani

et al We show that no computable function can even approximate optimal p olicies

Theorem The timedependent innitehorizon value of unobservable POMDPs under

average performance cannot be approximated

The pro of follows from the pro of by Madani et al showing the uncomputability

of the timedep endent value In Madani et al from a given Turing machine T an

unobservable POMDP is constructed having the following prop erties for arbitrary

If T halts on empty input then there is exactly one timedep endent innitehorizon

p olicy with p erformance all other timedep endent p olicies have p erformance

and the average value is b etween and This reduces the undecidable problem

of whether a Turing machine halts on empty input to the timedep endent innitehorizon

p olicy existence problem for unobservable POMDPs under average p erformance Actually

assuming that the value of the unobservable POMDP were approximable we could choose

in a way that even the approximation enables us to decide whether T halts on empty input

Since this is undecidable an approximation is imp ossible

Corollary The timedependent and historydependent innitehorizon value of PO

MDPs under average performance cannot be approximated

Acknowledgements

This work was supp orted in part by NSF grant CCR and CCR Lusena

and Goldsmith and by Deutscher Akademischer Austauschdienst DAAD grant PPP

g uab Mundhenk The third author p erformed part of the work at Dartmouth College

Lusena Goldsmith Mundhenk

We would like to thank Daphne Koller and several anonymous referees for catching

errors in earlier versions of this pap er

App endix A Pro of of Theorem

We present the reduction from Mundhenk et al Let x x b e an instance of

Sat with variables x x and clauses C C where clause C l l

n m j

v j v j

l for l fx x g We say that variable x appears in C with signum resp if

i i i i j

v j

x resp x is a literal in C

i i j

From we construct a POMDP M S s A O t o r with

S fi j j i n j mg fF T g

s v A f g O fx x F T g

if s v i j j s v j j j m i

and x app ears in C with signum a

v ij

if s v i m m s T i

and x app ears in C with signum a

v im

if s v i j j s v i j j i

ts a s

and x app ears in C with signum a

v ij

if s v j j s F

and x app ears in C with signum a

v j

if s s F or s s T

otherwise

x if s i j

if ts a T s T

os r s a T if s T

otherwise

F if s F

Note that all transitions in M are deterministic and every tra jectory has value or

There is a corresp ondence b etween p olicies for M and assignments of values to the

variables of such that p olicies under which M has value corresp ond to satisfying

assignments for and vice versa

App endix B Pro of of Theorem

Again we present the reduction from Mundhenk et al Let b e a formula with n

variables x x and m clauses C C This time we dene the unobservable MDP

n m

M S s A t r where

S fi j j i n j mg fsat j i ng fs T F g

A f g

Nonapproximability Results for POMDPs

if s s s j j m

if s i j s sat i n x app ears in C with signum a

i i j

if s i j s i j i n

x do es not app ear in C with signum a

i j

if s n j s T x app ears in C with signum a

n j

ts a s

if s n j s F x do es not app ear in C with signum a

n j

if s sat s sat i n

i i

if s sat s T

if s s F or s s T a or a

otherwise

m if s T and ts a T

r s a

otherwise

References

Bellman R Dynamic programming Princeton University Press

Blythe J Decisiontheoretic planning AI Magazine

Boutilier C Dean T L Hanks S Decisiontheoretic planning Structural

assumptions and computational leverage Journal of AI Research

Boutilier C Dearden R Goldszmidt M Exploiting structure in p olicy con

struction In Extending Theories of Action Formal Theory Practical Applications

Papers from the AAAI Spring Symposium pp AAAI Press Menlo

Park California

Burago D de Rougemont M Slissenko A On the complexity of partially

observed Markov decision pro cesses Theoretical Computer Science

Cassandra A Exact and approximate algorithms for partially observable Markov

decision processes Unpublished do ctoral dissertation Brown University

Cassandra A Kaelbling L Littman M L Ecient dynamicprogramming

updates in partially observable Markov decision processes Tech Rep No CS

Providence Rho de Island Brown University

Greenlaw R Ho over H J Ruzzo W L Limits to parallel computation

Pcompleteness theory Oxford University Press

Hansen E a Finitememory control of partially observable systems Unpublished

do ctoral dissertation Dept of Computer Science University of Massachussets at

Amherst

Hansen E b Solving POMDPs by searching in p olicy space In Proceedings of the

Conference Uncertainty in Articial Intel ligence pp

Lusena Goldsmith Mundhenk

Hansen E Feng Z Dynamic programming for POMDPs using a factored

state representation In Proceedings of the th international conference on articial

intel ligence planning and scheduling pp Breckenridge CO

Hauskrecht M Planning and control in stochastic domains with imperfect infor

mation Unpublished do ctoral dissertation Massachusetts Institute of Technology

Kim KE Dean T L Meuleau N Approximate solutions to factored Markov

decision pro cesses via greedy search in the space of nite controllers In Proceedings

of the th international conference on articial intel ligence planning and scheduling

pp Breckenridge CO

Koller D Parr R Policy iteration for factored MDPs In Proceedings of the

th Conference on Uncertainty in Articial Intel ligence

Littman M L Memoryless p olicies Theoretical limitations and practical results

In D Cli P Husbands JA Meyer S W Wilson Eds From Animals to Ani

mats Proceedings of the Third International Conference on Simulation of Adaptive

Behavior MIT Press

Littman M L Dean T L Kaelbling L P On the complexity of solving Markov

decision problems In Proceedings of the th Annual Conference on Uncertainty in

Articial Intel ligence pp

Lovejoy W S Computationally feasible b ounds for partially observed Markov

decision pro cess Operations Research

Lusena C Li T Sittinger S Wells C Goldsmith J My brain is full

When more memory helps In Proceedings of the th Conference on Uncertainty in

Articial Intel ligence pp

Madani O Hanks S Condon A On the undecidability of probabilistic planning

and innitehorizon partially observable Markov decision problems In Proceedings of

the th National Conference on Articial Intel ligence pp

McCallum A K Overcoming incomplete p erception with utile distinction memory

In Proceedings of the th International Conference on Machine Learning pp

Morgan Kaufmann Publishers

Meuleau N Kim KE Kaelbling L P Cassandra A R Solving POMDPs

by searching the space of nite p olicies In Proceedings of the th Conference on

Uncertainty in Articial Intel ligence pp

Meuleau N Peshkin L Kim KE Kaelbling L P Learning nitestate con

trollers for partially observable environments In Proceedings of the th Conference

on Uncertainty in Articial Intel ligence pp

Mundhenk M a The complexity of optimal small p olicies Mathematics of Opera

tions Research

Nonapproximability Results for POMDPs

Mundhenk M b The complexity of planning with partiallyobservable Markov deci

sion processes Tech Rep No TR Dartmouth College

Mundhenk M Goldsmith J Allender E The complexity of the p olicy existence

problem for partiallyobservable nitehorizon Markov decision pro cesses In Proceed

ings th Mathematical Foundations of Computer Sciences conference pp

SpringerVerlag

Mundhenk M Goldsmith J Lusena C Allender E Complexity results for

nitehorizon Markov decision pro cess problems Journal of the ACM

Papadimitriou C H Computational complexity AddisonWesley

Papadimitriou C H Tsitsiklis J N Intractable problems in control theory

SIAM Journal of Control and Optimization

Papadimitriou C H Tsitsiklis J N The complexity of Markov decision

pro cesses Mathematics of Operations Research

Peshkin L Meuleau N Kaelbling L P Learning p olicies with external

memory In Proceedings of the th International Conference on Machine Learning

pp Morgan Kaufmann San Francisco CA

Platzman L K Finitememory estimation and control of nite probabilistic sys

tems Unpublished do ctoral dissertation Department of Electrical Engineering and

Computer Science Massachusetts Institute of Technology Cambridge MA

Puterman M L Markov decision processes John Wiley Sons New York

Shamir A IP PSPACE Journal of the ACM

Smallwoo d R D Sondik E J The optimal control of partially observed Markov

pro cesses over the nite horizon Operations Research

Sondik E J The optimal control of partially observable Markov processes Unpub

lished do ctoral dissertation Stanford University

Tsitsiklis J N Van Roy B Featurebased metho ds for large scale dynamic

programming Machine Learning

White C C I I I Partially observed Markov decision pro cesses A survey Annals

of Operations Research

Zhang N L Lee S S Zhang W A metho d for sp eeding up value iteration in

partially obserable Markov decision pro cesses In Proceedings of the th Conference

on Uncertainty in Articial Intel ligence pp