Regret Based Robust Solutions for Uncertain Markov Decision Processes

Regret based Robust Solutions for Uncertain Markov Decision Processes Asrar Ahmed Pradeep Varakantham Singapore Management University Singapore Management University [email protected] [email protected] Yossiri Adulyasak Patrick Jaillet Massachusetts Institute of Technology Massachusetts Institute of Technology [email protected] [email protected] Abstract In this paper, we seek robust policies for uncertain Markov Decision Processes (MDPs). Most robust optimization approaches for these problems have focussed on the computation of maximin policies which maximize the value corresponding to the worst realization of the uncertainty. Recent work has proposed minimax regret as a suitable alternative to the maximin objective for robust optimization. However, existing algorithms for handling minimax regret are restricted to models with uncertainty over rewards only. We provide algorithms that employ sampling to improve across mul- tiple dimensions: (a) Handle uncertainties over both transition and reward models; (b) Dependence of model uncertainties across state, action pairs and decision epochs; (c) Scalability and quality bounds. Finally, to demonstrate the empirical effectiveness of our sampling approaches, we provide comparisons against benchmark algorithms on two domains from literature. We also provide a Sample Average Approximation (SAA) analysis to compute a posteriori error bounds. Introduction Motivated by the difficulty in exact specification of reward and transition models, researchers have proposed the uncertain Markov Decision Process (MDP) model and robustness objectives in solving these models. Given the uncertainty over the reward and transition models, a robust solution can typically provide some guarantees on the worst case performance. Most of the research in computing robust solutions has assumed a maximin objective, where one computes a policy that maximizes the value corresponding to the worst case realization [8, 4, 3, 1, 7]. This line of work has devel- oped scalable algorithms by exploiting independence of uncertainties across states and convexity of uncertainty sets. Recently, techniques have been proposed to deal with dependence of uncertainties [15, 6]. Regan et al. [11] and Xu et al. [16] have proposed minimax regret criterion [13] as a suitable alternative to maximin objective for uncertain MDPs. We also focus on this minimax notion of robustness and also provide a new myopic variant of regret called Cumulative Expected Regret (CER) that allows for development of scalable algorithms. Due to the complexity of computing optimal minimax regret policies [16] , existing algorithms [12] are restricted to handling uncertainty only in reward models and the uncertainties are independent across states. Recent research has shown that sampling-based techniques [5, 9] are not only efficient but also provide a priori (Chernoff-Hoeffiding bounds) and a posteriori [14] quality bounds for planning under uncertainty. In this paper, we also employ sampling-based approaches to address restrictions of existing approaches for obtaining regret-based solutions for uncertain MDPs . More specifically, we make the 1 following contributions: (a) An approximate Mixed Integer Linear Programming (MILP) formu- lation with error bounds for computing minimum regret solutions for uncertain MDPs, where the uncertainties across states are dependent. We further provide enhancements and error bounds to improve applicability. (b) We introduce a new myopic concept of regret, referred to as Cumulative Expected Regret (CER) that is intuitive and that allows for development of scalable approaches. (c) Finally, we perform a Sample Average Approximation (SAA) analysis to provide experimental bounds for our approaches on benchmark problems from literature. Preliminaries We now formally define the two regret criterion that will be employed in this paper. In the definitions below, we assume an underlying MDP, M = hS; A; T; R; Hi where a policy is represented as: ~πt = fπt; πt+1; : : : ; πH−1g, the optimal policy as ~π∗ and the optimal expected value as v0(~π∗). ∗ The maximum reward in any state s is denoted as R (s) = maxa R(s; a). Throughout the paper, we use α(s) to denote the starting state distribution in state s and γ to represent the discount factor. Definition 1 Regret for any policy ~π0 is denoted by reg(~π0) and is defined as: X reg(~π0) = v0(~π∗) − v0(~π0); where v0(~π0) = α(s) · v0(s; ~π0); s X h X i vt(s; ~πt) = πt(s; a) · R(s; a) + γ T (s; a; s0) · vt+1(s0; ~πt+1) a s0 Extending the definitions of simple and cumulative regret in stochastic multi-armed bandit problems [2], we now define a new variant of regret called Cumulative Expected Regret (CER). Definition 2 CER for policy ~π0 is denoted by creg(~π0) and is defined as: X creg(~π0) = α(s) · creg0(s; ~π0); where s X X cregt(s; ~πt) = πt(s; a) · R∗(s) − R(s; a) + γ T (s; a; s0) · cregt+1(s0; ~πt+1) (1) a s0 The following properties highlight the dependencies between regret and CER. H 0 0 0 h ∗ ∗ i (1−γ ) Proposition 1 For a policy ~π : 0 ≤ reg(~π ) − creg(~π ) ≤ maxs R (s) − mins R (s) · 1−γ Proof Sketch1 By rewriting Equation (1) as creg(~π0) = v0;#(~π0) − v0(~π0), we provide the proof. Corollary 1 If 8s; s0 2 S : R∗(s) = R∗(s0), then 8~π0 : creg(~π0) = reg(~π0). ∗ ∗ 0 Proof. Substituting maxs R (s) = mins R (s) in the result of Proposition 1, we have creg(~π ) = 0 reg(~π ). Uncertain MDP A finite horizon uncertain MDP is defined as the tuple of hS; A; T; R;Hi. S denotes the set of states and A denotes the set of actions. T = ∆τ (T ) denotes a distribution over the set of transition t 0 0 functions T , where Tk (s; a; s ) denotes the probability of transitioning from state s 2 S to state s 2 S on taking action a 2 A at time step t according to the kth element in T . Similarly, R = ∆ρ(R) t 0 denotes the distribution over the set of reward functions R, where Rk(s; a; s ) is the reinforcement obtained on taking action a in state s and transitioning to state s0 at time t according to kth element in R. Both T and R sets can have infinite elements. Finally, H is the time horizon. In the above representation, every element of T and R represent uncertainty over the entire horizon and hence this representation captures dependence in uncertainty distributions across states. We now provide a formal definition for the independence of uncertainty distributions that is equivalent to the rectangularity property introduced in Iyengar et al. [4]. 1Detailed proof provided in supplement under Proposition 1. 2 Definition 3 An uncertainty distribution ∆τ over the set of transition functions, T is independent over state-action pairs at various decision epochs if τ τ;t t k Y t ∆ (T ) = × ∆ (T ); i:e: 8k; P r τ (T ) = P r τ;t (T ) s2S;a2A;t≤H s;a s;a ∆ ∆s;a s;a s;a;t t t τ;t where T = ×s;a;tTs;a, Ts;a is the set of transition functions for s; a; t; ∆s;a is the distribution over t k k τ the set Ts;a and P r∆τ (T ) is the probability of the transition function T given the distribution ∆ . We can provide a similar definition for the independence of uncertainty distributions over the reward functions. In the following definitions, we include transition, T and reward, R models as subscripts to indicate value (v), regret (reg) and CER (creg) functions corresponding to a specific MDP. Existing works on computation of maximin policies have the following objective: maximin X 0 0 π = arg max min α(s) · vT;R(s; ~π ) ~π0 T 2T ;R2R s Our goal is to compute policies that minimize the maximum regret or cumulative regret over possible models of transitional and reward uncertainty. reg 0 creg 0 π = arg min max regT;R(~π ); π = arg min max cregT;R(~π ) ~π0 T 2T ;R2R ~π0 T 2T ;R2R Regret Minimizing Solution We will first consider the more general case of dependent uncertainty distributions. Our approach to obtaining regret minimizing solution relies on sampling the uncertainty distributions over the transition and reward models. We formulate the regret minimization problem over the sample set as an optimization problem and then approximate it as a Mixed Integer Linear Program (MILP). We now describe the representation of a sample and the definition of optimal expected value for a sample, a key component in the computation of regret. Since there are dependencies amongst τ ρ τ;t ρ,t uncertainties, we can only sample from ∆ , ∆ and not from ∆s;a , ∆s;a. Thus, a sample is: 0 0 1 1 H−1 H−1 ξq = f Tq ; Rq ; Tq ; Rq ; ··· Tq ; Rq g t t where Tq and Rq refer to the transition and reward model respectively at time step t in sample q . Let ~πt represent the policy for each time step from t to H − 1 and the set of samples be denoted by ξ. Intuitively, that corresponds to jξj number of discrete MDPs and our goal is to compute one policy that minimizes the regret over all the jξj MDPs, i.e. reg X ∗ 0 0 π = arg min max α(s) · [vξ (s) − vξ (s; ~π )] ~π0 ξ 2ξ q q q s where v∗ and v0 (s; ~π0) denote the optimal expected value and expected value for policy ~π0 re- ξq ξq spectively of the sample ξq. 0 Let, ~π be any policy corresponding to the sample ξq, then the expected value is defined as follows: t t X t t t t t t X t+1 0 t+1 t 0 vξq (s; ~π ) = π (s; a) · vξq (s; a; ~π ); where vξq (s; a; ~π ) = Rq(s; a) + γ vξq (s ; ~π ) ·Tq (s; a; s ) a s0 The optimization problem for computing the regret minimizing policy corresponding to sample set ξ is then defined as follows: min reg(~π0) ~π0 X reg(~π0) ≥ v∗ − α(s) · v0 (s; ~π0) 8ξ s.t.

Load more