Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret
Jean Tarbouriech∗† Runlong Zhou∗‡ Simon S. Du§ Matteo Pirottak Michal Valko¶ Alessandro Lazarick
Abstract We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence√ of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate Oe(B? SAK), where K is the number of episodes, S is the number of states, A is the number of actions and B? bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of B?, nor of T? which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of T? is available) where the regret only contains a logarithmic dependence on T?, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.
1 Introduction
Stochastic shortest path (SSP) is a goal-oriented reinforcement learning (RL) setting where the agent aims to reach a predefined goal state while minimizing its total expected cost [Bertsekas, 1995]. Unlike other well-studied settings in RL (e.g., finite-horizon), the interaction between the agent and the environment ends only when (and if) the goal state is reached, so its length is not predetermined (nor bounded) and is influenced by the agent’s behavior. SSP is a general RL setting that includes both finite-horizon and discounted Markov Decision Processes (MDPs) as special cases. Moreover, many popular RL problems can be cast under the SSP formulation, such as game playing (e.g., Atari games) or navigation problems (e.g., Mujoco mazes). We study the online learning problem in the SSP setting (online SSP in short), where both the transition dynamics and the cost function are initially unknown. The agent interacts with the environment for K episodes, each ending when the goal state is reached. Its objective is to reach the goal while incurring minimal expected cumulative cost. Online SSP has only been sparsely and recently studied. Before reviewing the
arXiv:2104.11186v1 [cs.LG] 22 Apr 2021 relevant literature, we begin by identifying some desirable properties for a learning algorithm in online SSP. • Desired property 1: Minimax. A first characteristic of an algorithm is to perform as close as possible ? to the optimal√ policy π , i.e., it should achieve low regret. The information theoretic lower bound on the regret is Ω(B? SAK) [Rosenberg et al., 2020], where K is the number of episodes, S is the number of states, A is the number of actions and B? bounds the total expected cost of the optimal policy starting from any state. We say that: ∗equal contribution †Facebook AI Research & Inria, Scool team. Email: [email protected] ‡Tsinghua University. Email: [email protected] §University of Washington & Facebook AI Research. Email: [email protected] ¶DeepMind. Email: [email protected] kFacebook AI Research. Emails: {pirotta, lazaric}@fb.com
1 √ An algorithm for online SSP is (nearly) minimax optimal if its regret is bounded by Oe(B? SAK), up to logarithmic factors and lower-order terms. • Desired property 2: Parameter-free. Another relevant consideration is the amount of prior knowledge required by the algorithm. While the knowledge of S, A and the cost (or reward) range [0, 1] is standard across regret minimization settings (e.g., finite-horizon, discounted, average-reward), the complexity of learning in SSP problems may be linked to SSP-specific quantities such as B? and T? which denotes the expected time-to-goal of the optimal policy from any state.1 An algorithm is said to rely on a parameter X (i.e., B?, T?) if it requires prior knowledge of X, or a tight upper bound of it, to achieve its regret bound. We then say that:
An algorithm for online SSP is parameter-free if it relies neither on T? nor B? prior knowledge. • Desired property 3: Horizon-free. A core challenge in SSP is to trade-off between minimizing costs and quickly reaching the goal state. This is accentuated when the instantaneous costs are small, i.e., when there is a mismatch between B? and T?. Indeed, while B? ≤ T? always since the cost range is [0, 1], the gap between the two may be arbitrarily large (see e.g., the simple example of App.A). The lower bound stipulates that the regret does depend on B?, while the “time horizon” of the problem, i.e., T? should a priori not impact the regret, even as a lower-order term. By analogy with the horizon-free guarantees recently uncovered in finite-horizon MDPs with total reward bounded by 1 (cf. Sect. 1.3), we say that:
An algorithm for online SSP is (nearly) horizon-free if its regret depends only poly-logarithmically on T?. Remark 1. We also mention that another consideration is computational efficiency. Although the main focus in regret minimization is often statistical efficiency, a desirable property is that the run-time is polynomial in K, S, A, B?,T?. Like all existing algorithms for online SSP, our algorithm is computationally efficient.
1.1 Related Work The setting of online learning in SSP was introduced by Tarbouriech et al.[2020a] who gave a parameter-free algorithm with a Oe(K2/3) regret guarantee. Rosenberg et al.[2020] later improved this result by deriving the 3/2 √ first order-optimal√ algorithm with a regret bound Oe(B? S AK) in the parameter-free case and a bound Oe(B?S AK) when√ prior knowledge√ of B? is available (for appropriately tuning the cost perturbation). These rates display a B?S (resp. S) mismatch w.r.t. the lower bound. Both approaches can be described as model-optimistic2, drawing inspiration from the ideas behind the UCRL2 algorithm [Jaksch et al., 2010] for average-reward MDPs. Note that the above bounds contains lower-order dependencies either on T? in the case of general costs, or on B?/cmin in the case of positive costs lower bounded by cmin > 0 (note that T? ≤ B?/cmin, which is one of the reasons why cmin shows up in existing results). As such, none of these bounds is horizon-free according to the third desired property. Concurrently to our work, Cohen et al.[2021] propose an algorithm for online SSP based on a black-box reduction from SSP to finite-horizon MDPs. It successively tackles finite-horizon problems with horizon set to H = Ω(T?) and costs augmented by a terminal cost set to cH (s) = Ω(B?I(s 6= g)), where g denotes the goal state. This finite-horizon construction guarantees that its optimal policy resembles the optimal policy in√ the original SSP instance, up to a lower-order bias. Their algorithm comes with a regret bound 4 2 5 −1 of O(B? SAKL + T? S AL ), with L = log(KT?SAδ ) (with probability at least 1 − δ). Inspecting their bound w.r.t. the three aforementioned desired properties yields that it achieves a nearly minimax optimal rate, however it relies on both T? and B? prior knowledge to tune the horizon and terminal cost in the reduction, 3 respectively. Finally it is not horizon-free due to the T? polynomial dependence in the lower-order term. 1The SSP-diameter D, i.e., the longest shortest path from any initial state to the goal, is sometimes used in characterizing the complexity of learning in SSP with adversarial costs. Since no existing algorithm for online SSP relies on D to achieve better regret guarantees, we will ignore it in our classification. 2See Neu and Pike-Burke[2020] for details on differences and interplay of model-optimistic and value-optimistic approaches. 3 As mentioned in Cohen et al.[2021, Remark 2], in the case of positive costs lower bounded by cmin > 0, their knowledge of T can be bypassed by replacing it with the upper bound T ≤ B /c . However, when generalizing from the c case to ? ? ? min √ min −4 4/5 general costs with a perturbation argument, their regret guarantee worsens from Oe( K + cmin) to Oe(K ), because of the −1 poor additive dependence in cmin.
2 Horizon- Algorithm Regret Minimax Parameters Free
2/3 [Tarbouriech et al., 2020a] OeK (K ) No None No √ 3/2 2 Oe B?S AK + T? S A No B? No [Rosenberg et al., 2020] √ 3/2 2 Oe B? S AK + T?B?S A No None No [Cohen et al., 2021] √ 4 2 (concurrent work) Oe B? SAK + T? S A Yes B?, T? No √ 2 Oe B? SAK + B?S A Yes B?, T? Yes √ 2 T? ∗ Oe B? SAK + B?S A + poly(K) Yes B? No This work √ 3 3 Oe B? SAK + B? S A Yes T? Yes √ 3 3 T? ∗ Oe B? SAK + B? S A + poly(K) Yes None No √ Lower Bound Ω(B? SAK) - - -
Table 1: Regret comparisons of algorithms for online SSP. Oe omits logarithmic factors and OeK only reports the dependence√ in K. Regret is the performance metric defined in Eq.3. Minimax: Whether the regret matches the Ω(B? SAK) lower bound of Rosenberg et al.[2020], up to logarithmic factors and lower-order terms. Parameters: The parameters that the algorithm requires as input: either both B? and T?, or one of them, or none (i.e., parameter- ∗ free). Horizon-Free: Whether the regret bound depends only logarithmically on T?. If K is known in advance, the additive term T?/poly(K) has a denominator that is polynomial in K, so it becomes negligible for large values of K (if K is unknown, the additive term is T?). See Sect.4 for the full statements of our regret bounds.
1.2 Contributions We summarize our main contributions as follows: • We propose EB-SSP (Exploration Bonus for SSP), an algorithm that builds on a new value-optimistic approach for SSP. In particular, we introduce a simple way of computing optimistic policies, by slightly biasing the estimated transitions towards reaching the goal from each state-action pair with positive probability. Under these biased transitions, all policies are in fact proper in the empirical SSP (i.e., they eventually reach the goal with probability 1 from any starting state). We decay the bias over time in a way that it only contributes to the regret as a lower-order term. See Sect.3 for an overview of our algorithm and analysis. Note that EB-SSP is not based on a model-optimistic approach2 [Tarbouriech et al., 2020a, Rosenberg et al., 2020], and it does not rely on a reduction from SSP to finite-horizon [Cohen et al., 2021] (i.e., we operate at the level of the non-truncated SSP model); √ • EB-SSP is the first algorithm to achieve the minimax rate of Oe(B? SAK) while being parameter- free: it does not require to know nor estimate T?, and it is able to bypass the knowledge of B? at the cost of only logarithmic and lower-order contributions to the regret; • EB-SSP is the first algorithm to achieve horizon-free regret for SSP in various cases: i) positive costs, ii) no almost-sure zero-cost cycles, and iii) the general cost case when an order-accurate estimate of T? T? ζ is available (i.e., a value T ? that verifies υ ≤ T ? ≤ λT? for some positive unknown constants υ, λ, ζ ≥ 1 is known). It also does so while maintaining the minimax rate. For instance, under general costs when
3 √ 2 4 relying on T? and B?, EB-SSP achieves a regret of Oe(B? SAK +B?S A). To the best of our knowledge, EB-SSP yields the first set of nearly horizon-free bounds beyond the setting of finite-horizon MDPs. Table1 summarizes our contributions and positions them with respect to prior and concurrent work.
1.3 Additional Related Work Planning in SSP. Early work by Bertsekas and Tsitsiklis[1991], followed by a line of work [e.g., Bertsekas, 1995, Bonet, 2007, Kolobov et al., 2011, Bertsekas and Yu, 2013, Guillot and Stauffer, 2020], examine the planning problem in SSP, i.e., how to compute an optimal policy when all parameters of the SSP model are known. Under mild assumptions, the optimal policy is deterministic and stationary and can be computed efficiently using standard planning techniques, e.g., value iteration, policy iteration or linear programming.
Regret minimization in MDPs. The exploration-exploitation dilemma in tabular MDPs has been extensively studied in finite-horizon [e.g., Azar et al., 2017, Jin et al., 2018, Zanette and Brunskill, 2019, Efroni et al., 2019, Simchowitz and Jamieson, 2019, Zhang et al., 2020b, Neu and Pike-Burke, 2020, Xu et al., 2021, Menard et al., 2021] and infinite-horizon [e.g., Jaksch et al., 2010, Bartlett and Tewari, 2012, Fruit et al., 2018, Wang et al., 2020b, Qian et al., 2019, Wei et al., 2020].
Other SSP-based settings. SSP with adversarially changing costs was investigated by Rosenberg and Mansour[2020], Chen et al.[2020], Chen and Luo[2021]. 5 Tarbouriech et al.[2021] study the sample complexity of SSP with a generative model, due to the observation that a standard regret-to-PAC conversion may not hold in SSP (as opposed to finite-horizon). Exploration problems involving multiple goal states (i.e., multi-goal SSP or goal-conditioned RL) were analyzed by Lim and Auer[2012], Tarbouriech et al.[2020b].
Horizon-free regret in finite-horizon MDPs. Our third desired property echoes the recent investigation of so-called horizon-free regret in finite-horizon MDPs that satisfy the assumption of total reward bounded by 1. Since Jiang and Agarwal[2018], it was wondered whether it is possible to obtain a regret bound that only depends logarithmically in the horizon H, which would imply that long-horizon RL is no more difficult than short-horizon RL. Recently, this question was answered positively [Wang et al., 2020a, Zhang et al., 2020a, 2021]. In SSP (which we recall subsumes finite-horizon MDPs), a natural counterpart of the horizon is T?, which bounds the expected time of the optimal policy to terminate the episode from any state. This contrasts with the horizon H in finite-horizon MDPs, which is the almost-sure time of any policy to terminate the episode. Note that such notion of horizon would obviously be too strong for SSP, where some (even most) policies may never reach the goal, thus having an unbounded time horizon. This motivates our definition of (nearly) horizon-free regret in SSP if it only contains a logarithmic dependence on T?. We stress that we do not make any extra assumption on the SSP problem; in particular we do not assume any bound on the cumulative cost of a trajectory.
2 Preliminaries
SSP-MDP. An instance of the SSP problem is a Markov decision process (MDP) M := hS, A, P, c, s0, gi, where S is the finite state space with cardinality S, A is the finite action space with cardinality A, and s0 ∈ S is the initial state. We denote by g∈ / S the goal state, and we set S0 := S ∪ {g} (thus S0 := S + 1). Taking action a in state s incurs a cost drawn i.i.d. from a distribution with expectation c(s, a) ∈ [0, 1], and the 0 0 0 P 0 next state s ∈ S is selected with probability P (s |s, a) (where s0∈S0 P (s |s, a) = 1). The goal state g is √ 4 We conjecture the optimal problem-independent regret in SSP to be Oe(B? SAK + B?SA) (by extrapolating the conjecture of Menard et al.[2021] for finite-horizon MDPs), which shows the tightness of our bound up to an S lower-order factor. 5Note that a different line of work [e.g. Neu et al., 2010, 2012, Rosenberg and Mansour, 2019a,b, Jin et al., 2020, Jin and Luo, 2020] studies finite-horizon MDPs with adversarial costs (sometimes called online loop-free SSP), where an episode ends after a fixed number of H steps. This is not to be confused with our setting where an episode lasts as long as the goal is not reached.
4 absorbing and zero-cost, i.e., P (g|g, a) = 1 and c(g, a) = 0 for any action a ∈ A, which effectively implies that the agent ends its interaction with M once it reaches the goal g. 0 For notational convenience, we use Ps,a and Ps,a,s0 to denote P (·|s, a) and P (s |s, a) respectively. For any 0 P 2 P 2 two S -dimensional vectors X and Y , we use XY to denote s∈S0 X(s)Y (s), X to denote s∈S0 [X(s)] , 0 2 2 and if X is a probability distribution over S , then V(X,Y ) := XY − (XY ) . We also define kXk∞ := 6=g P 6=g P maxs∈S0 |X(s)|, kXk∞ := maxs0∈S |X(s)|, kXk1 := s∈S0 |X(s)|, kXk1 := s∈S |X(s)| = kXk1 − |X(g)|.
Proper policies. A stationary and deterministic policy π : S → A is a mapping that executes action π(s) when the agent is in state s. A policy π is said to be proper if it reaches the goal with probability 1 when starting from any state in S (otherwise it is improper). We denote by Πproper the set of proper, stationary and deterministic policies. As all prior works on regret minimization in SSP, we make the following basic assumption which ensures that the SSP problem is well-posed.
Assumption 1. There exists at least one proper policy, i.e., Πproper 6= ∅.
Q-functions and value functions in SSP. The agent’s objective is to minimize its expected cumulative cost incurred until the goal is reached. The value function (also called cost-to-go) of a policy π is defined as T π X V (s) := lim E ct(st, π(st)) s1 = s , T →+∞ t=1 where ct ∈ [0, 1] is the cost incurred at time t at the state-action pair (st, π(st)). The Q-function of policy π is defined for all (s, a) ∈ S × A as T π X Q (s, a) := lim E ct(st, π(st)) s1 = s, π(s1) = a . T →+∞ t=1 In both definitions, the expectation is w.r.t. the random sequence of states generated by executing π starting from state s ∈ S (and taking action a ∈ A in the second case). Note that the value function of a policy may possibly be unbounded if it never reaches the goal. For a proper policy π, V π(s) and Qπ(s, a) are finite for any s ∈ S and a ∈ A. We also define V π(g) = Qπ(g, a) = 0 for all policies π and actions a, by virtue of the zero-cost and self-loop property of the goal state g. Finally, we denote by T π(s) the expected time that π takes to reach the goal g starting at state s; in particular, if π is proper then T π(s) is finite for all s, whereas if π is improper there must exist at least one state s such that T π(s) = ∞. Equipped with Asm.1 and an additional condition on improper policies defined below, one can derive important properties on the optimal policy π? that minimizes the value function component-wise. Lemma 2 (Bertsekas and Tsitsiklis, 1991;Yu and Bertsekas, 2013). Suppose that Asm.1 holds and that for 0 every improper policy π0 there exists at least one state s ∈ S such that V π (s) = +∞. Then the optimal policy ? π? is stationary, deterministic, and proper. Moreover, V ? = V π is the unique solution of the optimality equations V ? = LV ? and V ?(s) < +∞ for any s ∈ S, where for any vector V ∈ RS the optimal Bellman operator L is defined as n o LV (s) := min c(s, a) + Ps,aV . (1) a∈A
? Furthermore, the optimal Q-value, denoted by Q? = Qπ , is related to the optimal value function as follows
? ? ? ? Q (s, a) = c(s, a) + Ps,aV ,V (s) = min Q (s, a), (s, a) ∈ S × A. (2) a∈A
Since we will target the best proper policy, we will handle the second requirement of Lem.2 in the following way [Bertsekas and Yu, 2013, Rosenberg et al., 2020]. First, we see that the requirement is in particular verified if all instantaneous costs are strictly positive. To deal with the case of non-negative costs, we can introduce
5 a small additive perturbation η ∈ (0, 1] to all costs to yield a new cost function cη(s, a) = max{c(s, a), η}, which is strictly positive. In this cost-perturbed MDP, the conditions of Lem.2 hold so we obtain an optimal ? ? policy πη that is stationary, deterministic and proper and has a finite value function Vη . Taking the limit as ? ? ? π? ? η → 0, we have that πη → π and Vη → V , where π is the optimal proper policy in the original model ? that is also stationary and deterministic, and V π denotes its value function. This observation enables to circumvent the second condition of Lem.2 and to only require Asm.1 to hold.
Learning formulation. We consider the learning problem where the agent does not have any prior knowledge of the cost function c or transition function P . Each episode starts at the initial state s0 (the extension to any possibly unknown distribution of initial states is straightforward), and each episode ends only when the goal state g is reached (note that this may never happen if the agent does not devise an adequate goal-reaching strategy to terminate the episode). We evaluate the performance of the agent after K episodes by its regret, which captures the difference between its total cost over the K episodes and the total expected cost of the optimal proper policy:
K Ik X X k π RK := ch − K · min V (s0), (3) π∈Πproper k=1 h=1 k k where I is the time it takes the agent to complete episode k and ch is the cost incurred in the h-th step k k k of episode k when the agent visits state-action pair (sh, ah). If there exists k such that I is infinite, we define RK = ∞. ? Throughout the paper, we denote the optimal proper policy by π? and we set V ?(s) := V π (s) = π ? π? π minπ∈Πproper V (s) and Q (s, a) := Q (s, a) = minπ∈Πproper Q (s, a) for all (s, a) ∈ S × A. Let B? > 0 ? ? ? bound the values of V , i.e., B? := maxs∈S V (s). From Eq.2 it is obvious that Q (s, a) ≤ 1 + B?. Finally, π? let T? > 0 bound the expected goal-reaching times of the optimal policy, i.e., T? := maxs∈S T (s). We see that B? ≤ T? < +∞.
3 Algorithm and Analysis Overview
We introduce our algorithm EB-SSP (Exploration Bonus for SSP) in Alg.1. It takes as input the state- action space S × A, the confidence level δ ∈ (0, 1), the number of episodes K. It also considers that an estimate B such that B ≥ B? is available, and we handle in Sect. 4.2 and App.F the case where such an estimate is unavailable (i.e., unknown B?). As explained in Sect.2, the algorithm will enforce the conditions of Lem.2 to hold by adding a small cost perturbation η ∈ [0, 1] (cf. lines3, 12) — either η = 0 if the agent is aware that all costs are already positive, otherwise a careful choice of η > 0 will be provided in Sect.4. Our algorithm builds on a value-optimistic approach, i.e., it sequentially constructs optimistic lower bounds on the optimal Q-function and executes the policy that greedily minimizes these optimistic values. The structure is inspired from the MVP algorithm in Zhang et al.[2020a] designed for finite-horizon RL. In particular, we also adopt the doubling update framework (first introduced by Jaksch et al.[2010]): whenever the number of visits of a state-action pair is doubled, the algorithm updates the empirical cost and transition probability of this state-action pair, as well as computes a new optimistic Q-estimate and optimistic greedy policy. Note that this slightly differs from MVP which waits for the end of the finite-horizon episode to update the policy. In SSP, however, having this delay may yield linear regret as the episode has the risk of never terminating under the current policy (e.g., if it is improper), which is why we perform the policy update instantaneously when the doubling condition is met. The main algorithmic component lies in computing the Q-values (w.r.t. which the policy is greedy) whenever each doubling condition is met. To this purpose, we introduce a procedure called VISGO, for Value Iteration with Slight Goal Optimism. Starting with optimistic values V (0) = 0, VISGO iteratively computes V (i+1) = LeV (i) for a carefully defined operator Le. It terminates when a stopping condition is met, (i+1) (i) specifically once kV − V k∞ ≤ VI for a precision level VI > 0 (that will be specified later), and it outputs the corresponding values V (i+1) (and Q-values Q(i+1)).
6 Algorithm 1: Algorithm EB-SSP
1 Input: S, s0 ∈ S, g 6∈ S, A, δ. 2 Input: an estimate B guaranteeing B ≥ max{B?, 1} (see Sect. 4.2 and App.F if not available). 3 Optional input: cost perturbation η ∈ [0, 1]. √ √ j−1 4 Specify: Trigger set N ← {2 : j = 1, 2,...}. Constants c1 = 6, c2 = 36, c3 = 2 2, c4 = 2 2. 0 0 5 For (s, a, s ) ∈ S × A × S , set 0 N(s, a) ← 0; n(s, a) ← 0; N(s, a, s ) ← 0; Pbs,a,s0 ← 0; θ(s, a) ← 0; bc(s, a) ← 0; Q(s, a) ← 0; V (s) ← 0. 6 Set initial time step t ← 1 and trigger index j ← 0. 7 for episode k = 1, 2,... do 8 Set st ← s0 9 while st 6= g do
10 Take action at = arg maxa∈A Q(st, a), incur cost ct and observe next state st+1 ∼ P (·|st, at). 0 11 Set (s, a, s , c) ← (st, at, st+1, max{ct, η}) and t ← t + 1. 0 0 12 Set N(s, a) ← N(s, a) + 1, θ(s, a) ← θ(s, a) + c, N(s, a, s ) ← N(s, a, s ) + 1. 13 if N(s, a) ∈ N then 14 \\Update triggered: VISGO procedure 15 Set j ← j + 1. 16 \\Compute empirical costs, empirical transitions Pb and slightly skewed transitions Pe 2θ(s,a) 17 Set c(s, a) ← I[N(s, a) ≥ 2] + I[N(s, a) = 1]θ(s, a) and θ(s, a) ← 0. b N(s,a) 0 0 0 18 For all s ∈ S , set Pbs,a,s0 ← N(s, a, s )/N(s, a), n(s, a) ← N(s, a), and
n(s, a) I[s0 = g] Pes,a,s0 ← Pbs,a,s0 + . (4) n(s, a) + 1 n(s, a) + 1
19 \\Update Q −j (0) (−1) 20 Set VI ← 2 and i ← 0, V ← 0, V ← +∞. + 12SAS0[n+(s,a)]2 21 For all (s, a) ∈ S × A, set n (s, a) ← max{n(s, a), 1} and ιs,a ← ln δ . (i) (i−1) 22 while kV − V k∞ > VI do 23 \\Inner iteration of VISGO 24 For all (s, a) ∈ S × A, set s s p 0 n (P ,V (i))ι Bι o c(s, a)ι B S ιs,a b(i+1)(s, a) ← max c V es,a s,a , c s,a + c b s,a + c , 1 n+(s, a) 2 n+(s, a) 3 n+(s, a) 4 n+(s, a) (5) (i+1) (i) (i+1) Q (s, a) ← max bc(s, a) + Pes,aV − b (s, a), 0 , (6) V (i+1)(s) ← min Q(i+1)(s, a). (7) a
25 Set i ← i + 1. (i) (i) 26 Set Q ← Q , V ← V .
7 We begin by explaining how we design Le and we then provide some intuition. Let Pb, bc be the current empirical transition probabilities and costs, and let n(s, a) be the current number of visits to state-action pair (s, a) (and n+(s, a) = max{n(s, a), 1}). We first define transition probabilities Pe that are slightly skewed towards the goal w.r.t. Pb, as follows n(s, a) I[s0 = g] Pes,a,s0 := Pbs,a,s0 + . (8) n(s, a) + 1 n(s, a) + 1
0 We then define the bonus function, for any vector V ∈ RS and state-action pair (s, a), as s s p 0 n V(Pes,a,V )ιs,a Bιs,a o bc(s, a)ιs,a B S ιs,a b(V, s, a) := max c1 , c2 + c3 + c4 , (9) n+(s, a) n+(s, a) n+(s, a) n+(s, a)
for the estimate B, specific constants c1, c2, c3, c4 and a state-action dependent logarithmic term ιs,a. Note that the last term in Eq.9 accounts for the skewing of Pe w.r.t. Pb (see Lem. 12). Given the transitions Pe and bonus b, we are ready to define the operator Le as n o LeV (s) := max min c(s, a) + Pes,aV − b(V, s, a) , 0 . (10) a∈A b
We see that Le instills optimism in two different ways:
(i) On the empirical cost function bc, via the variance-aware exploration bonus b (Eq.9) that intuitively lowers the costs to bc − b; (ii) On the empirical transition function Pb, via the transitions Pe (Eq.8) that slightly bias Pb with the addition of a non-zero probability of reaching the goal from every state-action pair. While the first feature (i) is standard in finite-horizon approaches, the second feature (ii) is SSP-specific and it is required to cope with the fact that the empirical model Pb may not admit any proper policy, meaning that executing value iteration for SSP on Pb may diverge whatever the cost function. Our simple transition skewing in (ii) actually yields the strong guarantee that all policies are proper in Pe, for any bounded cost function.67 Crucially, we design the skewing in (ii) so that this extra goal-reaching probability decays inversely with n(s, a), the number of visits of the state-action pair. This allows to tightly control the gap between Pe and Pb and to ensure that it will only account for a lower-order term in the regret (cf. last term of Eq.9). Note that the model-optimistic algorithms in Tarbouriech et al.[2020a], Rosenberg et al.[2020] induce√ larger deviations between the empirical model and their optimistic models, which leads to an additional S factor in their regret bounds.
Equipped with these two sources of optimism, as long as B ≥ B?, we are able to prove that any VISGO procedure verifies the following two key properties: (1) Optimism: VISGO outputs an optimistic estimator of the optimal Q-function at each iteration step, i.e., Q(i)(s, a) ≤ Q?(s, a), ∀i ≥ 0,
(2) Finite-time near-convergence: VISGO terminates within a finite number of iteration steps (note (j) that the final iterate V approximates the fixed point of Le up to an error scaling with VI).
6In fact this transition skewing also implies that the SSP problem defined on Pe becomes equivalent to a discounted RL problem, with a varying state-action dependent discount factor (cf. Remark2 later). 7Also note that for different albeit mildly related purposes, a perturbation trick is sometimes used in regret minimization for average-reward MDPs [e.g., Fruit et al., 2018, Qian et al., 2019], where a non-zero probability of reaching an arbitrary state is added at each state-action pair to guarantee that all policies are unichain and that variants of value iteration (model-optimistic or value-optimistic) are nearly convergent in finite-time.
8 Having variance-aware exploration bonuses introduces a subtle correlation between consecutive value iterates (i.e., b depends on V in Eq.9), as in the finite-horizon case [Zhang et al., 2020a]. To satisfy (1), we derive similarly to MVP a monotonicity property of the operator Le, which can be achieved by carefully tuning the constants c1, c2, c3, c4 in the exploration bonus of Eq.9. On the other hand, the second requirement (2) is SSP-specific; in particular it is not needed in finite-horizon where value iteration always requires exactly H backward induction steps. We establish that the operator Le also satisfies a contraction property, which guarantees a polynomially bounded number of iterations before terminating, i.e., (2).
Regret analysis. Besides ensuring the computational efficiency of EB-SSP, the properties of VISGO lay the foundations for our regret analysis. In particular, we show that the recursion-based methods introduced by Zhang et al.[2020a] in finite-horizon can be adapted to the more general SSP setting. While some differences are quite straightforward to handle — e.g., the mismatch in the regret definitions between SSP
and finite-horizon, or the need in SSP to control the sequence of the VISGO errors VI — we now point out two main challenges that appear. • Challenge 1: disparity between episodes. In finite-horizon MDPs, the episodes always yield a cumulative reward that is upper bounded by H, or by 1 under the setting of Zhang et al.[2020a]. As such, their analysis is naturally episode-decomposed. However, in SSP, episodes may display very different behavior, as some episodes may incur much higher costs than others (especially at the beginning of learning). To deal with this, we decompose consecutive time steps into so-called intervals and we instead perform an interval-decomposed analysis. Note that the interval decomposition is not performed algorithmically, but only at the analysis level. Similarly to Rosenberg et al.[2020], we construct our intervals so that the cumulative cost within an interval is always upper bounded by B? + 1. This interestingly mirrors the finite-horizon assumption of the cumulative rewards within an episode that are bounded by 1 [Zhang et al., 2020a]. Our analysis will need to perform some sort of rescaling in 1/B? and consider normalized value functions (with range always upper bounded by 1), in order to avoid an exponential blow-up in the recursions. We stress that we do not make any extra assumption on the SSP problem w.r.t. the cumulative cost of the trajectories (which may even be infinite if the agent never reaches the goal). This differs from Zhang et al.[2020a] whose finite-horizon analysis relies on the assumption of the cumulative reward of any trajectory being upper bounded by 1.
• Challenge 2: unknown range of the optimal value function. In finite-horizon MDPs, both the horizon H and the bound on the cumulative reward (H, or 1 for Zhang et al.[2020a]) are assumed to be known to the agent. This prior knowledge is key to tune exploration bonuses that yield valid confidence intervals and thus guarantee optimism. In SSP, if the agent does not have a valid estimate B ≥ B?, then it may design an under-specified exploration bonus which cannot guarantee optimism. The case of unknown B? is non-trivial: it appears impossible to properly estimate B? (since some states may never be visited) and it is unclear how a standard doubling trick may be leveraged.8 In Sect. 4.2 and App.F, we propose a new scheme that circumvents the knowledge of B? at the cost of only logarithmic and lower-order contributions to the regret. Essentially, the agent starts with estimate Be = 1 and tracks both the cumulative costs and the range of the value function computed by VISGO. The agent carefully increments Be in two possible ways (each with a different speed): either whenever one of the two tracked quantities exceeds some computable threshold, or whenever a new episode begins. Our analysis then boils down to the following argument: if Be < B? we bound the regret by the cumulative cost threshold, otherwise optimism holds and we can recover the regret bound under a known upper bound on B?.
8Note, for instance, that the exploration-bonus-based approach of Qian et al.[2019] in average-reward MDPs requires prior knowledge of an upper bound on the optimal bias span.
9 4 Main Results
We first establish the following guarantee on which all our subsequent regret bounds will rely (Sect.5 is devoted to its proof).
Theorem 3. Assume that B ≥ max{B?, 1} and that the conditions of Lem.2 hold. Then with probability at least 1 − δ the regret of EB-SSP (Alg.1 with η = 0) can be bounded by
max{B , 1}SAT max{B , 1}SAT R = O p(B2 + B )SAK log ? + BS2A log2 ? , K ? ? δ δ where the random variable T denotes the accumulated time within the K episodes. In the case where B? ≥ 1, we can simplify the above bound as
√ B SAT B SAT R = O B SAK log ? + BS2A log2 ? . K ? δ δ
The statement of Thm.3 depends on the estimate B and more importantly on T (in its log terms), which is the total time accumulated within the K episodes and which is a random (and possibly unbounded) quantity. Recall that throughout we consider that Asm.1 holds. We now proceed as follows:
• In Sect. 4.1 we assume that B = B? (i.e., the agent has prior knowledge of B?) and we focus on bounding T . We instantiate several regret bounds depending on the nature of the SSP problem at hand.
• In Sect. 4.2 we handle the case of unknown B? and show how all the regret bounds of Sect. 4.1 remain valid up to logarithmic and lower-order terms.
Throughout Sect.4, we consider for ease of exposition that B? ≥ 1, otherwise all the following bounds hold by replacing the occurrences√ of B? with max{B?, 1}, except for the B? multiplicative factor in the leading term 9 which becomes B?. Moreover, for simplicity (when tuning the subsequent cost perturbations) we assume as in prior works [e.g., Rosenberg et al., 2020, Chen et al., 2020, Chen and Luo, 2021] that the total number of episodes K is known to the agent (this knowledge can be eliminated with the standard doubling trick).
4.1 Regret Bounds for B = B?
First we assume that B = B? (i.e., the agent has prior knowledge of B?) and we analyze the regret achieved by EB-SSP under various conditions on the SSP model.
4.1.1 Positive Costs
Assumption 4. All costs are lower bounded by a constant cmin > 0 which may be unknown to the agent.
We first focus on the case of positive costs. Asm.4 guarantees that the conditions of Lem.2 hold. Moreover, denoting by C the total cumulative cost,√ we have that the total time satisfies T ≤ C/cmin. By simplifying 2 p 3 5 3 2 the bound of Thm.3 as C = O(B?S A K · B?T SA/δ), we loosely obtain that T = O(B?S A K/(cminδ)). Plugging this in Thm.3 yields the following result.
Corollary 5. Under Asm.4, running EB-SSP (Alg.1) with B = B? and η = 0 gives the following regret bound with probability at least 1 − δ √ KB?SA 2 2 KB?SA RK = O B? SAK log + B?S A log . cminδ cminδ √ 9 When√ B? < 1, Thm.3 has a leading term scaling in Oe( B?SAK). While this shows a mismatch with the lower bound Ω(B? SAK) of Rosenberg et al.[2020], we notice that this lower√ bound assumes that B? ≥ 2. It is thus unclear what the dependence in B? of the lower bound should look like (either B? or B?) in the uncommon corner case of B? < 1.
10 The bound of Cor.5 only depends polynomially on K, S, A, B?. We note that T? ≤ B?/cmin and that this upper bound only appears in the logarithms. Under positive costs, the regret of EB-SSP is thus (nearly) minimax and horizon-free. Furthermore, in App.B we introduce an alternative assumption on the SSP problem (which is weaker than Asm.4) that considers that there are no almost-sure zero-cost cycles. We show that in this case also, the regret of EB-SSP is (nearly) minimax and horizon-free.
4.1.2 General Costs and T? Unknown Now we handle the general case of non-negative costs, with no assumption other than Asm.1. We use a cost perturbation argument to generalize the results from the cmin > 0 case to general costs (as also done in Tarbouriech et al.[2020a], Rosenberg et al.[2020], Rosenberg and Mansour[2020]). As reviewed in Sect.2, this enables to circumvent the second condition of Lem.2 (which holds in the cost-perturbed MDP) and to target the optimal proper policy in the original MDP up to a bias scaling with the cost perturbation. Specifically, if EB-SSP is run with costs cη(s, a) ← max{c(s, a), η} for any η ∈ (0, 1], then we recover the regret bound of Cor.5 with cmin ← η and B? ← B? + ηT?, as well as an additive bias of ηT?K. By picking η to balance these terms, we obtain the following.
−n Corollary 6. Running EB-SSP (Alg.1) with B = B? and η = K for any choice of constant n > 1 gives the following regret bound with probability at least 1 − δ √ ! √ T nT SAL R = O nB SAKL + ? + ? + n2B S2AL2 , K ? n−1 n− 1 ? K K 2