Stochastic Shortest Path: Minimax, Parameter-Free and Towards Horizon-Free Regret

Jean Tarbouriech∗† Runlong Zhou∗‡ Simon S. Du§ Matteo Pirottak Michal Valko¶ Alessandro Lazarick

Abstract We study the problem of learning in the stochastic shortest path (SSP) setting, where an agent seeks to minimize the expected cost accumulated before reaching a goal state. We design a novel model-based algorithm EB-SSP that carefully skews the empirical transitions and perturbs the empirical costs with an exploration bonus to guarantee both optimism and convergence√ of the associated value iteration scheme. We prove that EB-SSP achieves the minimax regret rate Oe(B? SAK), where K is the number of episodes, S is the number of states, A is the number of actions and B? bounds the expected cumulative cost of the optimal policy from any state, thus closing the gap with the lower bound. Interestingly, EB-SSP obtains this result while being parameter-free, i.e., it does not require any prior knowledge of B?, nor of T? which bounds the expected time-to-goal of the optimal policy from any state. Furthermore, we illustrate various cases (e.g., positive costs, or general costs when an order-accurate estimate of T? is available) where the regret only contains a logarithmic dependence on T?, thus yielding the first horizon-free regret bound beyond the finite-horizon MDP setting.

1 Introduction

Stochastic shortest path (SSP) is a goal-oriented (RL) setting where the agent aims to reach a predefined goal state while minimizing its total expected cost [Bertsekas, 1995]. Unlike other well-studied settings in RL (e.g., finite-horizon), the interaction between the agent and the environment ends only when (and if) the goal state is reached, so its length is not predetermined (nor bounded) and is influenced by the agent’s behavior. SSP is a general RL setting that includes both finite-horizon and discounted Markov Decision Processes (MDPs) as special cases. Moreover, many popular RL problems can be cast under the SSP formulation, such as game playing (e.g., Atari games) or navigation problems (e.g., Mujoco mazes). We study the online learning problem in the SSP setting (online SSP in short), where both the transition dynamics and the cost function are initially unknown. The agent interacts with the environment for K episodes, each ending when the goal state is reached. Its objective is to reach the goal while incurring minimal expected cumulative cost. Online SSP has only been sparsely and recently studied. Before reviewing the

arXiv:2104.11186v1 [cs.LG] 22 Apr 2021 relevant literature, we begin by identifying some desirable properties for a learning algorithm in online SSP. • Desired property 1: Minimax. A first characteristic of an algorithm is to perform as close as possible ? to the optimal√ policy π , i.e., it should achieve low regret. The information theoretic lower bound on the regret is Ω(B? SAK) [Rosenberg et al., 2020], where K is the number of episodes, S is the number of states, A is the number of actions and B? bounds the total expected cost of the optimal policy starting from any state. We say that: ∗equal contribution †Facebook AI Research & Inria, Scool team. Email: [email protected] ‡Tsinghua University. Email: [email protected] §University of Washington & Facebook AI Research. Email: [email protected] ¶DeepMind. Email: [email protected] kFacebook AI Research. Emails: {pirotta, lazaric}@fb.com

1 √ An algorithm for online SSP is (nearly) minimax optimal if its regret is bounded by Oe(B? SAK), up to logarithmic factors and lower-order terms. • Desired property 2: Parameter-free. Another relevant consideration is the amount of prior knowledge required by the algorithm. While the knowledge of S, A and the cost (or reward) range [0, 1] is standard across regret minimization settings (e.g., finite-horizon, discounted, average-reward), the complexity of learning in SSP problems may be linked to SSP-specific quantities such as B? and T? which denotes the expected time-to-goal of the optimal policy from any state.1 An algorithm is said to rely on a parameter X (i.e., B?, T?) if it requires prior knowledge of X, or a tight upper bound of it, to achieve its regret bound. We then say that:

An algorithm for online SSP is parameter-free if it relies neither on T? nor B? prior knowledge. • Desired property 3: Horizon-free. A core challenge in SSP is to trade-off between minimizing costs and quickly reaching the goal state. This is accentuated when the instantaneous costs are small, i.e., when there is a mismatch between B? and T?. Indeed, while B? ≤ T? always since the cost range is [0, 1], the gap between the two may be arbitrarily large (see e.g., the simple example of App.A). The lower bound stipulates that the regret does depend on B?, while the “time horizon” of the problem, i.e., T? should a priori not impact the regret, even as a lower-order term. By analogy with the horizon-free guarantees recently uncovered in finite-horizon MDPs with total reward bounded by 1 (cf. Sect. 1.3), we say that:

An algorithm for online SSP is (nearly) horizon-free if its regret depends only poly-logarithmically on T?. Remark 1. We also mention that another consideration is computational efficiency. Although the main focus in regret minimization is often statistical efficiency, a desirable property is that the run-time is polynomial in K, S, A, B?,T?. Like all existing algorithms for online SSP, our algorithm is computationally efficient.

1.1 Related Work The setting of online learning in SSP was introduced by Tarbouriech et al.[2020a] who gave a parameter-free algorithm with a Oe(K2/3) regret guarantee. Rosenberg et al.[2020] later improved this result by deriving the 3/2 √ first order-optimal√ algorithm with a regret bound Oe(B? S AK) in the parameter-free case and a bound Oe(B?S AK) when√ prior knowledge√ of B? is available (for appropriately tuning the cost perturbation). These rates display a B?S (resp. S) mismatch w.r.t. the lower bound. Both approaches can be described as model-optimistic2, drawing inspiration from the ideas behind the UCRL2 algorithm [Jaksch et al., 2010] for average-reward MDPs. Note that the above bounds contains lower-order dependencies either on T? in the case of general costs, or on B?/cmin in the case of positive costs lower bounded by cmin > 0 (note that T? ≤ B?/cmin, which is one of the reasons why cmin shows up in existing results). As such, none of these bounds is horizon-free according to the third desired property. Concurrently to our work, Cohen et al.[2021] propose an algorithm for online SSP based on a black-box reduction from SSP to finite-horizon MDPs. It successively tackles finite-horizon problems with horizon set to H = Ω(T?) and costs augmented by a terminal cost set to cH (s) = Ω(B?I(s 6= g)), where g denotes the goal state. This finite-horizon construction guarantees that its optimal policy resembles the optimal policy in√ the original SSP instance, up to a lower-order bias. Their algorithm comes with a regret bound 4 2 5 −1 of O(B? SAKL + T? S AL ), with L = log(KT?SAδ ) (with probability at least 1 − δ). Inspecting their bound w.r.t. the three aforementioned desired properties yields that it achieves a nearly minimax optimal rate, however it relies on both T? and B? prior knowledge to tune the horizon and terminal cost in the reduction, 3 respectively. Finally it is not horizon-free due to the T? polynomial dependence in the lower-order term. 1The SSP-diameter D, i.e., the longest shortest path from any initial state to the goal, is sometimes used in characterizing the complexity of learning in SSP with adversarial costs. Since no existing algorithm for online SSP relies on D to achieve better regret guarantees, we will ignore it in our classification. 2See Neu and Pike-Burke[2020] for details on differences and interplay of model-optimistic and value-optimistic approaches. 3 As mentioned in Cohen et al.[2021, Remark 2], in the case of positive costs lower bounded by cmin > 0, their knowledge of T can be bypassed by replacing it with the upper bound T ≤ B /c . However, when generalizing from the c case to ? ? ? min √ min −4 4/5 general costs with a perturbation argument, their regret guarantee worsens from Oe( K + cmin) to Oe(K ), because of the −1 poor additive dependence in cmin.

2 Horizon- Algorithm Regret Minimax Parameters Free

2/3 [Tarbouriech et al., 2020a] OeK (K ) No None No √  3/2 2  Oe B?S AK + T? S A No B? No [Rosenberg et al., 2020] √  3/2 2  Oe B? S AK + T?B?S A No None No [Cohen et al., 2021] √  4 2  (concurrent work) Oe B? SAK + T? S A Yes B?, T? No √  2  Oe B? SAK + B?S A Yes B?, T? Yes  √  2 T? ∗ Oe B? SAK + B?S A + poly(K) Yes B? No This work √  3 3  Oe B? SAK + B? S A Yes T? Yes  √  3 3 T? ∗ Oe B? SAK + B? S A + poly(K) Yes None No √ Lower Bound Ω(B? SAK) - - -

Table 1: Regret comparisons of algorithms for online SSP. Oe omits logarithmic factors and OeK only reports the dependence√ in K. Regret is the performance metric defined in Eq.3. Minimax: Whether the regret matches the Ω(B? SAK) lower bound of Rosenberg et al.[2020], up to logarithmic factors and lower-order terms. Parameters: The parameters that the algorithm requires as input: either both B? and T?, or one of them, or none (i.e., parameter- ∗ free). Horizon-Free: Whether the regret bound depends only logarithmically on T?. If K is known in advance, the additive term T?/poly(K) has a denominator that is polynomial in K, so it becomes negligible for large values of K (if K is unknown, the additive term is T?). See Sect.4 for the full statements of our regret bounds.

1.2 Contributions We summarize our main contributions as follows: • We propose EB-SSP (Exploration Bonus for SSP), an algorithm that builds on a new value-optimistic approach for SSP. In particular, we introduce a simple way of computing optimistic policies, by slightly biasing the estimated transitions towards reaching the goal from each state-action pair with positive probability. Under these biased transitions, all policies are in fact proper in the empirical SSP (i.e., they eventually reach the goal with probability 1 from any starting state). We decay the bias over time in a way that it only contributes to the regret as a lower-order term. See Sect.3 for an overview of our algorithm and analysis. Note that EB-SSP is not based on a model-optimistic approach2 [Tarbouriech et al., 2020a, Rosenberg et al., 2020], and it does not rely on a reduction from SSP to finite-horizon [Cohen et al., 2021] (i.e., we operate at the level of the non-truncated SSP model); √ • EB-SSP is the first algorithm to achieve the minimax rate of Oe(B? SAK) while being parameter- free: it does not require to know nor estimate T?, and it is able to bypass the knowledge of B? at the cost of only logarithmic and lower-order contributions to the regret; • EB-SSP is the first algorithm to achieve horizon-free regret for SSP in various cases: i) positive costs, ii) no almost-sure zero-cost cycles, and iii) the general cost case when an order-accurate estimate of T? T? ζ is available (i.e., a value T ? that verifies υ ≤ T ? ≤ λT? for some positive unknown constants υ, λ, ζ ≥ 1 is known). It also does so while maintaining the minimax rate. For instance, under general costs when

3 √ 2 4 relying on T? and B?, EB-SSP achieves a regret of Oe(B? SAK +B?S A). To the best of our knowledge, EB-SSP yields the first set of nearly horizon-free bounds beyond the setting of finite-horizon MDPs. Table1 summarizes our contributions and positions them with respect to prior and concurrent work.

1.3 Additional Related Work Planning in SSP. Early work by Bertsekas and Tsitsiklis[1991], followed by a line of work [e.g., Bertsekas, 1995, Bonet, 2007, Kolobov et al., 2011, Bertsekas and Yu, 2013, Guillot and Stauffer, 2020], examine the planning problem in SSP, i.e., how to compute an optimal policy when all parameters of the SSP model are known. Under mild assumptions, the optimal policy is deterministic and stationary and can be computed efficiently using standard planning techniques, e.g., value iteration, policy iteration or linear programming.

Regret minimization in MDPs. The exploration-exploitation dilemma in tabular MDPs has been extensively studied in finite-horizon [e.g., Azar et al., 2017, Jin et al., 2018, Zanette and Brunskill, 2019, Efroni et al., 2019, Simchowitz and Jamieson, 2019, Zhang et al., 2020b, Neu and Pike-Burke, 2020, Xu et al., 2021, Menard et al., 2021] and infinite-horizon [e.g., Jaksch et al., 2010, Bartlett and Tewari, 2012, Fruit et al., 2018, Wang et al., 2020b, Qian et al., 2019, Wei et al., 2020].

Other SSP-based settings. SSP with adversarially changing costs was investigated by Rosenberg and Mansour[2020], Chen et al.[2020], Chen and Luo[2021]. 5 Tarbouriech et al.[2021] study the sample complexity of SSP with a generative model, due to the observation that a standard regret-to-PAC conversion may not hold in SSP (as opposed to finite-horizon). Exploration problems involving multiple goal states (i.e., multi-goal SSP or goal-conditioned RL) were analyzed by Lim and Auer[2012], Tarbouriech et al.[2020b].

Horizon-free regret in finite-horizon MDPs. Our third desired property echoes the recent investigation of so-called horizon-free regret in finite-horizon MDPs that satisfy the assumption of total reward bounded by 1. Since Jiang and Agarwal[2018], it was wondered whether it is possible to obtain a regret bound that only depends logarithmically in the horizon H, which would imply that long-horizon RL is no more difficult than short-horizon RL. Recently, this question was answered positively [Wang et al., 2020a, Zhang et al., 2020a, 2021]. In SSP (which we recall subsumes finite-horizon MDPs), a natural counterpart of the horizon is T?, which bounds the expected time of the optimal policy to terminate the episode from any state. This contrasts with the horizon H in finite-horizon MDPs, which is the almost-sure time of any policy to terminate the episode. Note that such notion of horizon would obviously be too strong for SSP, where some (even most) policies may never reach the goal, thus having an unbounded time horizon. This motivates our definition of (nearly) horizon-free regret in SSP if it only contains a logarithmic dependence on T?. We stress that we do not make any extra assumption on the SSP problem; in particular we do not assume any bound on the cumulative cost of a trajectory.

2 Preliminaries

SSP-MDP. An instance of the SSP problem is a Markov decision process (MDP) M := hS, A, P, c, s0, gi, where S is the finite state space with cardinality S, A is the finite action space with cardinality A, and s0 ∈ S is the initial state. We denote by g∈ / S the goal state, and we set S0 := S ∪ {g} (thus S0 := S + 1). Taking action a in state s incurs a cost drawn i.i.d. from a distribution with expectation c(s, a) ∈ [0, 1], and the 0 0 0 P 0 next state s ∈ S is selected with probability P (s |s, a) (where s0∈S0 P (s |s, a) = 1). The goal state g is √ 4 We conjecture the optimal problem-independent regret in SSP to be Oe(B? SAK + B?SA) (by extrapolating the conjecture of Menard et al.[2021] for finite-horizon MDPs), which shows the tightness of our bound up to an S lower-order factor. 5Note that a different line of work [e.g. Neu et al., 2010, 2012, Rosenberg and Mansour, 2019a,b, Jin et al., 2020, Jin and Luo, 2020] studies finite-horizon MDPs with adversarial costs (sometimes called online loop-free SSP), where an episode ends after a fixed number of H steps. This is not to be confused with our setting where an episode lasts as long as the goal is not reached.

4 absorbing and zero-cost, i.e., P (g|g, a) = 1 and c(g, a) = 0 for any action a ∈ A, which effectively implies that the agent ends its interaction with M once it reaches the goal g. 0 For notational convenience, we use Ps,a and Ps,a,s0 to denote P (·|s, a) and P (s |s, a) respectively. For any 0 P 2 P 2 two S -dimensional vectors X and Y , we use XY to denote s∈S0 X(s)Y (s), X to denote s∈S0 [X(s)] , 0 2 2 and if X is a probability distribution over S , then V(X,Y ) := XY − (XY ) . We also define kXk∞ := 6=g P 6=g P maxs∈S0 |X(s)|, kXk∞ := maxs0∈S |X(s)|, kXk1 := s∈S0 |X(s)|, kXk1 := s∈S |X(s)| = kXk1 − |X(g)|.

Proper policies. A stationary and deterministic policy π : S → A is a mapping that executes action π(s) when the agent is in state s. A policy π is said to be proper if it reaches the goal with probability 1 when starting from any state in S (otherwise it is improper). We denote by Πproper the set of proper, stationary and deterministic policies. As all prior works on regret minimization in SSP, we make the following basic assumption which ensures that the SSP problem is well-posed.

Assumption 1. There exists at least one proper policy, i.e., Πproper 6= ∅.

Q-functions and value functions in SSP. The agent’s objective is to minimize its expected cumulative cost incurred until the goal is reached. The value function (also called cost-to-go) of a policy π is defined as  T  π X V (s) := lim E ct(st, π(st)) s1 = s , T →+∞ t=1 where ct ∈ [0, 1] is the cost incurred at time t at the state-action pair (st, π(st)). The Q-function of policy π is defined for all (s, a) ∈ S × A as  T  π X Q (s, a) := lim E ct(st, π(st)) s1 = s, π(s1) = a . T →+∞ t=1 In both definitions, the expectation is w.r.t. the random sequence of states generated by executing π starting from state s ∈ S (and taking action a ∈ A in the second case). Note that the value function of a policy may possibly be unbounded if it never reaches the goal. For a proper policy π, V π(s) and Qπ(s, a) are finite for any s ∈ S and a ∈ A. We also define V π(g) = Qπ(g, a) = 0 for all policies π and actions a, by virtue of the zero-cost and self-loop property of the goal state g. Finally, we denote by T π(s) the expected time that π takes to reach the goal g starting at state s; in particular, if π is proper then T π(s) is finite for all s, whereas if π is improper there must exist at least one state s such that T π(s) = ∞. Equipped with Asm.1 and an additional condition on improper policies defined below, one can derive important properties on the optimal policy π? that minimizes the value function component-wise. Lemma 2 (Bertsekas and Tsitsiklis, 1991;Yu and Bertsekas, 2013). Suppose that Asm.1 holds and that for 0 every improper policy π0 there exists at least one state s ∈ S such that V π (s) = +∞. Then the optimal policy ? π? is stationary, deterministic, and proper. Moreover, V ? = V π is the unique solution of the optimality equations V ? = LV ? and V ?(s) < +∞ for any s ∈ S, where for any vector V ∈ RS the optimal Bellman operator L is defined as n o LV (s) := min c(s, a) + Ps,aV . (1) a∈A

? Furthermore, the optimal Q-value, denoted by Q? = Qπ , is related to the optimal value function as follows

? ? ? ? Q (s, a) = c(s, a) + Ps,aV ,V (s) = min Q (s, a), (s, a) ∈ S × A. (2) a∈A

Since we will target the best proper policy, we will handle the second requirement of Lem.2 in the following way [Bertsekas and Yu, 2013, Rosenberg et al., 2020]. First, we see that the requirement is in particular verified if all instantaneous costs are strictly positive. To deal with the case of non-negative costs, we can introduce

5 a small additive perturbation η ∈ (0, 1] to all costs to yield a new cost function cη(s, a) = max{c(s, a), η}, which is strictly positive. In this cost-perturbed MDP, the conditions of Lem.2 hold so we obtain an optimal ? ? policy πη that is stationary, deterministic and proper and has a finite value function Vη . Taking the limit as ? ? ? π? ? η → 0, we have that πη → π and Vη → V , where π is the optimal proper policy in the original model ? that is also stationary and deterministic, and V π denotes its value function. This observation enables to circumvent the second condition of Lem.2 and to only require Asm.1 to hold.

Learning formulation. We consider the learning problem where the agent does not have any prior knowledge of the cost function c or transition function P . Each episode starts at the initial state s0 (the extension to any possibly unknown distribution of initial states is straightforward), and each episode ends only when the goal state g is reached (note that this may never happen if the agent does not devise an adequate goal-reaching strategy to terminate the episode). We evaluate the performance of the agent after K episodes by its regret, which captures the difference between its total cost over the K episodes and the total expected cost of the optimal proper policy:

K Ik X X k π RK := ch − K · min V (s0), (3) π∈Πproper k=1 h=1 k k where I is the time it takes the agent to complete episode k and ch is the cost incurred in the h-th step k k k of episode k when the agent visits state-action pair (sh, ah). If there exists k such that I is infinite, we define RK = ∞. ? Throughout the paper, we denote the optimal proper policy by π? and we set V ?(s) := V π (s) = π ? π? π minπ∈Πproper V (s) and Q (s, a) := Q (s, a) = minπ∈Πproper Q (s, a) for all (s, a) ∈ S × A. Let B? > 0 ? ? ? bound the values of V , i.e., B? := maxs∈S V (s). From Eq.2 it is obvious that Q (s, a) ≤ 1 + B?. Finally, π? let T? > 0 bound the expected goal-reaching times of the optimal policy, i.e., T? := maxs∈S T (s). We see that B? ≤ T? < +∞.

3 Algorithm and Analysis Overview

We introduce our algorithm EB-SSP (Exploration Bonus for SSP) in Alg.1. It takes as input the state- action space S × A, the confidence level δ ∈ (0, 1), the number of episodes K. It also considers that an estimate B such that B ≥ B? is available, and we handle in Sect. 4.2 and App.F the case where such an estimate is unavailable (i.e., unknown B?). As explained in Sect.2, the algorithm will enforce the conditions of Lem.2 to hold by adding a small cost perturbation η ∈ [0, 1] (cf. lines3, 12) — either η = 0 if the agent is aware that all costs are already positive, otherwise a careful choice of η > 0 will be provided in Sect.4. Our algorithm builds on a value-optimistic approach, i.e., it sequentially constructs optimistic lower bounds on the optimal Q-function and executes the policy that greedily minimizes these optimistic values. The structure is inspired from the MVP algorithm in Zhang et al.[2020a] designed for finite-horizon RL. In particular, we also adopt the doubling update framework (first introduced by Jaksch et al.[2010]): whenever the number of visits of a state-action pair is doubled, the algorithm updates the empirical cost and transition probability of this state-action pair, as well as computes a new optimistic Q-estimate and optimistic greedy policy. Note that this slightly differs from MVP which waits for the end of the finite-horizon episode to update the policy. In SSP, however, having this delay may yield linear regret as the episode has the risk of never terminating under the current policy (e.g., if it is improper), which is why we perform the policy update instantaneously when the doubling condition is met. The main algorithmic component lies in computing the Q-values (w.r.t. which the policy is greedy) whenever each doubling condition is met. To this purpose, we introduce a procedure called VISGO, for Value Iteration with Slight Goal Optimism. Starting with optimistic values V (0) = 0, VISGO iteratively computes V (i+1) = LeV (i) for a carefully defined operator Le. It terminates when a stopping condition is met, (i+1) (i) specifically once kV − V k∞ ≤ VI for a precision level VI > 0 (that will be specified later), and it outputs the corresponding values V (i+1) (and Q-values Q(i+1)).

6 Algorithm 1: Algorithm EB-SSP

1 Input: S, s0 ∈ S, g 6∈ S, A, δ. 2 Input: an estimate B guaranteeing B ≥ max{B?, 1} (see Sect. 4.2 and App.F if not available). 3 Optional input: cost perturbation η ∈ [0, 1]. √ √ j−1 4 Specify: Trigger set N ← {2 : j = 1, 2,...}. Constants c1 = 6, c2 = 36, c3 = 2 2, c4 = 2 2. 0 0 5 For (s, a, s ) ∈ S × A × S , set 0 N(s, a) ← 0; n(s, a) ← 0; N(s, a, s ) ← 0; Pbs,a,s0 ← 0; θ(s, a) ← 0; bc(s, a) ← 0; Q(s, a) ← 0; V (s) ← 0. 6 Set initial time step t ← 1 and trigger index j ← 0. 7 for episode k = 1, 2,... do 8 Set st ← s0 9 while st 6= g do

10 Take action at = arg maxa∈A Q(st, a), incur cost ct and observe next state st+1 ∼ P (·|st, at). 0 11 Set (s, a, s , c) ← (st, at, st+1, max{ct, η}) and t ← t + 1. 0 0 12 Set N(s, a) ← N(s, a) + 1, θ(s, a) ← θ(s, a) + c, N(s, a, s ) ← N(s, a, s ) + 1. 13 if N(s, a) ∈ N then 14 \\Update triggered: VISGO procedure 15 Set j ← j + 1. 16 \\Compute empirical costs, empirical transitions Pb and slightly skewed transitions Pe 2θ(s,a) 17 Set c(s, a) ← I[N(s, a) ≥ 2] + I[N(s, a) = 1]θ(s, a) and θ(s, a) ← 0. b N(s,a) 0 0 0 18 For all s ∈ S , set Pbs,a,s0 ← N(s, a, s )/N(s, a), n(s, a) ← N(s, a), and

n(s, a) I[s0 = g] Pes,a,s0 ← Pbs,a,s0 + . (4) n(s, a) + 1 n(s, a) + 1

19 \\Update Q −j (0) (−1) 20 Set VI ← 2 and i ← 0, V ← 0, V ← +∞. +  12SAS0[n+(s,a)]2  21 For all (s, a) ∈ S × A, set n (s, a) ← max{n(s, a), 1} and ιs,a ← ln δ . (i) (i−1) 22 while kV − V k∞ > VI do 23 \\Inner iteration of VISGO 24 For all (s, a) ∈ S × A, set s s p 0 n (P ,V (i))ι Bι o c(s, a)ι B S ιs,a b(i+1)(s, a) ← max c V es,a s,a , c s,a + c b s,a + c , 1 n+(s, a) 2 n+(s, a) 3 n+(s, a) 4 n+(s, a) (5) (i+1)  (i) (i+1) Q (s, a) ← max bc(s, a) + Pes,aV − b (s, a), 0 , (6) V (i+1)(s) ← min Q(i+1)(s, a). (7) a

25 Set i ← i + 1. (i) (i) 26 Set Q ← Q , V ← V .

7 We begin by explaining how we design Le and we then provide some intuition. Let Pb, bc be the current empirical transition probabilities and costs, and let n(s, a) be the current number of visits to state-action pair (s, a) (and n+(s, a) = max{n(s, a), 1}). We first define transition probabilities Pe that are slightly skewed towards the goal w.r.t. Pb, as follows n(s, a) I[s0 = g] Pes,a,s0 := Pbs,a,s0 + . (8) n(s, a) + 1 n(s, a) + 1

0 We then define the bonus function, for any vector V ∈ RS and state-action pair (s, a), as s s p 0 n V(Pes,a,V )ιs,a Bιs,a o bc(s, a)ιs,a B S ιs,a b(V, s, a) := max c1 , c2 + c3 + c4 , (9) n+(s, a) n+(s, a) n+(s, a) n+(s, a)

for the estimate B, specific constants c1, c2, c3, c4 and a state-action dependent logarithmic term ιs,a. Note that the last term in Eq.9 accounts for the skewing of Pe w.r.t. Pb (see Lem. 12). Given the transitions Pe and bonus b, we are ready to define the operator Le as  n o  LeV (s) := max min c(s, a) + Pes,aV − b(V, s, a) , 0 . (10) a∈A b

We see that Le instills optimism in two different ways:

(i) On the empirical cost function bc, via the variance-aware exploration bonus b (Eq.9) that intuitively lowers the costs to bc − b; (ii) On the empirical transition function Pb, via the transitions Pe (Eq.8) that slightly bias Pb with the addition of a non-zero probability of reaching the goal from every state-action pair. While the first feature (i) is standard in finite-horizon approaches, the second feature (ii) is SSP-specific and it is required to cope with the fact that the empirical model Pb may not admit any proper policy, meaning that executing value iteration for SSP on Pb may diverge whatever the cost function. Our simple transition skewing in (ii) actually yields the strong guarantee that all policies are proper in Pe, for any bounded cost function.67 Crucially, we design the skewing in (ii) so that this extra goal-reaching probability decays inversely with n(s, a), the number of visits of the state-action pair. This allows to tightly control the gap between Pe and Pb and to ensure that it will only account for a lower-order term in the regret (cf. last term of Eq.9). Note that the model-optimistic algorithms in Tarbouriech et al.[2020a], Rosenberg et al.[2020] induce√ larger deviations between the empirical model and their optimistic models, which leads to an additional S factor in their regret bounds.

Equipped with these two sources of optimism, as long as B ≥ B?, we are able to prove that any VISGO procedure verifies the following two key properties: (1) Optimism: VISGO outputs an optimistic estimator of the optimal Q-function at each iteration step, i.e., Q(i)(s, a) ≤ Q?(s, a), ∀i ≥ 0,

(2) Finite-time near-convergence: VISGO terminates within a finite number of iteration steps (note (j) that the final iterate V approximates the fixed point of Le up to an error scaling with VI).

6In fact this transition skewing also implies that the SSP problem defined on Pe becomes equivalent to a discounted RL problem, with a varying state-action dependent discount factor (cf. Remark2 later). 7Also note that for different albeit mildly related purposes, a perturbation trick is sometimes used in regret minimization for average-reward MDPs [e.g., Fruit et al., 2018, Qian et al., 2019], where a non-zero probability of reaching an arbitrary state is added at each state-action pair to guarantee that all policies are unichain and that variants of value iteration (model-optimistic or value-optimistic) are nearly convergent in finite-time.

8 Having variance-aware exploration bonuses introduces a subtle correlation between consecutive value iterates (i.e., b depends on V in Eq.9), as in the finite-horizon case [Zhang et al., 2020a]. To satisfy (1), we derive similarly to MVP a monotonicity property of the operator Le, which can be achieved by carefully tuning the constants c1, c2, c3, c4 in the exploration bonus of Eq.9. On the other hand, the second requirement (2) is SSP-specific; in particular it is not needed in finite-horizon where value iteration always requires exactly H backward induction steps. We establish that the operator Le also satisfies a contraction property, which guarantees a polynomially bounded number of iterations before terminating, i.e., (2).

Regret analysis. Besides ensuring the computational efficiency of EB-SSP, the properties of VISGO lay the foundations for our regret analysis. In particular, we show that the recursion-based methods introduced by Zhang et al.[2020a] in finite-horizon can be adapted to the more general SSP setting. While some differences are quite straightforward to handle — e.g., the mismatch in the regret definitions between SSP

and finite-horizon, or the need in SSP to control the sequence of the VISGO errors VI — we now point out two main challenges that appear. • Challenge 1: disparity between episodes. In finite-horizon MDPs, the episodes always yield a cumulative reward that is upper bounded by H, or by 1 under the setting of Zhang et al.[2020a]. As such, their analysis is naturally episode-decomposed. However, in SSP, episodes may display very different behavior, as some episodes may incur much higher costs than others (especially at the beginning of learning). To deal with this, we decompose consecutive time steps into so-called intervals and we instead perform an interval-decomposed analysis. Note that the interval decomposition is not performed algorithmically, but only at the analysis level. Similarly to Rosenberg et al.[2020], we construct our intervals so that the cumulative cost within an interval is always upper bounded by B? + 1. This interestingly mirrors the finite-horizon assumption of the cumulative rewards within an episode that are bounded by 1 [Zhang et al., 2020a]. Our analysis will need to perform some sort of rescaling in 1/B? and consider normalized value functions (with range always upper bounded by 1), in order to avoid an exponential blow-up in the recursions. We stress that we do not make any extra assumption on the SSP problem w.r.t. the cumulative cost of the trajectories (which may even be infinite if the agent never reaches the goal). This differs from Zhang et al.[2020a] whose finite-horizon analysis relies on the assumption of the cumulative reward of any trajectory being upper bounded by 1.

• Challenge 2: unknown range of the optimal value function. In finite-horizon MDPs, both the horizon H and the bound on the cumulative reward (H, or 1 for Zhang et al.[2020a]) are assumed to be known to the agent. This prior knowledge is key to tune exploration bonuses that yield valid confidence intervals and thus guarantee optimism. In SSP, if the agent does not have a valid estimate B ≥ B?, then it may design an under-specified exploration bonus which cannot guarantee optimism. The case of unknown B? is non-trivial: it appears impossible to properly estimate B? (since some states may never be visited) and it is unclear how a standard doubling trick may be leveraged.8 In Sect. 4.2 and App.F, we propose a new scheme that circumvents the knowledge of B? at the cost of only logarithmic and lower-order contributions to the regret. Essentially, the agent starts with estimate Be = 1 and tracks both the cumulative costs and the range of the value function computed by VISGO. The agent carefully increments Be in two possible ways (each with a different speed): either whenever one of the two tracked quantities exceeds some computable threshold, or whenever a new episode begins. Our analysis then boils down to the following argument: if Be < B? we bound the regret by the cumulative cost threshold, otherwise optimism holds and we can recover the regret bound under a known upper bound on B?.

8Note, for instance, that the exploration-bonus-based approach of Qian et al.[2019] in average-reward MDPs requires prior knowledge of an upper bound on the optimal bias span.

9 4 Main Results

We first establish the following guarantee on which all our subsequent regret bounds will rely (Sect.5 is devoted to its proof).

Theorem 3. Assume that B ≥ max{B?, 1} and that the conditions of Lem.2 hold. Then with probability at least 1 − δ the regret of EB-SSP (Alg.1 with η = 0) can be bounded by

 max{B , 1}SAT  max{B , 1}SAT  R = O p(B2 + B )SAK log ? + BS2A log2 ? , K ? ? δ δ where the random variable T denotes the accumulated time within the K episodes. In the case where B? ≥ 1, we can simplify the above bound as

 √ B SAT  B SAT  R = O B SAK log ? + BS2A log2 ? . K ? δ δ

The statement of Thm.3 depends on the estimate B and more importantly on T (in its log terms), which is the total time accumulated within the K episodes and which is a random (and possibly unbounded) quantity. Recall that throughout we consider that Asm.1 holds. We now proceed as follows:

• In Sect. 4.1 we assume that B = B? (i.e., the agent has prior knowledge of B?) and we focus on bounding T . We instantiate several regret bounds depending on the nature of the SSP problem at hand.

• In Sect. 4.2 we handle the case of unknown B? and show how all the regret bounds of Sect. 4.1 remain valid up to logarithmic and lower-order terms.

Throughout Sect.4, we consider for ease of exposition that B? ≥ 1, otherwise all the following bounds hold by replacing the occurrences√ of B? with max{B?, 1}, except for the B? multiplicative factor in the leading term 9 which becomes B?. Moreover, for simplicity (when tuning the subsequent cost perturbations) we assume as in prior works [e.g., Rosenberg et al., 2020, Chen et al., 2020, Chen and Luo, 2021] that the total number of episodes K is known to the agent (this knowledge can be eliminated with the standard doubling trick).

4.1 Regret Bounds for B = B?

First we assume that B = B? (i.e., the agent has prior knowledge of B?) and we analyze the regret achieved by EB-SSP under various conditions on the SSP model.

4.1.1 Positive Costs

Assumption 4. All costs are lower bounded by a constant cmin > 0 which may be unknown to the agent.

We first focus on the case of positive costs. Asm.4 guarantees that the conditions of Lem.2 hold. Moreover, denoting by C the total cumulative cost,√ we have that the total time satisfies T ≤ C/cmin. By simplifying 2 p 3 5 3 2 the bound of Thm.3 as C = O(B?S A K · B?T SA/δ), we loosely obtain that T = O(B?S A K/(cminδ)). Plugging this in Thm.3 yields the following result.

Corollary 5. Under Asm.4, running EB-SSP (Alg.1) with B = B? and η = 0 gives the following regret bound with probability at least 1 − δ  √     KB?SA 2 2 KB?SA RK = O B? SAK log + B?S A log . cminδ cminδ √ 9 When√ B? < 1, Thm.3 has a leading term scaling in Oe( B?SAK). While this shows a mismatch with the lower bound Ω(B? SAK) of Rosenberg et al.[2020], we notice that this lower√ bound assumes that B? ≥ 2. It is thus unclear what the dependence in B? of the lower bound should look like (either B? or B?) in the uncommon corner case of B? < 1.

10 The bound of Cor.5 only depends polynomially on K, S, A, B?. We note that T? ≤ B?/cmin and that this upper bound only appears in the logarithms. Under positive costs, the regret of EB-SSP is thus (nearly) minimax and horizon-free. Furthermore, in App.B we introduce an alternative assumption on the SSP problem (which is weaker than Asm.4) that considers that there are no almost-sure zero-cost cycles. We show that in this case also, the regret of EB-SSP is (nearly) minimax and horizon-free.

4.1.2 General Costs and T? Unknown Now we handle the general case of non-negative costs, with no assumption other than Asm.1. We use a cost perturbation argument to generalize the results from the cmin > 0 case to general costs (as also done in Tarbouriech et al.[2020a], Rosenberg et al.[2020], Rosenberg and Mansour[2020]). As reviewed in Sect.2, this enables to circumvent the second condition of Lem.2 (which holds in the cost-perturbed MDP) and to target the optimal proper policy in the original MDP up to a bias scaling with the cost perturbation. Specifically, if EB-SSP is run with costs cη(s, a) ← max{c(s, a), η} for any η ∈ (0, 1], then we recover the regret bound of Cor.5 with cmin ← η and B? ← B? + ηT?, as well as an additive bias of ηT?K. By picking η to balance these terms, we obtain the following.

−n Corollary 6. Running EB-SSP (Alg.1) with B = B? and η = K for any choice of constant n > 1 gives the following regret bound with probability at least 1 − δ √ ! √ T nT SAL R = O nB SAKL + ? + ? + n2B S2AL2 , K ? n−1 n− 1 ? K K 2

KT?SA  with the logarithmic term L := log δ . √ The bound of Cor.6 can be decomposed as (i) the minimax rate in K (up to a logarithmic term depending on T?), and (ii) an additive term that depends on T? and that vanishes as K → +∞ (we omit the last lower-order term that does not depend polynomially on either K or T?). Note that the second term (ii) can be made as small as possible by increasing the choice of exponent n in the cost perturbation, at the cost of the multiplicative constant n in (i). Equipped only with Asm.1, the regret of EB-SSP is thus (nearly) minimax, and it may be dubbed as horizon-vanishing when K is given in advance, in the sense that it contains an additive term that depends on T? and that becomes negligible for large values of K (if K is unknown in advance, the application of the doubling trick yields an additive term (ii) scaling as T?). We now show that the trade-off between minimizing the terms (i) and (ii) can be resolved with loose knowledge of T? and leads to a horizon-free bound.

4.1.3 General Costs and Order-Accurate Estimate of T? Available

T? ζ Assumption 7. The agent has prior knowledge of a quantity T ? that verifies υ ≤ T ? ≤ λT? for some positive unknown constants υ, λ, ζ ≥ 1.

Note that Asm.7 only requires to have access to an order-accurate estimate of T?. Indeed, T ? may be a constant lower-bound approximation away from T?, or a polynomial upper-bound approximation away from T?. We can now tune the cost perturbation of Sect. 4.1.2 using T ?. In particular, selecting as cost −1 perturbation η := (T ?K) ensures that the bias satisfies ηT?K ≤ υ = O(1), thus yielding the following.

−1 Corollary 8. Under Asm.7, running EB-SSP (Alg.1) with B = B? and η = (T ?K) gives the following regret bound with probability at least 1 − δ

 √ KT SA KT SA R = O B SAK log ? + B S2A log2 ? . K ? δ ? δ

11 The bound of Cor.8 depends polynomially on K, S, A, B?, and only logarithmically on T?. This means that in the general cost case under an order-accurate estimate of T?, the regret of EB-SSP is (nearly) minimax and horizon-free. We can compare√ Cor.8 with the concurrent result of Cohen et al.[2021]. Indeed, their regret bound 4 2 5 −1 scales as O(B? SAKL + T? S AL ) with L = log(KT?SAδ ) under the assumption of known T? and B? (or tight upper bounds of them), which implies that the conditions of Cor.8 hold. The bound of Cor.8 is strictly tighter, since it always holds that B? ≤ T? and the gap between the two may be arbitrarily large (see e.g., App.A), especially when some instantaneous costs are very small.

4.2 Regret Bounds for Unknown B?

We now relax the assumption that B? is known to the learner. In App.F we derive a parameter-free variant of EB-SSP that bypasses the requirement B ≥ B? (line2 of Alg.1) when tuning the exploration bonus. Specifically, it initializes an estimate Be = 1 and increments it with two different speeds. On the one hand, it tracks both the cumulative cost and the range of the value function computed by VISGO, and it doubles Be (i.e., Be ← 2Be) whenever either exceeds computable thresholds (that depend on Be, the number of episodes elapsed so far and other computable terms). We ensure that this only happens when Be < B? and we can bound the regret contribution by the cumulative cost threshold. On the other hand,√ the algorithm also increases the Be estimate whenever a new episode begins by setting Be ← max{B,e k/(S3/2A1/2)}, with k the current episode index. This ensures that at some point Be ≥ B? and thus we can bound the regret thanks to Thm.3. Parameter-free EB-SSP satisfies the following guarantee (which extends Thm.3 to unknown B?). Theorem 9. Assume that the conditions of Lem.2 hold. Then with probability at least 1 − δ the regret of parameter-free EB-SSP (Alg.3) can be bounded by  √ B SAT  B SAT  R = O R? + SAK log2 ? + B3S3A log3 ? K K δ ? δ  B SAT  B SAT  ≤ O R? log ? + B3S3A log3 ? , K δ ? δ

? where T is the cumulative time within the K episodes and RK denotes the regret after K episodes of EB-SSP in the case of known B? (i.e., the bound of Thm.3 with B = B?).

This result implies that we can effectively remove the condition of B ≥ max{B?, 1} in Thm.3, i.e., we make the statement parameter-free. Hence, all the regret bounds derived in Sect. 4.1 in the case of known B? (i.e., Cor.5,6,8, 20) still hold up to additional logarithmic and lower-order terms when B? is unknown.

5 Proof of Theorem3

In this section, we present the proof of Thm.3. We recall that throughout Sect.5 we analyze Alg.1 without cost perturbation (i.e., η = 0) and we assume that 1) the estimate verifies B ≥ max{B?, 1} and 2) the conditions of Lem.2 hold.

5.1 High-Probability Event

Definition 10 (High-probability event). We define the event E := E1 ∩ E2 ∩ E3, where  s  ?  ? V(Pbs,a,V )ιs,a 14B?ιs,a  E1 := ∀(s, a) ∈ S × A, ∀n(s, a) ≥ 1 : |(Pbs,a − Ps,a)V | ≤ 2 + , (11)  n(s, a) 3n(s, a)  ( s ) 2bc(s, a)ιs,a 28ιs,a E2 := ∀(s, a) ∈ S × A, ∀n(s, a) ≥ 1 : |c(s, a) − c(s, a)| ≤ 2 + , (12) b n(s, a) 3n(s, a)

12 ( s ) 0 0 2Ps,a,s0 ιs,a ιs,a E3 := ∀(s, a, s ) ∈ S × A × S , ∀n(s, a) ≥ 1 : |Ps,a,s0 − Pbs,a,s0 | ≤ + , (13) n(s, a) n(s, a)

 12SAS0[n+(s,a)]2  where ιs,a := ln δ .

Lemma 11. It holds that P(E) ≥ 1 − δ.

Proof. The events E1 and E2 hold with probability at least 1 − 2δ/3 by the concentration inequality of Lem. 22 and by union bound over all (s, a) ∈ S × A. The event E3 holds with probability at least 1 − δ/3 by Bennett’s inequality (Lem. 21, anytime version), by Lem. 27 and by union bound over all (s, a, s0) ∈ S × A × S0.

5.2 Analysis of a VISGO Procedure

A VISGO procedure in Alg.1 computes iterates of the form V (i+1) = LeV (i), where Le is an operator defined S for any U ∈ R and s ∈ S as LeU(s) := mina∈A LeU(s, a), where   s   s p 0   (P ,U)ι Bι  c(s, a)ι B S ιs,a  LU(s, a) := max c(s, a) + P U − max c V es,a s,a , c s,a − c b s,a − c , 0 . (14) e b es,a 1 + 2 + 3 + 4 +   n (s, a) n (s, a)  n (s, a) n (s, a) 

Starting from an optimistic initialization V (0) = 0 at each state, we will show the following two properties: • Optimism: with high probability, Q(i)(s, a) ≤ Q?(s, a), ∀i ≥ 0;

• Finite-time near-convergence: Given any error VI > 0, the procedure stops at a finite iteration j such (j) (j−1) (j) that kV − V k∞ ≤ VI, which implies that the vector V verifies some fixed point equation for Le

up to an error scaling with VI.

5.2.1 Properties of the slightly skewed transitions Pe

Lem. 12 shows that the bias introduced by replacing Pbs,a with Pes,a decays inversely with n(s, a), the number of visits to state-action pair (s, a).

0 Lemma 12. For any non-negative vector U ∈ RS such that U(g) = 0, for any (s, a) ∈ S × A, it holds that

2 0 kUk∞ 2kUk∞S Pes,aU ≤ Pbs,aU ≤ Pes,aU + , (Pes,a,U) − (Pbs,a,U) ≤ . n(s, a) + 1 V V n(s, a) + 1

Denote by ν the probability of reaching the goal from any state-action pair in Pe, i.e.,

νs,a := Pes,a,g, ν := min νs,a. (15) s,a

By construction of Pe, the quantity ν is strictly positive. This immediately implies the following result. Lemma 13. In the SSP-MDP associated to Pe with any bounded cost function, all policies are proper. Remark 2 (Mapping to a discounted problem). In an SSP problem with only proper policies, the (optimal) Bellman operator is usually contractive only w.r.t. a weighted-sup norm [Bertsekas, 1995]. Here, the con- struction of Pe entails that any SSP defined on it with fixed bounded costs has a (optimal) Bellman operator that is a sup-norm contraction. In fact, the SSP problem on Pe can be cast as a discounted problem with a (state-action dependent) discount factor γs,a := 1 − νs,a < 1 (we recall that discounted MDPs are a subclass of SSP-MDPs). Intuitively, at insufficiently visited state-action pairs, the agent behaves optimistically which increases the chance of reaching the goal and terminating the trajectory. Equivalently, we can interpret the agent as being uncertain about its future predictions and it is thus encouraged to act more myopically, which is connected to lowering the discount factor in the discounted RL setting.

13 5.2.2 Important auxiliary function f and its properties Lem. 14 examines an auxiliary function f that plays a key role in the analysis. Indeed, we see that an instantiation of f surfaces in the definition of the operator Le (Eq. 14). While the first property (monotonicity) is similar to the one required in Zhang et al.[2020a], the third property (contraction) is SSP-specific and is crucial to guarantee the (finite-time) near-convergence of a VISGO procedure. S0 S0 Lemma 14. Let Υ := {v ∈ R : v ≥ 0, v(g) = 0, kvk∞ ≤ B}. Let f : ∆ × Υ × R × R × R → R with n q o V(p,v)ι Bι f(p, v, n, B, ι) := pv − max c1 n , c2 n , with c1 = 6 and c2 = 36 (here taking any pair of constants 2 S0 such that c1 = c2 works). Then f satisfies, for all p ∈ ∆ , v ∈ Υ and n, ι > 0, 1. f(p, v, n, B, ι) is non-decreasing in v(s), i.e.,

∀(v, v0) ∈ Υ2, v ≤ v0 =⇒ f(p, v, n, B, ι) ≤ f(p, v0, n, B, ι);

q q c1 V(p,v)ι c2 Bι V(p,v)ι Bι 2. f(p, v, n, B, ι) ≤ pv − 2 n − 2 n ≤ pv − 2 n − 14 n ;

6=g 6=g 2 3. If kpk∞ < 1, then f(p, v, n, B, ι) is ρp-contractive in v(s), with ρp := 1 − (1 − kpk∞ ) < 1, i.e.,

0 2 0 0 ∀(v, v ) ∈ Υ , |f(p, v, n, B, ι) − f(p, v , n, B, ι)| ≤ ρpkv − v k∞.

5.2.3 Optimism of VISGO We now show that with the bonus defined in Eq.5, the Q-function is always optimistic with high probability. Lemma 15. Conditioned on the event E, for any output Q of the VISGO procedure (line 26 of Alg.1) and for any state-action pair (s, a) ∈ S × A, it holds that

Q(s, a) ≤ Q?(s, a).

Proof idea. We prove the result by induction on the inner iterations i of VISGO, i.e., Q(i)(s, a) ≤ Q?(s, a). We use the update of the Q-value (line6), Lem. 12, the definition of event E combined with the fact that (i) + B ≥ B?, as well as the first two properties of Lem. 14 applied to f(Pes,a,V , n (s, a), B, ιs,a).

5.2.4 Finite-time near-convergence of VISGO Warm-up: convergence with no bonuses. For the sake of discussion, let us first examine an idealized case where n(s, a) → +∞ for all (s, a), which means b(s, a) = 0 for all (s, a). In that case, the iterates verify (i+1) ? (i) ?  S ? V = Le V , where Le U(s) := mina c(s, a) + Pes,aU , ∀U ∈ R , s ∈ S. Thus Le is the optimal Bellman operator of the SSP instance Mf with transitions Pe and cost function c. From Lem. 13, all policies are proper in Mf. As a result, the operator Le? is contractive (cf. Remark2) and convergent [Bertsekas, 1995]. Convergence with bonuses. In VISGO, however, we must account for the bonuses b(s, a). Setting aside the truncation of each iterate V (i) (i.e., the lower bounding by 0), we notice that a update for V (i+1) can be interpreted as the (truncated) Bellman operator of an SSP problem with cost function c(s, a) − b(i+1)(s, a). However, b(i+1)(s, a) depends on V (i), the previous iterate. This dependence means that the cost function is no longer fixed and the reasoning from the previous paragraph no longer holds. As a result, we directly analyze the properties of the operator Le that defines the sequence of iterates V (i+1) = LeV (i) in VISGO (Eq. 14). (i) Lemma 16. The sequence (V )i≥0 is non-decreasing. Combining this with the fact that it is upper bounded by V ? from Lem. 15, the sequence must converge.

While Lem. 16 states that Le ultimately converges starting from a vector of zeros, the following result guarantees that it can approximate in finite time its fixed point within any (arbitrarily small) positive component-wise accuracy.

14 Lemma 17. Le is a ρ-contractive operator with modulus ρ := 1 − ν2 < 1, where ν is defined in Eq. 15.

6=g Proof idea. We have that, for any state-action pair (s, a), kPes,ak∞ ≤ 1 − νs,a < 1 from Eq. 15. Hence we can (i) + apply the third property (contraction) of Lem. 14 to f(Pes,a,V , n (s, a), B, ιs,a). Taking the maximum over (s, a) pairs yields the contraction property of Le.

(j+1) (j) log(max{B?,1}/VI) Remark 3. Lem. 17 guarantees that kV − V k∞ ≤ VI for j ≥ 1−ρ , which yields the desired property of finite-time near-convergence of VISGO (i.e., it always stops at a finite iteration j). −1 2 1 2 Moreover, we have VI = Ω(T ), the (possibly loose) lower bound 1−ρ = ν ≥ ( T +1 ) , and there are at most O(SA log T ) VISGO procedures in total, thus we see that EB-SSP has a polynomially bounded computational complexity. In addition, we show in App.C that a trivial modification of the constant c2 in the bonus enables to tighten the analysis of the contraction modulus ρ and thus get a total complexity complexity in Oe(T ) (instead of Oe(T 2)), only at the cost of an extra logarithmic term in T in the lower-order term of the regret.

5.3 Interval Decomposition and Notation Interval decomposition. In the analysis we split the time steps into intervals. The first interval begins at the first time step, and an interval ends once: (1) the goal state g is reached; (2) the trigger condition holds (i.e., the visit to a state-action pair is doubled); (3) or the cost in the interval accumulates to at least B?. We see that an update is triggered (line 13 of Alg.1) whenever condition (2) is met. Note that this interval decomposition is similar to the Bernstein-based one of Rosenberg et al.[2020], to the exception that we remove their condition of ending an interval once an “unknown” (i.e., insufficiently visited) state is reached. −1 This will be key to avoid any polynomial dependence on cmin in the regret (when cmin > 0, which then introduces a T? polynomial dependence following a cost perturbation argument).

Notation. We index intervals by m = 1, 2,... and the length of interval m is denoted by Hm (it is bounded m m m m m m almost surely). The trajectory visited in interval m is denoted by U = (s1 , a1 , . . . , sHm , aHm , sHm+1), m m where ah is the action taken in state sh . The concatenation of the trajectories of the intervals up to and m m Sm m0 m including interval m is denoted by U , i.e., U = m0=1 U . Moreover, ch denotes the cost in the h-th m m m m m step of interval m. We use the notation Q (s, a), V (s), Pbs,a, Pes,a and VI to denote the values (computed Q(s, a) V (s) P P  m nm(s, a) cm(s, a) in lines 14-26) of , , bs,a, es,a and VI in the beginning of interval . Let and b m denote the values of max{n(s, a), 1} and bc(s, a) used for computing Q (s, a). Finally, we set  s  s  m  m BpS0ι m V(Pes,a,V )ιs,a Bιs,a bc (s, a)ιs,a s,a b (s, a) := max c1 m , c2 m + c3 m + c4 m .  n (s, a) n (s, a) n (s, a) n (s, a)

m Note that by condition (3) in the interval decomposition and by ch ∈ [0, 1], it holds that, for any interval m, Hm X m ch ≤ B? + 1. (16) h=1

5.4 Bounding the Bellman Error Lemma 18. Conditioned on the event E, for any interval m and state-action pair (s, a) ∈ S × A,

m m  m |c(s, a) + Ps,aV − Q (s, a)| ≤ min β (s, a),B? + 1 , where we define s s 2 (P ,V ?)ι 2S0 (P ,V ? − V m)ι 3B S0ι  q  βm(s, a) := 4bm(s, a) + V s,a s,a + V s,a s,a + ? s,a + 1 + c ι /2 m . nm(s, a) nm(s, a) nm(s, a) 1 s,a VI

15 m Proof idea. We use that V approximates the fixed point of Le up to an error scaling with VI. We end up m m m ? m ? bounding the difference Ps,aV − Pes,aV ≤ (Pbs,a − Pes,a)V + (Ps,a − Pbs,a)V + (Ps,a − Pbs,a)(V − V ), where the first term is bounded by Lem. 12 and 15, while the second and third terms are bounded using the definition of the event E.

5.5 Regret Decomposition We assume that the event E defined in Def. 10 holds. In particular it guarantees that Lem. 15 and Lem. 18 hold for all intervals m simultaneously. We denote by M the total number of intervals in which the first K episodes elapse. For any M 0 ≤ M, 0 0 we denote by M0(M ) the set of intervals among the first M intervals that constitute the first intervals in each episode (i.e., either it is the first interval or its previous interval ended in the goal state). We also 0 P PM m K 0 := 0 1 T 0 := H K T K T denote by M m∈M0(M ) and M m=1 . Note that and are equivalent to M and M , respectively. 0 m PM PH m ? Instead of bounding the regret RK from Eq.3, we bound ReM 0 := m=1 h=1 ch − KM 0 V (s0) for any 0 fixed choice of M ≤ M, as done in Rosenberg et al.[2020]. We see that ReM 0 = RM 0 for any number of 0 0 intervals M (RM 0 is the true regret within the first M intervals). To derive Thm.3, we will show that M is finite and instantiate M 0 = M. In the following we do the analysis for arbitrary M 0 ≤ M as it will be useful for the parameter-free case studied in App.F (i.e., when no estimate B ≥ B? is available). We decompose ReM 0 as follows

M 0 Hm (i) X X m X m ReM 0 ≤ ch − V (s0), 0 m=1 h=1 m∈M0(M ) M 0 Hm M 0 Hm ! (ii) X X m X X m m m m m ≤ ch + V (sh+1) − V (sh ) + 2SA log2(TM 0 ) max kV k∞ 1≤m≤M 0 m=1 h=1 m=1 h=1 M 0 Hm M 0 Hm (iii) X X m m m m  X X m m m ≤ c + P m m V − V (s ) + V (s ) − P m m V + 2B SA log (T 0 ) h sh ,ah h h+1 sh ,ah ? 2 M m=1 h=1 m=1 h=1 M 0 Hm M 0 Hm M 0 Hm (iv) X X m m m X X m m m X X m m m ≤ V (s ) − P m m V + β (s , a ) + c − c(s , a ) +2B SA log (T 0 ), h+1 sh ,ah h h h h h ? 2 M m=1 h=1 m=1 h=1 m=1 h=1

| {z 0 } | {z 0 } | {z 0 } :=X1(M ) :=X2(M ) :=X3(M ) where (i) uses the optimism property of Lem. 15, (ii) stems from the construction of intervals (Lem. 31), m (iii) uses that max1≤m≤M 0 kV k∞ ≤ B? (from Lem. 15), and (iv) comes from Lem. 18. We now focus on 0 0 0 bounding the terms X1(M ), X2(M ) and X3(M ). To this end, we introduce the following useful quantities

M 0 Hm M 0 Hm 0 X X m 0 X X ? m X (M ) := (P m m ,V ),X (M ) := (P m m ,V − V ). 4 V sh ,ah 5 V sh ,ah m=1 h=1 m=1 h=1

0 5.5.1 The X1(M ) term 0 X1(M ) could be viewed as a martingale, so by taking c = max{B?, 1} in the technical Lem. 25, we have with probability at least 1 − δ,

0 p 0 2 |X1(M )| ≤ 2 2X4(M )(log2((max{B?, 1}) TM 0 ) + ln(2/δ)) 2 + 5(max{B?, 1})(log2((max{B?, 1}) TM 0 ) + ln(2/δ)).

0 0 To bound X1(M ), we only need to bound X4(M ).

16 0 5.5.2 The X3(M ) term Taking c = 1 in the technical Lem. 25, we have  v  u M 0 Hm 0 u X X m m P|X3(M )| ≥ 2t2 Var(sh , ah )(log2(TM 0 ) + ln(2/δ)) + 5(log2(TM 0 ) + ln(2/δ)) ≤ δ, m=1 h=1

2 where Var(st, at) := E[(ct − c(st, at)) ] (ct denotes the cost incurred at time step t). By Lem. 27 and Eq. 16,

M 0 Hm M 0 Hm M 0 Hm X X m m X X m m X X m m m 0 0 0 Var(sh , ah ) ≤ c(sh , ah ) ≤ (c(sh , ah ) − ch ) + M (B? + 1) ≤ |X3(M )| + M (B? + 1). m=1 h=1 m=1 h=1 m=1 h=1

Therefore we have h i 0 p 0 0 P |X3(M )| ≥ 2 2(|X3(M )| + M (B? + 1))(log2(TM 0 ) + ln(2/δ)) + 5(log2(TM 0 ) + ln(2/δ)) ≤ δ,   0 p 0 which implies that |X3(M )| ≤ O log2(TM 0 ) + ln(2/δ) + (B? + 1)M (log2(TM 0 ) + ln(2/δ)) with probabil- ity at least 1 − δ.

0 5.5.3 The X2(M ) term 0 The full proof of the bound on X2(M ) is deferred to App. E.3. Here we provide a brief sketch. First, we bound βm and apply a pigeonhole principle to obtain

0 p 0 p 2 0 X2(M ) ≤ O SA log2(TM 0 )ιM 0 X4(M ) + S A log2(TM 0 )ιM 0 X5(M ) v u M 0 Hm u X X m m m + tSA log2(TM 0 )ιM 0 bc (sh , ah ) m=1 h=1 M 0 Hm ! 2 3/2 X X p m 0 0 0 0 + B?S A log2(TM ) + BS A log2(TM )ιM + (1 + c1 ιM /2)VI m=1 h=1

 0 2  12SAS TM0 with the logarithmic term ιM 0 := ln δ which is the upper-bound of ιs,a when considering only time steps in the first M 0 intervals. The regret contributions of the estimated costs and the VISGO precision errors are respectively

M 0 Hm X X m m m 0 bc (sh , ah ) ≤ 2SA(log2(TM 0 ) + 1) + 2M (B? + 1), m=1 h=1 M 0 Hm X X p m √ 0 0 0 (1 + c1 ιM /2)VI = O(SA log2(TM ) ιM ). m=1 h=1

0 0 To bound X4(M ) and X5(M ), we perform a recursion-based analysis on the value functions normalized by 1/B?. As explained in Sect.3, we split the analysis on the intervals, and not on the episodes as done in Zhang et al.[2020a]. In Lem. 33 and 34 we establish that with overwhelming probability,

0 2 0 0  X4(M ) ≤ O (B? + B?)(M + log2(TM 0 ) + ln(2/δ)) + B?X2(M ) , 0 2 0  X5(M ) ≤ O B?(log2(TM 0 ) + ln(2/δ)) + B?X2(M ) .

17 As a result, we obtain   0 p 0 p 2 0 p 0 2 2 3/2 2 X2(M ) ≤ O SAX4(M )ιM 0 + S AX5(M )ιM 0 + (B? + 1)SAM ιM 0 + B?S AιM 0 + BS AιM 0 , 0 2 0 0  X4(M ) ≤ O (B? + B?)(M + ιM 0 ) + B?X2(M ) , 0 2 0  X5(M ) ≤ O B?ιM 0 + B?X2(M ) .

 0 2  12SAS TM0 2 2  0 with the logarithmic term ιM 0 := ln δ + log2((max{B?, 1}) TM 0 ) + ln δ . Isolating the X2(M ) term finally yields

0 p 2 0 2 2 X2(M ) ≤ O( (B? + B?)SAM ιM 0 + BS AιM 0 ).

5.5.4 Putting Everything Together Ultimately, with probability at least 1 − 6δ we have

0 0 0 p 2 0 2 2 ReM 0 ≤ X1(M ) + X2(M ) + X3(M ) + 2B?SA log2(TM 0 ) ≤ O( (B? + B?)SAM ιM 0 + BS AιM 0 ).

0 0 There remains to bound M the number of intervals. Denote by CM 0 the cumulative costs over the M 0 intervals and by KM 0 the number of episodes started over the M intervals. The construction of intervals given in Sect. 5.3 entails that

0 CM 0 M ≤ + KM 0 + 2SA log2(TM 0 ). (17) B?

? Noting that ReM 0 = CM 0 − KM 0 V (s0), we have

? p 2 2 2 p CM 0 ≤ KM 0 V (s0) + O( (B? + B?)SAKM 0 ιM 0 + BS AιM 0 + (B? + 1)SACM 0 ιM 0 ) 2 (i)  q  p ? p 2 2 2 CM 0 ≤ O( (B? + 1)SAιM 0 ) + KM 0 V (s0) + O( (B? + B?)SAKM 0 ιM 0 + BS AιM 0 )   ? 2 2 p 2 ≤ KM 0 V (s0) + O BS AιM 0 + (B? + B?)SAKM 0 ιM 0 ,

? where (i) uses Lem. 29 and V (s0) ≤ B?. Hence   p 2 2 2 ReM 0 ≤ O (B? + B?)SAKM 0 ιM 0 + BS AιM 0 .

By scaling δ ← δ/6 we have the following important bound

  0   0  p 2 max{B?, 1}SATM 2 2 max{B?, 1}SATM ReM 0 ≤ O (B + B?)SAKM 0 log + BS A log . (18) ? δ δ

The proof of Thm.3 is concluded by taking M 0 = M, where M denotes the number of intervals in which the first K episodes elapse.

6 Conclusion

In this paper, we introduced EB-SSP, an algorithm for online SSP with the two following important properties. 1) It is the first algorithm that is simultaneously nearly minimax-optimal and parameter-free (i.e., it does not require to know T? nor B?). 2) In various cases its regret can be coined “horizon-free” since it only has a logarithmic dependence on T?, thus exponentially improving over existing bounds in terms of the dependence on T?, which may be arbitrarily larger than B? in SSP models with small instantaneous costs.

18 Beyond our improvements in the SSP setting, we can highlight two broader implications of our results. 1) We show that it is possible to devise an adaptive exploration bonus strategy in a setting where no prior knowledge of the “optimal range” is available (this was an open question raised in Qian et al.[2019] whose exploration-bonus-based approach in average-reward MDPs requires prior knowledge of the optimal bias span). 2) We show that the concept of horizon-free regret can be extended beyond finite-horizon MDPs [e.g., Wang et al., 2020a, Zhang et al., 2020a, 2021]. In fact, it is perhaps even more meaningful in the goal-oriented setting, by virtue of the already present distinction between “optimal length” (T?) and “optimal return” (B?). Indeed we do not make any extra assumption on the SSP problem, as opposed to the finite-horizon set-up which requires the assumption of different episode length (H) and episode return of any trajectory (at most 1) so as to uncover horizon-free properties. An interesting question raised by our paper is whether it is possible to simultaneously achieve the three properties of minimax, parameter-free and horizon-free regret for SSP under general costs. Another interesting direction would be to build on our approach (e.g., the VISGO procedure) to derive tight sample complexity bounds in SSP, which as explained in Tarbouriech et al.[2021] do not directly ensue from regret guarantees.

References Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017. Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. arXiv preprint arXiv:1205.2661, 2012. Dimitri Bertsekas. and , volume 2. 1995.

Dimitri P Bertsekas. Linear network optimization: algorithms and codes. Mit Press, 1991. Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. of Operations Research, 16(3):580–595, 1991. Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.

Blai Bonet. On the speed of convergence of value iteration on stochastic shortest-path problems. Mathematics of Operations Research, 32(2):365–373, 2007. Liyu Chen and Haipeng Luo. Finding the stochastic shortest path with low regret: The adversarial cost and unknown transition case. arXiv preprint arXiv:2102.05284, 2021.

Liyu Chen, Haipeng Luo, and Chen-Yu Wei. Minimax regret for stochastic shortest path with adversarial costs and known transition. arXiv preprint arXiv:2012.04053, 2020. Alon Cohen, Yonathan Efroni, Yishay Mansour, and Aviv Rosenberg. Minimax regret for stochastic shortest path. arXiv preprint arXiv:2103.13056, 2021. Yonathan Efroni, Nadav Merlis, Mohammad Ghavamzadeh, and Shie Mannor. Tight regret bounds for model-based reinforcement learning with greedy policies. In Advances in Neural Information Processing Systems, 2019. Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1578–1586. PMLR, 2018.

Matthieu Guillot and Gautier Stauffer. The stochastic shortest path problem: a polyhedral combinatorics perspective. European Journal of Operational Research, 285(1):148–158, 2020.

19 Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010. Nan Jiang and Alekh Agarwal. Open problem: The dependence of sample complexity lower bounds on planning horizon. In Conference On Learning Theory, pages 3395–3398. PMLR, 2018.

Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018. Chi Jin, Tiancheng Jin, Haipeng Luo, Suvrit Sra, and Tiancheng Yu. Learning adversarial markov decision processes with bandit feedback and unknown transition. In International Conference on Machine Learning, pages 4860–4869. PMLR, 2020.

Tiancheng Jin and Haipeng Luo. Simultaneously learning stochastic and adversarial episodic mdps with known transition. arXiv preprint arXiv:2006.05606, 2020. Andrey Kolobov, Mausam, Daniel Weld, and Hector Geffner. Heuristic search for generalized stochastic shortest path mdps. In Proceedings of the International Conference on Automated Planning and Scheduling, volume 21, 2011.

Shiau Hong Lim and Peter Auer. Autonomous exploration for navigating in mdps. In Conference on Learning Theory, pages 40–1. JMLR Workshop and Conference Proceedings, 2012. Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.

Pierre Menard, Omar Darwiche Domingues, Xuedong Shang, and Michal Valko. Ucb momentum q-learning: Correcting the bias without forgetting. arXiv preprint arXiv:2103.01312, 2021. Gergely Neu and Ciara Pike-Burke. A unifying view of optimism in episodic reinforcement learning. In Advances in Neural Information Processing Systems, volume 33, pages 1392–1403, 2020. Gergely Neu, András György, and Csaba Szepesvári. The online loop-free stochastic shortest-path problem. In COLT, volume 2010, pages 231–243. Citeseer, 2010. Gergely Neu, Andras Gyorgy, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813. PMLR, 2012. Jian Qian, Ronan Fruit, Matteo Pirotta, and Alessandro Lazaric. Exploration bonus for regret minimization in discrete and continuous average reward mdps. In Advances in Neural Information Processing Systems, pages 4891–4900, 2019. Aviv Rosenberg and Yishay Mansour. Online in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478–5486. PMLR, 2019a. Aviv Rosenberg and Yishay Mansour. Online stochastic shortest path with bandit feedback and unknown transition function. Advances in Neural Information Processing Systems, 32:2212–2221, 2019b. Aviv Rosenberg and Yishay Mansour. Stochastic shortest path with adversarially changing costs. arXiv preprint arXiv:2006.11561, 2020. Aviv Rosenberg, Alon Cohen, Yishay Mansour, and Haim Kaplan. Near-optimal regret bounds for stochastic shortest path. In International Conference on Machine Learning, pages 8210–8219. PMLR, 2020.

Max Simchowitz and Kevin G. Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. In Advances in Neural Information Processing Systems, volume 32, pages 1151–1160, 2019.

20 Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning. In International Conference on Machine Learning, pages 9428–9437. PMLR, 2020a. Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Improved sample complexity for incremental autonomous exploration in mdps. In Advances in Neural Information Processing Systems, volume 33, pages 11273–11284, 2020b. Jean Tarbouriech, Matteo Pirotta, Michal Valko, and Alessandro Lazaric. Sample complexity bounds for stochastic shortest path with a generative model. In Algorithmic Learning Theory, pages 1157–1178. PMLR, 2021. Ruosong Wang, Simon S. Du, Lin F. Yang, and Sham M. Kakade. Is long horizon RL more difficult than short horizon RL? In Advances in Neural Information Processing Systems, 2020a. Yuanhao Wang, Kefan Dong, Xiaoyu Chen, and Liwei Wang. Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. In International Conference on Learning Representations, 2020b. Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforce- ment learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, 2020. Haike Xu, Tengyu Ma, and Simon S Du. Fine-grained gap-dependent bounds for tabular mdps via adaptive multi-step bootstrap. arXiv preprint arXiv:2102.04692, 2021. Huizhen Yu and Dimitri P Bertsekas. On boundedness of q-learning iterates for stochastic shortest path problems. Mathematics of Operations Research, 38(2):209–227, 2013. Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.

Zihan Zhang, Xiangyang Ji, and Simon S Du. Is reinforcement learning more difficult than bandits? a near-optimal algorithm escaping the curse of horizon. arXiv preprint arXiv:2009.13503, 2020a. Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference- advantage decomposition. arXiv preprint arXiv:2004.10019, 2020b. Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Model-free reinforcement learning: from clipped pseudo-regret to sample complexity. arXiv preprint arXiv:2006.03864, 2020c. Zihan Zhang, Jiaqi Yang, Xiangyang Ji, and Simon S Du. Variance-aware confidence set: Variance-dependent bound for linear bandits and horizon-free bound for linear mixture mdp. arXiv preprint arXiv:2101.12745, 2021.

21 Appendix

A T? can be arbitrarily larger than B?,S,A

Here we provide a simple illustration that the inequality c = 0 ... sS−3 B? ≤ T? may be arbitrarily loose, which shows that scaling c = 0 c = 0 with T? can be much worse than scaling with B?. Recall that B? bounds the total expected cost of the optimal policy s2 sS−2 starting from any state, and T? bounds the expected time- c = 0

to-goal of the optimal policy from any state. c = 0 c = 0 s1 s0 Let us consider an SSP instance whose optimal policy induces 1 − pmin the absorbing Markov chain depicted in Fig.1. It is easy to c = 0 −1 pmin see that B? = 1 and that T? = Ω(S pmin). Hence, the gap s−1 c = 1 g between B? and T? can grow arbitrarily large as pmin → 0. This simple example illustrates the benefit of having a bound Figure 1: Markov chain of the optimal policy that is (nearly) horizon-free (cf. desired property 3 in Sect.1). of an SSP instance with S states. Transitions Indeed, a bound that is not horizon-free scales polynomially in green incur a cost of 0, while the transition −1 in red leading to the goal state g incurs a cost with T? and thus with pmin, which may be arbitrarily large of 1. All transitions are deterministic, apart if pmin → 0. In contrast, a horizon-free bound only scales −1 from the one starting from s0, which reaches logarithmically with pmin and can therefore be much tighter. state s−1 with probability pmin and state s1 with probability 1 − pmin, where pmin > 0.

B An Alternative Assumption on the SSP Problem: No Almost- Sure Zero-Cost Cycles

Here we complement Sect. 4.1 by introducing an alternative assumption on the SSP problem (which is weaker than Asm.4) and we analyze the regret bound achieved by EB-SSP (under the set-up of Sect. 4.1). We draw inspiration from the common assumption in the deterministic shortest path setting that the transition graph does not possess any cycle of zero costs [Bertsekas, 1991]. In the following we introduce a “stochastic” counterpart of this assumption. Assumption 19. There exist unknown constants c† > 0 and q† > 0 such that:

|ω| ! \ \ n X †o † P ci ≥ c ≥ q , 0 s ∈S ω∈Ωs0 i=1

0 where for every state s ∈ S we denote by Ωs0 the set of all possible trajectories in the SSP-MDP that start from 0 0 state s and end in state s , and we denote by c1, . . . , c|ω| the sequence of costs incurred during a trajectory ω.

Asm. 19 is strictly weaker than the assumption of positive costs (Asm.4) and it guarantees that the conditions of Lem.2 hold. Intuitively, it implies that the agent has a non-zero probability of gradually accumulating some positive cost as its trajectory length increases. Indeed, it is straightforward to see that † PzS † under Asm. 19, with probability at least 1 − δ/T , for any z ≥ log(T/δ)/q , it holds that i=1 ci ≥ c .  S log(T/δ)  Reiterating this argument, we see that T = O c†q† C , with C the total cumulative cost. Using a loose bound on C and isolating T (with the same reasoning as in Sect. 4.1.1) gives that T is upper bounded −1 † † −1 polynomially by the quantities B?, S, A, K, δ , (c q ) . Plugging this in Thm.3 yields the following.

22 Corollary 20. Under Asm. 19, running EB-SSP (Alg.1) with B = B? and η = 0 gives the following regret bound with probability at least 1 − δ

 √ KB SA KB SA R = O B SAK log ? + B S2A log2 ? . K ? c†q†δ ? c†q†δ

The regret bound of Cor. 20 is (nearly) minimax and horizon-free (and it can be made parameter-free by executing Alg.3 instead of Alg.1). The bound depends logarithmically on the inverse of the constants c†, q†. We observe that i) it no longer becomes relevant if one constant is exponentially small, ii) spelling out c†, q† satisfying Asm. 19 is challenging as they subtly depend on both the cost function and the transition dynamics, although iii) the agent does not need to know nor estimate c† and q† to achieve the regret bound of Cor. 20.

C Computational Complexity of EB-SSP

Here we complement Remarks1 and3 on the computational efficiency of EB-SSP (Alg.1). We show that a trivial modification of the constant c2 in the bonus enables to tighten the analysis of the contraction modulus ρ and thus get a total complexity in Oe(T ) (instead of Oe(T 2)), only at the cost of an extra logarithmic term in T in the lower-order term of the regret bound of Thm.3. 2 2 e More precisely, we select the same value for c1, yet instead of c2 = c1 we pick c2 = c1 ln( 6=g ). 1−maxs,akPes,ak∞ 2 Since we have c2 ≤ c1 ln(e(T + 1)), we see that this small change will only add an extra logarithmic term in T in the lower-order term of the regret bound of Thm.3. We now retrace the analysis of Lem. 14. Specifically, 6=g by straightforward algebraic manipulation, we get that if kpk∞ < 1, then f(p, v, n, B?, ι) is ρp-contractive in v(s), with

6=g y −1  e  ρp := 1 − (1 − kpk∞ ) < 1, where y := 1 + ln 6=g . 1 − kpk∞

Recalling that we instantiate p = Pes,a for each state-action pair (s, a), the operator Le is now ρ-contractive with 1 ρ := 1− e(T +1) . This gives a refined bound on the contraction modulus (that does not exhibit a quadratic term in T 2 as in Lem. 17), and therefore on the computational complexity of a VISGO procedure which can be bounded S2A −1 as O( 1−ρ log(B?/VI)). Since VI = Ω(T ) and since there are at most O(SA log T ) VISGO procedures, we 2 obtain a total computational complexity for EB-SSP that is bounded as O(TS A log(B?T ) · SA log T ), which is polynomially bounded and in particular near-linear in T . Also note that T is always bounded polynomially w.r.t. K as shown in the various cases of Sect. 4.1. By means of comparison, although Tarbouriech et al.[2020a], Rosenberg et al.[2020] do not formally analyze the computational complexity of their algorithms, they can be shown to be also computationally efficient. Indeed we can leverage the result of Bonet[2007] that value iteration (VI) for SSP converges polynomially in the ratio between the infinity norm of the optimal value function and the minimum cost (either in the true model if costs are positive, or in the cost-perturbed model where the perturbation scales inversely with K). This means that the associated extended value iteration (EVI) is also computationally efficient (as in EVI for average-reward [e.g., Fruit et al., 2018]). As for the algorithm of Cohen et al.[2021], it performs a loop-free reduction of SSP to finite-horizon with horizon H = Ω(T?), so the computational complexity scales with that of a finite-horizon VI procedure (i.e., S2AH) times the number of them (i.e., T/H), i.e., it scales linearly with T . We also note that the analysis of the computational complexity of EB-SSP may likely be refined. Indeed, we see that i) on the one hand, if n(s, a) is small, then the optimistic skewing of Pes,a is not too small so the probability of reaching the goal from (s, a) is not too small (so the associated contraction modulus is bounded away from 1) and ii) on the other hand, if n(s, a) → +∞, then Pes,a → Pbs,a → Ps,a, so to the limit we should recover the convergence properties of VI of the optimal Bellman operator under the true model, which by assumption admits a proper policy in P . Thus we see that studying further the “intermediate regime” may bring into the picture the computational complexity of running VI in the true model, yet this is not our main

23 focus here. For our purposes, our complexity analysis is sufficient to ensure computational efficiency (which is on par with existing approaches). Moreover, the fact that EB-SSP is a value-optimistic algorithm (and not model-optimistic) makes it “easier to implement and thus more practical” [Neu and Pike-Burke, 2020].

D Technical Lemmas

Lemma 21 (Bennett’s Inequality, anytime version). Let Z,Z1,...,Zn be i.i.d. random variables with values in [0, b] and let δ > 0. Define V[Z] = E[(Z − E[Z])2]. Then we have " n r # 1 X 2 [Z] ln(4n2/δ) b ln(4n2/δ) ∀n ≥ 1, [Z] − Z > V + ≤ δ. P E n i n n i=1 Proof. Bennett’s inequality states that if the random variables have values in [0, 1], then for a specific n ≥ 1, " n r # 1 X 2 [Z] ln(2/δ) ln(2/δ) [Z] − Z > V + ≤ δ. P E n i n n i=1 δ We can then choose δ ← 2n2 and take a union bound over all possible values of n ≥ 1, and the result follows P δ by noting that n≥1 2n2 < δ. To account for the case b 6= 1 we can simply apply the result to (Zn/b).

Lemma 22 (Theorem 4 in Maurer and Pontil[2009], anytime version) . Let Z,Z1,...,Zn (n ≥ 2) be i.i.d. ¯ 1 ˆ 1 Pn ¯ 2 random variables with values in [0, b] and let δ > 0. Define Z = n Zi and Vn = n i=1(Zi − Z) . Then we have  s  n ˆ 2 2 1 X 2Vn ln(4n /δ) 7b ln(4n /δ) ∀n ≥ 1, [Z] − Z > + ≤ δ. P E n i n − 1 3(n − 1)  i=1

Lemma 23 (Popoviciu’s Inequality). Let X be a random variable whose value is in a fixed interval [a, b], 1 2 then V[X] ≤ 4 (b − a) .

Lemma 24 (Lemma 11 in Zhang et al.[2020c]) . Let (Mn)n≥0 be a martingale such that M0 = 0 and Pn 2 |Mn − Mn−1| ≤ c for some c > 0 and any n ≥ 1. Let Varn = k=1 E[(Mk − Mk−1) |Fk−1] for n ≥ 0, where Fk = σ(M1,...,Mk). Then for any positive integer n and any , δ > 0, we have that h i  nc2   |M | ≥ 2p2Var ln(1/δ) + 2p ln(1/δ) + 2c ln(1/δ) ≤ 2 log + 1 δ. P n n 2 

Lemma 25. Let (Mn)n≥0 be a martingale such that M0 = 0 and |Mn − Mn−1| ≤ c for some c > 0 and Pn 2 any n ≥ 1. Let Varn = k=1 E[(Mk − Mk−1) |Fk−1] for n ≥ 0, where Fk = σ(M1,...,Mk). Then for any positive integer n and δ ∈ (0, 2(nc2)1/ ln 2], we have that h i p 2 p 2 2 P |Mn| ≥ 2 2Varn(log2(nc ) + ln(2/δ)) + 2 log2(nc ) + ln(2/δ) + 2c(log2(nc ) + ln(2/δ)) ≤ δ.

0 2 Proof. Take  = 1 and δ = 2(log2(nc ) + 1)δ in Lem. 24. By x ≥ ln(x) + 1, we have 2 0 2 0 2 0 ln(1/δ) = ln(2(log2(nc ) + 1)/δ ) = ln(log2(nc ) + 1) + ln(2/δ ) ≤ log2(nc ) + ln(2/δ ). Hence, h i p 2 0 p 2 0 2 0 P |Mn| ≥ 2 2Varn(log2(nc ) + ln(2/δ )) + 2 log2(nc ) + ln(2/δ ) + 2c(log2(nc ) + ln(2/δ )) h p p i ≤ P |Mn| ≥ 2 2Varn ln(1/δ) + 2 ln(1/δ) + 2c ln(1/δ) ≤ δ0.

By swapping δ and δ0 we complete the proof.

24 0 Lemma 26 (Lemma 11 in Zhang et al.[2020a]) . Let λ1, λ2, λ4 ≥ 0, λ3 ≥ 1 and i = log2 λ1. Let a1, a2, . . . , ai0 be non-negative reals such that a ≤ λ and a ≤ λ pa + 2i+1λ + λ for any 1 ≤ i ≤ i0. Then we have i √1 i 2 i+1 3 4 p 2 2 that a1 ≤ max{(λ2 + λ2 + λ4) , λ2 8λ3 + λ4}.

Lemma 27. For random variable Z ∈ [0, 1], V[Z] ≤ E[Z]. Proof. V[Z] = E[Z2] − (E[Z])2 ≤ E[Z2] ≤ E[Z]. Lemma 28. For any a, b ∈ [0, 1] and k ∈ N, ak − bk ≤ k max{a − b, 0}.

k k Pk−1 i k−1−i Pk−1 Proof. a − b = (a − b) i=0 a b ≤ max{a − b, 0}· i=0 1 = k max{a − b, 0}. √ √ Lemma 29. For a, b, x ≥ 0, x ≤ a x + b implies x ≤ (a + b)2.

√ 2 √ √  a+ a2+b  2 Proof. x ≤ a x + b ⇒ x ≤ 2 ≤ (a + b) .

E Proofs E.1 Proofs of Lemmas 12, 14, 15, 16, 17, 18

0 Restatement of Lemma 12. For any non-negative vector U ∈ RS such that U(g) = 0, for any (s, a) ∈ S×A, it holds that 2 0 kUk∞ 2kUk∞S Pes,aU ≤ Pbs,aU ≤ Pes,aU + , (Pes,a,U) − (Pbs,a,U) ≤ . n(s, a) + 1 V V n(s, a) + 1

Proof. The proof uses the definition of Pe (Eq.4) and simple algebraic manipulation. For any s0 6= g, we have 0 Pes,a,s0 ≤ Pbs,a,s0 and U(s ) ≥ 0, as well as U(g) = 0, so Pes,aU ≤ Pbs,aU, and

 n(s, a)  kUk∞ (Pbs,a − Pes,a)U = 1 − Pbs,aU ≤ . n(s, a) + 1 n(s, a) + 1

In addition, for any s0 ∈ S0, n(s, a) I[s0 = g] 2 |Pes,a,s0 − Pbs,a,s0 | ≤ − 1 Pbs,a,s0 + ≤ . n(s, a) + 1 n(s, a) + 1 n(s, a) + 1

Therefore we have that

X 0 2 X 0 2 V(Pbs,a,U) = Pbs,a,s0 (U(s ) − Pbs,aU) ≤ Pbs,a,s0 (U(s ) − Pes,aU) s0∈S0 s0∈S0   2 0 X 2 0 2 2kUk∞S ≤ Pes,a,s0 + (U(s ) − Pes,aU) ≤ (Pes,a,U) + , n(s, a) + 1 V n(s, a) + 1 s0∈S0

? P P 2 where the first inequality is by the fact that z = i pixi minimizes the quantity i pi(xi − z) . Conversely,

X 0 2 X 0 2 V(Pes,a,U) = Pes,a,s0 (U(s ) − Pes,aU) ≤ Pes,a,s0 (U(s ) − Pbs,aU) s0∈S0 s0∈S0   2 0 X 2 0 2 2kUk∞S ≤ Pbs,a,s0 + (U(s ) − Pbs,aU) ≤ (Pbs,a,U) + . n(s, a) + 1 V n(s, a) + 1 s0∈S0

25 S0 S0 Restatement of Lemma 14. Let Υ := {v ∈ R : v ≥ 0, v(g) = 0, kvk∞ ≤ B}. Let f : ∆ × Υ × R × n q o V(p,v)ι Bι R × R → R with f(p, v, n, B, ι) := pv − max c1 n , c2 n , with c1 = 6 and c2 = 36 (here taking any 2 S0 pair of constants such that c1 = c2 works). Then f satisfies, for all p ∈ ∆ , v ∈ Υ and n, ι > 0, 1. f(p, v, n, B, ι) is non-decreasing in v(s), i.e., ∀(v, v0) ∈ Υ2, v ≤ v0 =⇒ f(p, v, n, B, ι) ≤ f(p, v0, n, B, ι);

q q c1 V(p,v)ι c2 Bι V(p,v)ι Bι 2. f(p, v, n, B, ι) ≤ pv − 2 n − 2 n ≤ pv − 2 n − 14 n ;

6=g 6=g 2 3. If kpk∞ < 1, then f(p, v, n, B, ι) is ρp-contractive in v(s), with ρp := 1 − (1 − kpk∞ ) < 1, i.e.,

0 2 0 0 ∀(v, v ) ∈ Υ , |f(p, v, n, B, ι) − f(p, v , n, B, ι)| ≤ ρpkv − v k∞.

Proof. The second claim holds by max{x, y} ≥ (x + y)/2, ∀x, y, by the choices of c1, c2 and because both q V(p,v)ι Bι n and n are non-negative. To verify the first and third claims, we fix all other variables but v(s) and q V(p,v)ι Bι view f as a function in v(s). Because the derivative of f in v(s) does not exist only when c1 n = c2 n , ∂f q V(p,v)ι Bι where the condition has at most two solutions, it suffices to prove ≥ 0 when c1 6= c2 . Direct ∂v(s) n n computation gives " r # ∂f I V(p, v)ι Bι p(s)(v(s) − pv)ι = p(s) − c1 c1 ≥ c2 p ∂v(s) n n nV(p, v)ι c2 ≥ min p(s), p(s) − 1 p(s)v(s) − pv c2B (i) c2 ≥ min p(s), p(s) − 1 p(s) c2  c2  ≥ p(s) 1 − 1 = 0. c2 q V(p,v)ι Bι Here (i) is by v(s)−pv ≤ v(s) ≤ B. For the third claim, we perform a distinction of cases. If c1 n = c2 n , Bι 6=g where the condition has at most two solutions, then f(v) = pv−c2 , which corresponds to a kpk∞ -contraction q n 6=g V(p,v)ι Bι with kpk∞ < 1. Otherwise c1 n 6= c2 n , then the derivative of f in v(s) exists and it verifies " r # ∂f I V(p, v)ι Bι p(s)(v(s) − pv)ι = p(s) − c1 c1 ≥ c2 p ∂v(s) n n nV(p, v)ι c2 ≤ max p(s), p(s) + 1 p(s)pv − v(s) c2B (i) 1 ≤ max p(s), p(s) + p(s)(1 − p(s))B + p(s)v(s) − v(s)  B | {z } ≤0 ≤ max p(s), p(s) + p(s)(1 − p(s) 2 = 1 − 1 − p(s) .

∂f  6=g 6=g 2 Here (i) is by kvk∞ ≤ B. Therefore, for all s ∈ S, we have ≤ max kpk , 1 − (1 − kpk ) ≤ ∂v(s) ∞ ∞ 6=g 2 1 − (1 − kpk∞ ) = ρp < 1. By the mean value theorem we obtain that f is ρp-contractive.

26 Restatement of Lemma 15. Conditioned on the event E, for any output Q of the VISGO procedure (line 26 of Alg.1) and for any state-action pair (s, a) ∈ S × A, it holds that

Q(s, a) ≤ Q?(s, a).

Proof. We prove by induction that for any inner iteration i of VISGO, Q(i)(s, a) ≤ Q?(s, a). By definition we have Q(0) = 0 ≤ Q?. Assume that the property holds for iteration i, then

(i+1)  (i) (i+1) Q (s, a) = max bc(s, a) + Pes,aV − b (s, a), 0 , where

(i) (i+1) bc(s, a) + Pes,aV − b (s, a) s s (i) BpS0ι (i) n V(Pes,a,V )ιs,a Bιs,a o bc(s, a)ιs,a s,a = c(s, a) + Pes,aV − max c1 , c2 − c3 − c4 b n+(s, a) n+(s, a) n+(s, a) n+(s, a) s (i) (i) p 0 (i) n V(Pes,a,V )ιs,a Bιs,a o 28ιs,a B S ιs,a ≤ c(s, a) + Pes,aV − max c1 , c2 + − c4 n+(s, a) n+(s, a) 3n+(s, a) n+(s, a) p 0 (i) + 28ιs,a B S ιs,a = c(s, a) + f(Pes,a,V , n (s, a), B, ιs,a) + − c4 3n+(s, a) n+(s, a)

(ii) p 0 ? + 28ιs,a B S ιs,a ≤ c(s, a) + f(Pes,a,V , n (s, a), B, ιs,a) + − c4 3n+(s, a) n+(s, a) s (iii) ? p 0 ? V(Pes,a,V )ιs,a 14Bιs,a B S ιs,a ≤ c(s, a) + Pes,aV − 2 − − c4 n+(s, a) 3n+(s, a) n+(s, a) s (iv) ? p 0 ? V(Pes,a,V )ιs,a 14Bιs,a B S ιs,a ≤ c(s, a) + Pbs,aV − 2 − − c4 n+(s, a) 3n+(s, a) n+(s, a) s s ? ? p 0 (v) (Pbs,a,V )ιs,a (Pes,a,V )ιs,a 14ιs,a B S ιs,a ≤ c(s, a) + P V ? + 2 V − 2 V − (B − B ) − c s,a n+(s, a) n+(s, a) ? 3n+(s, a) 4 n+(s, a) s ? ? p 0 (vi) | (Pbs,a,V ) − (Pes,a,V )|ιs,a 14ιs,a B S ιs,a ≤ c(s, a) + P V ? + 2 V V − (B − B ) − c s,a n+(s, a) ? 3n+(s, a) 4 n+(s, a) ! (vii) p 0 ? 14ιs,a 2 2S ιs,a ≤ c(s, a) + Ps,aV −(B − B?) + + + | {z } 3n (s, a) n (s, a) =Q?(s,a) ≤ Q?(s, a), where (i) is by definition of E2 and choice of c3, (ii) uses the first property of Lem. 14 and the induction (i) ? hypothesis that V ≤ V , (iii) uses the second property of Lem. 14 and assumption B ≥ max{B?, 1}, (iv) √ √ p uses Lem. 12, (v) is by definition of E1, (vi) uses the inequality x − y ≤ |x − y|, ∀x, y ≥ 0, and (vii) uses the second inequality of Lem. 12 and the choice of c4. Ultimately,

Q(i+1)(s, a) ≤ max Q?(s, a), 0 ≤ Q?(s, a).

(i) Restatement of Lemma 16. The sequence (V )i≥0 is non-decreasing. Combining this with the fact that it is upper bounded by V ? from Lem. 15, the sequence must converge.

27 (i+1) (i+1) Proof. We recognize that V (s) ← mina Q (s, a), with

( s p 0 ) c(s, a)ι B S ιs,a Q(i+1)(s, a) ← max c(s, a) + fP ,V (i), n+(s, a), B, ι  −c b s,a − c , 0 , b es,a s,a 3 + 4 + | {z } n (s, a) n (s, a) (i) :=gs,a(V )

+  where we introduce the function gs,a(V ) := f Pes,a, V, n (s, a), B, ιs,a for notational ease as all other parameters (apart from V ) will remain the same throughout the analysis. We prove by induction on the iterations indexed by i that Q(i) ≤ Q(i+1). First, Q(0) = 0 ≤ Q(1). Now assume that Q(i−1) ≤ Q(i). Then

( s p 0 ) c(s, a)ι B S ιs,a Q(i+1)(s, a) = max c(s, a) + g (V (i)) − c b s,a − c , 0 b s,a 3 n+(s, a) 4 n+(s, a)

( s p 0 ) c(s, a)ι B S ιs,a ≥ max c(s, a) + g (V (i−1)) − c b s,a − c , 0 b s,a 3 n+(s, a) 4 n+(s, a) = Q(i)(s, a),

(i) (i−1) where the inequality uses the induction hypothesis V ≥ V and the fact that gs,a is non-decreasing from the first claim of Lem. 14. Restatement of Lemma 17. Le is a ρ-contractive operator with modulus ρ := 1 − ν2 < 1, where ν is defined in Eq. 15.

Proof. Take any two vectors U1,U2, then for any state s ∈ S,

|LeU1(s) − LeU2(s)| = min LeU1(s, a) − min LeU2(s, a) a a n o ≤ max LeU1(s, a) − LeU2(s, a) , a and we have that for any action a ∈ A,   |LeU1(s, a) − LeU2(s, a)| ≤ max bc(s, a) + gs,a(U1), 0} − max bc(s, a) + gs,a(U2), 0}

≤ gs,a(U1) − g(U2) (i) ≤ ρs,akU1 − U2k∞.

6=g To justify the last inequality (i), we see that kPes,ak∞ ≤ 1 − νs,a < 1 from Eq. 15, therefore the third claim of Lem. 14 entails that gs,a is ρs,a-contractive (where gs,a is defined in the proof of Lem. 16), with

6=g 2 2 ρs,a := 1 − (1 − kPes,ak∞ ) ≤ 1 − νs,a.

Taking the maximum over (s, a) pairs, we get that Le is ρ-contractive with modulus ρ := 1 − ν2 < 1. Restatement of Lemma 18. Conditioned on the event E, for any interval m and state-action pair (s, a) ∈ S × A, m m  m |c(s, a) + Ps,aV − Q (s, a)| ≤ min β (s, a),B? + 1 , where we define s s 2 (P ,V ?)ι 2S0 (P ,V ? − V m)ι 3B S0ι  q  βm(s, a) := 4bm(s, a) + V s,a s,a + V s,a s,a + ? s,a + 1 + c ι /2 m . nm(s, a) nm(s, a) nm(s, a) 1 s,a VI

m m ? ? Proof. First we see that c(s, a) + Ps,aV − Q (s, a) ≤ c(s, a) + Ps,aV = Q (s, a) ≤ B? + 1 and that m m ? Q (s, a) − c(s, a) − Ps,aV ≤ Q (s, a) ≤ B? + 1, from Lem. 15 and the Bellman optimality equation (Eq.2). m m m Now we prove that |c(s, a) + Ps,aV − Q (s, a)| ≤ β (s, a).

28 m m m m Bounding c(s, a) + Ps,aV − Q (s, a). From the VISGO loop of Alg.1, the vectors Q and V can be (i) (i) associated to a finite iteration l of a sequence of vectors (Q )i≥0 and (V )i≥0 such that (i) Qm(s, a) := Q(l)(s, a), (ii) V m(s) := V (l)(s),

(l) (l−1) m (iii) kV − V k∞ ≤ VI,   √ q (l) q m B S0ι m (l+1) V(Pes,a,V )ιs,a Bιs,a bc (s,a)ιs,a s,a (iv) b (s, a) := b (s, a) = max c1 nm(s,a) , c2 nm(s,a) + c3 nm(s,a) + c4 nm(s,a) .

First, we examine the gap between the exploration bonuses at the final VISGO iterations l and l + 1 as follows s s (l−1) p 0 (i) (Pes,a,V )ιs,a Bιs,a c(s, a)ιs,a B S ιs,a b(l)(s, a) ≤ c V + c + c b + c 1 n+(s, a) 2 n+(s, a) 3 n+(s, a) 4 n+(s, a) s s (l) (l−1) (l) (ii) (Pes,a,V )ιs,a (Pes,a,V − V )ιs,a Bιs,a ≤ c 2V + c 2V + c 1 n+(s, a) 1 n+(s, a) 2 n+(s, a)

s p 0 c(s, a)ι B S ιs,a + c b s,a + c 3 n+(s, a) 4 n+(s, a) s (iii) √ (m )2ι ≤ 2 2b(l+1)(s, a) + c VI s,a 1 2n+(s, a) √ (l+1) m q ≤ 2 2b (s, a) + VIc1 ιs,a/2, √ √ √ where (i) uses max{x, y} ≤ x + y; (ii) uses V(P,X + Y ) ≤ 2(V(P,X) + V(P,Y )) and x + y ≤ x + y; (iii) (l−1) (l) m uses x + y ≤ 2 max{x, y} and Popoviciu’s inequality (Lem. 23) applied to V − V ∈ [−VI, 0]. Moreover, (l) (l−1) (l) we have that Q (s, a) ≥ bc(s, a) + Pes,aV − b (s, a) from Eq.6. Combining everything yields q √ −Qm(s, a) ≤ −c(s, a) − P (V m −  ) +  c ι /2 + 2 2bm(s, a) b es,a VI VI 1 s,a √  q  ≤ −c(s, a) − P V m + 2 2bm(s, a) + 1 + c ι /2 m . b es,a 1 s,a VI

Therefore, we have √  q  c(s, a) + P V m − Qm(s, a) ≤ c(s, a) + P V m − cm(s, a) − P V m + 2 2bm(s, a) + 1 + c ι /2 m s,a s,a b es,a 1 s,a VI

(i)   m m B? m q m ≤ Ps,aV − Pbs,aV + + 4b (s, a) + 1 + c1 ιs,a/2  nm(s, a) + 1 VI ? m ? < (Ps,a − Pbs,a)V + (Ps,a − Pbs,a)(V − V ) | {z } | {z } :=Y1 :=Y2 B  q  + ? + 4bm(s, a) + 1 + c ι /2 m , nm(s, a) 1 s,a VI

m where (i) comes from Lem. 12, the event E2, Lem. 15 and (loosely) bounding |c(s, a) − bc(s, a)| ≤ b (s, a). It holds under the event E1 that s 2 (P ,V ?)ι B ι |Y | ≤ V s,a s,a + ? s,a . 1 nm(s, a) nm(s, a)

29 Moreover, we have

(i) X m 0 ? 0 m ? |Y2| = | (Pbs,a,s0 − Ps,a,s0 )(V (s ) − V (s ) − Ps,a(V − V ))| s0 X m 0 ? 0 m ? ≤ |Ps,a,s0 − Pbs,a,s0 ||V (s ) − V (s ) − Ps,a(V − V )| s0 s 0 (ii) X 2Ps,a,s0 ιs,a B?S ιs,a ≤ |V m(s0) − V ?(s0) − P (V m − V ?)| + nm(s, a) s,a nm(s, a) s0 s (iii) 2S0 (P ,V m − V ?)ι B S0ι ≤ V s,a s,a + ? s,a , nm(s, a) nm(s, a)

P P where the shift performed in (i) is by s0 Ps,a,s0 = s0 Pbs,a,s0 = 1; (ii) holds under the event E3 and Lem. 15 m (V (s) ∈ [0,B?]); (iii) is by Cauchy-Schwarz inequality.

m m m (l) m m Bounding Q (s, a) − c(s, a) − Ps,aV . If Q (s, a) = Q (s, a) = 0, then Q (s, a) − bc(s, a) − Ps,aV ≤  m m (l) (l−1) (l) 0 ≤ min β (s, a),B? . Otherwise, we have Q (s, a) = Q (s, a) = bc(s, a) + Pes,aV − b (s, a). Using m (l−1) m m that V ≥ V (Lem. 16) and Pbs,aV ≥ Pes,aV (Lem. 12), we get

m m m m m Q (s, a) − c(s, a) − Ps,aV ≤ Q (s, a) − bc(s, a) − Ps,aV + b (s, a) (l−1) (l) m m = Pes,aV − b (s, a) − Ps,aV + b (s, a) m m m ≤ Pbs,aV − Ps,aV + b (s, a) ? ? m m = (Pbs,a − Ps,a)V − (Pbs,a − Ps,a)(V − V ) + b (s, a) m ≤ |Y1| + |Y2| + b (s, a), which can be bounded as above.

E.2 Additional lemmas Lemma 30. Let Qem(s, a) := Q?(s, a) − Qm(s, a) and Ve m(s) := V ?(s) − V m(s). Then conditioned on the event E, we have that for all (s, a, m, h),

m m m m m V (s ) − P m m V (s ) ≤ β (s , a ). e h sh ,ah e h+1 h h

Proof. We write that

m m m m ? m ? m m m V (s ) − P m m V (s ) = V (s ) − P m m V + P m m V − V (s ) e h sh ,ah e h+1 h sh ,ah sh ,ah h ? m m ? m m m ≤ Q (s , a ) − P m m V + P m m V − V (s ) h h sh ,ah sh ,ah h (i) m m m m m m = c(s , a ) + P m m V − Q (s , a ) h h sh ,ah h h (ii) m m m ≤ β (sh , ah ),

m m m m m where (i) uses the Bellman optimality equation (Eq.2) and the fact that V (sh ) = Q (sh , ah ), and (ii) comes from Lem. 18. Lemma 31. For any M 0 ≤ M, it holds that

M 0 Hm ! X X m m m m X m m V (sh ) − V (sh+1) − V (s0) ≤ 2SA log(TM 0 ) max kV k∞. 1≤m≤M 0 0 m=1 h=1 m∈M0(M )

30 0 0 Proof. We recall that we denote by M0(M ) the set of intervals among the first M intervals that constitute the first intervals in each episode. From the analytical construction of intervals, an interval m < M 0 can end due to one of the following three conditions: (i) If interval m ends in the goal state, then

m+1 m+1 m m m+1 m m+1 V (s1 ) − V (sHm+1) = V (s0) − V (g) = V (s0).

This happens for all the intervals m + 1 ∈ M0.

(ii) If interval m ends when the cumulative cost reaches B? and we maintain the same policy (i.e., we do m m+1 m+1 m not replan), then V = V and s1 = sHm+1. Therefore,

m+1 m+1 m m V (s1 ) − V (sHm+1) = 0.

(iii) If interval m ends when the count to a state-action pair is doubled, then we replan with a VISGO procedure. Thus we get

m+1 m+1 m m m+1 m+1 m V (s1 ) − V (sHm+1) ≤ V (s1 ) ≤ max kV k∞. 1≤m≤M 0

This happens at most 2SA log2(TM 0 ) times. Combining the three conditions above implies that

M 0 Hm ! X X m m m m V (sh ) − V (sh+1) m=1 h=1 M 0 X m m m m = V (s1 ) − V (sHm+1) m=1 M 0−1 M 0−1 X m+1 m+1 m m  X m m m+1 m+1  M 0 M 0 M 0 M 0 = V (s1 ) − V (sHm+1) + V (s1 ) − V (s1 ) +V (s1 ) −V (sHM0 +1) m=1 m=1 | {z } | {z } ≤0 1 1 M0 M0 =V (s1)−V (s1 ) M 0−1 X m+1 m+1 m m  1 ≤ V (s1 ) − V (sHm+1) + V (s0) m=1 M 0−1 X m+1 0 1 m ≤ V (s0)I[m + 1 ∈ M0, m + 1 ≤ M ] + V (s0) + 2SA log2(TM 0 ) max kV k∞ 1≤m≤M 0 m=1 X m m = V (s0) + 2SA log2(TM 0 ) max kV k∞. 1≤m≤M 0 0 m∈M0(M )

0 E.3 Full proof of the bound on X2(M ) x First, bound βm.

Recall that we assume that the event E holds. From Lem. 18, we have for any m, s, a, s s s m ? ? m (Pes,a,V )ιs,a (Ps,a,V )ιs,a S (Ps,a,V − V )ιs,a βm(s, a) = O V + V + V nm(s, a) nm(s, a) nm(s, a)

31 s √ ! cm(s, a)ι B Sι B Sι  q  + b s,a + ? s,a + s,a + 1 + c ι /2 m . nm(s, a) nm(s, a) nm(s, a) 1 s,a VI

Here we interchange S0 and S since we use the O() notation. From Lem. 12 and Lem. 15, for any m, s, a,

2 0 2 0 m m 2B?S m 2B?S (Pes,a,V ) ≤ (Pbs,a,V ) + < (Pbs,a,V ) + . V V nm(s, a) + 1 V nm(s, a)

Under the event E3, it holds that s 2Ps,a,s0 ιs,a ιs,a 3 2ιs,a Pbs,a,s0 ≤ Ps,a,s0 + + ≤ Ps,a,s0 + . nm(s, a) nm(s, a) 2 nm(s, a)

Thus, it holds that for any m, s, a,

2 m X  m 0 m V(Pbs,a,V ) = Pbs,a,s0 V (s ) − Pbs,aV s0 (i) X m 0 m 2 ≤ Pbs,a,s0 (V (s ) − Ps,aV ) s0   X 3 2ιs,a m 0 m 2 ≤ P 0 + (V (s ) − P V ) 2 s,a,s nm(s, a) s,a s0 3 2B2S0ι ≤ (P ,V m) + ? s,a . 2V s,a nm(s, a)

? P P 2 (i) is by the fact that z = i pixi minimizes the quantity i pi(xi − z) . As a result,

2 0 2 0 m 3 m 2B?S 2B?S ιs,a (Pes,a,V ) < (Ps,a,V ) + + . V 2V nm(s, a) nm(s, a) √ √ √ Utilizing V(P,X + Y ) ≤ 2(V(P,X) + V(P,Y )) with X = V ? − V m and Y = V m and x + y ≤ x + y, finally we have s s (P ,V m)ι S (P ,V ? − V m)ι βm(s, a) ≤ O V s,a s,a + V s,a s,a nm(s, a) nm(s, a) s √ ! c(s, a)ι B Sι B Sι  q  + b s,a + ? s,a + s,a + 1 + c ι /2 m . nm(s, a) nm(s, a) nm(s, a) 1 s,a VI y Second, bound a special type of summation.

m m 0 Lemma 32. Let w = {wh ≥ 0 : 1 ≤ m ≤ M, 1 ≤ h ≤ H } be a group of weights, then for any M ≤ M,

0 m v 0 m  M H s m u M H X X wh u X X m ≤ O tSA log (T 0 ) w . nm(sm, am)  2 M h  m=1 h=1 h h m=1 h=1

0 m i Proof. Since for m ≤ M , n (s, a) ∈ {2 : i ∈ N, i ≤ log2(TM 0 )}, then ∀i, s, a

M 0 Hm X X  2, i = 0, [(sm, am) = (s, a), nm(s, a) = 2i] ≤ I h h 2i, i ≥ 1. m=1 h=1

32 Thus M 0 Hm M 0 Hm X X 1 X X X X m m m i 1 m m m = I[(sh , ah ) = (s, a), n (s, a) = 2 ] i n (sh , ah ) 2 m=1 h=1 s,a 0≤i≤log2(TM0 ) m=1 h=1   X X = 2 + 1

s,a 1≤i≤log2(TM0 )

≤ SA(2 + log2(TM 0 ) − 1) (19)

≤ O(SA log2(TM 0 )). By Cauchy-Schwarz inequality,

0 m v 0 m 0 m v 0 m  M H s m u M H ! M H ! u M H X X wh u X X m X X 1 u X X m ≤ t w ≤ O tSA log (T 0 ) w . nm(sm, am) h nm(sm, am)  2 M h  m=1 h=1 h h m=1 h=1 m=1 h=1 h h m=1 h=1

m m ? m m m w = (Psm,am ,V ), (Psm,am ,V − V ) c(s , a ) ιsm,am By setting successively h V h h V h h and b h h , and relaxing h h  0 2  12SAS TM0 to its upper-bound ιM 0 = ln δ we have v v u M 0 Hm u M 0 Hm u u 0 X X m 2 X X ? m X2(M ) ≤ O uSA log (TM 0 )ιM 0 (Psm,am ,V ) + uS A log (TM 0 )ιM 0 (Psm,am ,V − V ) u 2 V h h u 2 V h h u u t m=1 h=1 t m=1 h=1 | {z 0 } | {z 0 } :=X4(M ) :=X5(M ) v u M 0 Hm u X X m m 2 + tSA log2(TM 0 )ιM 0 bc(sh , ah ) + B?S A log2(TM 0 ) m=1 h=1 M 0 Hm ! 3/2 X X p m 0 0 0 + BS A log2(TM )ιM + (1 + c1 ιM /2)VI . m=1 h=1 z Third, bound each summation separately.

2θ(s,a) Regret contribution of the estimated costs. From line 17 in EB-SSP, we have that bc(s, a) ≤ N(s,a) . Let m m θ (s, a) denote the value of θ(s, a) for calculating bc . By definition,

0 M 0 Hm m m m X X m m m0 m0 m m m m0 m0 m0 m0 θ (sh , ah ) = I[(sh , ah ) = (sh0 , ah0 ), n (sh , ah ) = 2n (sh0 , ah0 )]ch0 m0=1 h0=1 0 0 m m m0 m0 m m m m0 m0 m0 m0 − I[first occurrence of (m , h ) such that (sh , ah ) = (sh0 , ah0 ), n (sh , ah ) = 2n (sh0 , ah0 )]ch0 0 0 m m m0 m0 m m m m0 m0 m0 m0 + I[first occurrence of (m , h ) such that (sh , ah ) = (sh0 , ah0 ), n (sh , ah ) = n (sh0 , ah0 )]ch0 0 M 0 Hm X X m m m0 m0 m m m m0 m0 m0 m0 ≤ I[(sh , ah ) = (sh0 , ah0 ), n (sh , ah ) = 2n (sh0 , ah0 )]ch0 + 1. m0=1 h0=1 For any M 0 ≤ M we have

M 0 Hm X X m m m bc (sh , ah ) m=1 h=1

33 M 0 Hm X X 2θm(sm, am) ≤ h h nm(sm, am) m=1 h=1 h h

0 m 0 m0 M H M H m0 X X X X m m m0 m0 m m m m0 m0 m0 2ch0 = [(s , a ) = (s 0 , a 0 ), n (s , a ) = 2n (s 0 , a 0 )] I h h h h h h h h nm(sm, am) m=1 h=1 m0=1 h0=1 h h M 0 Hm X X 2 + nm(sm, am) m=1 h=1 h h 0 M 0 Hm 0 M 0 Hm (i) m X X ch0 X X m m m0 m0 m m m m0 m0 m0 ≤ · [(s , a ) = (s 0 , a 0 ), n (s , a ) = 2n (s 0 , a 0 )] nm0 (sm0 , am0 ) I h h h h h h h h m0=1 h0=1 h0 h0 m=1 h=1

+ 2SA(log2(TM 0 ) + 1)

0 m0 M H m0 X X ch0 m0 m0 m0 ≤ 2SA(log (T 0 ) + 1) + · 2n (s 0 , a 0 ) 2 M nm0 (sm0 , am0 ) h h m0=1 h0=1 h0 h0 0 M 0 Hm X X m0 = 2SA(log2(TM 0 ) + 1) + 2 ch0 m0=1 h0=1 (ii) 0 ≤ 2SA(log2(TM 0 ) + 1) + 2M (B? + 1). Here (i) comes from Eq. 19 and (ii) comes from Eq. 16. 0 Regret contribution of the VISGO precision errors. For any M ≤ M, denote by JM 0 the (unknown) 0 total number of triggers in the first M intervals. For 1 ≤ j ≤ JM 0 , denote by Lj the number of time steps j elapsed between the (j − 1)-th and the j-th trigger. The doubling condition implies that Lj ≤ 2 SA and that j −j 0 0 there are at most JM = O(log2(TM )) triggers. Using that Alg.1 selects as error VI = 2 , we have that

0 m M H JM0 X X p m p X j p √ 0 0 0 0 0 0 (1 + c1 ιM /2)VI ≤ (1 + c1 ιM /2) LjVI ≤ (1 + c1 ιM /2)SAJM = O(SA log2(TM ) ιM ). m=1 h=1 j=1 Lemma 33. Conditioned on Lem. 18, for a fixed M 0 ≤ M with probability 1 − 2δ,

0 2 0 0  X4(M ) ≤ O (B? + B?)(M + log2(TM 0 ) + ln(2/δ)) + B?X2(M ) .

m m Proof. We introduce the normalized value function V := V /B? ∈ [0, 1]. Define

M 0 Hm M 0 Hm X X m 2d m m 2d X X m 2d F (d) := (P m m (V ) − (V (s )) ),G(d) := (P m m , (V ) ). sh ,ah h+1 V sh ,ah m=1 h=1 m=1 h=1

0 2 Then X4(M ) = B?G(0). Direct computation gives that

M 0 Hm X X m 2d+1 m 2d 2 G(d) = P m m (V ) − (P m m (V ) ) sh ,ah sh ,ah m=1 h=1 M 0 Hm M 0 Hm (i) X X m 2d+1 m m 2d+1  X X m m 2d+1 m 2d+1  ≤ P m m (V ) − (V (s )) + (V (s )) − (P m m V ) sh ,ah h+1 h sh ,ah m=1 h=1 m=1 h=1 M 0 X m m 2d+1 − (V (s1 )) m=1 | {z } ≤0

34 M 0 Hm (ii) d+1 X X m m m ≤ F (d + 1) + 2 max{V (s ) − P m m V , 0} h sh ,ah m=1 h=1 0 m d+1 M H 2 X X m m m m = F (d + 1) + max{Q (s , a ) − P m m V , 0} h h sh ,ah B? m=1 h=1 M 0 Hm (iii) d+1 2 X X m m m m m ≤ F (d + 1) + (c(sh , ah ) + β (sh , ah )) B? m=1 h=1 0 m d+1 M H 2 X X m m m m m m m = F (d + 1) + (ch + β (sh , ah ) + (c(sh , ah ) − ch )) B? m=1 h=1

(iv) d+1 2 0 0 0 ≤ F (d + 1) + ((B? + 1)M + X2(M ) + |X3(M )|). B?

d (i) is by convexity of f(x) = x2 ; (ii) is by Lem. 28; (iii) is by Lem. 18; (iv) is by the interval construction 0 m PM PH m 0 (Eq. 16) which entails that m=1 h=1 ch ≤ (B? + 1)M . For a fixed d, F (d) is a martingale. By taking c = 1 in Lem. 25, we have h p i P F (d) > 2 2G(d)(log2(TM 0 ) + ln(2/δ)) + 5(log2(TM 0 ) + ln(2/δ)) ≤ δ.

0 0 Taking δ = δ/(log2(TM 0 ) + 1), using x ≥ ln(x) + 1 and finally swapping δ and δ , we have that

h p i δ P F (d) > 2 2G(d)(2 log2(TM 0 ) + ln(2/δ)) + 5(2 log2(TM 0 ) + ln(2/δ)) ≤ . log2(TM 0 ) + 1

Taking a union bound over d = 1, 2,..., log2(TM 0 ), we have that with probability 1 − δ, s  0 0  p d+1 B? + 1 0 X2(M ) + |X3(M )| F (d) ≤2 2(2 log2(TM 0 ) + ln(2/δ)) · F (d + 1) + 2 M + B? B?

+ 5(2 log2(TM ) + ln(2/δ)).

p 0 From Lem. 26, taking λ1 = TM 0 , λ2 = 2 2(2 log2(TM 0 ) + ln(2/δ)), λ3 = (1 + 1/B?)M + (X2(M ) + 0 |X3(M )|)/B?, λ4 = 5(2 log2(TM 0 ) + ln(2/δ)), we have that (

F (1) ≤ max 42(2 log2(TM 0 ) + ln(2/δ)), s   0 0  ) 1 0 X2(M ) + |X3(M )| 8 1 + M + (2 log2(TM 0 ) + ln(2/δ)) + 5(2 log2(TM 0 ) + ln(2/δ)) . B? B?

Hence

  0 0 0 2 1 X2(M ) + |X3(M )| X4(M ) ≤ O B? 1 + M + B? B? s   0 0  !! 1 0 X2(M ) + |X3(M )| + 1 + M + (log2(TM 0 ) + ln(2/δ)) + log2(TM 0 ) + ln(2/δ) B? B?    0 0  2 1 0 X2(M ) + |X3(M )| ≤ O B? 1 + M + + log2(TM 0 ) + ln(2/δ) . B? B?

35 0 From the bound of |X3(M )|, the following holds with probability 1 − 2δ: 0 2 0 0  X4(M ) ≤ O (B? + B?)(M + log2(TM 0 ) + ln(2/δ)) + B?X2(M ) . √ Throughout the proof, the inequality O( xy) ≤ O(x + y) is utilized to simplify the bound. Lemma 34. Conditioned on Lem. 18, for a fixed M 0 ≤ M with probability 1 − δ,

0 2 0  X5(M ) ≤ O B?(log2(TM 0 ) + ln(2/δ)) + B?X2(M ) . m m Proof. Recall the definition in Lem. 30, we introduce the normalized value function Ve := Ve /B? ∈ [−1, 1]. Define

0 m 0 m M H m m M H m X X 2d m 2d X X 2d F (d) := (P m m (V ) − (V (s )) ), G(d) := (P m m , (V ) ). e sh ,ah e e h+1 e V sh ,ah e m=1 h=1 m=1 h=1

0 2 Then X5(M ) = Ge(0)B?. Direct computation gives that

0 m M H m m X X 2d+1 2d 2 G(d) = P m m (V ) − (P m m (V ) ) e sh ,ah e sh ,ah e m=1 h=1 0 m 0 m M H m m M H m m X X 2d+1 m 2d+1  X X m 2d+1 2d+1  ≤ P m m (V ) − (V (s )) + (V (s )) − (P m m V ) sh ,ah e e h+1 e h sh ,ah e m=1 h=1 m=1 h=1 0 M m X m 2d+1 − (Ve (s1 )) m=1 | {z } ≤0 0 m M H m m d+1 X X m ≤ F (d + 1) + 2 max{V (s ) − P m m V , 0} e e h sh ,ah e m=1 h=1 0 m d+1 M H 2 X X m m m = F (d + 1) + max{V (s ) − P m m V , 0} e e h sh ,ah e B? m=1 h=1 M 0 Hm (i) d+1 2 X X m m m ≤ Fe(d + 1) + β (sh , ah ) B? m=1 h=1 d+1 2 0 = Fe(d + 1) + X2(M ), B? where (i) come from Lem. 30. For a fixed d, Fe(d) is a martingale. By taking c = 1 in Lem. 25, we have  q  P Fe(d) > 2 2Ge(d)(log2(TM 0 ) + ln(2/δ)) + 5(log2(TM 0 ) + ln(2/δ)) ≤ δ.

0 0 Taking δ = δ/(log2(TM 0 ) + 1), using x ≥ ln(x) + 1 and finally swapping δ and δ , we have that  q  δ P Fe(d) > 2 2Ge(d)(2 log2(TM 0 ) + ln(2/δ)) + 5(2 log2(TM 0 ) + ln(2/δ)) ≤ . log2(TM 0 ) + 1

Taking a union bound over d = 1, 2,..., log2(TM 0 ), we have that with probability 1 − δ, s 0 p d+1 X2(M ) Fe(d) ≤ 2 2(2 log2(TM 0 ) + ln(2/δ)) · Fe(d + 1) + 2 + 5(2 log2(TM 0 ) + ln(2/δ)). B?

36 p 0 From Lem. 26, taking λ1 = TM 0 , λ2 = 2 2(2 log2(TM 0 ) + ln(2/δ)), λ3 = X2(M )/B?, λ4 = 5(2 log2(TM 0 ) + ln(2/δ)), we have that  s  0  X2(M )  Fe(1) ≤ max 42(2 log2(TM 0 ) + ln(2/δ)), 8 (2 log2(TM 0 ) + ln(2/δ)) + 5(2 log2(TM 0 ) + ln(2/δ)) .  B? 

Hence with probability 1 − δ, we have

0 2 0  X5(M ) ≤ O B?(log2(TM 0 ) + ln(2/δ)) + B?X2(M ) . √ Throughout the proof, the inequality O( xy) ≤ O(x + y) is utilized to simplify the bound.

{ Finally, bind them together.

 0 2  12SAS TM0 2 2  Let ιM 0 := ln δ + log2((max{B?, 1}) TM 0 ) + ln δ be the upper bound of all previous log terms.   0 p 0 p 2 0 p 0 2 2 3/2 2 X2(M ) ≤ O SAX4(M )ιM 0 + S AX5(M )ιM 0 + (B? + 1)SAM ιM 0 + B?S AιM 0 + BS AιM 0 , 0 2 0 0  X4(M ) ≤ O (B? + B?)(M + ιM 0 ) + B?X2(M ) , 0 2 0  X5(M ) ≤ O B?ιM 0 + B?X2(M ) .

This implies that

(i)   0 p 2 p 0 p 2 0 2 2 X2(M ) ≤ O B?S AιM 0 · X2(M ) + (B? + B?)SAM ιM 0 + BS AιM 0  n o p 2 p 0 p 2 2 2 ≤ O max B?S AιM 0 · X2(M ), (B? + B?)SAMιM 0 + BS AιM 0 , where (i) uses the assumption B ≥ max{B?, 1} to simplify the bound. Considering terms in max{} separately, we obtain two bounds:

0 2 2 X2(M ) ≤ O(B?S AιM 0 ), 0 p 2 0 2 2 X2(M ) ≤ O( (B? + B?)SAM ιM 0 + BS AιM 0 ).

By taking the maximum of these bounds, we have

0 p 2 0 2 2 X2(M ) ≤ O( (B? + B?)SAM ιM 0 + BS AιM 0 ).

37 F Unknown B?: Parameter-Free EB-SSP

In this section, we relax the assumption that (an upper bound of) B? is known to EB-SSP. In Alg.3 we derive a parameter-free EB-SSP that bypasses the requirement B ≥ B? (line2 of Alg.1) to tune the exploration bonus. As in Sect.4 we consider for ease of exposition that B? ≥ 1. We structure the section as follows: App. F.1 presents our algorithm and provides intuition, App. F.2 spells out its regret guarantee, and App. F.3 gives its proof.

F.1 Algorithm and Intuition

Parameter-free EB-SSP (Alg.3) initializes an estimate Be = 1 and decomposes the time steps into phases, indexed by φ. The execution of a phase is reported in the subroutine PHASE (Alg.4). Given any estimate Be, a subroutine PHASE has the same structure as Alg.1, up to two key differences: • Halting due to exceeding cumulative cost. PHASE tracks the cumulative cost within the current phase, and terminates whenever it exceeds a threshold Cbound (Eq. 26) that depends on Be, S, A, δ and the current episode and time indexes k and t, which are all computable quantities to the agent. • Halting due to exceeding VISGO range. During each VISGO procedure, PHASE tracks the range of (i) (i) the value function V at each VISGO iteration i, and terminates if kV k∞ > Be. The estimate Be can be incremented in two different ways and speeds: • Doubling increment of Be. On the one hand, whenever a phase ends (i.e., one of the two halting conditions above is met), Be is doubled (Be ← 2Be).

• Episode-driven increment of Be. On the other hand,√ at the beginning of each new episode k, the estimate is automatically increased to Be ← max{B,e k/(S3/2A1/2)}. We now explain the rationale behind our scheme: • Reason for episode-driven increment of Be. The fact that Be grows as a function of k implies that at some (unknown) point it will hold that Be ≥ B? for large enough k. This will enable us to recover the analysis and the regret bound of Thm.3. • Reason for doubling increment of Be. The doubling increment comes into play whenever a phase terminates due to an exceeding cumulative cost or VISGO range. At this point, the agent becomes aware that Be is too small and thus it doubles it. It is crucial to allow intra-episode increments of Be to avoid getting stuck in an episode with an underestimate Be < B?.

• Reason for cumulative cost halting. The cost threshold Cbound is designed so that (w.h.p.) it can be exceeded at most once in the case of Be ≥ B?, and so that it can serve as a tight enough bound on the regret in the case of Be < B?. • Reason for VISGO range halting. The threshold Be on the range of the VISGO value functions is chosen so that (w.h.p.) it is never exceeded in the case of Be ≥ B?, and so that it can serve as a guarantee of finite-time near-convergence of a VISGO procedure (i.e., the contraction property) in the case of Be < B?.

F.2 Regret Guarantee of Parameter-Free EB-SSP

Parameter-free EB-SSP satisfies the following guarantee (which extends Thm.3 to unknown B?). Restatement of Theorem9. Assume that the conditions of Lem.2 hold. Then with probability at least 1 − δ the regret of parameter-free EB-SSP (Alg.3) can be bounded by  √ B SAT  B SAT  R = O R? + SAK log2 ? + B3S3A log3 ? K K δ ? δ

38  B SAT  B SAT  ≤ O R? log ? + B3S3A log3 ? , K δ ? δ

? where T is the cumulative time within the K episodes and RK denotes the regret after K episodes of EB-SSP in the case of known B? (i.e., the bound of Thm.3 with B = B?).

As a result, parameter-free EB-SSP is able to circumvent the knowledge of B? at the cost of only logarithmic and lower-order terms.

F.3 Proof of Theorem9 We begin by defining notations and concepts exclusively used in this section:

• Ct denotes the cumulative cost up to time step t (included) that is accumulated in the execution of the subroutine PHASE in which time step t belongs. Importantly, note that the cumulative cost Ct is initialized to 0 at the beginning of each PHASE (line5 of Alg.4). Also note that re-planning (i.e., a VISGO procedure) occurs whenever the estimate Be is changed.

• Denote by tm the time step at the end of the current interval m, and by km the episode in which the

time step tm belongs. Bem denotes the value of Be at time step tm. Cm denotes Ctm , i.e., the cumulative cost up to interval m (included) in the execution of the PHASE in which interval m belongs.

Unlike EB-SSP, the parameter-free version has an increasing Be throughout the process. So whenever we utilize regret bound (from previous sections) for a certain time period, we substitute the B term with an upper-bound for Be in the time period. We now show two key properties on which the analysis of parameter-free EB-SSP relies: Property 1: Optimism avoids the first halting condition. Let us study any phase starting with estimate Be ≥ B?. From Eq. 18 (which is the interval-generalization of Thm.3), for a fixed initial state s0 and a fixed interval m, the cumulative cost can be bounded with probability 1 − δ by      ? p B?tmSA 2 2 B?tmSA kmV (s0) + x B? SAkm log + BemS A log , (20) 2 δ 2 δ where x > 0 is a large enough absolute constant (which can be retraced in the analysis leading to Eq. 18). 2 By scaling δ ← δ/(2Stm) for each m ≤ M, we have the following cumulative cost bound that holds for any initial state in S and any interval m ≤ M, with probability 1 − δ:

  2   2  ? p B?tmSA · 2Stm 2 2 B?tmSA · 2Stm Cm ≤ kmV (s0) + x B? SAkm log + BemS A log 2 δ 2 δ      p B?tmSA 2 2 B?tmSA ≤ kmB? + 3x B? SAkm log + BemS A log , 2 δ 2 δ

Since we are in the case of Bem ≥ B?, we have ! !! p BemtmSA 2 2 BemtmSA Cm ≤ kmBem + 3x Bem SAkm log + BemS A log . (21) 2 δ 2 δ

Since costs are non-negative, for any t ≤ tm, we have Ct ≤ Cm hence Ct must also satisfy the bound of Eq. 21. There remains to predict the values of km, tm, Bem, given the current kcur, tcur, Becur. The upper bounds for km and Bem are kcur and Becur respectively, since they can only be incremented when reaching the goal g, which is a condition for ending the current interval. The upper bound for tm can be derived using the P P pigeonhole principle: since tcur = (s,a)∈S×A n(s, a), we know that 2tcur > (s,a)∈S×A(2n(s, a) − 1). Thus by time step 2tcur there must exist a trigger condition, which is a condition for ending the current interval.

39 Hence, by replacing km ← kcur, Bem ← Becur and tm ← 2tcur in Eq. 21, we get, with probability at least 1 − δ, that the cumulative cost within a phase that starts with Be ≥ B? has the following anytime upper bound ! !! p 2BecurtcurSA 2 2 2BecurtcurSA Ctcur ≤ kcurBecur + 3x Becur SAkcur log + BecurS A log . 2 δ 2 δ

Note that this bound corresponds exactly to the cumulative cost threshold Cbound in Eq. 26. This means that with probability at least 1 − δ, the first halting condition cannot be met in a phase that starts with Be ≥ B?.

Property 2: Optimism avoids the second halting condition. Let us consider the case of Be ≥ B? whenever the algorithm re-plans (i.e., running VISGO procedure). The proof of Lem. 15 ensures that at any (i) iteration, kV k∞ ≤ B? ≤ Be, so the second halting condition is never met under the same high-probability event as above.

Implications. The two properties above indicate that, if a phase starts with estimate Be ≥ B?, with probability at least 1 − δ, this phase will never halt due to the two halting conditions (it can only terminate if it completes the final episode K), and Alg.3 will thus never enter a new phase. Due to the doubling increment of Be every time a phase ends, we can therefore bound the total number of phases as Φ ≤ dlog2(B?)e + 1. Analysis. We now split the analysis of the regret contributions of the episodes in two regimes. To this end, 2 3 let κ? := dB?S Ae denote a special episode (note that it is unknown to the learner since it depends on B?). We consider that the high-probability event mentioned above holds (which is the case with√ probability at least 1 − δ). Recall that at the beginning of each episode k, the algorithm sets Be ← max{B,e k/(S3/2A1/2)}.

x Regret contribution in the first regime (i.e., episodes k < κ?).

We denote respectively by R1→κ? and C1→κ? the cumulative regret and the cumulative cost incurred by the algorithm before episode κ? begins. For any phase φ, we denote by C(φ) both φ k < κ • 1→κ? the cumulative cost incurred during time steps that are in phase and in an episode ?; • k(φ) the episode when phase φ ends; • t(φ) the time step when phase φ ends; • Be(φ) the value of Be at the end of phase φ. Observe that Φ X (φ) C = C . 1→κ? 1→κ? φ=1

? Now, by definition of κ , the episode-driven increment of Be never exceeds B?, unless Be is already ≥ B? at the beginning of the phase. But Property 1 ensures that, if Be ≥ B? in the beginning of a phase, then Be will never be doubled afterwards. Hence, we are guaranteed that within the episodes k < κ?, the final value of the estimate Be is at most 2B?. Since PHASE tracks the cumulative cost at each step using the threshold in Eq. 26 and since ct ≤ 1, by the fact that Cbound is monotonously increasing with respect to t, we have that for any phase φ,

(φ) (φ) ! (φ) (φ) !! (φ) p 2Be t SA 2 2Be t SA C ≤ k(φ)Be(φ) + 3x Be(φ) SAk(φ) log + Be(φ)S2A log + 1 1→κ? 2 δ 2 δ      p 2(2B?)TSA 2(2B?)TSA ≤ κ (2B ) + 3x (2B ) SAκ log + (2B )S2A log2 + 1 ? ? ? ? 2 δ ? 2 δ  B TSA B TSA ≤ O B3S3A + B2S2A log ? + B S2A log2 ? . ? ? δ ? δ

40 ? In addition, we recall that Φ ≤ dlog2(B?)e + 1. Hence, by plugging in the definition of κ , we can bound the cost (and thus the regret) accumulated over the episodes k < κ? as follows

dlog2(B?)e+1      X B?TSA B?TSA R ≤ C ≤ O B3S3A + B2S2A log + B S2A log2 1→κ? 1→κ? ? ? δ ? δ φ=1  B TSA B TSA  ≤ O B3S3A log(B ) + B2S2A log ? log(B ) + B S2A log2 ? log(B ) ? ? ? δ ? ? δ ? 3 3 2 2 2 2 3 ≤ O B?S Aι + B?S Aι + B?S Aι .

y Regret contribution in the second regime (i.e., episodes k ≥ κ?).

We denote respectively by Rκ?→K and Cκ?→K the cumulative regret and the cumulative cost incurred during ? ? the episodes k ≥ κ . By definition of κ , the episode-driven increment of Be ensures that Be ≥ B?. During this second regime there may be at most two phases: one that started at an episode k < κ? (i.e., in the first regime) and that overlaps the two regimes, and one starting after that (note that properties 1 and 2 ensure that at this point neither halting condition can end this phase since it started with estimate Be ≥ B?, thus it lasts until the end of the learning interaction). In addition, we can upper bound B as follows √ e n 2 K o Be ≤ max 2B?, . S3/2A1/2 We now introduce a fourth condition of stopping an interval to the analysis performed in Sect. 5.3: (4) an interval ends when a subroutine PHASE ends. This implies that the policy always stays the same within an interval when running Alg.3. Condition (4) is met at most once in the second regime. 0 We now focus on only the second regime: we re-index intervals by 1, 2,...,M and let Tm denote the time

step counting from the beginning of κ? to the end of interval m. To bound Rκ?→K , we need to adapt the proofs in Sect. 5.5 and App. E.3 to be compatible with our new interval decomposition. Concretely, there are three slight modifications in the analysis of the second regime: 0 0 0 • Statistics: For any statistic (i.e., N(s, a, s ), θ(s, a) and bc(s, a) for any (s, a, s ) ∈ S × A × S ), instead of learning from scratch, PHASE reuses all samples collected thus far. This difference does not affect the regret bound and the probability, since it can be viewed by taking a partial sum of terms in ReM 0 . • The regret decomposition: In the proof of Lem. 31, we need to incorporate condition (4) which is met at most once during the second regime. It falls into case (iii) in the proof of Lem. 31, which thus happens at most 2SA log2(TM 0 ) + 1 times, and the regret decomposition should be 0 0 0 ReM 0 ≤ X1(M ) + X2(M ) + X3(M ) + 2B?SA log2(TM 0 ) + B?.

• Interval number: Since condition (4) is met at most once, the bound on M 0 in the second regime is slightly modified as 0 Cκ?→K M ≤ + K − κ? + 1 + 2SA log2(TM 0 ) + 1 B?

Cκ?→K ≤ + K + 2SA log2(T ) + 1. (22) B?

Hence by incorporating these slight modifications in the proof of Thm.3, we get probability at least 1 − δ  √     B?TSA 2 2 B?TSA 0 Rκ?→K ≤ O B? SAK log + S ABeM log δ δ √ ! √ B TSA K B TSA ≤ O B SAK log ? + S2A log2 ? ? δ S3/2A1/2 δ √ √  2 ≤ O B? SAKι + SAKι .

41 z Combining the regret contributions in the two regimes. The overall regret is bounded with probability at least 1 − δ by √ √  2 3 3 2 2 2 2 3 RK = R1→κ? + Rκ?→K ≤ O B? SAKι + SAKι + B?S Aι + B?S Aι + B?S Aι .

This concludes the proof of Thm.9.

Remark 4. At a high level, our analysis to circumvent the knowledge of B? boils down to the following argument: if the estimate is too small, we bound the regret by the cumulative cost; otherwise if it is large enough, we recover the regret bound under a known upper bound on B?. Interestingly, this somewhat resembles the reasoning behind the schemes for unknown SSP-diameter D in the adversarial SSP algorithms of Rosenberg π and Mansour[2020, App. I] and Chen and Luo[2021, App. E] (recall that D := maxs∈S minπ∈Πproper T (s) and that B? ≤ D ≤ T?). Note however that these schemes change their algorithms’ structure: whenever the agent is in a state that is insufficiently visited, it executes the Bernstein-SSP algorithm of Rosenberg et al.[2020] with unit costs until the goal is reached. In other words, these schemes first learn to reach the goal (regardless of the costs) and then focus on minimizing the costs to goal. In contrast, our scheme for unknown B? targets the original SSP objective from the start and it does not fundamentally alter our algorithm EB-SSP with known B?. Indeed, the only addition of parameter-free EB-SSP is a dual tracking of the cumulative costs and VISGO ranges, and a careful increment of the estimate Be in the bonus.

Algorithm 2: Subroutine VISGO

1 Input: VI. 2 Global constants: S, A, s0 ∈ S, g 6∈ S, η. 3 Global variables: t, j, N(), n(), Pb , θ(), c(),Q(),V (). 0 0 b 4 For all (s, a, s ) ∈ S × A × S , set

n(s, a) I[s0 = g] Pes,a,s0 ← Pbs,a,s0 + . n(s, a) + 1 n(s, a) + 1

+  12SAS0[n+(s,a)]2  5 For all (s, a) ∈ S × A, set n (s, a) ← max{n(s, a), 1}, ιs,a ← ln δ . (0) (−1) 6 Set i ← 0, V ← 0, V ← +∞. (i) (i−1) 7 while kV − V k∞ > VI do 8 For all (s, a) ∈ S × A, set

s s (i) p 0 n (Ps,a,V )ιs,a Bιs,a o c(s, a)ιs,a Be S ιs,a b(i+1)(s, a) ← max c V e , c e + c b + c , (23) 1 n+(s, a) 2 n+(s, a) 3 n+(s, a) 4 n+(s, a) (i+1)  (i) (i+1) Q (s, a) ← max bc(s, a) + Pes,aV − b (s, a), 0 , (24) V (i+1)(s) ← min Q(i+1)(s, a). (25) a

9 Set i ← i + 1. (i) 10 if kV k∞ > Be then 11 \\Second halting condition: VISGO range exceeds threshold (i) (i) 12 return Fail, Q ,V .

(i) (i) 13 return Success, Q ,V .

42 Algorithm 3: Algorithm for unknown B?: Parameter-free EB-SSP

1 Input: S, s0 ∈ S, g 6∈ S, A, δ. 2 Optional input: cost perturbation η ∈ [0, 1]. 3 Set up global constants: S, A, s0 ∈ S, g 6∈ S, η. 4 Set up global variables: t, j, N(), n(), Pb , θ(), bc(),Q(),V (). 5 Set estimate Be ← 1. 6 Set current starting state sstart ← s0. 7 Set t ← 1, k ← 1, j ← 0. 0 0 8 For (s, a, s ) ∈ S × A × S , set 0 N(s, a) ← 0; n(s, a) ← 0; N(s, a, s ) ← 0; Pbs,a,s0 ← 0; θ(s, a) ← 0; bc(s, a) ← 0; Q(s, a) ← 0; V (s) ← 0. 9 Set phase counter φ ← 1. 10 while True do 11 Set scur, Becur, kcur ← PHASE (sstart, B,e k) (Alg.4). 12 \\PHASE halts because of B? underestimation, entering a new phase

13 Set sstart ← scur, k ← kcur, Be ← 2Becur, and increment phase index φ ← φ + 1.

Algorithm 4: Subroutine PHASE

1 Input: sstart ∈ S, B,e k. 2 Global constants: S, A, s0 ∈ S, g 6∈ S, η. 3 Global variables: t, j, N(), n(), P , θ(), c(),Q(),V (). b b √ √ j−1 4 Specify: Trigger set N ← {2 : j = 1, 2,...}. Constants c1 = 6, c2 = 36, c3 = 2 2, c4 = 2 2. Large enough absolute constant x > 0 (so that Eq. 20 holds, see App. F.3). 5 Set C ← 0.\\ Reinitialize cumulative cost tracker 6 for episode kcur = k, k + 1,... do √ 3/2 1/2 7 if kcur/(S A ) > Be then √ 3/2 1/2 −j 8 Set Be ← kcur/(S A ), and set j ← j + 1, VI ← 2 .

9 Info, Q, V ← VISGO (VI). 10 if Info = Fail then 11 \\Second halting condition: VISGO range exceeds threshold

12 return st, B,e kcur.  sstart, kcur = k, 13 Set st ← s0, otherwise. 14 while st 6= g do

15 Take action at = arg maxa∈A Q(st, a), incur cost ct and observe next state st+1 ∼ P (·|st, at). 0 16 Set (s, a, s , c) ← (st, at, st+1, max{ct, η}) and t ← t + 1. 0 0 17 Set N(s, a) ← N(s, a) + 1, θ(s, a) ← θ(s, a) + c, C ← C + c, N(s, a, s ) ← N(s, a, s ) + 1, and set ! !! √ 2BtSAe 2 2 2BtSAe Cbound ← kcurBe + 3x Be SAkcur log + BSe A log . (26) 2 δ 2 δ

18 if C > Cbound then 19 \\First halting condition: cumulative cost exceeds threshold

20 return st, B,e kcur. 21 if N(s, a) ∈ N then 2θ(s,a) 22 Set c(s, a) ← I[N(s, a) ≥ 2] + I[N(s, a) = 1]θ(s, a) and θ(s, a) ← 0. b N(s,a) 0 0 −j 23 For all s ∈ S, set Pbs,a,s0 ← N(s, a, s )/N(s, a), n(s, a) ← N(s, a), and set j ← j + 1, VI ← 2 .

24 Info, Q, V ← VISGO (VI). 25 if Info = Fail then 26 \\Second halting condition: VISGO range exceeds threshold

27 return st, B,e kcur.

43