Lipschitz Lifelong Reinforcement Learning

The Thirty-Fifth AAAI Conference on Artificial Intelligence (AAAI-21)

Erwan Lecarpentier1, 2, David Abel3, Kavosh Asadi3, 4∗ , Yuu Jinnai3, Emmanuel Rachelson1, Michael L. Littman3 1ISAE-SUPAERO, Universite´ de Toulouse, France 2ONERA, The French Aerospace Lab, Toulouse, France 3Brown University, Providence, Rhode Island, USA 4Amazon Web Service, Palo Alto, California, USA [email protected] Abstract performance degradation, as the computed upper bound prov- ably does not underestimate the optimal Q-value function. We consider the problem of knowledge transfer when an agent Our contributions are as follows. First, we study theoreti- is facing a series of Reinforcement Learning (RL) tasks. We introduce a novel metric between Markov Decision Processes cally the Lipschitz continuity of the optimal Q-value function and establish that close MDPs have close optimal value func- in the task space by introducing a metric between MDPs tions. Formally, the optimal value functions are Lipschitz (Section 3). Then, we use this continuity property to propose continuous with respect to the tasks space. These theoreti- a value-transfer method based on a local distance between cal results lead us to a value-transfer method for Lifelong RL, MDPs (Section 4). Full knowledge of both MDPs is not re- which we use to build a PAC-MDP algorithm with improved quired and the transfer is non-negative, which makes the convergence rate. Further, we show the method to experience method applicable online and safe. In Section 4.3, we build no negative transfer with high probability. We illustrate the a PAC-MDP algorithm called Lipschitz RMax, applying this benefits of the method in Lifelong RL experiments. transfer method in the online Lifelong RL setting. We provide sample and computational complexity bounds and showcase 1 Introduction the algorithm in Lifelong RL experiments (Section 5). Lifelong Reinforcement Learning (RL) is an online problem where an agent faces a series of RL tasks, drawn sequentially. 2 Background and Related Work Transferring knowledge from prior experience to speed up Reinforcement Learning (RL) (Sutton and Barto 2018) is the resolution of new tasks is a key question in that setting a framework for sequential decision making. The problem (Lazaric 2012; Taylor and Stone 2009). We elaborate on the is typically modeled as a Markov Decision Process (MDP) intuitive idea that similar tasks should allow a large amount (Puterman 2014) consisting of a 4-tuple , ,R,T , where hSa A i of transfer. An agent able to compute online a similarity mea- is a state space, an action space, Rs is the expected sure between source tasks and the current target task could S A a reward of taking action a in state s and Tss0 is the transition be able to perform transfer accordingly. By measuring the probability of reaching state s0 when taking action a in state a amount of inter-task similarity, we design a novel method s. Without loss of generality, we assume Rs [0, 1]. Given for value transfer, practically deployable in the online Life- a discount factor γ [0, 1), the expected cumulative∈ return long RL setting. Specifically, we introduce a metric between P t at ∈ t γ Rst obtained along a trajectory starting with state s and MDPs and prove that the optimal Q-value function is Lips- π action a using policy π in MDP M is denoted by QM (s, a) chitz continuous with respect to the MDP space. This prop- and called the Q-function. The optimal Q-function QM∗ is the erty makes it possible to compute a provable upper bound highest attainable expected return from s, a and VM∗ (s) = on the optimal Q-value function of an unknown target task, maxa QM∗ (s, a) is the optimal value function in s. Notice ∈Aa 1 given the learned optimal Q-value function of a source task. that R 1 implies Q∗ (s, a) for all s, a . s ≤ M ≤ 1 γ ∈ S × A Knowing this upper bound accelerates the convergence of an This maximum upper bound is used− by the RMax algorithm RMax-like algorithm (Brafman and Tennenholtz 2002), rely- as an optimistic initialization of the learned Q function. A ing on an optimistic estimate of the optimal Q-value function. key point to reduce the sample complexity of this algorithm Overall, the proposed transfer method consists of computing is to benefit from a tighter upper bound, which is the purpose online the distance between source and target tasks, deducing of our transfer method. the upper bound on the optimal Q value function of the source Lifelong RL (Silver, Yang, and Li 2013; Brunskill and Li task and using this bound to accelerate learning. Importantly, 2014) is the problem of experiencing online a series of MDPs the method exhibits no negative transfer, i.e., it cannot cause drawn from an unknown distribution. Each time an MDP ∗Kavosh Asadi finished working on this project before joining is sampled, a classical RL problem takes place where the Amazon. agent is able to interact with the environment to maximize Copyright c 2021, Association for the Advancement of Artificial its expected return. In this setting, it is reasonable to think Intelligence (www.aaai.org). All rights reserved. that knowledge gained on previous MDPs could be re-used

8270 to improve the performance in new MDPs. In this paper, we QM∗ (s, a) provide a novel method for such transfer by characterizing the way the optimal Q-function can evolve across tasks. As 1 1 γ commonly done (Wilson et al. 2007; Brunskill and Li 2014; − Bounds: Abel et al. 2018), we restrict the scope of the study to the RMax case where sampled MDPs share the same state-action space MaxQInit . For brevity, we will refer indifferently to MDPs, LRMax modelsS × A or tasks, and write them M = R,T . h i Using a metric between MDPs has the appealing charac- M teristic of quantifying the amount of similarity between tasks, which intuitively should be linked to the amount of transfer Figure 1: The optimal Q-value function represented for a par- achievable. Song et al. (2016) define a metric based on the bi- ticular s, a pair across the MDP space. The RMax, MaxQInit simulation metric introduced by Ferns, Panangaden, and Pre- and LRMax bounds are represented for three sampled MDPs. cup (2004) and the Wasserstein metric (Villani 2008). value transfer is performed between states with low bi-simulation distances. However, this metric requires knowing both MDPs bound in an RMax-like algorithm, providing PAC guarantees. completely and is thus unusable in the Lifelong RL setting The difference between the two approaches is illustrated in where we expect to perform transfer before having learned Figure 1, where the MaxQInit bound is the one developed the current MDP. Further, the transfer technique they pro- by Abel et al. (2018), and the LRMax bound is the one we pose does allow negative transfer (see Appendix, Section 1). present in this paper. On this figure, the essence of the LRMax Carroll and Seppi (2005) also define a value-transfer method bound is noticeable. It stems from the fact that the optimal based on a measure of similarity between tasks. However, Q value function is locally Lipschitz continuous in the MDP this measure is not computable online and thus not applica- space w.r.t. a specific pseudometric. Confirming the intuition, ble to the Lifelong RL setting. Mahmud et al. (2013) and close MDPs w.r.t. this metric have close optimal Q values. Brunskill and Li (2013) propose MDP clustering methods; It should be noticed that no bound is uniformly better than respectively using a metric quantifying the regret of running the other as intuited by Figure 1. Hence, combining all the the optimal policy of one MDP in the other MDP and the 1 bounds results in a tighter upper bound as we will illustrate norm between the MDP models. An advantage of clusteringL in experiments (Section 5). We first carry out the theoretical is to prune the set of possible source tasks. They use their characterization of the Lipschitz continuity properties in the approach for policy transfer, which differs from the value- following section. Then, we build on this result to propose a transfer method proposed in this paper. Ammar et al. (2014) practical transfer method for the online Lifelong RL setting. learn the model of a source MDP and view the prediction error on a target MDP as a dissimilarity measure in the task 3 Lipschitz Continuity of Q-Functions space. Their method makes use of samples from both tasks The intuition we build on is that similar MDPs should have and is not readily applicable to the online setting considered similar optimal Q-functions. Formally, this insight can be in this paper. Lazaric, Restelli, and Bonarini (2008) provide a translated into a continuity property of the optimal Q-function practical method for sample transfer, computing a similarity over the MDP space . The remainder of this section math- metric reflecting the probability of the models to be identical. ematically formalizesM this intuition that will be used in the Their approach is applicable in a batch RL setting as opposed next section to derive a practical method for value transfer. to the online setting considered in this paper. The approach To that end, we introduce a local pseudometric characterizing developed by Sorg and Singh (2009) is very similar to ours in the distance between the models of two MDPs at a particular the sense that they prove bounds on the optimal Q-function state-action pair. A reminder and a detailed discussion on the for new tasks, assuming that both MDPs are known and that metrics used herein can be found in the Appendix, Section 2. a soft homomorphism exists between the state spaces. Brun- Definition 1. Given two tasks M = R,T , M¯ = R,¯ T¯ , skill and Li (2013) also provide a method that can be used and a function f : R+, we defineh thei pseudometrich i for Q-function bounding in multi-task RL. between models at (Ss, a →) w.r.t. f as: Abel et al. (2018) present the MaxQInit algorithm, provid- ∈ S × A f a a X a a D (M, M¯ ) , R R¯ + f(s0) T 0 T¯ 0 . (1) ing transferable bounds on the Q-function with high proba- sa | s − s | | ss − ss | s0 bility while preserving PAC-MDP guarantees (Strehl, Li, and ∈S Littman 2009). Given a set of solved tasks, they derive the This pseudometric is relative to a positive function f. We probability that the maximum over the Q-values of previous implicitly cast this definition in the context of discrete state MDPs is an upper bound on the current task’s optimal Q- spaces. The extension to continuous spaces is straightforward function. This approach results in a method for non-negative but beyond the scope of this paper. For the sake of clarity in transfer with high probability once enough tasks have been the remainder of this study, we introduce sampled. The method developed by Abel et al. (2018) is ∗ γVM¯ D (M M¯ ) , Dsa (M, M¯ ), similar to ours in two fundamental points: first, a theoretical sa k upper bounds on optimal Q-values across the MDP space is corresponding to the pseudometric between models with the built; secondly, this provable upper bound is used to transfer particular choice of f = γV ∗¯ . From this definition stems the 1 M knowledge between MDPs by replacing the maximum 1 γ following pseudo-Lipschitz continuity result. −

8271 ∗ Proposition 1 (Local pseudo-Lipschitz continuity). For two 4.1 A Transferable Upper Bound on QM ¯ MDPs M, M, for all (s, a) , From Proposition 1, one can naturally define a local upper ∈ S × A ¯ QM∗ (s, a) QM∗¯ (s, a) ∆sa(M, M), (2) bound on the optimal Q-function of an MDP given the opti- − ≤ mal Q-function of another MDP. with the local MDP pseudometric ∆sa(M, M¯ ) , ¯ ¯ , and the local MDP dissim- Definition 2. Given two tasks M and M¯ , for all (s, a) min dsa(M M), dsa(M M) ∈ k ¯ k , the Lipschitz upper bound on Q∗ induced by Q∗¯ is ilarity dsa(M M) is the unique solution to the following S × A M M fixed-point equationk for : defined as UM¯ (s, a) Q∗ (s, a) with: dsa ≥ M ¯ X a dsa = Dsa(M M) + γ Tss0 max ds0a0 , s, a. (3) U ¯ (s, a) , Q∗¯ (s, a) + ∆sa(M, M¯ ). (5) k a0 ∀ M M s0 ∈A ∈S optimism in the face of uncertainty All the proofs of the paper can be found in the Appendix. The principle leads to considering that the long-term expected return from any This result establishes that the distance between the opti- 1 state is the 1 γ maximum return, unless proven otherwise. mal Q-functions of two MDPs at (s, a) is con- − trolled by a local dissimilarity between the∈ MDPs. S × AThe latter Particularly, the RMax algorithm (Brafman and Tennenholtz follows a fixed-point equation (Equation 3), which can be 2002), explores an MDP so as to shrink this upper bound. solved by Dynamic Programming (DP) (Bellman 1957). Note RMax is a model-based, online RL algorithm with PAC- MDP guarantees (Strehl, Li, and Littman 2009), meaning that, although the local MDP dissimilarity dsa(M M¯ ) is k that convergence to a near-optimal policy is guaranteed in asymmetric, ∆sa(M, M¯ ) is a pseudometric, hence the name pseudo-Lipschitz continuity. Notice that the policies in Equa- a polynomial number of missteps with high probability. It tion 2 are the optimal ones for the two MDPs and thus are relies on an optimistic model initialization that yields an different. Proposition 1 is a mathematical result stemming optimistic upper bound U on the optimal Q-function, then acts greedily w.r.t. U. By default, it takes the maximum value from Definition 1 and should be distinguished from other 1 frameworks of the literature that assume the continuity of U(s, a) = 1 γ , but any tighter upper bound is admissible. the reward and transition models w.r.t. (Rachelson Thus, shrinking− U with Equation 5 is expected to improve and Lagoudakis 2010; Pirotta, Restelli,S and × Bascetta A 2015; the learning speed or sample complexity for new tasks in Asadi, Misra, and Littman 2018).This result establishes that Lifelong RL. the optimal Q-functions of two close MDPs, in the sense of In RMax, during the resolution of a task M, is S × A Equation 1, are themselves close to each other. Hence, given split into a subset of known state-action pairs K and its com- plement Kc of unknown pairs. A state-action pair is known Q∗M¯ , the function ¯ if the number of collected reward and transition samples s, a QM∗¯ (s, a) + ∆sa(M, M) (4) 7→ allows estimating an -accurate model in 1-norm with prob- can be used as an upper bound on QM∗ with M an unknown ability higher than 1 δ. We refer to Land δ as the RMax MDP. This is the idea on which we construct a computable − precision parameters. This results in a threshold nknown on and transferable upper bound in Section 4. In Figure 1, the the number of visits n(s, a) to a pair s, a that are necessary upper bound of Equation 4 is represented by the LRMax to reach this precision. Given the experience of a set of m bound. Noticeably, we provide a global pseudo-Lipschitz MDPs ¯ = M¯ 1,..., M¯ m , we define the total bound as continuity property, along with similar results for the optimal the minimumM { over all the induced} Lipschitz bounds. value function VM∗ and the value function of a fixed policy. As these results do not directly serve the purpose of this article, Proposition 2. Given a partially known task M = R,T , the set of known state-action pairs K, and the set of Lipschitzh i we report them in the Appendix, Section 4. bounds on QM∗ induced by previous tasks UM¯ 1 ,...,UM¯ m , the function defined below is an upper bound on for 4 Transfer Using the Lipschitz Continuity Q QM∗ all s, a . A purpose of value transfer, when interacting online with a ∈ S × A  a P a new MDP, is to initialize the value function and drive the R + γ T 0 max Q(s , a ) s ss 0 0 0 exploration to accelerate learning. We aim to exploit value  s0 a , ∈S ∈A transfer in a method guaranteeing three conditions: Q(s, a) if (s, a) K, (6)  ∈ C1. the resulting algorithm is PAC-MDP; U(s, a) otherwise, C2. the transfer accelerates learning; n o C3. the transfer is non-negative. with 1 . U(s, a) = min 1 γ ,UM¯ 1 (s, a),...,UM¯ m (s, a) To achieve these conditions, we first present a transferable − Commonly in RMax, Equation 6 is solved to a precision upper bound on QM∗ in Section 4.1. This upper bound stems from the Lipschitz continuity result of Proposition 1. Then, Q via Value Iteration. This yields a function Q that is a valid we propose a practical way to compute this upper bound in heuristic (bound on QM∗ ) for the exploration of MDP M. Section 4.2. Precisely, we propose a surrogate bound that can ∗ be calculated online in the Lifelong RL setting, without hav- 4.2 A Computable Upper Bound on QM ing explored the source and target tasks completely. Finally, The key issue addressed in this section is how to actually we implement the method in an algorithm described in Sec- compute U(s, a), particularly when both source and target tion 4.3, and demonstrate formally that it meets conditions tasks are partially explored. Consider two tasks M and M¯ , on C1, C2 and C3. Improvements are discussed in Section 4.4. which vanilla RMax has been applied, yielding the respective

8272 sets of known state-action pairs K and K¯ , along with the Similarly as in Proposition 3, the condition γ(1 + ) < 1 learned models Mˆ = T,ˆ Rˆ and M¯ˆ = T,¯ˆ R¯ˆ , and the upper illustrates the fact that for a large return horizon (large γ), a ¯ h i h i high accuracy (small ) is needed for the bound to be com- bounds Q and Q respectively on Q∗ and Q∗¯ . Notice that, if M M putable. Eventually, a computable upper bound on Q given K = , then Q(s, a) = 1 for all s, a pairs. Conversely, if M∗ ∅ 1 γ M¯ with high probability is given by Kc = , Q is an -accurate− estimate of Q in -norm with M∗ 1 ˆ ¯ high probability.∅ Equation 6 allows the transferL of knowledge UM¯ (s, a) = Q(s, a) from ¯ to if can be computed. Unfortunately, n o M M UM¯ (s, a) + min dˆ (M M¯ ), dˆ (M¯ M) . (9) the true model and optimal value functions, necessary to sa k sa k compute UM¯ , are partially known (see Equation 5). Thus, The associated upper bound on U(s, a) (Equation 6) given we propose to compute a looser upper bound based on the the set of previous tasks ¯ = M¯ m is defined by M { i}i=1 learned models and value functions. First, we provide an n 1 o Uˆ(s, a) = min , Uˆ ¯ (s, a),..., Uˆ ¯ (s, a) . (10) upper bound Dˆ (M M¯ ) on D (M M¯ ) (Definition 1). 1 γ M1 Mm sa k sa k − Proposition 3. Given two tasks M, M¯ and respectively K, This upper bound can be used to transfer knowledge from a ¯ ˆ 1 K the subsets of where their models are known with partially solved source task to a target task. If U(s, a) 1 γ S × A ≤ − accuracy in 1-norm with probability at least 1 δ, on a subset of , then the convergence rate can be L − improved. As completeS × A knowledge of both tasks is not needed Pr Dˆ (M M¯ ) D (M M¯ ) 1 δ sa k ≥ sa k ≥ − to compute the upper bound, it can be applied online in the Lifelong RL setting. In the next section, we explicit an with Dˆ sa(M M¯ ) the upper bound on the pseudometric be- k algorithm that leverages this value-transfer method. tween models defined below for B = 1 + γ maxs0 V¯ (s0) . 4.3 Lipschitz RMax Algorithm Dˆ sa(M M¯ ) ,  k In Lifelong RL, MDPs are encountered sequentially. Apply- γV¯ ˆ ¯ˆ if ¯ Dsa (M, M) + 2B (s, a) K K ing RMax to task M yields the set of known state-action  γV¯ ˆ ∈ ∩ ¯ c max Dsa (M, µ¯) + B if (s, a) K K pairs K, the learned models Tˆ and Rˆ, and the upper bound µ¯ ∈ ∩ ∈M ¯ (7) Q on Q . Saving this information when the task changes γV ¯ˆ if c ¯ M∗ max Dsa (µ, M) + B (s, a) K K allows computing the upper bound of Equation 10 for the new µ ∈ ∩  ∈M γV¯ c ¯ c target task, and using it to shrink the optimistic heuristic of  max Dsa (µ, µ¯) if (s, a) K K µ,µ¯ 2 ∈ ∩ ∈M RMax. This computation effectively transfers value functions between tasks based on task similarity. As the new task is Importantly, this upper bound Dˆ sa(M M¯ ) can be calculated analytically (see Appendix, Sectionk 7). This makes explored online, the task similarity is progressively assessed ˆ ¯ Dˆ (M M¯ ) usable in the online Lifelong RL setting, where with better confidence, refining the values of Dsa(M M), sa ˆ ¯ ˆ k alreadyk explored tasks may be partially learned, and little dsa(M M) and eventually U, allowing for more efficient k knowledge has been gathered on the current task. The mag- transfer where the task similarity is appraised. The result- nitude of the B term is controlled by . In the case where ing algorithm, Lipschitz RMax (LRMax), is presented in no information is available on the maximum value of V¯ , we Algorithm 1. To avoid ambiguities with ¯ , we use ˆ to ˆ ˆ M M have that B = 1 γ . measures the accuracy with which store learned features (T , R, K, Q) about previous MDPs. the tasks are known:− the smaller , the tighter the B bound. In a nutshell, the behavior of LRMax is precisely that of ¯ ˆ Note that V is used as an upper bound on the true VM∗¯ . In RMax, but with a tighter admissible heuristic U that be- 1 many cases, max 0 V ∗ (s0) ; e.g. for stochastic short- comes better as the new task is explored (while this heuristic s M¯ ≤ 1 γ est path problems, which feature− rewards only upon reaching remains constant in vanilla RMax). LRMax is PAC-MDP 0 (Condition C1) as stated in Propositions 5 and 6 below. With terminal states, we have that maxs VM∗¯ (s0) = 1 and thus B = (1 + γ) is a tighter bound for transfer. Combining S = and A = , the sample complexity of vanilla |S| ˜ 2 3|A| 3 Dˆ (M M¯ ) and Equation 3, one can derive an upper bound RMax is (S A/( (1 γ) )), which is improved by LR- sa k O − ˆ dˆ (M M¯ ) on d (M M¯ ), detailed in Proposition 4. Max in Proposition 5 and meets Condition C2. Finally, U is sa sa a provable upper bound with high probability on , which k k ¯ QM∗ Proposition 4. Given two tasks M, M , K the set of avoids negative transfer and meets Condition C3. state-action pairs where (R,T ) is known∈ with M accuracy in Proposition 5 (Sample complexity (Strehl, Li, and Littman -norm with probability at least 1 δ. If γ(1 + ) < 1, the 1 With probability , the greedy policy w.r.t. L ˆ ¯ − 2009)). 1 δ Q solution dsa(M M) of the following fixed-point equation on computed by LRMax achieves− an -optimal return in MDP ˆ k ¯ dsa (for all s, a ) is an upper bound on dsa(M M) M after with probability∈ at S least × A : k 1 δ ˆ ! − S s, a U(s, a) VM∗ (s) ˆ ˆ ¯ (8) ˜ |{ ∈ S × A | ≥ − }| dsa = Dsa(M M) + O 3(1 γ)3  k − P a γ Tˆ 0 max dˆ 0 0 + max dˆ 0 0 if s, a K, ˆ  ss 0 s a 0 0 s a samples (when logarithmic factors are ignored), with U de- 0 a s ,a ∈ s ∈A ∈S×A fined in Equation 10 a non-static, decreasing quantity, upper ∈S ˆ γ max ds0a0 otherwise. 1 s0,a0 bounded by 1 γ . ∈S×A −

8273 Algorithm 1: Lipschitz RMax algorithm Set of all the MDPs ˆ 1+γ M Initialize = . M 1 γ M ∅ − for each newly sampled MDP M do ˜ Set ˜ of possible MDPs in 1 M a lifelongM RL experiment Initialize Q(s, a) = 1 γ , s, a, and K = Dmax − ∀ ∅ Initialize Tˆ and Rˆ (RMax initialization) Sampled MDPs Q UpdateQ( ˆ , T,ˆ Rˆ) for←t [1, max numberM of steps] do s ∈= current state, a = arg max Q(s, a ) Figure 2: Illustration of the prior knowledge on the maximum 0 0 a pseudo-distance between models for a particular s, a pair. Observe reward r and next state s0 n(s, a) n(s, a) + 1 ← if n(s, a) < nknown then Store (s, a, r, s ) the maximum distance between any two MDPs defined on 0 the common state-action space of the ALE (extended discus- if n(s, a) = nknown then sion in Appendix, Section 11). Let us note ˜ the set â â M ⊂ M Update K and (Tss0 , Rs ) (learned model) of possible MDPs for a particular Lifelong RL experiment. ˆ ˆ ˆ , ¯ Q UpdateQ( , T, R) Let D (s, a) max ¯ ˜ 2 D (M M) be the max- ← M max M,M sa k imum model pseudo-distance∈Mat a particular s, a pair on the Save Mˆ = T,ˆ R,K,Qˆ in ˆ ˜ M subset . Prior knowledge might indicate a smaller upper bound forM than 1+γ . We will note such an upper Function UpdateQ( ˆ , T,ˆ Rˆ): Dmax(s, a) 1 γ M − for ¯ ¯ do bound Dmax, considered valid for all s, a pairs, i.e., such that M ¯ ∈ M ˆ ¯ ˆ ¯ Dmax max Dsa(M M) . In a Lifelong Compute Dsa(M M), Dsa(M M) (Eq. 7) ≥ s,a,M,M¯ ˜ 2 k k k ∈S×A×M Compute dˆ (M M¯ ), dˆ (M¯ M) (DP on Eq. 8) RL experiment, Dmax can be seen as a rough estimate of sa k sa k Compute ˆ (Eq. 9) the maximum model discrepancy an agent may encounter. UM¯ 1+γ Figure 2 illustrates the relative importance of Dmax vs. 1 γ . Compute Uˆ (Eq. 10) − Solving Equation 8 boils down to accumulating Dˆ (M M¯ ) Compute and return Q (DP on Eq. 6 using Uˆ) sa k values in dˆ (M M¯ ). Hence, reducing a Dˆ (M M¯ ) es- sa k sa k timate in a single s, a pair actually reduces dˆ (M M¯ ) in sa k all s, a pairs. Thus, replacing Dˆ sa(M M¯ ) in Equation 8 by Proposition 5 shows that the sample complexity of LRMax ˆ ¯ k is no worse than that of RMax. Consequently, in the worst min Dmax, Dsa(M M) , provides a smaller upper bound ˆ { ¯ k ¯} ˆ case, LRMax performs as badly as learning from scratch, dsa(M M) on dsa(M M), and thus a smaller U which al- k k 1 which is to say that the transfer method is not negative as it lows transfer if it is less than 1 γ . Consequently, the knowl- − cannot degrade the performance. edge of such a bound Dmax can make a difference between successful and unsuccessful transfer, even if its value is of Proposition 6 (Computational complexity). The total computational complexity of LRMax (Algorithm 1) is little importance. Conversely, setting a value for Dmax quan- tifies the distance between MDPs where transfer is efficient. S3A2N 1 Refining by learning the maximum distance. The value ˜ τ + ln O (1 γ) (1 γ) of Dmax(s, a) can be estimated online for each s, a pair, dis- − Q − carding the hypothesis of available prior knowledge. We pro- with τ the number of interaction steps, Q the precision of pose to use an empirical estimate of the maximum model dis- value iteration and N the number of source tasks. ˆ , ˆ ¯ tance at s, a: Dmax(s, a) maxM,M¯ ˆ 2 Dsa(M M) , ∈M { k } with ˆ the set of explored tasks. The pitfall of this approach 4.4 Refining the LRMax Bounds M is that, with few explored tasks, Dˆ (s, a) could underesti- LRMax relies on bounds on the local MDP dissimilarity max mate D (s, a). Proposition 7 provides a lower bound on (Equation 8). The quality of the Lipschitz bound on can max QM∗ ˆ be improved according to the quality of those estimates. We the probability that Dmax(s, a) + does not underestimate discuss two methods to provide finer estimates. Dmax(s, a), depending on the number of sampled tasks. Refining with prior knowledge. First, from the definition Proposition 7. Consider an algorithm producing -accurate ¯ of Dsa(M M), it is easy to show that this pseudometric be- model estimates Dˆ sa(M M¯ ) for a subset K of af- k 1+γ k ¯ S × A tween models is always upper bounded by 1 γ . However, in ter interacting with any two MDPs M, M . Assume − ∈ M practice, the tasks experienced in a Lifelong RL experiment Dˆ sa(M M¯ ) Dsa(M M¯ ) for any s, a / K. For all might not cover the full span of possible MDPs and may s, a k ≥, δ (0, 1]k, after sampling m∈tasks, if m is 1+γ M large∈ enough S × A to verify∈ m m , systematically be closer to each other than 1 γ . For instance, 2(1 pmin) (1 2pmin) δ − − − − ≤ the distance between two games in the Arcade Learning En- vironment (ALE) (Bellemare et al. 2013), is smaller than Pr Dˆ max(s, a) + Dmax(s, a) 1 δ . ≥ ≥ −

8274 This result indicates when Dˆ max(s, a) + upper the aforementioned discounted returns are displayed relative bounds Dmax(s, a) with high probability. In such a to an estimator of the optimal expected return for each task. case, Dˆ sa(M M¯ ) of Equation 8 can be replaced by For readability, Figures 3b and 3c display a moving average k over 100 episodes. Figure 3d reports the benefits of various min Dˆ max(s, a) + , Dˆ sa(M M¯ ) to tighten the bound on { k } values of Dmax on the algorithmic properties. dsa(M M¯ ). Assuming a lower bound pmin on the sampling probabilityk of a task implies that is finite and is seen as a In Figure 3a, we first observe that LRMax benefits from the non-adversarial task sampling ruleM (Abel et al. 2018). transfer method, as the average discounted return increases as more tasks are experienced. Moreover, this advantage appears 5 Experiments as early as the second task. In contrast, MaxQInit requires to wait for task 12 before benefiting from transfer. As suggested The experiments reported here1 illustrate how the Lipschitz in Section 4.4, increasing amounts of prior knowledge allow bound (Equation 9) provides a tighter upper bound on Q∗, the LRMax transfer method to be more efficient: a smaller improving the sample complexity of LRMax compared to known upper bound Dmax on Dˆ sa(M M¯ ) accelerates con- RMax, and making the transfer of inter-task knowledge effec- vergence. Combining both approachesk in the LRMaxQInit tive. Graphs are displayed with 95% confidence intervals. For algorithm outperforms all other methods. Episode-wise, we information in line with the Machine Learning Reproducibil- observe in Figure 3b that the LRMax transfer method allows ity Check-list (Pineau 2019) see the Appendix, Section 16. for faster convergence, i.e., lower sample complexity. Inter- We evaluate different variants of LRMax in a Lifelong estingly, LRMax exhibits three stages in the learning process. RL experiment. The RMax algorithm will be used as a no- 1) The first episodes are characterized by a direct exploita- transfer baseline. LRMax(x) denotes Algorithm 1 with prior tion of the transferred knowledge, causing these episodes to Dmax = x. MaxQInit denotes the MAXQINIT algorithm yield high payoff. This behavior is a consequence of the com- from Abel et al. (2018), consisting in a state-of-the art PAC- bined facts that the Lipschitz bound (Equation 9) is larger MDP algorithm. Both LRMax and MaxQInit algorithms on promising regions of seen on previous tasks and achieve value transfer by providing a tighter upper bound the fact that LRMax actsS greedily × A w.r.t. that bound. 2) This 1 on Q∗ than 1 γ . Computing both upper bounds and taking high performance regime is followed by the exploration of the minimum− results in combining the two approaches. We in- unknown regions of , in our case yielding low returns. clude such a combination in our study with the LRMaxQInit Indeed, as promisingS regions × A are explored first, the bound be- algorithm. Similarly, the latter algorithm benefiting from comes tighter for the corresponding state-action pairs, enough prior knowledge Dmax = x is denoted by LRMaxQInit(x). for the Lipschitz bound of unknown pairs to become larger, For the sake of comparison, we only compare algorithms thus driving the exploration towards low payoff regions. Such with the same features, namely, tabular, online, PAC-MDP regions are then identified and never revisited. 3) Eventually, methods, presenting non-negative transfer. LRMax stops exploring and converges to the optimal policy. The environment used in all experiments is a variant of the Importantly, in all experiments, LRMax never experiences “tight” task used by Abel et al. (2018). It is an 11 11 grid- negative transfer, as supported by the provability of the Lips- world, the initial state is in the centre, actions are the× cardinal chitz upper bound with high probability. LRMax is at least moves (Appendix, Section 12). The reward is always zero as efficient as the no-transfer RMax baseline. except for the three goal cells in the upper-right corner. Each Figure 3c displays the collected returns of RMax, LR- sampled task has its own reward values, drawn from [0.8, 1] Max(0.1), and MaxQInit for specific tasks. We observe that for each of the three goal cells and its own probability of LRMax benefits from transfer as early as Task 2, where the slipping (performing a different action than the one selected), previous 3-stage behavior is visible. MaxQInit takes until picked in [0, 0.1]. Hence, tasks have different reward and task 12 to leverage the transfer method. However, the bound transition functions. Notice the distinction in applicability it provides is tight enough that it does not have to explore. between MaxQInit, that requires the set of MDPs to be finite, In Figure 3d, we display the following quantities for vari- and LRMax, that can be used with any set of MDPs. For ous values of Dmax: ρLip, the fraction of the time the Lips- the comparison between both to be possible, we drew tasks 1 chitz bound was tighter than the RMax bound 1 γ ; ρSpeed-up, from a finite set of 5 MDPs. We sample 15 tasks sequentially − among this set, each run for 2000 episodes of length 10. is the relative gain of time steps before convergence when The operation is repeated 10 times to narrow the confidence comparing LRMax to RMax. This quantity is estimated based on the last updates of the empirical model M¯ ; ρReturn, is intervals. We set nknown = 10, δ = 0.05, and = 0.01 (discussion in Appendix, Section 15). Other Lifelong RL the relative total return gain on 2000 episodes of LRMax experiments are reported in Appendix, Section 13. w.r.t. RMax. First, we observe an increase of ρLip as Dmax becomes tighter. This means that the Lipschitz bound of The results are reported in Figure 3. Figure 3a displays 1 Equation 9 becomes effectively smaller than 1 γ . This phe- the discounted return for each task, averaged across episodes. − Similarly, Figure 3b displays the discounted return for each nomenon leads to faster convergence, indicated by ρSpeed-up. episode, averaged across tasks (same color code as Figure 3a). Eventually, this increased convergence rate allows for a net Figure 3c displays the discounted return for five specific total return gain, as can be seen with the increase of ρReturn. instances, detailed below. To avoid inter-task disparities, all Overall, in this analysis, we have showed that LRMax benefits from an enhanced sample complexity thanks to the value- 1 Code available at https://github.com/SuReLI/llrl transfer method. The knowledge of a prior Dmax increases

8275 3.0 1.2 2.5 1.0 2.0 0.8 1.5 0.6 1.0

2 4 6 8 10 12 14 0.4 Task number 0.2 RMax MaxQInit LRMax LRMaxQInit Average Relative0 Discounted Return .0 Average Relative Discounted Return LRMax(0.2) LRMaxQInit(0.1) LRMax(0.1) 0 250 500 750 1000 1250 1500 1750 2000 Episode number (a) Average discounted return vs. tasks (b) Average discounted return vs. episodes 160 ρLip (% use Lipschitz bound)

140 ρSpeed up (% convergence speed-up) 1.0 − ρ (% total return gain) 120 Return

0.5 100

% 80

0.0 60 Relative discounted return 0 250 500 750 1000 1250 1500 1750 2000 Episode number 40 RMax Task = 1 MaxQInit Task = 11 20 LRMax(0.1) Task = 1 MaxQInit Task = 12 0 LRMax(0.1) Task = 2 1.0 0.8 0.6 0.4 0.2 0.0 MM¯ Prior knowledge (knownPrior upper-boundknowledge D onmax maxs,a = DγV (s, a)) (c) Discounted return for speciﬁc tasks M∗¯ (d) Algorithmic properties vs. Dmax

Figure 3: Experimental results. LRMax benefits from an enhanced sample complexity thanks to the value-transfer method. this benefit. The method is comparable to the MaxQInit tasks and exhibits no negative transfer. Improvements of the method and has some advantages such as the early fitness for algorithm were discussed in the form of prior knowledge use and the applicability to infinite sets of tasks. Moreover, on the maximum distance between models and online esti- the transfer is non-negative while preserving the PAC-MDP mation of this distance. As a non-negative, similarity-based, guarantees of the algorithm. Additionally, we show in Ap- PAC-MDP transfer method, the LRMax algorithm is the pendix, Section 14 that, when provided with any prior Dmax, first method of the literature combining those three appeal- LRMax increasingly stops using it during exploration, con- ing features. We showcased the algorithm in Lifelong RL firming the claim of Section 4.4 that providing Dmax enables experiments and demonstrated empirically its ability to accel- transfer even if its value is of little importance. erate learning while not experiencing any negative transfer. Notably, our approach can directly extend other PAC-MDP 6 Conclusion algorithms (Szita and Szepesvari´ 2010; Rao and Whiteson We have studied theoretically the Lipschitz continuity prop- 2012; Pazis, Parr, and How 2016; Dann, Lattimore, and Brun- erty of the optimal Q-function in the MDP space w.r.t. a new skill 2017) to the Lifelong setting. In hindsight, we believe metric. We proved a local Lipschitz continuity result, estab- this contribution provides a sound basis to non-negative value lishing that the optimal Q-functions of two close MDPs are transfer via MDP similarity, a study that was lacking in the themselves close to each other. We then proposed a value- literature. Key insights for the practitioner lie both in the the- transfer method using this continuity property with the Lip- oretical analysis and in the practical derivation of a transfer schitz RMax algorithm, practically implementing this ap- scheme achieving non-negative transfer with PAC guaran- proach in the Lifelong RL setting. The algorithm preserves tees. Further, designing scalable methods conveying the same PAC-MDP guarantees, accelerates learning in subsequent intuition could be a promising research direction.

8276 Acknowledgments Lazaric, A.; Restelli, M.; and Bonarini, A. 2008. Transfer of We would like to thank Dennis Wilson for fruitful discussions Samples in Batch Reinforcement Learning. In Proceedings and comments on the paper. This research was supported of the 25th International Conference on Machine Learning by the Occitanie region, France; ISAE-SUPAERO (Institut (ICML 2008), 544–551. Superieur´ de l’Aeronautique´ et de l’Espace); fondation ISAE- Mahmud, M. M.; Hawasly, M.; Rosman, B.; and Ramamoor- SUPAERO; Ecole´ Doctorale Systemes;` and ONERA, the thy, S. 2013. Clustering Markov Decision Processes French Aerospace Lab. for Continual Transfer. Computing Research Repository (arXiv/CoRR) URL https://arxiv.org/abs/1311.3959. References Pazis, J.; Parr, R. E.; and How, J. P. 2016. Improving PAC Abel, D.; Jinnai, Y.; Guo, S. Y.; Konidaris, G.; and Littman, Exploration using the Median of Means. In Advances in M. L. 2018. Policy and Value Transfer in Lifelong Rein- Neural Information Processing Systems 29 (NeurIPS 2016), forcement Learning. In Proceedings of the 35th International 3898–3906. Conference on Machine Learning (ICML 2018), 20–29. Pineau, J. 2019. Machine Learning Reproducibil- Ammar, H. B.; Eaton, E.; Taylor, M. E.; Mocanu, D. C.; ity Checklist. https://www.cs.mcgill.ca/∼jpineau/ Driessens, K.; Weiss, G.; and Tuyls, K. 2014. An Automated ReproducibilityChecklist.pdf. Version 1.2, March 27, Measure of MDP Similarity for Transfer in Reinforcement 2019, last accessed on August 27, 2020. Learning. In Workshops at the 28th AAAI Conference on Pirotta, M.; Restelli, M.; and Bascetta, L. 2015. Policy gra- Artificial Intelligence (AAAI 2014). dient in Lipschitz Markov Decision Processes. Machine Learning 100(2-3): 255–283. Asadi, K.; Misra, D.; and Littman, M. L. 2018. Lipschitz Continuity in Model-Based Reinforcement Learning. Pro- Puterman, M. L. 2014. Markov Decision Processes: Discrete ceedings of the 35th International Conference on Machine Stochastic Dynamic Programming. John Wiley & Sons. Learning (ICML 2018) . Rachelson, E.; and Lagoudakis, M. G. 2010. On the Locality Bellemare, M. G.; Naddaf, Y.; Veness, J.; and Bowling, M. of Action Domination in Sequential Decision Making. In Pro- 2013. The Arcade Learning Environment: An Evaluation ceedings of the 11th International Symposium on Artificial Platform for General Agents. Journal of Artificial Intelli- Intelligence and Mathematics (ISAIM 2010). gence Research 47: 253–279. Rao, K.; and Whiteson, S. 2012. V-MAX: Tempered Opti- Bellman, R. 1957. Dynamic Programming. Princeton, USA: mism for Better PAC Reinforcement Learning. In Proceed- Princeton University Press. ings of the 11th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2012), 375–382. Brafman, R. I.; and Tennenholtz, M. 2002. R-max - a Gen- eral Polynomial Time Algorithm for Near-Optimal Rein- Silver, D. L.; Yang, Q.; and Li, L. 2013. Lifelong Machine AAAI forcement Learning. Journal of Machine Learning Research Learning Systems: Beyond Learning Algorithms. In Spring Symposium: Lifelong Machine Learning 3(Oct): 213–231. , volume 13, 05. Brunskill, E.; and Li, L. 2013. Sample Complexity of Multi- Song, J.; Gao, Y.; Wang, H.; and An, B. 2016. Measuring task Reinforcement Learning. In Proceedings of the 29th con- the Distance Between Finite Markov Decision Processes. ference on Uncertainty in Artificial Intelligence (UAI 2013). In Proceedings of the 15th International Conference on Au- Brunskill, E.; and Li, L. 2014. PAC-inspired Option Dis- tonomous Agents and Multiagent Systems (AAMAS 2016), covery in Lifelong Reinforcement Learning. In Proceedings 468–476. of the 31st International Conference on Machine Learning Sorg, J.; and Singh, S. 2009. Transfer via Soft Homo- (ICML 2014), 316–324. morphisms. In Proceedings of the 8th International Con- Carroll, J. L.; and Seppi, K. 2005. Task Similarity Measures ference on Autonomous Agents and Multiagent Systems for Transfer in Reinforcement Learning Task Libraries. In (AAMAS 2009), 741–748. International Foundation for Au- Proceedings of the 5th International Joint Conference on tonomous Agents and Multiagent Systems. Neural Networks (IJCNN 2005), volume 2, 803–808. IEEE. Strehl, A. L.; Li, L.; and Littman, M. L. 2009. Reinforcement Dann, C.; Lattimore, T.; and Brunskill, E. 2017. Unifying Learning in Finite MDPs: PAC Analysis. Journal of Machine PAC and Regret: Uniform PAC Bounds for Episodic Rein- Learning Research 10(Nov): 2413–2444. forcement Learning. In Advances in Neural Information Sutton, R. S.; and Barto, A. G. 2018. Reinforcement Learning: Processing Systems 30 (NeurIPS 2017), 5713–5723. An Introduction. MIT press, Cambridge. Ferns, N.; Panangaden, P.; and Precup, D. 2004. Metrics Szita, I.; and Szepesvari,´ C. 2010. Model-Based Reinforce- for Finite Markov Decision Processes. In Proceedings of ment Learning with Nearly Tight Exploration Complexity the 20th conference on Uncertainty in Artificial Intelligence Bounds. In Proceedings of the 27th International Conference (UAI 2004), 162–169. AUAI Press. on Machine Learning (ICML 2010), 1031–1038. Lazaric, A. 2012. Transfer in Reinforcement Learning: a Taylor, M. E.; and Stone, P. 2009. Transfer Learning for Framework and a Survey. In Reinforcement Learning, 143– Reinforcement Learning Domains: A Survey. Journal of 173. Springer. Machine Learning Research 10(Jul): 1633–1685.

8277 Villani, C. 2008. Optimal Transport: Old and New, volume 338. Springer Science & Business Media. Wilson, A.; Fern, A.; Ray, S.; and Tadepalli, P. 2007. Multi- Task Reinforcement Learning: A Hierarchical Bayesian Ap- proach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), 1015–1022.

8278