<<

Momentum in Reinforcement Learning

Nino Vieillard1,2 Bruno Scherrer2 Olivier Pietquin1 Matthieu Geist1 1Google Research, Brain Team 2Université de Lorraine, CNRS, Inria, IECL, F-54000 Nancy, France

Abstract state-action dependant noise. The resulting sequence of policies will be unstable and suboptimal, even with We adapt the optimization’s concept of mo- centered and bounded noise. Dealing with errors is how- mentum to reinforcement learning. Seeing ever crucial to RL, as we hope to tackle problems with the state-action value functions as an analog large states spaces that require function approximation. to the gradients in optimization, we inter- Indeed, many recent RL successes are algorithms that pret momentum as an average of consecutive instantiate ADP schemes with neural networks for func- q-functions. We derive Momentum Value It- tion approximation. Deep Q-Networks (DQN, Mnih eration (MoVI), a variation of Value iteration et al. [2015]) for example, can be seen as an extension that incorporates this momentum idea. Our of AVI with neural networks. analysis shows that this allows MoVI to av- In optimization, a common strategy to stabilize the erage errors over successive iterations. We descent direction, known as momentum, is to average show that the proposed approach can be read- the successive gradients instead of considering the last ily extended to deep learning. Specifically,we one. In reinforcement learning, the state-action value propose a simple improvement on DQN based function can be seen informally as a kind of gradient, as on MoVI, and experiment it on Atari games. it gives an improvement direction for the policy. Hence, we propose to bring the concept of momentum to rein- forcement learning by basically averaging q-values in a 1 Introduction DP scheme.

Reinforcement Learning (RL) is largely based on Ap- We introduce Momentum Value Iteration (MoVI) in proximate Dynamic Programming (ADP), that pro- Section 4. It is Value Iteration, up to the fact that vides algorithms to solve Markov Decision Processes the policy, instead of being greedy with respect to the (MDP, Puterman [1994]) under approximation. In the last state-action value function, is greedy with respect exact case, where there is no approximation, classic to an average of the past value functions. We analyze algorithms such as Value Iteration (VI) or Policy Iter- the propagation of errors of this scheme. In AVI, the ation (PI) are guaranteed to converge to the optimal performance bound will depend on a weighted sum of solution, that is find an optimal policy that dominates the norms of the errors at each iteration. For MoVI, we every policy in terms of value. These algorithms rely on show that this depends on the norms of the cumulative solving fixed-point problems: in VI, one tries to reach errors of previous iteration. This means that it allows the fixed point of the Bellman optimality operator by for a compensation of errors along different iterations, an iterative method. We focus on VI for the rest of the and even convergence in the case of zero-mean and paper, but the principle we propose can be extended bounded noises, under some assumption. This compen- beyond this. Approximate Value Iteration (AVI) is a sation property is shared by a few algorithms that will VI scheme with approximation errors. It is well known be discussed in Section 6. We also show that MoVI [Bertsekas and Tsitsiklis, 1996] that if the errors do can be successfully combined with powerful function not vanish, AVI does not converge. To get some in- approximation by proposing Momentum-DQN in Sec- tuition, consider a sequence of policies being greedy tion 5, an extension of MoVI with neural networks according to the optimal q-function, with an additional based on DQN. It provides a strong performance im- provement over DQN on the standard Arcade Learning Environment (ALE) benchmark [Bellemare et al., 2013]. rd Proceedings of the 23 International Conference on Artificial All stated results are proven in the appendix. Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: 108. Copyright 2020 by the author(s). Momentum in Reinforcement Learning

2 Background the partial evaluation step. AVI satisfies the following bound for the quality of the policy πk Markov Decision Processes. We consider the RL 2γ max k k setting where an agent interacts with an environment kq − q k ≤ 2γkq + j

qπ∗ (s, a) ≥ qπ(s, a). The Bellman optimality operator variable. In particular, we can rewrite the greedy step is defined as T q = max T q, and we have that q is ∗ π π ∗ (in AVI) as πk(·|s) ∈ argmaxπ(·|s)∈∆A hqk(s, ·), π(·|s)i, the unique fixed point of T∗. A policy is said to be thus seeing this step as finding the policy being state- S×A greedy with respect to q ∈ R if T∗q = Tπq. We wise the most colinear with qk. This is also reminiscent denote the set of these policies G(q). Note that such a of the direction finding subproblem of Frank and Wolfe policy can be computed without accessing to the model [1956]. Consequently, the greedy step can be seen as (the transition kernel). an analog of the update in gradient ascent (the policy π is analog to the variable x), the differences being Finally, for µ ∈ ∆ we write d = (1 − γ)µ(I − S×A π,µ (i) that q in AVI is not a gradient, but the result of γP )−1 the discounted cumulative occupancy measure k π an iterative process, q = T q , and (ii) that the induced by π when starting from the distribution µ k πk k−1 policy is not updated, but replaced. (distributions being written as row vectors). We define S×A the µ-weighted `p-norm as, for each q ∈ R , kqkp,µ = This analogy is thus quite limited (qk is not really a p 1 gradient, there is no optimized function, the policy is E [|q(s, a)| ] p . (s,a)∼µ replaced rather than updated). However, it is sufficient to adapt the momentum idea to AVI, by replacing the Approximate Value Iteration. Approximate Dy- q-function in the improvement step by a smoothing namic Programming provides algorithms to solve an of the q-functions, h = ρh + q . We can then MDP under some errors. One classic algorithm is Ap- k k−1 k notice that G(h ) = G( hk ), allowing us to compute a proximate Value Iteration. It looks directly for the k 1+ρ instead of a smoothing, hk = βkhk−1 + fixed point of T∗ with an iterative process (1 − βk)qk, which leads to the following ADP scheme, ( π ∈ G(q ) initialized with h0 = q0, k+1 k (AVI) qk+1 = Tπ qk + k+1.  k+1 π = G(h )  k+1 k (MoVI) (2) Notice that here, Tπk+1 qk = T∗qk. In this scheme, we qk+1 = Tπk+1 qk + k+1  call the first line the greedy step, and the second line hk+1 = βk+1hk + (1 − βk+1)qk+1. 1This is for ease and clarity of exposition, the proposed algorithm and analysis can be extended to continuous state We call this scheme Momentum Value Iteration (MoVI), spaces. we analyze it in the following section. Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

4 Analysis

30 MoVI AVI For the analysis, we consider a specific case of the 25 scheme in Equation (2), with an empirical mean rather 20 1 ‖

than an iteration-dependant moving average. This k π k q 15 amounts to define β = in Eq. (2). We study − k k+1 * q ‖ the propagation of errors of MoVI, to see how it is 10 impacted by the introduction of momentum, compared 5 to a classic AVI scheme (see Eq. (1)). 0 0 1000 2000 3000 4000 5000 4.1 Error propagation analysis Iterations (k)

First, let us define some useful notations. We denote Figure 1: Illustration of the convergence of MoVI. We

Pj:i = Pπj Pπj−1 ...Pπi if 1 ≤ i ≤ j, Pj:i = I otherwise, represent the empirical mean and where πj is the policy computed by MoVI at iteration of the error over 100 MDPs. j. We then define the negative cumulative error Ek = Pk − j, and the weighted negative cumulative error j=1 term corresponds to a sum of errors, that can then com- E0 = − Pk−j P (I − γP ) . k,j i=1 i+j:i+1 πi i pensate, which is not the case in AVI (see Equation (1)). 1 To study the efficiency of the algorithm, the natural The normalization by k+1 reduces the variance of this

quantity to bound is the loss q∗ −qπk ≥ 0, the difference term, and that can lead to convergence under some between the value of the optimal policy and the (true) assumptions (see Section 4.2). However, the second 0 value of the policy computed by MoVI. term is more cumbersome. The terms Ek,j depend Theorem 1. After k + 1 iterations of MoVI, we have on sums of error weighted by composed kernels Pi:j. Would these kernels be arbitrary, this could lead to " 1 further variance reduction. However, the corresponding q − q ≤ (I − γP )−1(E + q − q ) ∗ πk+1 k + 1 π∗ k+1 k+1 0 average is done over the state-action space in addition to over iterations, and the kernels are dependent of k−1 k #  X X  the error they weight, this dependency being hard to −(I−γP )−1 γjE0 + γjP (T q −q ) . πk+1 k,j j:1 π1 0 0 quantify. We further discuss this next. j=0 j=0 Still, the algorithm can converge in practice, and we illustrate its behaviour on a simple case. We give our- To understand and then discuss this result, we provide selves a tabular representation of a randomly generated a bound of a µ-weighted `1-norm of the loss: the norm MDP, with access to a generative model. The approxi- is what one would want to control in a practical setting. mation comes from the fact that the Bellman operator Notice that we could similarly derive a bound for the is sampled at each iteration (instead of being evaluated µ-weighted `p-norm. exactly); we compare it to AVI in the same scenario. We report the average error between q and q in Fig- Corollary 1. Let µ be the distribution of interest, πk ∗ and ν the sampling distribution. We introduce the ure 1. This experiments illustrate how AVI oscillates following concentrability coefficient (the fraction being with high error, while MoVI converges to q∗. componentwise) We note that our proof technique should hold with a constant β too (moving average instead of average). In dπ,µ −1 C = max this case, instead of having an average error (k Ek), π ν ∞ we would have a moving average of the (weighted) Suppose that we initialize h0 = q0 = 0. At iteration errors. This would not vanish asymptotically, even k + 1 of MoVI, we have with zero-mean bounded noises k, but this would still reduce the variance, and improve upon the AVI bound. C kq∗ − qπk+1 k1,µ ≤ kEk+1k1,ν + (k + 1)(1 − γ) 4.2 About the sample complexity k−1 ! X j 0 γ kEk,jk1,ν + 2qmax . To better understand MoVI, we analyze its sample j=0 complexity in a simple case, Sampled-MoVI. In this setting, we have access to a generative model of the

Theorem 1 shows that q∗ − qπk depends on two error MDP and we give ourselves a tabular representation 0 terms, Ek and a γ-discounted sum of Ek,j. The first of the MDP. At each iteration of Sampled-MoVI, for Momentum in Reinforcement Learning

each (s, a) ∈ S × A, we sample a state s0 ∼ P (·|s, a) This result only holds under the strong Asm. 1. Under and perform the update from Equation (2) with only this setting (tabular representation, generative model), ˆ this state. We denote Tπk the resulting sampled Bell- there exist algorithms with faster convergence [Wain- ˆ 0 man operator, Tπk q(s, a) = r(s, a) + γq(s , πk(s)). The wright, 2019]. However, they are not easily extandable error at iteration k is then, for each (s, a), k(s, a) = beyond this setting, contrary to MoVI that can be easily ˆ Tπk qk−1(s, a) − Tπk qk−1(s, a). It is thus zero-mean and turned into a practical large scale deep RL algorithm. centered. We provide a detailed pseudo-code in the Appendix. 5 Momentum DQN We are interested in controlling the distance of our policy to the optimal policy, precisely in the norm We now propose an extension of MoVI to Momentum- DQN, introducing stochastic approximation and using kq∗ − qπk k∞ at iteration k of Sampled-MoVI. We have, as a direct consequence of Thm. 1, that deep neural networks for function approximation. We base ourselves on Deep Q-Networks (DQN , Mnih et al.

1 [2015]), using the same algorithmic structure. We kq∗ − qπ k∞ ≤ kEk+1k∞+ propose an off-policy algorithm, using a replay buffer as k+1 (k + 1)(1 − γ) in DQN: we can apply the Bellman evaluation operator k−1 ! to the estimated q-function in an off-policy manner. X j 0 γ kEk,jk∞ + 2qmax . (3) j=0 We parametrize the q-function by an online network Qθ of weights θ, and we keep a copy of these weights Informally, using an Hoeffding argument, we have in a target network Q− of weights θ−. We addition- −1 − 1 k kEkk∞ = O(k 2 ). However, bounding a term ally define the averaging network Hφ of weights φ, and 0 − − maxj≤k kEk,jk∞ is more involved. This could typi- their target counterparts H and φ . Momentum- cally be done using the Maximal Azuma-Hoeffding DQN interacts in an online way with an environment inequality. Yet, this requires the errors to be centered collecting transitions {s, a, r, s0} ∈ S × A × R × S, that and bounded. In our case, the sequence of estimation are stored in a FIFO replay buffer B. In DQN, the errors {1(s, a), . . . k(s, a)} is a martingale difference algorithm performs gradient descent to approximate sequence with respect to the natural filtration Fk−1 the partial evaluation step by regressing an approxima- − (generated by the sequence of states sampled from the tion of T∗Q , and periodically copies the weights of generative model), that is E[k(s, a)|Fk−1] = 0. This the online networks to the target networks. The loss is sufficient for controlling the term Ek, but the terms minimized at each step is almost the same as in DQN, 0 Ek,j are more difficult. Indeed, there, the errors are replacing an approximation of T∗Qk by an approxima- multiplied by a series of transition kernel matrices. For tion of the evaluation operator of the greedy policy − an arbitrary kernel P , independent of k, we would with respect to the averaging network, TG(Hφ)Q . We have E[P k(s, a)|Fk−1] = P E[k(s, a)|Fk−1] = 0. Un- define a regression target for Qθ as fortunately Pπ depends on πk+1, which is greedy k+1 Qˆ(r, s0) = r + γQ−(s0, argmax H−(s0, ·)), with respect to hk, which is computed using qk and so depends on k. Thus, the independence cannot be and a regression loss assessed. To control the error in Sampled-MoVI, we  2 consequently make the following assumption.  0  Lq(θ) = EˆB Qˆ(r, s ) − Qθ(s, a) , (4) Assumption 1. ∀i, j ≥ 1, E [Pj+i:j+1j|Fj−1] = 0. with Eˆ the empirical loss over a finite set. Then, we This assumption may seem very strong, as the depen- define a regression loss for the averaging network as dency is hard to quantify. However, we have that an approximation of Equation (2). We use the general π ∈ G(h ) = G( k h + 1 T q + 1  ). k+1 k k+1 k−1 k+1 πk k−1 k+1 k scheme from Equation (2) with a possibly variable mix- Thus, the influence of  on π diminishes with time. k k ture rate β . The regression target Hˆ for the averaging Indeed, assuming that E [P  |F ] = o( √1 ) k j+i:j+1 j j−1 j network is computed as should be enough to ensure convergence, but at a lower 0 − 0 speed. We study numerically this assumption in Sec- Hˆ (s, a, r, s ) = βkH (s, a) + (1 − βk)Qˆ(r, s ), tion 7. which leads to a regression loss Proposition 1. Suppose Asm. 1 holds. After k itera-  2 tions of Sampled MoVI, with probability at least 1 − δ  0  LH (φ) = EˆB Hˆ (s, a, r, s ) − Hφ(s, a) . (5) s " 4|S||A| # 2rmax 1 3 2 ln δ Momentum-DQN interacts with the environment with kq∗ −qπk k∞ ≤ 2 + . (1 − γ) k (1 − γ) k the policy Gek (H) that is ek-greedy with respect to H, Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

the averaging network (ek depends on k because we Algorithmic comparison. First, let us consider use a classic decreasing schedule for the exploration). DPP, in the DPP-RL version2 [Azar et al., 2012, Algo- During training, it minimizes losses Lq and Lh with rithm 2]. Define the scalar product on A for all policy π P stochastic gradient descent (or a variant), and update and q-value q as hπ, qi(s) = a∈A π(a|s)q(s, a). DPP S×A the target weights with the online weights every C estimates a quantity ψk ∈ R , as gradient steps. A detailed pseudo-code is given in ψ = ψ + T ψ − hπ , ψ i +  , (6) Algorithm 1, and we evaluate this algorithm in Section k k−1 πk k−1 k k−1 k

7.2. with πk ∈ G(ψk). Without error, ψk(s, a) converges to q∗(s, a) when a is the optimal action in state s, and On the mixture rate. We aim at considering a rate to −∞ otherwise. This makes difficult an extension of k close to the one of MoVI, βk = k+1 . Due to stochastic DPP to a function approximation setting (unbounded approximation, an iteration of Momentum-DQN does function). not match one iteration of MoVI, rather we should wait for several target updates before considering we Secondly, SQL updates a q-value qk as have performed such an iteration. Consequently, we 1 bk/κc qk = qk−1 + (T∗qk−2 − qk−1) consider a rate such that βk = bk/κc+1 , with κ a rate k update period (an hyperparameter), that is the number k − 1 + (T q − T q ). (7) of environment steps between each change of β. k ∗ k−1 ∗ k−2 We then re-write SQL as an update on similar quantities Algorithm 1 Momentum-DQN as DPP. Let us define ψk = kqk, and consider the policy Require: K ∈ N∗ the number of steps, C ∈ N∗ the πk = G(qk) = G(ψk). SQL is then equivalent to update period, F ∈ N∗ the interaction period, κ ∈ N∗

the rate update period. ψk = ψk−1 + Tπk ψk−1 − γPπk−1 ψk−2 + k. (8) Initialize θ, φ at random B = {} Finally, we also MoVI in this setting. Here, − − Pk θ = θ, φ = φ we define ψk as ψk = (k + 1)hk = i=0 qj. We con- for k = 1 to K do sider the sequence of policies πk = G(hk) = G(ψk). 0 Collect a transition t = (s, a, r, s ) from Gek (Hφ) With some work (detailed in Appendix) we can rewrite B ← B ∪ {t} Equation (2) as an update on ψk as if k mod F == 0 then bk/κc ψk = ψk−1 + Tπk ψk−1 − γPπk ψk−2 + k. (9) βk = bk/κc+1 On a random batch of transitions Bq,k ⊂ B, Comparing MoVI, SQL and DPP through the prism update θ with one step of SGD of Lq, see (4) of Eqs. (9), (6) and (8), we observe that these three On a random batch of transitions Bh,k ⊂ B, schemes are similar. They all share the first part of update φ with one step of SGD of Lh, see (5) their update in common, and differ only in the subtrac- end if tion term – that allows for error compensation. This if k mod C == 0 then term is γPπk ψk−2 in MoVI, which is replaced by a θ− ← θ, φ− ← φ γPπk−1 ψk−2 = T∗ψk−2 in SQL, and by hπk, ψk−1i in end if DPP. This writing eases comparison, but we highlight end for that it is not how algorithms are defined initially, and return G(Hφ) implemented, except for DPP (SQL and MoVI do not require estimating an unbounded function).

6 Related work and discussion Performance bounds. We now compare perfor- mance bounds of various algorithm. SQL and DPP The closest approaches to MoVI are Speedy Q-Learning both propagates averaged errors instead of errors, as 3 (SQL) [Azar et al., 2011] and Dynamic Policy Program- they both satisfy ming (DPP) [Azar et al., 2012] (generalized by Kozuno k ! 2γ X 8γqmax et al. [2019] as Conservative VI, with similar guaran- kq − q k ≤ γk−jkE k + , ∗ πk ∞ k(1 − γ) j ∞ 1 − γ tees). Both approaches are extensions of AVI that also j=1

benefit from a similar compensation of errors along iter- 2 ations. As fat as we know, they are the sole algorithms DPP considers general softmax policies, of which greedy policies are a special case, that correspond to DPP-RL. with this kind of guarantee. We first discuss extensively 3The bounds in the original papers differ slightly by the links to SQL and DPP, before mentioning other their multiplicative constants, the one provided here is true (less) related works. for both. Momentum in Reinforcement Learning

This is to be compared to the bound for MoVI given principles. Pérolat et al. [2016] build their algorithm as in Eq. (3). MoVI, DPP and SQl enjoys similar bounds, a quasi-Newton method on the Bellman residual and the main difference being in the nature of the error rely heavily on linear parameterization, while Politex terms. Both SQL and DPP depend on a term of the build upon prediction with expert advice, and deals Pk k−j from j=0 γ kEkk∞, that is a discounted sum of the with the average reward criterion, instead of the dis- norm of averaged errors. On the other hand, in MoVI, counted one. Moreover, none of these two approaches

Pk k−j 0 offer the kind of guarantee about the propagation of we have the dependency in j=0 γ Ek,j , so not ∞ averaged errors that DPP, SQL or MoVI have. averaged errors, but averaged weighted errors. In the generative model setting, the bound is less favourable for MoVI. Indeed, in this case, the errors are zero- 7 Experiments mean, so the dependence on their average in DPP and SQL is a strong advantage. However, we empirically In this Section, we present experimental results from show that MoVI behaves similarly to SQL and DPP MoVI and Momentum-DQN. First, we consider small in this case (see Sec. 7.1). In a more general case (k random MDPs (Garnets), to check empirically Asm. 1 corresponding to a regression error), none of the bounds and to compare to DDP and SQL on a tabular setting, can be easily instantiated, because the quantity we can with access to a generative model. Then, we experiment hope to control is kkk2, µ, not kEkk2,µ. This, we will Momentum-DQN on a subset of Atari games, and check the algorithms’ behaviors empirically. compare to DQN (a natural baseline) as well as deep versions of DPP and SQL. Further experimental details From a practical point of view, neither SQL or DPP are provided in the appendix. have been originally implemented in RL on large scale problems. A deep version of a variation of DPP4 have been proposed by Tsurumine et al. [2017], but it is 7.1 Garnets only applied on a small number of samples. The prin- A Garnet [Archibald et al., 1995, Bhatnagar et al., cipal issue of a practical DPP is that it has to estimate 2009] is an abstract MDP. It is built from three param- ψ , a quantity that is asymptotically unbounded. It k eters (N , N , N ). N and N are respectively the could then be applied on training environments, S A B S A number of states and actions. The parameter N is when this value is updated a relatively small number B the branching factor, the maximum number of states of times, and stays numerically stable. However, on en- accessible from any other state. The transition probabil- vironments like the ALE, where one needs to compute ities P (s0|s, a) are then computed as follows. For each millions of environments steps, DPP is likely to diverge, state-action couple (s, a), N states (s , . . . s ) are and fail due to numerical issues. In Section 7.2, we B 1 NB drawn uniformly without replacement. Then, N − 1 provide a experiment in a larger setting. We extened B number are drawn uniformly in (0, 1) and sorted as MoVI to deep learning, and, for the sake of somparison, (p = 0, p , . . . p , p = 1). The transition proba- we propose deep versions of SQL and DPP. These two 0 1 NB −1 NB bilities are assigned as P (s |s, a) = p − p for each last algorithms are variations of DQN that make use k k k−1 1 ≤ k ≤ N . The reward function is drawn uniformly of updates in Equations (7) and (6) to define DQN- B in (−1, 1)NS . like regression targets. We could not obtain satisfying results with both of these implementations. Experi- mental results and details are given in Section (7) and Assumption check. First, we want to check that in the Appendix. Asm. 1 is reasonable. Given a step j of the algorithm and a size l, we compute an empirical estimate of E[Pj+l:j+1j]. With Garnets, we have access to the Other related methods. MoVI shares also algo- transition kernel, so we can compute the error at step rithmic similarities with other algorithms, Softened j, j(s, a) = Tˆπ qj(s, a)−Tπ qj(s, a). Given a fixed Gar- LSPI [Pérolat et al., 2016] and Politex [Lazic et al., j j net, we first compute the value qj with MoVI. Then, on 2019]. Pérolat et al. [2016] consider the zero-sum games a number N of runs, we re-start MoVI from the same setting, and propose a Policy Iteration (PI)-based algo- qj, re-run the algorithm for l steps from there, and com- rithm. It relates to MoVI in the sense that it averages pute the values Pj+l:j+1,nj,n(s, a), with n ∈ [|1; N|]. the q-values of consecutive policies. Politex is also a We get an estimate ¯l,N of kE[Pj+l:j+1j|Fj−1]k , PI-scheme, where the policy is a softmax of the sum ∞ of all q-values. These two algorithms share the idea of N 1 X averaging the q-values, but are derived from different ¯l,N = max Pj+l:j+1,nj,n(s, a) . (s,a)∈S×A N n=1 4Specifically, it is the update described by Azar et al. [2012, Eq. (24)], that also lead to an asymptotically un- We want to check that ¯N → 0 when N → ∞. For bounded function, and thus to numerical instability. several values of l, We compute ¯l,N for N between 0 Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

7.2 Atari 100 l=0 l=1 ) Atari is a standard discrete-actions environment intro- N , l ̄ l=5 ε ( l=10 duced by Bellemare et al. [2013] with a high dimensional s r o r

r −1

e 10 state space. We use this environment to validate our

d e t

a Momentum-DQN architecture. Our baseline is DQN as m i t

s it is implemented in the Dopamine library [Castro et al., d e g a

r −2 2018]. We used the same architecture and the same

e 10 v A hyperparameters as DQN, and notably we used sticky action with a rate of 0.25 to introduce stochasticity as 0 25 50 75 100 125 150 175 200 Runs (N) recommended by Machado et al. [2018], and our state consists in the stacking of the 4 last frames. Every 4 Figure 2: Evolution of the empirical weighted average steps in the environment, we perform a gradient update error ¯l,N with N (log scale) for different values of l. on θ and φ. Every C=25000 environment steps, we We need a convergence towards 0 for our assumption update the target networks. We report the average to be numerically verified, which seems to be the case. undiscounted score obtained during learning on the last 250000 steps (named an iteration). On the figures, the thick line show this average score averaged on 5 random seeds, while the semi-transparent parts denote the standard deviation with respect to the seeds. 101 We evaluate Momentum-DQN on a subset of 20 Atari 1 ‖ k

π games. This games are selected to represent the cate- q

− 0 * 10 gories from Ostrovski et al. [2017, Appendix A], exclud- q ‖ ing the hardest exploration ones – we have no claim in MoVI DPP helping DQN in this setting. Here, we used a sched- SQL −1 10 AVI ule of βk as defined in Section 5, with κ = 2500000 0 2000 4000 6000 8000 10000 that we tuned on a small subset of game (Asterix, Iterations (k) Zaxxon, and Jamesbond). As an example, we give the comparison of Momentum-DQN and DQN on the Figure 3: Error on the policy value of different ADP game SpaceInvaders in Figure 4. In figure 5, we report schemes. Each curve represents kq − q k , where π ∗ πk 1 k the normalized improvement of Momentum-DQN over results from AVI, MoVI, SQL or DPP. DQN using the Area Under the Curve (AUC) metric. These results show a clear improvement using Momen- and 200, and average these results over 100 garnets. We tum. Momentum-DQN outperforms DQN on 16 games report the evolution of the means of ¯ (over Garnets) out of 20, with an average normalized improvement of l,N 45%. It only under-performs DQN on three games by in Figure 2. We observe that the limit of ¯l,N seems to be 0 for each l, which experimentally validates our a low , while the improvement goes up to 200% assumption. With l = 0, we get the “natural” norm for the game Seaquest. In the Appendix, we report the of errors (not multiplied by any matrix). We see here score obtained for the 20 games, along with experiments that, for every tested l > 0, the norm is lower than for testing the influence of various βk schedules. l = 0, meaning that the policies kernels do not have a negative impact on the expected value, but seem to SpaceInvaders 6000 further reduce variance. DQN Momentum-DQN 5000

Algorithms comparison. We compare VI, MoVI, 4000

SQL and DPP on random Garnets, using the sampled 3000 version with a generative model described in Section 4.2. 2000 We run each algorithm on 100 Garnets, an we report Averaged score the norm of the empirical error on the uniform distri- 1000 bution kq∗ − qπk k1. We can compute the exact value 0 0 25 50 75 100 125 150 175 200 of πk with access to the model. The four algorithms Iterations are compared in Figure 3. We observe an almost iden- tical behaviour for MoVI, DPP, and SQL. They all Figure 4: Scores obtained on SpaceInvaders by DQN converge towards v∗ at roughly the same speed, while (dashed-dotted, blue) and Momentum-DQN (orange). AVI oscillates around a sub-optimal policy. Momentum in Reinforcement Learning

Momentum-DQN vs DQN. Average improvement: 45.0% DSQL vs DQN. Average improvement: -10.8%

3.0 1.5

2.5 1.0 2.0

1.5 0.5

1.0 0.0 0.5

−0.5

Normalized improvement over DQN 0.0 Normalized improvement over DQN

Hero Atla. Fros. Fros. Hero Atla. Jame. Mspa.Amid.Kung.Berz.Zaxx.Tuta. PongRoad.Endu.Cent.Brea.Upnd.Astd.Spac. Astx.Seaq. Zaxx. Brea.Jame.Mspa.Amid.Upnd.Cent.Road. Spac.Tuta.Berz.Endu.Kung.PongAstd.Seaq.Astx.

Figure 5: Normalized improvement of Momentum- Figure 6: Normalized improvement of DSQL over DQN. DQN over DQN. We obtain an almost constant im- provement on these 20 games. DDPP vs DQN. Average improvement: -28.8%

0.50 Deep-SQL and Deep-DPP. We implemented 0.25 Deep versions of SQL and DPP (respectively DSQL 0.00 and DDPP), that we tested on Atari, also based on the −0.25 architecture and hyperparameters of Dopamine’s DQN. −0.50

For both algorithms, we derive an update rule based −0.75 on the ADP scheme, using the same parametrization −1.00 as DQN (we report specific equations in Appendix). −1.25 We were however not able to obtain satisfying – i.e. Normalized improvement over DQN −1.50 competitive with DQN – scores with these algorithms. Hero Atla.Fros. We report the experimental results of DDPP and DSQL PongZaxx. Road.Amid.Tuta.Endu.Kung.Astx.Berz.Spac.Mspa.Astd.Seaq.Jame. Brea.Upnd.Cent. versus DQN in Figures 6 and 7. We used the same Figure 7: Normalized improvement of DDPP over parameters as for Momentum-DQN, in particular the DQN. same βk schedule for DSQL. On these two graphs, we see that both DSQL and DDPP underperform DQN on most of the games. errors to AVI. We also derived a partial analysis of the For DDPP, the reason is quite simple, as the Q-network sample complexity when instantiated in the tabular has to estimate a value that diverges to −∞, causing case. These results are similar to what are to our knowl- heavy numerical issues, and the algorithms fails on edge the closest algorithms to MoVI, SQL and DPP. most of the games after a few iterations. It is less clear Our bound involves a more complicated averaging of er- why DSQL underperforms DQN. Our hypothesis is rors, extensively discussed. Yet, we have shown that all that Momentum-DQN enjoys a separate network that algorithmic schemes behave similarly in toy problems. approximate the average of the q-values, while DSQL We advocated that MoVI is better suited for deep needs to compute its update from consecutive target. learning extensions and proposed Momentum-DQN, However, when using deep networks and stochastic as well as natural deep extensions of DPP and SQL. approximation, the consecutive target networks cannot With experiments on a representative subset of Atari securely be associated to consecutive q-values computed games, we have shown that, contrary to DDPP and in ADP, making the update in DSQL less reliable. DSQL, momentum-DQN brings a clear improvement over DQN. Note that in principle, Momentum could be applied to any RL algorithm that estimates a value: a 8 Conclusion value-based algorithm like C51 [Bellemare et al., 2017], or an actor-critic (for example, SAC [Haarnoja et al., We introduced a new ADP scheme, MoVI, inspired by 2018]). It could also be extended straightforwardly to Momentum in gradient ascent. To adapt Momentum continuous action settings, replacing the critic by the to RL, we made an analogy between the q-values in DP average of successive critics. We plan to extend the schemes and the gradient in gradient ascent methods, idea of Momentum to other RL algorithms in future interpreting Momentum in RL as an averaging of con- works. secutive q-function. We provided an anlysis of MoVI, showing that the Momentum brings compensation of Nino Vieillard, Bruno Scherrer, Olivier Pietquin, Matthieu Geist

References Marlos C Machado, Marc G Bellemare, Erik Talvi- tie, Joel Veness, Matthew Hausknecht, and Michael TW Archibald, KIM McKinnon, and LC Thomas. On Bowling. Revisiting the arcade learning environment: the generation of markov decision processes. Journal Evaluation protocols and open problems for general of the Operational Research Society, 46(3):354–361, agents. Journal of Artificial Intelligence Research, 1995. 61:523–562, 2018. Mohammad G Azar, Mohammad Ghavamzadeh, Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Hilbert J Kappen, and Rémi Munos. Speedy q- Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex learning. In Advances in Neural Information Pro- Graves, Martin Riedmiller, Andreas K Fidjeland, cessing System (NeurIPS), pages 2411–2419, 2011. Georg Ostrovski, et al. Human-level control through Mohammad Gheshlaghi Azar, Vicenç Gómez, and deep reinforcement learning. Nature, 518(7540):529, Hilbert J Kappen. Dynamic policy programming. 2015. Journal of Machine Learning Research, 13(Nov):3207– Georg Ostrovski, Marc G Bellemare, Aäron van den 3245, 2012. Oord, and Rémi Munos. Count-based exploration Marc G Bellemare, Yavar Naddaf, Joel Veness, and with neural density models. In International Confer- Michael Bowling. The arcade learning environment: ence on Machine Learning (ICML), pages 2721–2730, An evaluation platform for general agents. Journal 2017. of Artificial Intelligence Research, 47:253–279, 2013. Julien Pérolat, Bilal Piot, Matthieu Geist, Bruno Scher- Marc G Bellemare, Will Dabney, and Rémi Munos. A rer, and Olivier Pietquin. Softened approximate pol- distributional perspective on reinforcement learning. icy iteration for markov games. In International Con- In International Conference on Machine Learning ference on Machine Learning (ICML), pages 1860– (ICML), pages 449–458, 2017. 1868, 2016. Dimitri P Bertsekas and John N Tsitsiklis. Neuro Martin L Puterman. Markov decision processes: dis- dynamic programming. Athena Scientific Belmont, crete stochastic dynamic programming. John Wiley MA, 1996. & Sons, 1994. Shalabh Bhatnagar, Richard S Sutton, Mohammad Ning Qian. On the momentum term in gradient descent Ghavamzadeh, and Mark Lee. Natural actor–critic learning algorithms. Neural networks, 12(1):145–151, algorithms. Automatica, 45(11):2471–2482, 2009. 1999. Pablo Samuel Castro, Subhodeep Moitra, Carles Yoshihisa Tsurumine, Yunduan Cui, Eiji Uchibe, and Gelada, Saurabh Kumar, and Marc G Bellemare. Takamitsu Matsubara. Deep dynamic policy pro- Dopamine: A research framework for deep reinforce- gramming for robot control with raw images. In ment learning. arXiv preprint arXiv:1812.06110, IEEE/RSJ International Conference on Intelligent 2018. Robots and Systems (IROS), pages 1545–1550. IEEE, 2017. Marguerite Frank and Philip Wolfe. An algorithm for quadratic programming. Naval research logistics Martin J Wainwright. Variance-reduced q-learning is quarterly, 3(1-2):95–110, 1956. minimax optimal. arXiv preprint arXiv:1906.04697, 2019. Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochas- tic actor. In International Conference on Machine Learning (ICML), pages 1861–1870, 2018. Tadashi Kozuno, Eiji Uchibe, and Kenji Doya. Theoret- ical analysis of efficiency and robustness of softmax and -increasing operators in reinforcement learn- ing. In International Conference on Artificial Intel- ligence and Statistics (AISTATS), pages 2995–3003, 2019. Nevena Lazic, Yasin Abbasi-Yadkori, Kush Bhatia, Gellert Weisz, Peter Bartlett, and Csaba Szepesvari. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning (ICML), pages 3692–3702, 2019.