<<

Journal of Machine Research 5 (2003) 1-25 Submitted 4/02; Revised 9/02; Published 12/03

Learning Rates for Q-learning

Eyal Even-Dar [email protected] Yishay Mansour [email protected] School of Science Tel-Aviv University Tel-Aviv, 69978, Israel

Editor: Peter Bartlett

Abstract In this paper we derive convergence rates for Q-learning. We show an interesting relationship between the convergence rate and the used in Q-learning. For a polynomial learning rate, one which is 1/tω at time t where ω (1/2,1), we show that the convergence rate is poly- nomial in 1/(1 γ), where γ is the discount∈ factor. In contrast we show that for a linear learning rate, one which− is 1/t at time t, the convergence rate has an exponential dependence on 1/(1 γ). In addition we show a simple example that proves this exponential behavior is inherent for linear− learning rates. Keywords: Learning, Q-Learning, Stochastic Processes, Convergence Bounds, Learning Rates.

1. Introduction

In , an agent wanders in an unknown environment and tries to maximize its long term return by performing actions and receiving rewards. The challenge is to understand how a current action will affect future rewards. A good way to model this task is with Markov Decision Processes (MDP), which have become the dominant approach in Reinforcement Learning (Sutton and Barto, 1998, Bertsekas and Tsitsiklis, 1996). An MDP includes states (which abstract the environment), actions (which are the available actions to the agent), and for each state-action pair, a distribution of next states (the states reached after performing the action in the given state). In addition there is a reward function that assigns a stochastic reward for each state and action. The return combines a sequence of rewards into a single value that the agent tries to optimize. A discounted return has a parameter γ (0,1) where the reward received at step k is discounted by γk. ∈ One of the challenges of Reinforcement Learning is when the MDP is not known, and we can only observe the trajectory of states, actions and rewards generated by the agent wandering in the MDP. There are two basic conceptual approaches to the learning problem. The first is model based, where we first reconstruct a model of the MDP, and then find an optimal policy for the approximate model. The second approach is implicit methods that update the information after each step, and based on this derive an estimate to the optimal policy. The most popular of those methods is Q- learning (Watkins, 1989). Q-learning is an off-policy method that can be run on top of any strategy wandering in the MDP. It uses the information observed to approximate the optimal function, from which one can

c 2003 Eyal Even-Dar and Yishay Mansour.

EVEN-DAR AND MANSOUR

construct the optimal policy. There are various proofs that Q-learning does converge to the optimal Q function, under very mild conditions (Bertsekas and Tsitsiklis, 1996, Tsitsiklis, 1994, Watkins and Dyan, 1992, Littman and Szepesvari,´ 1996, Jaakkola et al., 1994, Borkar and Meyn, 2000). The conditions have to do with the exploration policy and the learning rate. For the exploration one needs to require that each state action be performed infinitely often. The learning rate controls how fast we modify our estimates. One expects to start with a high learning rate, which allows fast changes, and lowers the learning rate as time progresses. The basic conditions are that the sum of the learning rates goes to infinity (so that any value could be reached) and that the sum of the squares of the learning rates is finite (which is required to show that the convergence is with one). We build on the proof technique of Bertsekas and Tsitsiklis (1996), which is based on conver- gence of stochastic iterative , to derive convergence rates for Q-learning. We study two models of updating in Q-learning. The first is the synchronous model, where all state action pairs are updated simultaneously. The second is the asynchronous model, where at each step we update a single state action pair. We distinguish between two sets of learning rates. The most interesting outcome of our investigation is the relationship between the form of the learning rates and the rate of convergence. A linear learning rate is of the form 1/t at time t, and a polynomial learning rate, which is of the form 1/tω, where ω (1/2,1) is a parameter. ∈ We show for synchronous models that for a polynomial learning rate the convergence rate is polynomial in 1/(1 γ), while for a linear learning rate the convergence rate is exponential in − 1/(1 γ). We also describe an MDP that has exponential behavior for a linear learning rate. The lower− bound simply shows that if the initial value is one and all the rewards are zero, it takes 1/(1 γ) O((1/ε) − ) updates, using a linear learning rate, until we reach a value of ε. α The different behavior might be explained by the asymptotic behavior of ∑t t, one of the condi- tions that ensure that Q-learning converges from any initial value. In the case of a linear learning rate T α 1 ω we have that ∑t=1 t = O(ln(T)), whereas using polynomial learning rate it behaves as O(T − ). Therefore, using polynomial learning rate each value can be reached by polynomial number of steps and using linear learning rate each value requires exponential number of steps. The convergence rate of Q-learning in a batch setting, where many samples are averaged for each update, was analyzed by Kearns and Singh (1999). A batch setting does not have a learning rate and has much of the flavor of model based techniques, since each update is an average of many samples. A run of a batch Q-learning is divided into phases, at the end of each phase an update is made. The update after each phase is reliable since it averages many samples. The convergence of Q-learning with linear learning rate was studied by Szepesvari (1998) for special MDPs, where the next state distribution is the same for each state. (This setting is much closer to the PAC model, since there is no influence between the action performed and the states reached, and the states are i.i.d distributed). For this model Szepesvari (1998) shows a convergence rate, which is exponential in 1/(1 γ). Beleznay et al. (1999) give an exponential lower bound in the number of the states for undiscounted− return.

2. The Model We define a (MDP) as follows

Definition 1 A Markov Decision process (MDP) M is a 4-tuple (S,U,P,R), where S is a set of the M states, U is a set of actions (U(i) is the set of actions available at state i), Pi, j(a) is the transition

2 LEARNING RATES FOR Q-LEARNING

probability from state i to state j when performing action a U(i) in state i, and RM(s,a) is the reward received when performing action a in state s. ∈

We assume that RM(s,a) is non-negative and bounded by Rmax, i.e., s,a :0 RM(s,a) Rmax. ∀ ≤ ≤ For simplicity we assume that the reward RM(s,a) is deterministic, however all our results apply when RM(s,a) is stochastic. A strategy for an MDP assigns, at each time t, for each state s a probability for performing action a U(s), given a history Ft 1 = s1,a1,r1,...,st 1,at 1,rt 1 which includes the states, actions and rewards∈ observed until time− t {1. A policy is− memory-less− − } strategy, i.e., it depends only on the current state and not on the history.− A deterministic policy assigns each state a unique action.

While following a policy π we perform at time t action at at state st and observe a reward rt (distributed according to R s a ), and the next state s (distributed according to PM a ). We M( , ) t+1 st ,st+1 ( t) combine the sequence of rewards to a single value called the return, and our goal is to maximize the return. In this work we focus on discounted return, which has a parameter γ (0,1), and the π π ∑∞ γt ∈ discounted return of policy is VM = t=0 rt, where rt is the reward observed at time t. Since all Rmax the rewards are bounded by Rmax the discounted return is bounded by Vmax = 1 γ . − π π ∑∞ γi We define a value function for each state s, under policy ,asVM(s)=E[ i=0 ri ], where the expectation is over a run of policy π starting at state s. We define a state-action value function π γ∑ M π QM(s,a)=RM(s,a)+ s¯ Ps,s¯(a)VM(s¯) , whose value is the return of initially performing action a at state s and then following policy π. Since γ < 1 we can define another parameter β =(1 γ)/2, − which will be useful for stating our results. (Note that as β decreases Vmax increases.)

Let π∗ be an optimal policy, which maximizes the return from any start state. (It is well known that there exists an optimal strategy, which is a deterministic policy (Puterman., 1994).) This im- π π plies that for any policy π and any state s we have V ∗ (s) V (s), and π (s)=argmax (RM(s,a)+ M ≥ M ∗ a γ(∑ PM (a)max Q(s ,b)). The optimal policy is also the only fixed point of the operator, (TQ)(s,a)= s0 s,s b 0 0 π π γ∑ ∗ ∗ RM(s,a)+ s Ps,s (a)maxb Q(s0,b). We use VM∗ and QM∗ for VM and QM , respectively. We say that 0 0 π a policy π is an ε-approximation of the optimal policy if V V ∞ ε. k M∗ − Mk ≤ For a sequence of state-action pairs let the covering time, denoted by L, be an upper bound on the number of state-action pairs starting from any pair, until all state-action appear in the sequence. Note that the covering time can be a function of both the MDP and the sequence or just of the sequence. Initially we assume that from any start state, within L steps all state-action pairs appear 1 in the sequence. Later, we relax the assumption and assume that with probability at least 2 , from any start state in L steps all state-action appear in the sequence. In this paper, the underlying policy generates the sequence of state action pairs. The Parallel Sampling Model, PS(M), as was introduced by Kearns and Singh (1999). The PS(M) is an ideal exploration policy. A single call to PS(M) returns for every pair (s,a) the next M state s0, distributed according to P (a) and a reward r distributed according to RM(s,a). The s,s0 advantage of this model is that it allows to ignore the exploration and to focus on the learning. In some sense PS(M) can be viewed as a perfect exploration policy.

Notations: The notation g = Ω˜ ( f ) implies that there are constants c1 and c2 such that g c2 ≥ c1 f ln ( f ). All the norms , unless otherwise specified, are L∞ norms, i.e., (x1,...,xn) = k·k k k maxi xi.

3 EVEN-DAR AND MANSOUR

3. Q-learning The Q-learning (Watkins, 1989) estimates the state-action value function (for discounted return) as follows:

Qt+1(s,a)=(1 αt(s,a))Qt(s,a)+αt(s,a)(RM(s,a)+γ max Qt (s0,b)), (1) − b U(s ) ∈ 0 s,a where s0 is the state reached from state s when performing action a at time t. Let T be the s,a set of times, where action a was performed at state s, then αt(s,a)=0 for t / T . It is known ∈ that Q-learning converges to Q∗ if each state action pair is performed infinitely often and αt(s,a) ∞ α ∞ ∞ α2 ∞ satisfies for each (s,a) pair: ∑t=1 t(s,a)= and ∑t=1 t (s,a) < (Bertsekas and Tsitsiklis, 1996, Tsitsiklis, 1994, Watkins and Dyan, 1992, Littman and Szepesvari,´ 1996, Jaakkola et al., 1994). Q-learning is an asynchronous process in the sense that it updates a single entry each step. Next we describe two variants of Q-learning, which are used in the proofs. The first algorithm is synchronous Q-learning, which performs the updates by using the PS(M). Specifically:

s,a : Q0(s,a)=C ∀ αω αω γ s,a : Qt+1(s,a)=(1 t )Qt(s,a)+ t (RM(s,a)+ max Qt (s¯,b)), ∀ − b U(s¯) ∈ wheres ¯ is the state reached from state s when performing action a and C is some constant. The ω 1 learning rate is α = ω , for ω (1/2,1]. We distinguish between a linear learning rate, which t (t+1) ∈ ω ω 1 is = 1, and a polynomial learning rate, which is ( 2 ,1). The asynchronous Q-learning algorithm, is simply∈ regular Q-learning as define in (1), and we add the assumption that the underlying strategy has a covering time of L. The updates are as follows:

s,a : Q0(s,a)=C ∀ αω αω γ s,a : Qt+1(s,a)=(1 t (s,a))Qt(s,a)+ t (s,a)(RM(s,a)+ max Qt (s¯,b)) ∀ − b U(s¯) ∈ wheres ¯ is the state reached from state s when performing action a and C is some constant. Let #(s,a,t) be one plus the number of times, until time t, that we visited state s and performed action ω 1 s,a ω a. The learning rate α (s,a)= ω ,ift T and α (s,a)=0 otherwise. Again, ω = 1isa t [#(s,a,t)] ∈ t linear learning rate, and ω ( 1 ,1) is a polynomial learning rate. ∈ 2 4. Our Main Results Our main results are upper bounds on the convergence rates of Q-learning algorithms and showing their dependence on the learning rate. The basic case is the synchronous Q-learning. We show that for a polynomial learning rate we have a , which is polynomial in 1/(1 γ)=1/(2β).In − contrast, we show that linear learning rate has an exponential dependence on 1/β. Our results exhibit a sharp difference between the two learning rates, although they both converge with probability one. This distinction, which is highly important, can be observed only when we study the convergence rate, rather than convergence in the limit. The bounds for asynchronous Q-learning are similar. The main difference is the introduction of a covering time L. For polynomial learning rate we derive a bound polynomial in 1/β, and for linear 1 learning rate our bound is exponential in β . We also show a lower bound for linear learning rate,

4 LEARNING RATES FOR Q-LEARNING

1 which is exponential in β . This implies that our upper bounds are tight, and that the gap between the two bounds is real. We first prove the results for the synchronous Q-learning algorithm, where we update all the entries of the Q function at each time step, i.e., the updates are synchronous. The following theorem derives the bound for polynomial learning rate.

Theorem 2 Let QT be the value of the synchronous Q-learning algorithm using polynomial learn- ing rate at time T. Then with probability at least 1 δ, we have that QT Q ε, given that − || − ∗|| ≤ 1 ω 2 S A Vmax 1 Vmax ln( | ||δβε| ) 1 V 1 ω T = Ω + ln max −  β2ε2  β ε         The above bound is somewhat complicated. To simplify, assume that ω is a constant and con- 2 1/ω 1/(1 ω) sider first only its dependence on ε. This gives us Ω((ln(1/ε)/ε ) +(ln(1/ε)) − ), which is optimized when ω approaches one. Considering the dependence only on β, recall that Vmax = 4/ω 1/(1 ω) Rmax/(2β), therefore the complexity is Ω˜ (1/β + 1/β − ) which is optimized for ω = 4/5. The following theorem bounds the time for linear learning rate.

Theorem 3 Let QT be the value of the synchronous Q-learning algorithm using linear learning rate at time T. Then for any positive constant ψ with probability at least 1 δ, we have QT Q∗ ε, given that − || − || ≤ 2 S A Vmax 1 Vmax Vmax ln( | ||δβψε| ) T = Ω ((2 + ψ) β ln( ε ) .  (ψβε)2    Next we state our results to asynchronous Q-learning. The bounds are similar to those of syn- chronous Q-learning, but have the extra dependency on the covering time L.

Theorem 4 Let QT be the value of the asynchronous Q-learning algorithm using polynomial learn- ing rate at time T. Then with probability at least 1 δ, we have QT Q ε, given that − || − ∗|| ≤ 1 ω 1+3ω 2 S A Vmax 1 L Vmax ln( | ||δβε| ) L V 1 ω T = Ω + ln max −  β2ε2  β ε         2+1/ω 1/(1 ω) The dependence on the covering time, in the above theorem, is Ω(L + L − ), which is optimized for ω 0.77. For the linear learning rate the dependence is much worse, since it has to be that L S A≈, as is stated in the following theorem. ≥| |·| |

Theorem 5 Let QT be the value of the asynchronous Q-learning algorithm using linear learning rate at time T. Then with probability at least 1 δ, for any positive constant ψ we have QT Q∗ ε, given that − || − || ≤ 2 S A Vmax 1 Vmax Vmax ln( | ||δβεψ| ) T = Ω (L + ψL + 1) β ln ε  (ψβε)2    5 EVEN-DAR AND MANSOUR

The following theorem shows that a linear learning rate may require an exponential dependence on 1/(2β)=1/(1 γ), thus showing that the gap between linear learning rate and polynomial learning rate is real− and does exist for some MDPs.

Theorem 6 There exists a deterministic MDP, M, such that Q-learning with linear learning rate 1 Ω 1 1 γ ε after T = (( ) − ) steps has QT Q > . ε k − M∗ k 5. Experiments

In this section we present experiments using two types of MDPs as well as one we call M0, which is used in the lower bound example from Section 10. The two MDP types are the “random MDP” and “line MDP”. Each type contains n states and two actions for each state. We generate the “random MDP” as follows: For every state i, action a and state j, we assign a random number ni, j(a) uniformly from [0,1]. The probability of a transition from state i to state ni, j(a) j while performing action a is pi, j(a)= . The reward R(s,a) is deterministic and chosen at ∑k ni,k(a) random uniformly in the interval [0,10]. For the line MDP, all the states are integers and the transition probability from state i to state j 1 is proportional to i j , where i = j. The reward distribution is identical to that of the random MDP. (We implemented| the− | random function6 using the function rand() in C.)

Random MDP Line MDP 0.9 0.8

0.8 0.7

0.7 0.6

0.6 0.5

0.5 0.4

precision 0.4 precision

0.3 0.3

0.2 0.2

0.1 0.1

0 0 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Learning rate Learning rate

Figure 1: Example of 100 states MDP (both line and random), where the discount factor is γ = 0.7. We ran asynchronous Q-learning using a random exploration policy for 108 steps.

Figure 1 demonstrates the relation between the exponent of the learning rate ω and the accuracy of the model. The best experimental value for ω is about 0.85. Note that when ω approaches one (a linear learning rate), the precision deteriorates. This behavior coincides with our theoretical results on two points. First, our theoretical results predict bad behavior when the learning rate approaches

6 LEARNING RATES FOR Q-LEARNING

one (an exponential lower and upper bound). Second, the experiments suggest an optimal value for ω of approximately 0.85. Our theoretical results derive optimal values of optimal ω for different settings of the parameters but most give a similar range. Furthermore, the two types of MDP have similar behavior, which implies that the difference between linear and polynomial learning rates is inherent to many MDPs and not only special cases (as in the lower bound example).

Random MDP Line MDP 102 102

101 101 on on i i s s i i 100 100 prec prec

10−1 10−1

10−2 10−2 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 Discount factor Discount factor

Figure 2: Example of 10 state MDPs (both random and line) using two different learning rates for Q-learning. Both use random exploration policy for 107 steps. The solid line is asynchronous Q-learning using ω = 0.7; the dashed line is asynchronous Q-learning using a linear learning rate (ω = 1.0).

Figure 2 demonstrates the strong relationship between discount factor, γ, and convergence rate. In this experiment, we again see similar behavior in both MDPs. When the discount factor ap- proaches one, Q-learning using linear learning rate estimation of the Q value becomes unreliable, while Q-learning using learning rate of ω = 0.7 remains stable (the error is below 0.1). Figure 3 compares two different learning rates ω = 0.6 and ω = 0.9 for ten state MDPs (both ran- dom and line) and finds an interesting tradeoff. For low precision levels, the learning rate of ω = 0.6 was superior, while for high precision levels the learning rate of ω = 0.9 was superior. An explana- 2 1/ω 1/(1 ω) tion for this behavior is that the dependence in terms of ε is Ω((ln(1/ε)/ε ) +(ln(1/ε)) − ), which is optimized as the learning rate approaches one. Our last experimental result is M0, the lower bound example from Section 10. Here the differ- ence between the learning rates is the most significant, as shown in Figure 4.

6. Background from Stochastic Algorithms Before we derive our proofs, we first introduce the proof given by Bertsekas and Tsitsiklis (1996) for the convergence of stochastic iterative algorithms; in Section 7 we show that Q-learning algorithms fall in this category. In this section we review the proof for convergence in the limit, and in the next sections we will analyze the rate at which different Q-learning algorithms converge. (We will try to keep the background as close as possible to the needs for this paper rather than giving the most general results.)

7 EVEN-DAR AND MANSOUR

Random MDP Line MDP 101 101

100 100 on on i i s s i i 10−1 10−1 prec prec

10−2 10−2

10−3 10−3 102 103 104 105 106 102 103 104 105 106 Steps number Steps number

Figure 3: Random and Line MPDs (10 states each), where the discount factor is γ = 0.9. The dashed line is synchronous Q-learning using ω = 0.9 and the the solid line is synchronous Q-learning using ω = 0.6.

Lower Bound Example MDP 0.7

0.6

0.5

0.4

precision 0.3

0.2

0.1

0 2 3 4 5 6 7 10 10 10 10 10 10 Steps number

Figure 4: Lower bound example M0, with discount factor γ = 0.9. Q-learning ran with two different learning rates, linear (dashed line) and ω = 0.65 (solid line).

This section considers a general type of iterative stochastic algorithms, which is computed as follows: Xt 1(i)=(1 αt(i))Xt(i)+αt(i)((HtXt)(i)+wt(i)), (2) + − where wt is a bounded with zero expectation, and each Ht is assumed to belong to a family H of pseudo contraction mappings (See Bertsekas and Tsitsiklis (1996) for details).

Definition 7 An iterative stochastic algorithm is well-behaved if:

8 LEARNING RATES FOR Q-LEARNING

∞ ∞ 2 1. The step size αt (i) satisfies (1) ∑ αt(i)=∞, (2) ∑ α (i) < ∞ and (3) αt (i) (0,1). t=0 t=0 t ∈

2. There exists a constant A that bounds wt (i) for any history Ft, i.e., t,i : wt(i) A. ∀ | |≤

3. There exists γ [0,1) and a vector X such that for any X we have HtX X γ X X . ∈ ∗ || − ∗|| ≤ || − ∗|| The main theorem states that a well-behaved stochastic iterative algorithm converges in the limit.

Theorem 8 [Bertsekas and Tsitsiklis (1996)] Let Xt be the sequence generated by a well-behaved stochastic iterative algorithm. Then Xt converges to X∗ with probability 1.

The following is an outline of the proof given by Bertsekas and Tsitsiklis (1996). Without loss of generality, assume that X = 0 and X0 A. The value of Xt is bounded since X0 A and for ∗ k k≤ k k≤ any history Ft we have wt A; hence, for any t we have Xt A. 1 γk k≤ k k≤ Recall that β = − . Let D1 = A and Dk 1 =(1 β)Dk for k 1. Clearly the sequence Dk 2 + − ≥ converges to zero. We prove by induction that for every k there exists some time τk such that for any t τk we have Xt Dk. Note that this will guarantee that at time t τk for any i the value ≥ k k≤ ≥ Xt(i) is in the interval [ Dk,Dk]. k k − The proof is by induction. Assume that there is such a time τk and we show that there exists a time τk 1 such that for t τk 1 we have Xt Dk 1. Since Dk converges to zero this proves that + ≥ + k k≤ + Xt converges to zero, which equals X . For the proof we define for t τ the quantity ∗ ≥

Wt 1;τ(i)=(1 αt(i))Wt;τ(i)+αt(i)wt(i), + − where Wτ;τ(i)=0. The value of Wt;τ bounds the contributions of w j(i), j [τ,t], to the value of Xt ∈ (starting from time τ). We also define for t τk, ≥

Yt 1;τ(i)=(1 αt(i))Yt;τ(i)+αt(i)γDk + − where Yτk;τk = Dk. Notice that Yt;τk is a deterministic process. The following lemma gives the motivation for the definition of Yt;τk .

Lemma 9 [Bertsekas and Tsitsiklis (1996)] For every i, we have

Yt;τ (i)+Wt;τ (i) Xt(i) Yt;τ (i)+Wt;τ (i) − k k ≤ ≤ k k

Next we use Lemma 9 to complete the proof of Theorem 8. From the definition of Yt;τ and the ∞ α ∞ γ assumption that ∑t=0 t = , it follows that Yt;τ converges to Dk as t goes to infinity. In addition τ Wt;τk converges to zero as t goes to infinity. Therefore there exists a time k+1 such that Yt;τ β ≤ (γ + )Dk, and Wt;τ βDk/2. This fact, together with Lemma. 9, yields that for t τk 1, 2 | k |≤ ≥ +

Xt (γ + β)Dk = Dk 1, || || ≤ + which completes the proof of Theorem 8.

9 EVEN-DAR AND MANSOUR

7. Applying the Stochastic Theorem to Q-learning In this section we show that both synchronous and asynchronous Q-learning are well-behaved it- erative stochastic algorithms. The proof is similar in spirit to the proof given by Bertsekas and Tsitsiklis (1996) At the beginning we deal with synchronous Q-learning. First define operator H as

n (HQ)(i,a)=∑ Pij(a)(R(i,a)+γ max Q( j,b)) b U( j) j=0 ∈ Rewriting Q-learning with H, we get

Qt 1(i,a)=(1 αt(i,a))Qt(i,a)+αt(i,a)((HQt)(i,a)+wt(i,a)). + − Let i¯ is the state reached by performing at time t action a in state i and r(i,s) is the reward observed at time i; then

n wt (i,s)=r(i,s)+γ max Qt (i¯,b) ∑ Pij(a) R(i,a)+γ max Qt ( j,b) b U(i¯) − b U( j) ∈ j=0  ∈  In synchronous Q-learning, H is computed simultaneously on all states actions pairs.

Lemma 10 Synchronous Q-learning is a well-behaved iterative stochastic algorithm.

Proof We know that for any history Ft E[wt(i,a) Ft]=0 and wt(i,a) Vmax. We also know that 1 ω ∑α ∞ ∑α2 | ∞ |α |≤ for 2 < 1 we have that t(s,a)= , t (s,a) < and t (s,a) (0,1). We need only≤ show that H satisfies the contraction property. ∈

n (HQ)(i,a) (HQ¯)(i,a) ∑ Pij(a) γ max Q( j,b) γ max Q¯( j,b) | − |≤ | b U( j) − b U( j) | j=0 ∈ ∈ n = ∑ Pij(a)γ max Q( j,b) max Q¯( j,b) | b U( j) − b U( j) | j=0 ∈ ∈ n ∑ Pij(a)γ max Q( j,b) Q¯( j,b) ≤ b U( j) | − | j=0 ∈ n γ ∑ Pij(a) Q Q¯ γ Q Q¯ ≤ j=0 k − k≤ k − k

Since we update all (i,a) pairs simultaneously, synchronous Q-learning is well-behaved stochas- tic iterative algorithm.

We next show that Theorem 8 can be applied also to asynchronous Q-learning.

Lemma 11 Asynchronous Q-learning, where the input sequence has a finite covering time L, is a well-behaved iterative stochastic algorithm.

Proof We define HQ for every start state i and start time t1 of a phase (beginning of the covering time) until the end of the phase (completing the covering time) at time t2, during which all state- action pairs are updated. Since a state action can be performed more than once, HQ(i,a) can be

10 LEARNING RATES FOR Q-LEARNING

performed more than once. We consider time t, in which the policy performs action a at state i and Qt is the vector. We have that

n (HQt)(i,a) (HQ∗)(i,a) ∑ Pij(a) γ max Qt( j,b) γ max Q∗( j,b) | − |≤ | b U( j) − b U( j) | j=0 ∈ ∈ n = ∑ Pij(a)γ max Qt( j,b) max Q∗( j,b) | b U( j) − b U( j) | j=0 ∈ ∈ n ∑ Pij(a)γ max Qt( j,b) Q∗( j,b) ≤ b U( j) | − | j=0 ∈ ∑ Pij(a)γ max Qt( j,b) Q∗( j,b) ≤ j A b U( j) | − | ∈ ∈ + ∑ Pij(a)γ max Qt( j,b) Q∗( j,b) j B b U( j) | − | ∈ ∈ γ Qt Q∗ , ≤ k − k where A includes the states for which during (t1,t) all the actions in U(i) were performed, and B = S A. We conclude that Qt Q∗ Qt 1 Q∗ , since we only change at each time a sin- − k 0 − k≤k 0− − k gle state-action pair, which satisfies HQt(i,a) Q (i,a) γ Qt Q . We look at the operator H | − ∗ |≤ | − ∗| after performing all state-action pairs, HQ Q maxi a HQt(i,a) Q (i,a) γ Qt Q k − ∗k≤ , | − ∗ |≤ k − ∗k≤ γ Q Q . k − ∗k

8. Synchronous Q-learning

In this section we give the proof of Theorems 2 and 3. Our main focus will be the value of rt = Qt Q , and our aim is to bound the time until rt ε. We use a sequence of values Di, such k − ∗k ≤ that Dk 1 =(1 β)Dk and D1 = Vmax. As in Section 6, we will consider times τk such that for any + − t τk we have rt Dk. We call the time between τk and τk+1 the kth iteration. (Note the distinction between≥ a step of≤ the algorithm and an iteration, which is a sequence of many steps.) Our proof has two parts. The first (and simple) part is bounding the number of iterations until Di ε. The bound is derived in the following Lemma. ≤ 1 Lemma 12 For m ln(Vmax/ε) we have Dm ε. ≥ β ≤

Proof We have that D1 = Vmax and Di =(1 β)Di 1. We want to find the m that satisfies Dm = m − − 1 Vmax(1 β) ε. By taking a logarithm over both sides of the inequality we get m ln(Vmax/ε). − ≤ ≥ β

The second (and much more involved) part is to bound the number of steps in an iteration. We αω use the following quantities introduced in Section 6. Let Wt+1,τ(s,a)=(1 t (s,a))Wt,τ(s,a)+ αω − t (s,a)wt(s,a), where Wτ;τ(s,a)=0 and

S | | wt (s,a)=R(s,a)+γ max Qt (s0,b) ∑ Ps, j(a) R(s,a)+γ max Qt ( j,b) , b U(s ) − b U( j) ∈ 0 j=1  ∈ 

11 EVEN-DAR AND MANSOUR

where s0 is the state reached after performing action a at state s. Let ω ω Yt 1;τ (s,a)=(1 α (s,a))Yt;τ (s,a)+α (s,a)γDk, + k − t k t where Yτk;τk (s,a)=Dk. Our first step is to rephrase Lemma 9 for our setting.

Lemma 13 For every state s action a and time τk, we have

Yt;τ (s,a)+Wt;τ (s,a) Q∗(s,a) Qt(s,a) Yt;τ (s,a)+Wt;τ (s,a) − k k ≤ − ≤ k k

The above lemma suggests (once again) that in order to bound the error rt one can bound Yt;τk and Wt;τk separately, and the two bounds imply a bound on rt. We first bound the Yt term, which is deterministic process, and then we bound the term, Wt;τ, which is stochastic.

8.1 Synchronous Q-learning using a Polynomial Learning Rate We start with Q-learning using a polynomial learning rate and show that the duration of iteration k, τ τ τω which starts at time k and ends at time k+1, is bounded by k . For synchronous Q-learning with a τ τ τω τ polynomial learning rate we define k+1 = k + k , where 1 will be specified latter. Lemma 14 Consider synchronous Q-learning with a polynomial learning rate and assume that for ω 2 any t τk we have Yt;τ (s,a) Dk. Then for any t τk +τ = τk 1 we have Yt;τ (s,a) Dk(γ+ β). ≥ k ≤ ≥ k + k ≤ e

Proof Let Yτ ;τ (s,a)=γDk + ρτ , where ρτ =(1 γ)Dk. We can now write k k k k − ω ω ω Yt 1;τ (s,a)=(1 α )Yt;τ (s,a)+α γDk = γDk +(1 α )ρt, + k − t k t − t ρ ρ αω τ τ τω τ where t+1 = t (1 t ). We would like to show that after time k+1 = k + k for any t k+1 we 2 − ≥ have ρt βDk. By definition we can rewrite ρt as ≤ e t τk t τk t τk − ω − ω − 1 ρt =(1 γ)Dk ∏(1 α τ )=2βDk ∏(1 α τ )=2βDk ∏(1 ), l+ k l+ k τ ω − l=1 − l=1 − l=1 − (l + k) αω ω αω where the last identity follows from the fact that t = 1/t . Since the t ’s are monotonically decreasing 1 τ ρ β t k t 2 Dk(1 τω ) − . ≤ − k ω For t τk + τ we have ≥ k 1 τω 2 ρ β k β t 2 Dk(1 τω ) Dk. ≤ − k ≤ e 2 Hence, Yt;τ (s,a) (γ + β)Dk. k ≤ e

2 Next we bound the term Wt;τ by (1 )βDk. The sum of the bounds for Wt;τ (s,a) and Yt;τ (s,a) k − e k k would be (γ + β)Dk =(1 β)Dk = Dk 1, as desired. − + αω αω Definition 15 Let Wt;τk (s,a)=(1 t (s,a))Wt 1;τk (s,a)+ t (s,a)wt(s,a) t k,t −k,t ω− t ω l = ∑ η s a w s a , where η s a α s a ∏ 1 α s a and let W τ s a i=τk+1 i ( , ) i( , ) i ( , )= i+τk ( , ) j=τk+i+1( j ( , )) t; k ( , )= τ − ∑ k+l ηk,t(s,a)w (s,a). i=τk+1 i i

12 LEARNING RATES FOR Q-LEARNING

αω ηk,t Note that in the synchronous model t (s,a) and i (s,a) are identical for every state action pair. t τ +1 k τ τ τ We also note that Wt;−τk (s,a)=Wt; k (s,a). We have bounded the term Yt; k , for t = k+1. This bound holds for any t τk 1, since the sequence Yt;τ is monotonically decreasing. In contrast, the ≥ + k term Wt;τk is stochastic. Therefore it is not sufficient to bound Wτk+1;τk , but we need to bound Wt;τk for t τk+1. However, it is sufficient to consider t [τk+1,τk+2]. The following lemma bounds the coefficients≥ in that interval. ∈

k,t 1 Lemma 16 For any t [τk+1,τk+2] and i [τk,t], we have η = Θ( τ ω ), ∈ ∈ i k Proof Since ηk,t αω ∏t 1 αω ,we can divide ηk,t into two parts, the first one αω i = i+τk j=τk+i+1( j ) t i+τk t −αω Θ 1 and the second one µ = ∏ j τ i 1(1 ). We show that the first one is ( τω ) and the second is = k+ + j k constant. − ω ω ω ω Since α are monotonically decreasing we have for every i τ τ we have ατ α ατ , i+τk [ k, k+2] k i k+2 1 αω 1 1 1 ∈ ≤ ≤ thus τω t τ τω ω τω τ ω < τω . Next we bound µ. Clearly µ is bounded from above k ( k+1+ k+1) (3 k + k) 4 k ≤ ≤ τ ≤ τ ω k+2 α 1 3 k 1 τ τ by 1. Also µ ∏ τ (1 j) (1 ω ) 3 . Therefore, we have that for every t [ k, k 2], j= k τk e + k,t 1 ≥ − ≥ − ≥ ∈ η = Θ( ω ). i τk

We introduce Azuma’s inequality, which bounds the deviations of a martingale. The use of Azuma’s inequality is mainly needed for the asynchronous case.

Lemma 17 (Azuma 1967) Let X0,X1,...,Xn be a martingale sequence such that for each 1 k n, ≤ ≤ Xk Xk 1 ck, | − − |≤ where the constant ck may depend on k. Then for all n 1 and any ε > 0 ≥ ε − 2∑n c2 Pr [ Xn X0 > ε] 2e k=1 k | − | ≤ l Next we show that Azuma’s inequality can be applied to Wt;τk . τ τ l Lemma 18 For any t [ k+1, k+2] and 1 l t we have that Wt;τk (s,a) is a martingale sequence, which satisfies ∈ ≤ ≤ l l 1 Vmax W τ (s,a) W −τ (s,a) Θ( ) t; k t; k τω | − |≤ k l Proof We first note that Wt;τk (s,a) is a martingale sequence, since l l 1 k,t τ η τ τ E[Wt;τ (s,a) Wt;−τ (s,a) F k+l 1]=E[ τ l(s,a)w k+l(s,a) F k+l 1] k − k | − k+ | − k,t η τ τ = τ lE[w k+l(s,a) F k+l 1]=0. k+ | − ηk,t Θ τω By Lemma 16 we have that l (s,a)= (1/ k ), thus

l l 1 k,t Vmax W τ (s,a) W −τ (s,a) = ητ (s,a) wτ +l(s,a) Θ( ω ). t; k t; k k+l k τ | − | | |≤ k

The following lemma provides a bound for the stochastic error caused by the term Wt;τk by using Azuma’s inequality.

13 EVEN-DAR AND MANSOUR

Lemma 19 Consider synchronous Q-learning with a polynomial learning rate. With probability at δ 2 least 1 we have Wt;τ (1 )βDk for any t [τk 1,τk 2], i.e., − m | k |≤ − e ∈ + + 2 δ Pr t [τk 1,τk 2] : Wt;τ (1 )βDk 1 ∀ ∈ + + | k |≤ − e ≥ − m   2 δβ τ Θ Vmax ln(Vmax S A m/( Dk)) 1/ω given that k = (( β| 2||2 | ) ). Dk

τ t k+1 Vmax Proof By Lemma 18 we can apply Azuma’s inequality to Wt;−τ with ci = Θ( τω ) (note that k k t τ +1 k τ Wt;−τk = Wt; k ) . Therefore, we derive that

2ε˜2 ∑t− 2 ω 2 2 i=τ ci cτ ε˜ /V Pr [ Wt;τ (s,a) ε˜ t [τk 1,τk 2]] 2e k 2e− k max , | k |≥ | ∈ + + ≤ ≤ τωε2 2 ω δ˜ c k ˜ /Vmax τ Θ δ˜ 2 ε˜2 for some constant c > 0. Set k = 2e− , which holds for k = (ln(1/ )Vmax/ ). Using the union bound we have,

τk+2 τ τ ε ε Pr [ t [ k+1, k+2] : Wt;τk (s,a) ˜] ∑ Pr [Wt;τk (s,a) ˜], ∀ ∈ ≤ ≤ t=τk+1 ≤

˜ δ δ thus taking δk = assures that with probability at least 1 the statement hold at m(τk 2 τk 1) S A m + − + | || | − every state-action pair and time t [τk 1,τk 2]. As a result we have, ∈ + + 2 ω 2 V ln( S A mτ /δ) ω V ln( S A mVmax/δε˜) ω τ = Θ ( max | || | k )1/ = Θ ( max | || | )1/ k ε˜2 ε˜2    

Setting ε˜ =(1 2/e)βDk gives the desire bound. −

We have bounded for each iteration the time needed to achieve the desired precision level with probability 1 δ . The following lemma provides a bound for the error in all the iterations. − m Lemma 20 Consider synchronous Q-learning using a polynomial learning rate. With probability 2 at least 1 δ, for every iteration k [1,m] and time t [τk 1,τk 2] we have Wt;τ (1 )βDk, i.e., − ∈ ∈ + + k ≤ − e 2 Pr k [1,m], t [τk 1,τk 2] : Wt;τ (1 )βDk 1 δ, ∀ ∈ ∀ ∈ + + | k |≤ − e ≥ −   2 δβε τ Θ Vmax ln(Vmax S A /( )) 1/ω given that 0 = (( β2|ε||2 | ) ).

Proof From Lemma 19 we know that 2 Pr t [τk 1,τk 2] : Wt;τ (1 )βDk ∀ ∈ + + | k |≥ − e   t δ k,t = Pr s,a t [τk 1,τk 2] : ∑ wi(s,a)η ε˜ ∀ ∀ ∈ + + i ≥ ≤ m " i=τk #

14 LEARNING RATES FOR Q-LEARNING

Using the union bound we have,

t k,t Pr s,a k m, t [τk 1,τk 2] ∑ wi(s,a)η ε˜ ∀ ∀ ≤ ∀ ∈ + + i ≥ " i=τk # m t k,t ∑ Pr s,a t [τk 1,τk 2] ∑ wi(s,a)η ε˜ δ ≤ ∀ ∀ ∈ + + i ≥ ≤ k=1 " i=τk #

2 where ε˜ =(1 )βDk. − e

We have bounded both the size of each iteration, as a function of its starting time, and the τ τ τω number of the iterations needed. The following lemma solves the recurrence k+1 = k + k and bounds the total time required (which is a special case of Lemma 32).

Lemma 21 Let k ω ω ak+1 = ak + ak = a0 + ∑ ai . i=0

1 1 1 ω ω ω For any constant ω (0,1),ak = O((a − + k) 1 )=O(ao + k 1 ) ∈ 0 − − The proof of Theorem 2 follows from Lemma 21, Lemma 20, Lemma 12 and Lemma 14.

8.2 Synchronous Q-learning using a Linear Learning Rate In this subsection, we derive results for Q-learning with a linear learning rate. The proof is very similar in spirit to the proof of Theorem 2 and we give here analogous lemmas to the ones in Subsection 8.1. First, the number of iterations required for synchronous Q-learning with a linear learning rate is the same as that for a polynomial learning rate. Therefore, we only need to analyze the number of steps in an iteration.

Lemma 22 Consider synchronous Q-learning with a linear learning rate and assume that for any 2 t τk we have Yt;τ (s,a) Dk. Then for any t (2+ψ)τk = τk 1 we have Yt;τ (s,a) Dk(γ+ β) ≥ k ≤ ≥ + k ≤ 2+ψ

Proof Let Yτ ;τ (s,a)=γDk + ρτ , where ρτ =(1 γ)Dk. We can now write k k k k −

Yt 1;τ (s,a)=(1 αt)Yt;τ (s,a)+αtγDk = γDk +(1 αt)ρt, + k − k − where ρt 1 = ρt (1 αt). We would like show that after time (2 + ψ)τk = τk 1 for any t τk 1 we + − + ≥ + have ρt βDk. By definition we can rewrite ρt as, ≤

t τk t τk t τk − − − 1 ρ =(1 γ)D ∏(1 α τ )=2βD ∏(1 α τ )=2βD ∏(1 ), t k l+ k k l+ k k τ − l=1 − l=1 − l=1 − l + k where the last identity follows from the fact that αt = 1/t. Simplifying the expression, and setting t =(2 + ψ)τk,wehave τk 2Dkβ ρt 2Dkβ = ≤ t 2 + ψ

15 EVEN-DAR AND MANSOUR

2 Hence, Yt;τ (s,a) (γ + β)Dk. k ≤ 2+ψ

The following lemma enables the use of Azuma’s inequality. τ l Lemma 23 For any t k and 1 l t we have that Wt;τk (s,a) is a martingale sequence, which satisfies ≥ ≤ ≤ l l 1 Vmax W τ (s,a) W −τ (s,a) | t; k − t; k |≤ t l Proof We first note that Wt;τk (s,a) is a martingale sequence, since

l l 1 k,t τ η τ τ E[Wt;τ (s,a) Wt;−τ (s,a) F k+l 1]=E[ τ l(s,a)w k+l(s,a) F k+l 1] k − k | − k+ | − k,t η τ τ = τ lE[w k+l(s,a) F k+l 1]=0. k+ | − For linear learning rate we have that ηk,t (s,a)=1/t, thus l+τk

l l 1 wτk+l(s,a) Vmax W τ (s,a) W −τ (s,a) = . | t; k − t; k | t ≤ t

The following Lemma provides a bound for the stochastic term Wt;τk . Lemma 24 Consider synchronous Q-learning with a linear learning rate. With probability at least δ ψ 1 , we have Wt;τ βDk for any t [τk 1,τk 2] and any positive constant ψ, i.e. − m | k |≤ 2+ψ ∈ + + ψ δ Pr t [τk 1,τk 2] : Wt;τ βDk 1 ∀ ∈ + + k ≤ 2 + ψ ≥ − m   2 ψδβ τ Θ Vmax ln(Vmax S A m/( Dk)) given that k = ( ψ| 2||β2 | 2 ) Dk τ t k+1 Vmax Proof By Lemma 23 for any t τk 1 we can apply Azuma’s inequality to W −τ with ci = + t; k i+τk t τ +1 ≥ k τ (note that Wt;−τk = Wt; k ). Therefore, we derive that

2 2ε˜ 2ε2 2 − t ˜ tε˜ ∑t c2 c 2 c i=τ i − (t τ )Vmax V2 Pr[ Wt;τ ε˜ t τk 1] 2e k = 2e − k 2e− max | k |≥ | ≥ + ≤ ≤ for some positive constant c. Using the union bound we get τ τ ε ψ τ ε Pr [ t [ k+1, k+2] : Wt;τk ˜] Pr [ t (2 + ) k : Wt;τk ˜] ∀ ∈ | |≥ ≤ ∀∞ ≥ | |≥ ∑ Pr [ Wt;τ ε˜] ≤ | k |≥ t=(2+ψ)τk ∞ ψ τ ε2 ∞ tε˜2 ((2+ ) k)˜ tε˜ c 2 c 2 2 ∑ 2e− Vmax = 2e− Vmax ∑ e− Vmax ≤ t=(2+ψ)τk t=0 ψ τ ε2 τ ε2 (2+ ) k ˜ c0 k ˜ c 2 2 2 2e− Vmax e− Vmax V = = Θ( max ), ε˜2 ε2 −2 ˜ 1 e Vmax − 16 LEARNING RATES FOR Q-LEARNING

where the last equality is due to Taylor expansion and c0 is some positive constant. By setting τ ε2 c0 k ˜ 2 δ − Vmax 2 2 δε˜ ψ Θ e Vmax τ Θ Vmax ln(Vmax S A m/( )) ε˜ β m S A = ( ε˜2 ), which holds for k = ( ε|˜2|| | ), and = 2+ψ Dk assures us | || | τ τ τ δ that for every t k+1 (and as a result for any t [ k+1, k+2]) with probability at least 1 m the statement holds≥ at every state-action pair. ∈ −

We have bounded for each iteration the time needed to achieve the desired precision level with probability 1 δ . The following lemma provides a bound for the error in all the iterations. − m Lemma 25 Consider synchronous Q-learning using a linear learning rate. With probability 1 δ, ψβ−D τ τ ψ τ k for every iteration k [1,m], time t [ k+1, k+2] and any constant > 0 we have Wt; k 2+ψ ), i.e., ∈ ∈ | |≤ ψβDk Pr k [1,m], t [τk 1,τk 2] : Wt;τ 1 δ, ∀ ∈ ∀ ∈ + + | k |≤ 2 + ψ ≥ −   2 ψδβε τ Θ Vmax ln(Vmax S A m/( )) given that 0 = ( ψ|2β||2ε2| ). Proof From Lemma 19 we know that

ψβDk δ Pr t [τk 1,τk 2] : Wt;τ ∀ ∈ + + | k |≥ 2 + ψ ≤ m   Using the union bound we have that,

ψβDk Pr k m, t [τk 1,τk 2] : Wt;τ ∀ ≤ ∀ ∈ + + | k |≥ 2 + ψ  m  ψβDk ∑ Pr t [τk 1,τk 2] : Wt;τ δ ≤ ∀ ∈ + + | k |≥ 2 + ψ ≤ k=1  

The proof of Theorem 5 follows from Lemmas 25 and 22, 12 and the fact that ak+1 =(2 + k ψ)ak =(2 + ψ) a1.

9. Asynchronous Q-learning The major difference between synchronous and asynchronous Q-learning is that asynchronous Q- learning updates only one state action pair at each time while synchronous Q-learning updates all state-action pairs at each time unit. This causes two difficulties: the first is that different updates use different values of the Q function in their update. This problem is fairly easy to handle given the machinery introduced. The second, and more basic problem is that each state-action pair should occur enough times for the update to progress. To ensure this, we introduce the notion of covering time, denoted by L. We first extend the analysis of synchronous Q-learning to asynchronous Q- learning, in which each run always has covering time L, which implies that from any start state, in L steps all state-action pairs are performed. Later we relax the requirement such that the condition holds only with probability 1/2, and show that with high probability we have a covering time of LlogT for a run of length T. Note that our notion of covering time does not assume a stationary

17 EVEN-DAR AND MANSOUR

distribution of the exploration strategy; it may be the case that at some periods of time certain state- action pairs are more frequent while in other periods different state-action pairs are more frequent. In fact, we do not even assume that the sequence of state-action pairs is generated by a strategy—it can be an arbitrary sequence of state-action pairs, along with their reward and next state.

Definition 26 Let n(s,a,t1,t2) be the number of times that the state action pair (s,a) was performed in the time interval [t1,t2]. τ In this section, we use the same notations as in Section 8 for Dk, k, Yt;τk and Wt;τk , with a different set of values for τk. We first give the results for asynchronous Q-learning using polynomial learning rate (Subsection 9.1); we give a similar proof for linear learning rates in Subsection 9.2.

9.1 Asynchronous Q-learning using a Polynomial Learning Rate τω Our main goal is to show that the size of the kth iteration is L k . The covering time property τω τω guarantees that in L k steps each pair of state action is performed at least k times. For this reason τ τ τω we define for asynchronous Q-learning with polynomial learning rate the sequence k+1 = k +L k , τ where 1 will be specified later. As in Subsection 8.1 we first bound the value of Yt;τk Lemma 27 Consider asynchronous Q-learning with a polynomial learning rate and assume that for ω 2 any t τk we have Yt;τ (s,a) Dk. Then for any t τk +Lτ = τk 1 we have Yt;τ (s,a) D(γ+ β) ≥ k ≤ ≥ k + k ≤ e τ τ τω Proof For each state-action pair (s,a) we are assured that n(s,a, k, k+1) k , since the covering ω ≥ τ τ time is L and the underlying policy has made L k steps. Using the fact that the Yt; k (s,a) are mono- tonically decreasing and deterministic, we can apply the same argument as in the proof of Lemma 14.

The next Lemma bounds the influence of each wt (s,a) on Wt;τk (s,a). k,t Lemma 28 Let w˜t s a η s a w τ s a then for any t τ τ the random variable i+τk ( , )= i ( , ) i+ k ( , ) [ k+1, k+2] w˜t s a has zero mean and bounded by L τ ωV . ∈ i+τk ( , ) ( / k) max

Proof Note that by definition wτk+i(s,a) has zero mean and is bounded by Vmax for any history and state-action pair. In a time interval of length τ, by definition of the covering time, each state-action k,t ω pair is performed at least τ/L times; therefore, η (s,a) (L/τk) . Looking at the expectation of i ≤ w˜i+τk (s,a) we observe that t k,t k,t η τ η τ E[w˜i+τk (s,a)] = E[ i (s,a)wi+ k (s,a)] = i (s,a)E[wi+ k (s,a)] = 0 Next we prove that it is bounded as well: t k,t w˜ τ (s,a) = η (s,a)wi τ (s,a) | i+ k | | i + k | ηk,t i (s,a) Vmax ≤| ω | (L/τk) Vmax ≤

Next we define W l s a ∑l w˜t s a and prove that it is martingale sequence with bounded t;τk ( , )= i=1 i+τk ( , ) differences.

18 LEARNING RATES FOR Q-LEARNING

τ τ l Lemma 29 For any t [ k+1, k+2] and 1 l t we have that Wt;τk (s,a) is a martingale sequence, which satisfies ∈ ≤ ≤ l l 1 ω W τ (s,a) W −τ (s,a) (L/τk) Vmax | t; k − t; k |≤ l Proof We first note that Wt;τk (s,a) is a martingale sequence, since

l l 1 t E[Wt;τ (s,a) Wt;−τ (s,a) Fτk+l 1]=E[w˜l+τ (s,a) Fτk+l 1]=0. k − k | − k | − By Lemma 28 we have thatw ˜t (s,a) is bounded by (L/τ )ωV , thus l+τk k max

l l 1 t ω W τ (s,a) W −τ (s,a) = w˜ τ (s,a) (L/τk) Vmax. | t; k − t; k | l+ k ≤

The following Lemma bounds the value of the term Wt;τk .

Lemma 30 Consider asynchronous Q-learning with a polynomial learning rate. With probability δ 2 τ β τ τ at least 1 m we have for every state-action pair Wt; k (s,a) (1 e ) Dk for any t [ k+1, k+2], i.e. − | |≤ − ∈ 2 δ Pr s,a t [τk 1,τk 2] : Wt;τ (s,a) (1 )βDk 1 ∀ ∀ ∈ + + | k |≤ − e ≥ − m   1+3ω 2 δβ τ Θ L Vmax ln(Vmax S A m/( Dk)) 1/ω given that k = (( β2 | 2|| | ) ). Dk

l t τk+1 τ τ Proof For each state-action pair we look on Wt;τk (s,a) and note that Wt; k (s,a)=Wt;−k (s,a). Let τ τ τ τ τ Θ 1+ωτω ` = n(s,a, k,t), then for any t [ k+1, k+2] we have that ` k+2 k (L k ). By Lemma ∈ τ ≤ − ≤ t k+1 τ ω 29 we can apply Azuma’s inequality to Wt;−τk (s,a) with ci =(L/ k) Vmax. Therefore, we derive that

2 ω ε˜ ε˜2τ2 ∑t − 2 k 2 τ s,a c c 2 2ω ε τ τ i= k+1,i T i − `VmaxL Pr[ Wt;τk (s,a) ˜ t [ k+1, k+2]] 2e ∈ 2e | |≥ | ∈ ≤ ω ≤ ε˜2τ k c 1 3ω 2 2e− L + Vmax , ≤ τωε2 1+3ω 2 ω ω δ˜ c k ˜ /(L Vmax) τ Θ δ˜ 1+3 2 ε2 for some constant c > 0. We can set k = 2e− , which holds for k = (ln(1/ k)L Vmax/˜ ). Using the union bound we have

τk+2 τ τ ε ε Pr [ t [ k+1, k+2] : Wt;τk (s,a) ˜] ∑ Pr [Wt;τk (s,a) ˜], ∀ ∈ ≤ ≤ t=τk+1 ≤ ˜ δ δ thus taking δk = assures a certainty level of 1 for each state-action pair. As a m(τk+2 τk+1) S A m result we have − | || | −

1+3ω 2 ω 1+3ω 2 ω L V ln( S A mτ /δ) L V ln( S A mVmax/(δε˜)) τ = Θ( max | || | k )=Θ( max | || | ) k ε˜2 ε˜2

Setting ε˜ =(1 2/e)βDk give the desired bound. − We have bounded for each iteration the time needed to achieve the desired precision level with probability 1 δ . The following lemma provides a bound for the error in all the iterations. − m

19 EVEN-DAR AND MANSOUR

Lemma 31 Consider asynchronous Q-learning using a polynomial learning rate. With probability 2 1 δ, for every iteration k [1,m] and time t [τk 1,τk 2] we have Wt;τ (s,a) (1 )βDk, i.e., − ∈ ∈ + + | k |≤ − e 2 Pr k [1,m], t [τk 1,τk 2], s,a : Wt;τ (s,a) (1 )βDk 1 δ, ∀ ∈ ∀ ∈ + + ∀ | k |≤ − e ≥ −   1+3ω 2 δβε τ Θ L Vmax ln(Vmax S A m/( )) 1/ω given that 0 = (( β2ε2| || | ) ).

Proof From Lemma 30 we know that 2 δ Pr t [τk 1,τk 2] : Wt;τ (1 )βDk ∀ ∈ + + | k |≥ − e ≤ m   Using the union bound we have that

m τ τ ε τ τ ε δ Pr [ k m, t [ k+1, k+2] Wt;τk ˜] ∑ Pr [ t [ k+1, k+2] Wt;τk ˜] , ∀ ≤ ∀ ∈ | |≥ ≤ k=1 ∀ ∈ | |≥ ≤

2 where ε˜ =(1 )βDk − e

∑m 1 τω τ The following lemma solves the recurrence i=−0 L i + 0 and derives the .

Lemma 32 Let k ω ω ak+1 = ak + Lak = a0 + ∑ Lai i=0 1 1 1 ω ω ω Then for any constant ω (0,1),ak = O((a − + Lk) 1 )=O(a0 +(LK) 1 )). ∈ 0 − − Proof We define the following series

k ω bk+1 = ∑ Lbi + b0 i=0 with an initial condition 1 1 ω b0 = L − . 1 ω We show by induction that bk (L(k + 1)) 1 for k 1. For k = 0 ≤ − ≥ 1 1 1 ω ω ω b0 = L 1 (0 + 1) 1 L 1 − − ≤ − We assume that the induction hypothesis holds for k 1 and prove it for k, − ω 1 ω 1 ω 1 1 ω 1 ω 1 ω 1 ω 1 ω bk = bk 1 + Lbk 1 (Lk) − + L(Lk) − L − k − (k + 1) (L(k + 1)) − − − ≤ ≤ ≤ and the claim is proved. 1/(1 ω) Now we lower bound bk by (L(k + 1)/2) − .Fork = 0

1 L 1 ω ω b0 = L 1 ( ) 1 − ≥ 2 −

20 LEARNING RATES FOR Q-LEARNING

Assume that the induction hypothesis holds for k 1 and prove for k, − ω 1 ω 1 1 ω 1 ω 1 ω 1 ω 1 ω 1 ω bk = bk 1 + Lbk 1 =(Lk/2) − + L(Lk/2) − = L − ((k/2) − +(k/2) − ) − 1 − 1 L 1 ω ((k + 1)/2) 1 ω . ≥ − − 1 1 ω For a0 > L − we can view the series as starting at bk = a0. From the lower bound we know that ω ω 1 Θ 1 1 1 ω the start point has moved (a0− /L). Therefore we have a total complexity of O((a0− +Lk) − )= 1 1 ω O(a0 +(LK) − )).

The proof of Theorem 4 follows from Lemmas 27, 31,12 and 32. In the following lemma we relax the condition of the covering time.

Lemma 33 Assume that from any start state with probability 1/2 in L steps we perform all state action pairs. Then with probability 1 δ, from any start state we perform all state action pairs in δ − δ Llog2(1/ ) steps, for a run of length [Llog2(1/ )].

Proof The proof follows from the fact that after k intervals of length L (where k is a natural num- k δ ber), the probability of not visiting all state action pairs is 2− . Since we have k =[log2(1/ )] we get that the probability of failing is δ.

Corollary 34 Assume that from any start state with probability 1/2 in L steps we perform all state action pairs. Then with probability 1 δ, from any start state we perform all state action pairs in − Llog(T/δ) steps, for a run of length T.

9.2 Asynchronous Q-learning using a Linear Learning Rate In this section we consider asynchronous Q-learning with a linear learning rate. Our main goal is to show that the size of the kth iteration is L(1 + ψ)τk, for any constant ψ > 0. The covering time property guarantees that in (1+ψ)Lτk steps each pair of state action is performed at least (1+ψ)τk times. The sequence of times in this case is τk+1 = τk +(1 + ψ)Lτk, where the τ0 will be defined latter. We first bound Yt;τk and then bound the stochastic term Wt;τk .

Lemma 35 Consider asynchronous Q-learning with a polynomial learning rate and assume that τ τ ψ τ τ for any t k we have Yt;τk (s,a) Dk. Then for any t k +(1+ )L k = k+1 we have Yt;τk (s,a) γ 2 β≥ ≤ ≥ ≤ ( + 2+ψ )Dk

Proof For each state-action pair (s,a) we are assured that n(s,a,τk,τk 1) (1 + ψ)τk , since in an + ≥ interval of (1+ψ)Lτk steps each state-action pair is visited at least (1+ψ)τk times by the definition of the covering time. Using the fact that the Yt;τk (s,a) are monotonically decreasing and determin- istic (thus independent), we can apply the same argument as in the proof of Lemma 22.

The following Lemma enables the use of Azuma’s inequality.

21 EVEN-DAR AND MANSOUR

τ l Lemma 36 For any t k and 1 l t we have that Wt;τk (s,a) is a martingale sequence, which satisfies ≥ ≤ ≤ l l 1 Vmax W τ (s,a) W −τ (s,a) | t; k − t; k |≤n(s,a,0,t)

l Proof We first note that Wt;τk (s,a) is a martingale sequence, since

l l 1 k,t τ η τ τ E[Wt;τ (s,a) Wt;−τ (s,a) F k+l 1]=E[ τ l(s,a)w k+l(s,a) F k+l 1] k − k | − k+ | − k,t η τ τ = τ lE[w k+l(s,a) F k+l 1]=0. k+ | − For a linear learning rate we have that ηk,t (s,a)=1/n(s,a,0,t), thus τk+l

l l 1 k,t Vmax W τ (s,a) W −τ (s,a) = ητ (s,a) wτ +l(s,a) . | t; k − t; k | k+l | k |≤n(s,a,0,t)

The following lemma bounds the value of the term Wt;τk .

Lemma 37 Consider asynchronous Q-learning with a linear learning rate. With probability at δ ψ τ β τ least 1 m we have for every state-action pair Wt; k (s,a) 2+ψ Dk for any t k+1 and any positive− constant ψ, i.e. | |≤ ≥ ψ δ Pr t [τk 1,τk 1] : Wt;τ (s,a) βDk 1 ∀ ∈ + + k ≤ 2 + ψ ≥ − m   2 V ln(Vmax S A m/(δβDkψ)) given that τ Θ max | || | . k (( ψ2β2D2 )) ≥ k

t τ +1 t τ +1 k k τ Proof By Lemma 36 we can apply Azuma’s inequality on Wt;−τk (We note that Wt;−τk = Wt; k ) Vmax with ci = Θ( ) for any t τk 1. Therefore, we derive that n(s,a,0,t ≥ +

2 2ε˜ τ ε2 − n(s,a, k,t)˜ ∑t c2 c i=τ ,i Ts,a i V2 Pr[ Wt;τ ε˜] 2e k ∈ 2e− max , | k |≥ ≤ ≤ for some positive constant c. Let us define the following variable

1, αt(s,a) = 0 ζt (s,a)= 6 0, otherwise 

Using the union bound and the fact in an interval of length (1 + ψ)Lτk each state-action pair is visited at least (1 + ψ)τk times, we get τ τ ε ψ τ ε Pr [ t [ k+1, k+2] : Wt;τk (s,a) ˜] Pr [ t ((1 + )L + 1) k : Wt;τk (s,a) ˜] ∀ ∈ | |≥ ≤ ∀ ∞≥ | |≥ ∑ Pr [ Wt;τ (s,a) ε˜] ≤ | k |≥ t=((1+ψ)L+1)τk

22 LEARNING RATES FOR Q-LEARNING

∞ n(s,a,0,t)ε˜2 c 2 ∑ ζt (s,a)2e− Vmax ≤ t=((1+ψ)L+1)τk ψ τ ε2 ∞ ((1+ ) k)˜ tε˜2 c 2 2 2e− Vmax ∑ e− 2Vmax ≤ t=0 ψ τ ε2 τ ε2 (1+ ) k ˜ c0 k ˜ c 2 2 2 2e− Vmax e− Vmax V = = Θ( max ), ε˜2 ε2 −2 ˜ 1 e Vmax −

τ ε2 c0 k ˜ 2 δ − Vmax 2 2 δε˜ Θ e Vmax τ Θ Vmax ln(Vmax S A m/( )) for some positive constant c0. Setting m S A = ( ε˜2 ), which hold for k = ( ε|˜2|| | ), ψ | || | and ε˜ = βDk assures us that for every t τk 1 (and as a result for any t [τk 1,τk 2]) with prob- 2+ψ ≥ + ∈ + + ability at least 1 δ the statement holds at every state-action pair. − m

We have bounded for each iteration the time needed to achieve the desired precision level with probability 1 δ . The following lemma provides a bound for the error in all the iterations. − m

Lemma 38 Consider synchronous Q-learning using a linear learning rate. With probability 1 δ, ψβ−D τ τ ψ τ k for every iteration k [1,m], time t [ k+1, k+2] and any constant > 0 we have Wt; k 2+ψ , i.e., ∈ ∈ | |≤ ψβDk Pr k [1,m], t [τk 1,τk 2] : Wt;τ 1 δ, ∀ ∈ ∀ ∈ + + | k |≤ 2 + ψ ≥ −   2 δβεψ τ Θ Vmax ln(Vmax S A m/( )) given that 0 = ( ψ|2β||2ε2| ).

Proof From Lemma 37 we know that

ψβDk δ Pr t [τk 1,τk 2] : Wt;τ ∀ ∈ + + | k |≥ 2 + ψ ≤ m   Using the union bound we have that,

ψβDk Pr k m, t [τk 1,τk 2] Wt;τ ∀ ≤ ∀ ∈ + + | k |≥ 2 + ψ  m  ψβDk ∑ Pr t [τk 1,τk 2] Wt;τ δ ≤ ∀ ∈ + + | k |≥ 2 + ψ ≤ k=1  

Theorem 5 follows from Lemmas 35, 38, 12 and the fact that ak+1 = ak +(1+ψ)Lak = a0((1+ ψ)L + 1)k.

23 EVEN-DAR AND MANSOUR

10. Lower Bound for Q-learning using a Linear Learning Rate

1 1 1 γ In this section we show a lower bound for Q-learning with a linear learning rate, which is O(( ε ) − ). We consider the following MDP, denoted M0, that has a single state s, a single action a, and a deterministic reward RM0 (s,a)=0. Since there is only one action in the MDP we denote Qt(s,a) as Qt (s). We initialize Q0(s)=1 and observe the time until Qt(s) ε. ≤

Lemma 39 Consider running synchronous Q-learning with linear learning rate on MDP M0, when 1 1 1 γ ε initializing Q0(s)=1. Then there is a time t = c( ) − for some constant c > 0, such that Qt . ε ≥ Proof First we prove by induction on t that

t 1 − i + γ Qt (s)=∏ . i=1 i + 1

For t = 1wehaveQ1(s)=(1 1/2)Q0(s)+(1/2)γQ0(s)=(1+γ)/2. Assume the hypothesis holds for t 1 and prove it for t. By− definition, − 1 1 t 1 + γ Qt (s)=(1 )Qt 1(s)+ γQt 1(s)= − Qt 1(s). − t − t − t − In order to help us estimate this quantity we use the Γ function. Let

1 2 k Γ(x + 1,k)= · ··· kx (x + 1) (x + 2) (x + k) · ····

The limit of Γ(1 + x,k),ask goes to infinity, is constant for any x. We can rewrite Qt (s) as

γ 1 t γ 1 Q (s)= = Θ(t − ) t Γ(γ + 1,t) t + 1

1 1 1 γ ε Therefore, there is a time t = c( ) − , for some constant c > 0, such that Qt (s) . ε ≥

Acknowledgments

This research was supported in part by a grant from the Israel Science Foundation. Eyal Even-Dar was partially supported by the Deutsch Institute.

References K. Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 68:357–367, 1967.

F. Beleznay, T. Grobler, and C. Szepesvari. Comparing value-function estimation algorithms in undiscounted problems. Technical Report TR-99-02, Mindmaker Ltd, 1999.

24 LEARNING RATES FOR Q-LEARNING

Dimitri P. Bertsekas and John N. Tsitsiklis. Neuro-. Athena Scientific, Belmont, MA, 1996.

V.S. Borkar and S.P. Meyn. The O.D.E method for convergence of stochstic approximation and reinforcement learning. Siam J. Control, 38 (2):447–69, 2000.

Tommi Jaakkola, Michael I. Jordan, and Satinder P. Singh. On the convergence of stochastic iterative dynamic programming algorithms. Neural Computation, 6, 1994.

Michael Kearns and Satinder P. Singh. Finite-sample convergence rates for Q-learning and indirect algorithms. In M.J. Kearns, S.A. Solla, and D.A. Cohn, editors, Advances in Neural Information Processing Systems 11, pages 996–1002, 1999.

Michael L. Littman and Gaba Szepesvari.´ A generalized reinforcement-learning model: Conver- gence and applications. In L. Saitta, editor, Proceedings of the 13th International Conference on (ICML-96), pages 310–318, Bari, Italy, 1996. Morgan Kaufmann. URL citeseer.nj.nec.com/littman96generalized.html.

Martin L. Puterman. Markov Decision Processes—Discrete Stochastic Dynamic Programming. John Wiley and Sons, Inc., New York, NY, 1994.

R. Sutton and A. Barto. Reinforcement Learning. MIT Press., Cambridge, MA., 1998.

C. Szepesvari. The asymptotic convergence-rate of Q-learning. In M.I. Jordan, M.J. Kearns, and S.A. Solla, editors, Advances in Neural Information Processing Systems 10, pages 1064–1070, 1998.

J. Tsitsiklis. Asynchronous stochastic approximation and Q-learning, 1994.

C. Watkins and P. Dyan. Q-learning. Machine Learning, 8(3/4):279–292, 1992.

C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge, England, 1989.

25