Ensemble Bootstrapping for Q-Learning

Oren Peer 1 Chen Tessler 1 Nadav Merlis 1 Ron Meir 1

Abstract focuses on learning the value-function. The value represents the expected, discounted, reward-to-go that the agent will Q-learning (QL), a common reinforcement learn- obtain. In particular, such methods learn the optimal pol- ing algorithm, suffers from over-estimation bias icy via an iterative maximizing bootstrapping procedure. A due to the maximization term in the optimal well-known property in QL is that this procedure results in Bellman operator. This bias may lead to sub- a positive estimation bias. Van Hasselt(2010) and Van Has- optimal behavior. Double-Q-learning tackles selt et al.(2016) argue that over-estimating the real value this issue by utilizing two estimators, yet re- may negatively impact the performance of the obtained pol- sults in an under-estimation bias. Similar to icy. They propose Double Q-learning (DQL) and show that, over-estimation in Q-learning, in certain scenar- as opposed to QL, it has a negative estimation bias. ios, the under-estimation bias may degrade per- formance. In this work, we introduce a new The harmful effects of any bias on value-learning are widely bias-reduced algorithm called Ensemble Boot- known. Notably, the deadly triad (Sutton & Barto, 2018; strapped Q-Learning (EBQL), a natural extension Van Hasselt et al., 2018) attributes the procedure of boot- of Double-Q-learning to ensembles. We analyze strapping with biased values as one of the major causes of our method both theoretically and empirically. divergence when learning value functions. This has been Theoretically, we prove that EBQL-like updates empirically shown in multiple works (Van Hasselt et al., yield lower MSE when estimating the maximal 2016; 2018). Although this phenomenon is well known, mean of a set of independent random variables. most works focus on avoiding positive-estimation instead of Empirically, we show that there exist domains minimizing the bias itself (Van Hasselt et al., 2016; Wang where both over and under-estimation result in et al., 2016; Hessel et al., 2018). sub-optimal performance. Finally, We demon- In this work, we tackle the estimation bias in an underesti- strate the superior performance of a deep RL vari- mation setting. We begin, similar to Van Hasselt(2010), by ant of EBQL over other deep QL algorithms for a analyzing the estimation bias when observing a set of i.i.d. suite of ATARI games. random variables (RVs). We consider three schemes for estimating the maximal mean of these RVs. (i) The single- estimator, corresponding to a QL-like estimation scheme. 1. Introduction (ii) The double-estimator, corresponding to DQL, an estima- tion scheme that splits the samples into two equal-sized sets: In recent years, (RL) algorithms one for determining the index of the RV with the maximal have impacted a vast range of real-world tasks, such as mean and the other for estimating the mean of the selected robotics (Andrychowicz et al., 2020; Kober et al., 2013), RV. (iii) The ensemble-estimator, our proposed method, power management (Xiong et al., 2018), autonomous con- arXiv:2103.00445v2 [cs.LG] 20 Apr 2021 splits the data into K sets. Then, a single set is used for trol (Bellemare et al., 2020), traffic control (Abdulhai et al., estimating the RV with the maximal-mean whereas all other 2003; Wiering, 2000), and more (Mahmud et al., 2018; Lu- sets are used for estimating the mean of the selected RV. ong et al., 2019). These achievements are possible by the ability of RL agents to learn a behavior policy via interac- We start by showing that the ensemble estimator can be tion with the environment while simultaneously maximizing seen as a double estimator that unequally splits the samples the obtained reward. (Proposition1). This enables, for instance, less focus on maximal-index identification and more on mean estimation. A common family of algorithms in RL is Q-learning We prove that the ensemble estimator under-estimates the (Watkins & Dayan, 1992, QL) based algorithms, which maximal mean (Lemma1). We also show, both theoreti- 1Viterbi Faculty of Electrical Engineering, Technion Institute cally (Proposition2) and empirically (Fig.1), that in order of Technology, Haifa, Israel. Correspondence to: Oren Peer to reduce the magnitude of the estimation error, a double . estimator with equally-distributed samples is sub-optimal. Ensemble Bootstrapping for Q-Learning

Following this theoretical analysis, we extend the ensem- The goal of the agent is to learn an optimal policy π∗ ∈ ble estimator to the multi-step reinforcement learning case. argmax vπ(s). Importantly, there exists a policy π∗ such ∗ π We call this extension Ensemble Bootstrapped Q-Learning that vπ (s) = v∗(s) for all s ∈ S (Sutton & Barto, 2018). (EBQL). We analyze EBQL in a tabular meta chain MDP and on a set of ATARI environments using deep neural net- 2.2. Q-Learning works. In the tabular case, we show that as the ensemble size ∗ grows, EBQL minimizes the magnitude of the estimation One of the most prominent algorithms for learning π is bias. Moreover, when coupled with deep neural networks, Q-learning (Watkins & Dayan, 1992, QL). Q-learning learns we observe that EBQL obtains superior performance on a the Q-function, which is defined as follows set of ATARI domains when compared to Deep Q-Networks " ∞ # (DQN) and Double-DQN (DDQN). π π X t Q (s, a) = E γ rt(st, at)|s0 = s, a0 = a . Our contributions are as follows: t=0

Similar to the optimal value function, the optimal Q-function • We Analyze the problem of estimating the maximum is denoted by Q∗(s, a) = max Qπ(s, a), ∀(s, a)∈S ×A. expectation over independent random variables. We π Given Q∗, the optimal policy π∗ can obtained by using the prove that the estimation mean-squared-error (MSE) greedy operator π∗(s) = argmax Q∗(s, a) ∀s ∈ S. can be reduced using the Ensemble Estimator. In addi- a tion, we show that obtaining the minimal MSE requires QL is an off-policy algorithm that learns Q∗ using transition utilizing more than two ensemble members. tuples (s, a, , s0). Here, s0 is the state that is observed after performing action a in state s and receiving reward • Drawing inspiration from the above, we introduce En- r. Specifically, to do so, QL iteratively applies the optimal semble Bootstrapped Q-Learning (EBQL) and show Bellman equation (Bellman, 1957): that it reduces the bootstrapping estimation bias. 0 0 Q(s, a) ← (1−α)Q(s, a)+α(rt +γ max Q(s , a )) . (2) • We show that EBQL is superior to both Q-learning and a0 double Q-learning in both a tabular setting and when coupled with deep neural networks (ATARI). Under appropriate conditions, QL asymptotically converges almost surely to an optimal fixed-point solution (Tsitsiklis, t→∞ ∗ 2. Preliminaries 1994), i.e. Qt(s, a) −−−→ Q (s, a) ∀s ∈ S, a ∈ A. 2.1. Model Free Reinforcement Learning A known phenomenon in QL is the over-estimation of the Q-function. The Q-function is a random variable, and the A reinforcement learning (Sutton & Barto, 2018) agent faces optimal Bellman operator, Eq. (2), selects the maximal value a sequential decision-making problem in some unknown or over all actions at each state. Such estimation schemes are partially known environment. We focus on environments known to be positively biased (Smith & Winkler, 2006). in which the action space is discrete. The agent interacts with the environment as follows. At each time t, the agent 2.3. Double Q-Learning observes the environment state st ∈S and selects an action at ∈A. Then, it receives a scalar reward rt ∈R and transi- A major issue with QL is that it overestimates the true Q tions to the next state st+1 ∈S according to some (unknown) values. Van Hasselt(2010); Van Hasselt et al.(2016) show transition kernel P (st+1|st, at). A policy π : S → A, is that, in some tasks, this overestimation can lead to slow con- a function that determines which action the agent should vergence and poor performance. To avoid overestimation, take at each state and the sampled performance of a policy, Double Q-learning (Van Hasselt, 2010, DQL) was intro- a random variable starting from state s, is denoted by duced. DQL combines two Q-estimators: QA and QB, each updated over a unique subset of gathered experience. For π X t R (s) = γ rt|s0 = s, a ∼ π(st) . (1) any two estimators A and B, the update step of estimator A t=0∞ is defined as (full algorithm in the appendix Algorithm3): where γ ∈ [0, 1) is the discount factor, which determines aˆ∗ = argmax QA(s , a0) π A t+1 how myopic the agent should behave. As R is a random a0 A A variable, in RL we are often interested in the expected per- Q (st, at) ← (1 − αt)Q (st, at) formance, also known as the policy value, B ∗  + αt rt + γQ (st+1, aˆA) . (3) " ∞ # π π X t v (s) = E γ r(st)|s0 = s . Importantly, Van Hasselt(2010) proved that DQL, as op- t=0 posed to QL, leads to underestimation of the Q-values. Ensemble Bootstrapping for Q-Learning

m 3. Related Work distribution as X. Formally, let S = {Sa}a=1 be a set of samples, where S = {S (n)}N is a subset of N i.i.d. Q-learning stability. Over the last three decades, QL has a a n=1 samples from the same distribution as X . Specifically, this inspired a large variety of algorithms and improvements, a implies that [S (n)] = µ for all n ∈ [N ]. We further both in the tabular setting (Strehl et al., 2006; Kearns & E a a a assume all sets are mutually independent. Singh, 1999; Ernst et al., 2005; Azar et al., 2011) and in the function approximation-based settings (Schaul et al., Denote the empirical means of a set Sa by µˆa(Sa) = 1 P 2015; Bellemare et al., 2017; Hessel et al., 2018; Jin et al., Sa(n). This is an unbiased estimator of µa, Na n∈[Na] 2018; Badia et al., 2020). Many improvements focus on the and given sufficiently many samples, it is reasonable to ap- estimation bias of the learned Q-values, first identified by proximate µˆa(Sa) ≈ µa. Then, a straightforward method Thrun & Schwartz(1993), and on minimizing the variance to estimate (4) is to use the Single-Estimator (SE): of the target approximation-error (Anschel et al., 2017). ∗ ∗ Addressing this bias is an ongoing effort in other fields as µˆ max µˆa(Sa) ≈ max [ˆµa(Sa)] = µ . SE , a a E well, such as economics and (Smith & Winkler, 2006; Thaler, 2012). Some algorithms (Van Hasselt, 2010; As shown in (Smith & Winkler, 2006) (and rederived in Zhang et al., 2017) tackle the QL estimator inherent bias by AppendixD for completeness), this estimator is positively using double estimators, resulting in a negative bias. biased. This overestimation is believed to negatively impact RL algorithms that use neural-network-based function ap- the performance of SE-based algorithms, such as QL. proximators (deep RL) are known to be susceptible to over- To mitigate this effect, (Van Hasselt, 2010) introduced the estimation and oftentimes suffer from poor performance in Double Estimator (DE). In DE, the samples of each random challenging environments. Specifically, estimation bias, a variable a ∈ [m] are split into two disjoint, equal-sized member of the deadly triad (Van Hasselt et al., 2018), is (1) (2) (1) (2) subsets Sa and Sa , such that Sa ∪ Sa = Sa and considered as one of the main reasons for the divergence of (1) (2) Sa ∩ Sa = ∅ for all a ∈ [m]. For brevity, we denote the QL-based deep-RL algorithms. (j) (j) (j) empirical mean of the samples in Sa by µˆa , µˆa(Sa ). In our work, we tackle the estimation bias by reducing the Then, DE uses a two-phase estimation process: In the MSE of the next-state Q-values. While EBQL is negatively first phase, the index of the variable with the maximal ex- biased, we show that it better balances the bias-variance pectation is estimated using the empirical means of S(1), of the estimator. We also show that the bias magnitude of ∗ (1) aˆ = argmaxa µˆa . In the second phase, the mean of Xaˆ∗ EBQL is governed by the ensemble size and reduces as the (2) ∗ (2) is estimated using Saˆ∗ , µˆDE =µ ˆaˆ∗ . The resulting estimator ensemble size grows. ∗ ∗ is negatively biased, i.e., E [ˆµDE] ≤ µ (Van Hasselt, 2010). Ensembles. One of the most exciting applications of en- sembles in RL (Osband et al., 2018; Lee et al., 2020) is to 4.1. The Ensemble Estimator improve exploration and data collection. These methods can In this section, we introduce the Ensemble Estimator (EE), be seen as a natural extension of Thompson sampling-like a natural generalization of DE to K estimators. We take methods to deep RL. The focus of this work complements another step forward and ask: their achievements. While they consider how an ensemble of estimators can improve the learning process, we focus on How can we benefit by using K estimators rather than 2? how to better train the estimators themselves. In the DE, two sources are affecting the approximation error. 4. Estimating the Maximum Expected Value One type of error rises when the wrong index aˆ∗ is selected (inability to identify the maximal index). The second, when As mentioned in Section2, QL and DQL display opposing the mean is incorrectly estimated. While increasing the behaviors of over- and under-estimation of the Q-values. To number of samples used to identify the index reduces the better understand these issues, and our contribution, we take chance of misidentification, it results with fewer samples a step back from RL and statistically analyze the different for mean estimation. Notably, the DE na¨ıvely allocates the estimators on i.i.d. samples. same number of samples to both stages. As we show, the Consider the problem of estimating the maximal expected optimal “split” is not necessarily equal, hence, in an attempt to minimize the total MSE, a good estimator should more value of m independent random variables (X1,...,Xm) , carefully tune the number of samples allocated to each stage. X, with means µ = (µ1, . . . , µm) and standard deviations σ =(σ1, . . . , σm). Namely, we are interested in estimating Consider, for example, the case of two independent random ∗ variables (X ,X ) ∼ N (µ , σ2), N (µ , σ2). When the max [Xa] = max µa µ . (4) 1 2 1 2 a E a , means µ1 and µ2 are dissimilar, i.e., |µ1 − µ2|/σ  1, The estimation is based on i.i.d. samples from the same the chance of index misidentification is low, and thus, it Ensemble Bootstrapping for Q-Learning

∗ (k˜) is preferable to allocate more samples to the task of mean Finally, let µˆW-DE be a DE that uses S for its index- estimation. On the other extreme, as |µ1 − µ2|/σ → 0, the selection and all other samples for mean-estimation; i.e., two distributions are effectively identical, and the task of ∗ (k˜) ∗ S (j) aˆ = argmaxa µˆa and µˆW-DE =µ ˆaˆ∗ ˜ S . index identification becomes irrelevant. Thus, most of the j∈[K]\k Then, µˆ∗ =µ ˆ∗ . samples should be utilized for mean estimation. Interest- EE W-DE ingly, in both these extremities, it is beneficial to allocate We call the weighted version of the DE that uses 1/K of its more samples to mean estimation over index identification. samples to the index estimation W-DE. The proxy identity As argued above, minimizing the MSE of the estimator can establishes that W-DE is equivalent to the ensemble estima- be achieved by controlling the relative number of samples in tor. Specifically, let N1 be the number of samples used in each estimation phase. We will now formally define the en- the first phase of W-DE (index estimation) and say that the ∗ semble estimator and prove that it is equivalent to changing MSE of the W-DE is minimized at N1 = N1 . Then, by the relative allocation of samples between index-estimation the proxy-identity, the optimal ensemble size of the EE is ∗ and mean-estimation (Proposition1). Finally, we demon- K ≈ N/N1 . Thus, the optimal split-ratio of W-DE serves strate both theoretically and numerically that EE achieves as a proxy to the number of estimators to be used in EE. lower MSE than DE, and by doing so, better balances the To better understand the optimal split-ratio, we analyze the two sources of errors. MSE of the EE. To this end, we utilize the proxy identity Ensemble estimation is defined as follows: While the DE and instead calculate the MSE of the W-DE with N1 ≈ divides the samples into two sets, the EE divides the samples N/K samples for index-estimation. Let X be a vector of each random variable a ∈ [m] into K equal-sized dis- with independent component of means (µ1, . . . , µm) and K (k) (k) (l) 2 2  S variances σ1, . . . , σm . Assume w.l.o.g. that the means are joint subsets, such that Sa = k=1 Sa and Sa ∩Sa = ∅, ∀k 6= l ∈ [K]. We further denote the empirical mean of sorted such that µ1 ≥ µ2 ≥ · · · ≥ µm. Then, the following th th (k) holds (see derivations in Appendix C.1): the a component of X, based on the k set, by µˆa . As in double-estimation, EE uses a two-phase procedure to m X estimate the maximal mean. First, a single arbitrary set bias(ˆµ∗ ) = (µ − µ ) P (ˆa∗ = a), ˜ W-DE a 1 k ∈ [K] is used to estimate the index of the maximal expec- a=1 ˜ ∗ (k) m  2  tation, aˆ = argmaxa µˆa . Then, EE utilizes the remaining X σ var(ˆµ∗ ) = a + µ2 P (ˆa∗ = a) ensemble members to jointly estimate µaˆ∗ , namely, W-DE N − N a a=1 1 2 ∗ 1 X (j) m ! µˆEE , µˆaˆ∗ X ∗ K − 1 − µaP (ˆa = a) . j∈[K]\k˜ a=1   1 X (j) ∗ ≈ max E  µˆa  = µ . a K − 1 As µ1 is assumed to be largest, the bias is always negative; j∈[K]\k˜ hence, EE underestimates the maximal mean. Furthermore, Lemma1 shows that by using EE, we still maintain the we derive a closed form for the MSE: underestimation nature of DE. ∗ MSE (ˆµW-DE) Lemma 1. Let M = argmaxa E [Xa] be the set of indices m ˜  σ2  ∗ (k) X a 2 ∗ where [X ] is maximized. Let aˆ ∈ argmax µˆ be the = + (µ1 − µa) P (ˆa = a) . (5) E a a a N − N estimated maximal-expectation index according to the k˜th a=1 1 ∗ subset. Then E [ˆµEE] = E [µaˆ∗ ] ≤ maxa E [Xa]. Moreover, the inequality is strict if and only if P (ˆa∗ ∈/ M) > 0. We hypothesize that in most cases, the MSE is minimized for K > 2. As (5) is rather cumbersome to analyze, we The proof can be found in AppendixA. In addition, below, prove this hypothesis in a simpler case and demonstrate it we show that changing the size of the ensemble is equivalent numerically for more challenging situations. T 2  to controlling the partition of data between the two estima- Proposition 2. Let X = (X1,X2) ∼ N (µ1, µ2) , σ I2 tion phases. As previously mentioned, the split-ratio greatly be a Gaussian random vector such that µ1 ≥ µ2 and let affects the MSE, and therefore, this property enables the ∆ = µ1 − µ2. Also, define the signal to noise ratio as analysis of the effect of the ensemble size K on the MSE. SNR = ∆√ and let µˆ∗ be a W-DE that uses N sam- σ/ N W-DE 1 Proposition 1 (The Proxy Identity). Let S(1),...,S(K) be ples for index estimation. Then, for any fixed even sample- ∗ ∗ some subsets of the samples of equal sizes N/K and let size N > 10 and any N1 that minimizes MSE(ˆµW-DE), it ˜ ∗ ∗ k ∈ [K] be an arbitrary index. Also, denote by µˆEE, the holds that (1) As SNR → ∞, N1 → 1 (2) As SNR → 0, ˜th ∗ ∗ EE that uses the k subset for its index-estimation phase. N1 → 1 (3) For any σ and ∆, it holds that N1 < N/2. Ensemble Bootstrapping for Q-Learning

Figure 1: The MSE values as function of the split-ratio for Figure 2: Numerical calculation of the optimal sample split- two Gaussians with different mean-gaps ∆ = µ1 − µ2 and ratio as function of normalized gap for 2, 4 and 6 Gaussians σ2 = 0.25. Stars mark values in which the minimum MSE with means uniformly spread over the interval ∆=µmax − is achieved. As expected from Proposition2, the optimal µmin. split-ratio is always smaller than N/2. The black arrows show the trend of the optimum as ∆ increases. Notably, the optimal split-ratio increases from 0 to 0.4 and then decreases back to 0, with alignment to the SNR claims. of K > 2. Therefore, we expect that in practice, EE will outperform DE in terms of the minimal MSE (MMSE).

5. Ensemble Bootstrapped Q-Learning The proof can be found in Appendix C.2. Note that a simi- lar analysis can be done for sub-Gaussian variables, using In Section 4.1, we presented the ensemble estimator and its standard concentration bounds instead of using Φ. However, benefits. Namely, ensembles allow unevenly splitting sam- in this case, we have to bound P (ˆa∗ = a) and can only ples between those used to select the index of the maximal analyze an upper bound of the MSE. component and those used to approximate its mean. We also showed that allocating more samples to estimate the mean, Proposition2 yields a rather intuitive, yet surprising re- rather than the index, reduces the MSE in various scenarios. sult. From the first claim, we learn that when X1 and X2 are easily distinguishable (high SNR), samples should not Although the analysis in Section 4.1 focuses on a statisti- be invested in index-estimation, but rather allocated to the cal framework, a similar operation is performed by the QL mean-estimation. Moreover, the second claim establishes agent during training when it bootstraps on next-state values. Specifically, recall that Rπ(s, a) is the reward-to-go when the same conclusion for the case where X1 and X2 are indis- starting from state s and action a (see Eq. (1)). Then, denot- cernible (low SNR). Then, X1 and X2 have near-identical π π ing (X1,...,Xm) = (R (st+1, a1),...,R (st+1, am)) means, and samples are better used for reducing the variance π π of any of them. The most remarkable claim is the last one; and (µ1, . . . , µm) = (Q (st+1, a1),...,Q (st+1, am)), it implies that that the optimal split ratio is always smaller the next-state value used by QL is determined by maxa µa. than half. In turn, when N is large enough, this implies that As the real Q-values are unknown, DQL uses two Q estima-  (1) (1) A  (2) (2) B there exists K > 2 such that the MSE of EE is strictly lower tors µ1 , . . . , µm = Q and µ1 , . . . , µm = Q than the one of DE. To further demonstrate our claim for in a way that resembles the DE – the index is selected as intermediate values of the SNR, Fig.1 shows the MSE as ∗ (1) (2) a = argmaxa µa and the value as µa∗ . a function of N1 for different values of ∆ = µ1 − µ2 (and fixed variance σ2 = 0.25). Our method, Ensemble Bootstrapped Q-Learning (EBQL) presented in Algorithm1, can be seen as applying the con- In Fig.2 we plot the value of N1 that minimizes the MSE as cepts of EE to the QL bootstrapping phase by using an en- √∆ a function of the normalized-distance of the means, σ m , semble of K Q-function estimators. Similar to DQL, when for a different number of random variables m ∈ {2, 4, 6}. updating the kth ensemble member in EBQL, we define the ∗ k The means are evenly spread on the interval ∆, all with next-state action as aˆ = argmaxa Q (st+1, a). However, 2 N σ =0.25. Fig.2 further demonstrates that using N1 < 2 while DQL utilizes two estimators, in EBQL, the value is can reduce the MSE in a large range of scenarios. By the obtained by averaging over the remaining ensemble mem- EN\k 1 P j ∗ proxy-identity, this is equivalent to using ensemble sizes bers Q = K−1 j∈[K]\k Q (st+1, aˆ ). Notice how Ensemble Bootstrapping for Q-Learning

(a) (b) (c)

Figure 3: Correct-action rate from state Ai as function of episode in the Meta Chain MDP. Fig. 3a shows the results of a specific chain-MDP where µi = −0.2 < 0, where in Fig. 3b, µi = 0.2 > 0. In Fig. 3c, the average of 6 symmetric µ-values is presented. All results are averaged over 50 random seeds.

Algorithm 1 Ensemble Bootstrapped Q-Learning (EBQL)

Parameters: learning-rates: {αt}t≥1  i K Initialize: Q-ensemble of size K : Q i=1 , s0 for t = 0,...,T do hPK i i Choose action at = argmaxa i=1 Q (st, a) at = explore(at) //e.g. -greedy st+1, rt ← env.step(st, at) Figure 4: The Meta-Chain-MDP, a set of chain-MDPs dif- Sample an ensemble member to update: kt ∼U([K]) ∗ kt fering by the reward at state D. Each episode the agent aˆ = argmaxa Q (st+1, a) kt kt is randomly initialized in one of the MDPs. The variance Q (st, at) ← (1 − αt)Q (st, at) EN\kt ∗  σ = 1 is identical across the chain MDPs. +αt rt + γQ (st+1, aˆ ) end for  i K Return Q i=1 For each chain-MDP i ∈ [N], the agent starts at state Ai and can move either left or right. While both immediate re- wards are 0, transitioning to state Ci terminates the episode. when K = 2 (two ensemble members) we recover the same However, from state Bi, the agent can perform one of m update scheme as DQL, showing how EBQL is a natural actions, all transitioning to state Di and providing the same i 2 extension of DQL to ensembles. In practice, the update is reward r(D ) ∼ N (µi, σ ), and the episode is terminated. done using a αt, as in DQL (Eq. (3)). The set of chain MDPs is a symmetrical mixture of chains with positive and negative means. While a negative µi am- 6. Experiments plifies the sub-optimality of QL due to the over-estimation bootstrapping bias, we observe that positive µi presents the In this section, we present two main experimental results of sub-optimality of DQL under-estimation bias. EBQL compared to QL and DQL in both a tabular setting and on the ATARI ALE (Bellemare et al., 2013) using the This simple, yet challenging MDP serves as a playground deep RL variants. We provide additional details on the to examine whether an algorithm balances optimism and methodology in AppendixE. pessimism. It also provides motivation for the general case, where we do not know if optimism or pessimism is better. 6.1. Tabular Experiment - Meta Chain MDP Analysis: We present the empirical results in Fig.3. In addition to the results on the meta chain MDP, we also We begin by analyzing the empirical behavior of the various present results on the single-chain setting, for both positive algorithms on a Meta Chain MDP problem (Fig.4). In this and negative reward means µ . setting, an agent repeatedly interacts with a predetermined i set of N chain-MDPs. All chain-MDPs share the same As expected, due to its optimistic nature, QL excels in sub- structure, and by the end of the interaction with one MDP, MDPs where µi > 0 (Fig. 3b). The optimistic nature of the next MDP is sampled uniformly at random. QL drives the agent towards regions with higher uncertainty Ensemble Bootstrapping for Q-Learning

Figure 5: Estimation bias of the optimal action in state Ai as a function of time in the meta-chain MDP. Due to the zero-initialization of the Q-tables, both DQL and EBQL are positively biased for a short period of time.

(variance). As argued above, we observe that in this case, DQL is sub-optimal due to the pessimistic nature of the estimator. On the other hand, due to its pessimistic nature, Figure 6: Normalized scores: We present both the mean and as shown in Van Hasselt(2010), DQL excels when and median normalized scores over the 11 environments (Ta- µi < 0. We replicate their results in Fig. 3a. ble1) with 5 seeds per environment. Curves are smoothed Although QL and DQL are capable of exploiting the nature with a moving average over 5 points. In addition to DQN of certain MDPs, this may result in catastrophic failures. and DDQN, we provide a comparison against Rainbow (Hes- On the other hand, and as shown in Fig. 3c, EBQL excels sel et al., 2018). Rainbow represents the state-of-the-art in the meta chain MDP, a scenario that averages over the non-distributed Q-learning-based agent. sub-MDPs, showing the robustness of the method to a more general MDP. Moreover, we observe that as the size of the ensemble grows, the performance of EBQL improves. Estimation Bias: In addition to reporting the performance, While there exists a multitude of extensions and practical we empirically compare the bias of the various learned Q- improvements to DQN, our work focuses on learning the Q- functions on the meta chain MDP. The results are presented function at a fundamental level. Similarly to what is shown in Fig.5. As expected QL and DQL converge to 0+ and in Rainbow (Hessel et al., 2018), most of these methods are 0− respectively. Interestingly, the absolute bias of EBQL orthogonal and will thus only complement our approach. decreases and the algorithm becomes less pessimistic as In order to obtain a fair comparison, we do not use the en- the ensemble size K grows. This explains the results of semble for improved exploration (as was done in Osband Fig.3, where increasing the ensemble size greatly improves et al. 2016; 2018; Lee et al. 2020). Rather, the action is se- the performance in chain MDPs with µi > 0. To maintain lected using a standard -greedy exploration scheme, where fairness, all Q-tables were set to zero, which can explain the the greedy estimator is taken on the average Q-values of the positive bias of EBQL in the initial episodes. ensemble. All hyper-parameters are identical to the base- lines, as reported in (Mnih et al., 2015), including the use 6.2. Atari of target networks (see AppendixE for pseudo-codes). Here, we evaluate EBQL in a high dimensional task – Analysis: We test all methods on 11 randomly-chosen envi- ATARI ALE (Bellemare et al., 2013). We present EBQL’s ronments. Each algorithm is trained over 50m steps and 5 performance when K = 5 (ensemble size). Here, we com- random seeds. To ensure fairness, for DQN and DDQN we pare to the DQN (Mnih et al., 2015) and DDQN (Van Hasselt report the results as provided by Quan & Ostrovski(2020). et al., 2016) algorithms, the deep RL variants of Q-learning The complete training curves over all environments are pro- and Double-Q-Learning respectively. vided in AppendixF. The numerical results are shown in Ensemble Bootstrapping for Q-Learning

Table 1: ATARI: Comparison of the DQN, DDQN and EBQL agents on 11 random ATARI environments. Each algorithm is evaluated for 50m steps over 5 random seeds. We present the average score of the final policies.

Algorithm Human Random DQN DDQN EBQL Environment Asterix 8503.3 210.0 4603.2 5718.1 22152.5 Breakout 31.8 1.7 283.5 333.4 406.3 CrazyClimber 35410.5 10780.5 93677.1 111765.4 127967.5 DoubleDunk -15.5 -18.6 -15.8 -18.6 -10.1 Gopher 2321.0 257.6 3439.4 6940.5 21940.0 Pong 9.3 -20.7 17.4 19.9 21.0 PrivateEye 69571.3 24.9 78.1 100.0 100.0 Qbert 13455.0 163.9 5280.2 6277.1 14384.4 RoadRunner 7845.0 11.5 27671.4 40264.9 55927.5 Tennis -8.9 -23.8 -1.2 -14.5 -1.1 VideoPinball 17297.6 16256.9 104146.4 230534.3 361205.1

Table1. In addition to the numerical results, we present the provements (Hessel et al., 2018). In all the evaluated envi- mean and median normalized scores in Fig.6. ronments, EBQL exhibits the best performance, in addition to outperforming the human baseline in all but a single When comparing our method to DQN and DDQN, we ob- environment. Finally, EBQL performs competitively with serve that, in the environments we tested, EBQL consistently Rainbow in the small-samples regime, even though it does obtains the highest performance. In addition, PrivateEye is not utilize many of its well-known improvements. the only evaluated environment where EBQL did not out- perform the human baseline. Remarkably, despite EBQL Ensembles are becoming prevalent in RL, especially due not containing many improvements (e.g., prioritized experi- to their ability to provide uncertainty estimates and im- ence replay (Schaul et al., 2015), distributional estimators proved exploration techniques (Osband et al., 2018; Lee (Bellemare et al., 2017), and more (Hessel et al., 2018)), it et al., 2020). Although ensembles are integral to our method, outperforms the more complex Rainbow in the small-sample to ensure a fair evaluation, we did not utilize their properties region (< 20 million steps, Fig.6). Particularly in the Break- for improved exploration. These results are exciting as they out and RoadRunner domains (AppendixF), EBQL also suggest that by utilizing these properties, the performance outperforms Rainbow over all 50m training steps. of EBQL can be dramatically improved. We believe that our work leaves room to many interesting 7. Summary and Future Work extensions. First, recall that EBQL fixes the ensemble size at the beginning of the interaction and then utilizes a single In this work, we presented Ensemble Bootstrapped Q- estimator to estimate the action index. However, in Sec- Learning, an under-estimating bias-reduction method. We tion 4.1, we showed that the optimal split ratio depends on started by analyzing the simple statistical task of estimating the distribution of the random variables in question. There- the maximal mean of a set of independent random variables. fore, one possible extension is to dynamically change the Here, we proved that the ensemble estimator is equivalent to number of ensemble members used for the index-estimation, a weighted double-estimator that unevenly splits samples be- according to observed properties of the MDP. tween its two phases (index-selection and mean-estimation). Moreover, we showed both theoretically and empirically Also, when considering the deep RL implementation, we that in some scenarios, the optimal partition is not symmet- provided a clean comparison of the basic algorithms (QL, rical. Then, using the ensemble estimator is beneficial. DQL and EBQL). Notably, and in contrast to previous work on ensembles in RL, we did not use ensembles for improved Based on this analysis, we propose EBQL, a way of utiliz- exploration. However, doing so is non-trivial; for explo- ing the ensemble estimator in RL. Our empirical results in ration needs, previous work decouples the ensemble mem- the meta chain MDP show that, compared to QL (single bers during training, whereas our update does the opposite. estimator) and DQL (double estimator), EBQL dramatically reduces the estimation bias while obtaining superior per- formance. Finally, we evaluated EBQL on 11 randomly References selected ATARI environments. To ensure fairness, we com- Abdulhai, B., Pringle, R., and Karakoulas, G. J. Rein- pared DQN, DDQN and EBQL without any additional im- forcement learning for true adaptive traffic signal control. Ensemble Bootstrapping for Q-Learning

Journal of Transportation Engineering, 129(3):278–285, Kearns, M. and Singh, S. Finite-sample convergence rates 2003. for q-learning and indirect algorithms. Advances in neural information processing systems, pp. 996–1002, 1999. Andrychowicz, O. M., Baker, B., Chociej, M., Jozefowicz, R., McGrew, B., Pachocki, J., Petron, A., Plappert, M., Kober, J., Bagnell, J. A., and Peters, J. Reinforcement Powell, G., Ray, A., et al. Learning dexterous in-hand learning in robotics: A survey. The International Journal manipulation. The International Journal of Robotics of Robotics Research, 32(11):1238–1274, 2013. Research, 39(1):3–20, 2020. Lee, K., Laskin, M., Srinivas, A., and Abbeel, P. Sunrise: A Anschel, O., Baram, N., and Shimkin, N. Averaged-dqn: simple unified framework for ensemble learning in deep Variance reduction and stabilization for deep reinforce- reinforcement learning. arXiv preprint arXiv:2007.04938, ment learning. In International Conference on Machine 2020. Learning, pp. 176–185. PMLR, 2017. Luong, N. C., Hoang, D. T., Gong, S., Niyato, D., Wang, P., Arulkumaran, K. Rainbow dqn. https://https:// Liang, Y.-C., and Kim, D. I. Applications of deep rein- github.com/Kaixhin/Rainbow, 2019. forcement learning in communications and networking: A survey. IEEE Communications Surveys & Tutorials, 21 Azar, M. G., Munos, R., Ghavamzadeh, M., and Kappen, H. (4):3133–3174, 2019. Speedy q-learning. In Advances in Neural Information Processing Systems, 2011. Mahmud, M., Kaiser, M. S., Hussain, A., and Vassanelli, S. Applications of and reinforcement learning Badia, A. P., Piot, B., Kapturowski, S., Sprechmann, P., to biological data. IEEE transactions on neural networks Vitvitskyi, A., Guo, Z. D., and Blundell, C. Agent57: and learning systems, 29(6):2063–2079, 2018. Outperforming the atari human benchmark. In Interna- tional Conference on , pp. 507–517. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, PMLR, 2020. J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidje- land, A. K., Ostrovski, G., et al. Human-level control Bellemare, M. G., Naddaf, Y., Veness, J., and Bowling, M. through deep reinforcement learning. nature, 518(7540): The arcade learning environment: An evaluation plat- 529–533, 2015. form for general agents. Journal of Artificial Intelligence Research, 47:253–279, 2013. Osband, I., Blundell, C., Pritzel, A., and Van Roy, B. Deep exploration via bootstrapped dqn. In Advances in neural Bellemare, M. G., Dabney, W., and Munos, R. A distribu- information processing systems, pp. 4026–4034, 2016. tional perspective on reinforcement learning. In Interna- tional Conference on Machine Learning, pp. 449–458. Osband, I., Aslanides, J., and Cassirer, A. Randomized prior PMLR, 2017. functions for deep reinforcement learning. In Advances in Neural Information Processing Systems, pp. 8617–8629, Bellemare, M. G., Candido, S., Castro, P. S., Gong, J., 2018. Machado, M. C., Moitra, S., Ponda, S. S., and Wang, Z. Autonomous navigation of stratospheric balloons using Quan, J. and Ostrovski, G. DQN Zoo: Reference imple- reinforcement learning. Nature, 588(7836):77–82, 2020. mentations of DQN-based agents, 2020. URL http: //github.com/deepmind/dqn_zoo. Bellman, R. A markovian decision process. Journal of mathematics and mechanics, pp. 679–684, 1957. Schaul, T., Quan, J., Antonoglou, I., and Silver, D. Priori- tized experience replay. arXiv preprint arXiv:1511.05952, Ernst, D., Geurts, P., and Wehenkel, L. Tree-based batch 2015. mode reinforcement learning. Journal of Machine Learn- ing Research, 6:503–556, 2005. Smith, J. E. and Winkler, R. L. The optimizer’s curse: Skepticism and postdecision surprise in decision analysis. Hessel, M., Modayil, J., Van Hasselt, H., Schaul, T., Os- Management Science, 52(3):311–322, 2006. trovski, G., Dabney, W., Horgan, D., Piot, B., Azar, M., and Silver, D. Rainbow: Combining improvements in Strehl, A. L., Li, L., Wiewiora, E., Langford, J., and Littman, deep reinforcement learning. In Proceedings of the AAAI M. L. Pac model-free reinforcement learning. In Pro- Conference on Artificial Intelligence, volume 32, 2018. ceedings of the 23rd international conference on Machine learning, pp. 881–888, 2006. Jin, C., Allen-Zhu, Z., Bubeck, S., and Jordan, M. I. Is q-learning provably efficient? arXiv preprint Sutton, R. S. and Barto, A. G. Reinforcement learning: An arXiv:1807.03765, 2018. introduction. MIT press, 2018. Ensemble Bootstrapping for Q-Learning

Thaler, R. H. The winner’s curse: Paradoxes and anomalies of economic life. Simon and Schuster, 2012. Thrun, S. and Schwartz, A. Issues in using function approx- imation for reinforcement learning. In Proceedings of the 1993 Connectionist Models Summer School Hillsdale, NJ. Lawrence Erlbaum, 1993. Tsitsiklis, J. N. Asynchronous stochastic approximation and q-learning. Machine learning, 16(3):185–202, 1994. Van Hasselt, H. Double q-learning. Advances in neural information processing systems, 23:2613–2621, 2010. Van Hasselt, H., Guez, A., and Silver, D. Deep reinforce- ment learning with double q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, vol- ume 30, 2016. Van Hasselt, H., Doron, Y., Strub, F., Hessel, M., Sonnerat, N., and Modayil, J. Deep reinforcement learning and the deadly triad. arXiv preprint arXiv:1812.02648, 2018. Wang, Z., Schaul, T., Hessel, M., Hasselt, H., Lanctot, M., and Freitas, N. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, pp. 1995–2003. PMLR, 2016. Watkins, C. J. and Dayan, P. Q-learning. Machine learning, 8(3-4):279–292, 1992. Wiering, M. A. Multi-agent reinforcement learning for traffic light control. In Machine Learning: Proceedings of the Seventeenth International Conference (ICML’2000), pp. 1151–1158, 2000. Xiong, R., Cao, J., and Yu, Q. Reinforcement learning-based real-time power management for hybrid energy storage system in the plug-in hybrid electric vehicle. Applied energy, 211:538–548, 2018. Zhang, Z., Pan, Z., and Kochenderfer, M. J. Weighted double q-learning. In IJCAI, pp. 3455–3461, 2017. Ensemble Bootstrapping for Q-Learning A. Proof of Lemma1

Recall the setting of EE defined as follows: Let X = (X1,...,Xm) be a vector of independent random variables and let K n (k)  (k) (k)o µˆ = µˆ1 ,..., µˆm be K unbiased, independent, estimators of E[X]. In our algorithm and analysis, we set k=1 (j) th ˜ µˆa to be the empirical means of the j subset of samples from the distribution of Xa. Also, let k ∈ [K] be some index. ∗ 1 P (j) ∗ (k˜) Then, the ensemble estimator is defined as µˆEE = K−1 j∈[K]\k˜ µˆaˆ∗ , where aˆ = argmaxa µˆa . ˜ ∗ (k) Lemma 1. Let M = argmaxa E [Xa] be the set of indices where E[Xa] is maximized. Let aˆ ∈ argmaxa µˆa be the ˜th ∗ estimated maximal-expectation index according to the k subset. Then E [ˆµEE] = E [µaˆ∗ ] ≤ maxa E [Xa]. Moreover, the inequality is strict if and only if P (ˆa∗ ∈/ M) > 0.

Proof. Since all sets are equally-distributed, assume w.l.o.g. that EE performs its first stage based on the k˜th subset.  (k) K Moreover, recall that the estimators µˆ k=1 are independent and unbiased. Therefore, using the tower property, we get   1 X (j) [ˆµ∗ ] = µˆ E EE E K − 1 aˆ∗  j∈[K]\k˜   

1 X (j) ∗ = E E  µˆa aˆ = a (Tower property) K − 1 j∈[K]\k˜   

1 X ∗ = E E  µa aˆ = a (Estimators are independent and unbiased) K − 1 j∈[K]\k˜ ∗ = E [E [µa|aˆ = a]] = E [µaˆ∗ ] , which proves the first result of the lemma. Next we prove the inequality. First notice that if P (ˆa∗ ∈ M) = 1, then ∗ aˆ ∈ argmaxa E[Xa] almost-surely; then,

∗ ∗ ∗ E [ˆµEE] = E [µaˆ∗ ] = E [µ ] = µ .

∗ ∗ ∗ Otherwise, assume that P (ˆa ∈ M) < 1. Then, there exists an index a such that µa < µ and P (ˆa = a) > 0, which implies that:

m ∗ X ∗ 0 E [ˆµEE] = µa0 P (ˆa = a ) a0=1 X ∗ ∗ 0 ∗ ≤ µ P (ˆa = a ) + µaP (ˆa = a) a06=a X < µ∗P (ˆa∗ = a0) + µ∗P (ˆa∗ = a) a06=a m X = P (ˆa∗ = a0)µ∗ = µ∗. a0=1 Ensemble Bootstrapping for Q-Learning B. Proof of Proposition1 Proposition 1 (The Proxy Identity). Let S(1),...,S(K) be some subsets of the samples of equal sizes N/K and let k˜ ∈ [K] ∗ ˜th ∗ be an arbitrary index. Also, denote by µˆEE, the EE that uses the k subset for its index-estimation phase. Finally, let µˆW-DE (k˜) ∗ (k˜) be a DE that uses S for its index-selection and all other samples for mean-estimation; i.e., aˆ = argmaxa µˆa and ∗ S (j) ∗ ∗ µˆW-DE =µ ˆaˆ∗ j∈[K]\k˜ S . Then, µˆEE =µ ˆW-DE.

∗ ∗ Proof. Denote by aˆEE and aˆW-DE, the output indices from the first stage of EE and W-DE, respectively. Importantly, by ∗ (k˜) ∗ ∗ definitions of both algorithms, we have that aˆW-DE = argmaxa µˆa =a ˆEE. For brevity, we denote this index by aˆ . (j) For the second stage, notice that all subsets are of equal-size, i.e., |Saˆ∗ | = N/K for all j ∈ [K]. Also, denote the sample W-DE S (j) W-DE sets for second stage of the W-DE by Sa = j∈[K]\k˜ Saˆ∗ , which is of size Sa = N(1 − 1/K). Then, the ensemble estimator can be written as

1 X (j) µˆ∗ = µˆ EE K − 1 aˆ∗ j∈[K]\k˜

(j) |Saˆ∗ | 1 X 1 X (j) = Saˆ∗ (n) K − 1 |S(j)| j∈[K]\k˜ aˆ∗ n=1 N/K 1 X 1 X (j) = S (n) K − 1 N/K aˆ∗ j∈[K]\k˜ n=1 N/K 1 X X (j) = S (n) N(K − 1)/K aˆ∗ j∈[K]\k˜ n=1

W-DE |Sa | 1 X W-DE = S ∗ (n) |SW-DE| aˆ a n=1   [ (j) =µ ˆaˆ∗  Saˆ∗  j∈[K]\k˜ ∗ =µ ˆW-DE. Ensemble Bootstrapping for Q-Learning C. MSE Analysis of the Weighted Double Estimator C.1. Calculating the Bias, Variance and MSE of the Estimator In this appendix, we analyze the bias, variance and MSE of the W-DE as a function of the number of samples used for index (1) (2) estimation N1. Formally, for some N ∈ N, we define S and S to be sets of samples from the same distribution as (1) (2) X of sizes |S | = N1 ∈ [1,N − 1] and |S | = N − N1 respectively. We further assume that all samples are mutually independent. We then define the empirical mean estimators as

|S(k)| 1 X µˆ(j) = S(k)(j), ∀k ∈ {1, 2} , a ∈ [m] . a S(k) a j=1

h (j)i The estimators are naturally unbiased, namely, E µˆa = µa, and since samples are independent, have variance of h (j)i 1 2 var µˆa = |S(j)| σa. Then, the W-DE is calculated via the following two-phases estimation procedure:

(1) ∗ (1) 1. Index-estimation using samples from S : aˆ ∈ argmaxa µˆa .

(2) ∗ (2) 2. Mean-estimation using samples from S : µˆW-DE =µ ˆaˆ∗ .

By the proxy-identity (Proposition1), this estimator is closely related to the ensemble estimator. Specifically, we can understand how the ensemble size affects the MSE of EE by analyzing the effect of N1 on the MSE of W-DE. We now directly calculate the statistics of W-DE.

• First moment. Since the samples from the different components of X are mutually independent and µˆ(2) is an unbiased estimator, we can write m ∗ h (2)i h h (2) ∗ ii ∗ X ∗ E [ˆµW-DE] = E µˆaˆ∗ = E E µˆa |aˆ = a = E [E [µa|aˆ = a]] = E [µaˆ∗ ] = µaP (ˆa = a) . a=1

• Second moment. Similar to the first moment, we get

 2   2  h ∗ 2i  (2)  (2) ∗ E (ˆµW-DE) = E µˆaˆ∗ = E E µˆa |aˆ = a

m  2 X  (2) ∗ = E µˆa P (ˆa = a) a=1 m  2 X  (2)  h (2)i ∗ = var µˆa + E µˆa P (ˆa = a) a=1 m X  σ2  = a + µ2 P (ˆa∗ = a) . N − N a a=1 1

Next, we calculate the bias, variance and MSE of the estimator. To this end, assume w.l.o.g. that µ1 ≥ · · · ≥ µm. In ∗ particular, it implies that µ = µ1.

• Bias. m m ∗ ∗ ∗ X ∗ X ∗ bias (ˆµW-DE) = E [ˆµW-DE] − µ = µaP (ˆa = a) − µ1 = (µa − µ1)P (ˆa = a) . (6) a=1 a=1

• Variance. 2 m  2  m ! h 2i 2 X σ X var (ˆµ∗ ) = (ˆµ∗ ) − ( [ˆµ∗ ]) = a + µ2 P (ˆa∗ = a) − µ P (ˆa∗ = a) . (7) W-DE E W-DE E W-DE N − N a a a=1 1 a=1 Ensemble Bootstrapping for Q-Learning

Finally, we calculate the MSE of the estimator. First recall that for any estimator µˆ of µ∗, it holds that

h ∗ 2i MSE (ˆµ) = E (ˆµ − µ )

h ∗ 2i = E (ˆµ − E [ˆµ] + E [ˆµ] − µ )

h 2i ∗ ∗ 2 = E (ˆµ − E [ˆµ]) +2 E [ˆµ − E [ˆµ]] (E [ˆµ] − µ ) + (E [ˆµ] − µ ) | {z } | {z } | {z } =0 =bias(ˆµ)2 =var(ˆµ) = var(ˆµ) + bias(ˆµ)2 (8) Specifically, for W-DE, we get

m !2 ∗ 2 X ∗ bias (ˆµW-DE) = (µa − µ1)P (ˆa = a) a=1 m !2 m X ∗ X ∗ 2 = µaP (ˆa = a) − 2 µaµ1P (ˆa = a) + µ1 a=1 a=1 m !2 m X ∗ X 2 ∗ = µaP (ˆa = a) + (µ1 − 2µaµ1)P (ˆa = a) (9) a=1 a=1 and combining Eq. (7), Eq. (8) and Eq. (9), we get ∗ ∗ ∗ 2 MSE (ˆµW-DE) = var (ˆµW-DE) + bias (ˆµW-DE) m m X  σ2  X = a + µ2 P (ˆa∗ = a) + (µ2 − 2µ µ )P (ˆa∗ = a) N − N a 1 a 1 a=1 1 a=1 m X  σ2  = a + (µ − µ )2 P (ˆa∗ = a) N − N 1 a a=1 1 m X  σ2  a + ∆2 P (ˆa∗ = a) , (10) , N − N a a=1 1 where we defined ∆a = µ1 − µa. Finally, we address the values of the probabilities P (ˆa∗ = a) where m = 2. For this, we assume that all samples are drawn independently from a Gaussian distribution. Then, if Φ is the cumulative distribution function (cdf) of a standard Gaussian variable, we have that

∗  (1) (1) P (ˆa = 1) = P µˆ1 ≥ µˆ2

2 2  σa+σj  = P (Z1,2 ≥ 0) (For Za,j ∼ N µa − µj, ) N1

2 √  Y N1 (µ1 − µj) = 2 · Φ   q 2 2 j=1 σj + σ1 where in the last relation, we substituted the cdf of the Gaussian distribution and used the fact that for j = a, we have 1 ∗ Φ(0) = 2 . Moreover, from symmetry, a similar relation holds for P (ˆa = 2), and we can write

2 √  ∗ Y Na (µa − µj) P (ˆa = a) = 2 · Φ   . (11) q 2 2 j=1 σj + σa Finally, plugging Eq. (11) to Eq. (10), we get  √  2  2  2 ∗ X σa 2 Y Ni (µi − µj) MSE (ˆµW-DE) = 2  + (µa − µ1) Φ  q  (12) N − N1 2 2 i=1 j=1 σi + σj Ensemble Bootstrapping for Q-Learning

C.2. Proof of Proposition2

T 2  Proposition 2. Let X = (X1,X2) ∼ N (µ1, µ2) , σ I2 be a Gaussian random vector such that µ1 ≥ µ2 and let ∆ = µ − µ . Also, define the signal to noise ratio as SNR = ∆√ and let µˆ∗ be a W-DE that uses N samples for 1 2 σ/ N W-DE 1 ∗ ∗ index estimation. Then, for any fixed even sample-size N > 10 and any N1 that minimizes MSE(ˆµW-DE), it holds that (1)As ∗ ∗ ∗ SNR → ∞, N1 → 1 (2) As SNR → 0, N1 → 1 (3) For any σ and ∆, it holds that N1 < N/2.

Proof. For this proof, we analyze the case of m = 2 where σ1 = σ2 = σ and ∆ = µ1 − µ2 ≥ 0. In this specific case, by Eq. (12), we have √ √ 2    2     ∗ σ ∆ N1 σ 2 ∆ N1 MSE (ˆµW-DE) = Φ √ + + ∆ 1 − Φ √ N − N1 2σ N − N1 2σ √ 2   σ 2 2 ∆ N1 = + ∆ − ∆ Φ √ , MSE(N1) N − N1 2σ

For the analysis, we allow N1 to be continuous in the interval [1,N − 1]. We then show that there exists a value N˜1 such that for any N ≥ N˜1, the function MSE(N1) is strictly increasing. This implies that even for integer values, any minimizer ∗ ∗ ˜ of the MSE N1 is upper-bounded by N1 ≤ dN1e. To do so, we lower bound the derivative of the MSE, which equals to

2 3 2 dMSE(N1) σ ∆ − 1 ∆ N 4 σ2 1 = 2 − √ e (13) dN1 (N − N1) 4σ πN1 3 2  2   2 ! 2 σ 2σ 1 ∆ 1 ∆ 2 − 4 ( σ ) N1 = 2 − √ 2 N1 e (14) (N − N1) πN1 4 σ

3 ∆√ Proof of part (1) We now focus on lower bounding the derivative in Eq. (13). Denoting b = 4σ π , we have that 2 3 dMSE(N1) σ ∆ ≥ 2 − √ dN1 (N − N1) 4σ πN1 σ2 b = 2 − √ (N − N ) N1 √ 1 σ2 N − b N 2 − 2NN + N 2 = 1 1 1 (N − N )2 √ 1 2 σ N1 − bN (N − N1) 2 > 2 . (N1 < N and thus N1 < N1N) (N − N1) √ Next, denote x = N1. Then, the numerator of the derivative is quadratic in x and can be written as 2p 2 2 2 2 2 σ N1 − bN (N − N1) = σ x − bN N − x = bNx + σ x − bN . √ as the denominator is always positive, and since b > 0 and x = N1 > 0, the derivative will always be strictly positive for any N1 > N˜1 such that r  2 r  2  6 √ 2 4 ∆3 3 1 ∆ q 2 4 2 3 −σ + σ + 4 √ N √ −1 + 1 + √ √ ˜ −σ + σ + 4b N 4σ π 2 π σ/ N N1 ≥ = 3 = N 2bN ∆√  3 2N 4σ π ∆√ √1 σ/ N 2 π r 6  ∆  q −1 + 1 + c2 √ 2 6 (∗) √ σ/ N √ −1 + 1 + c (SNR) = N = N  3 3 ∆√ c (SNR) c σ/ N where in (∗) we defined c √1 and in the last equality, we substituted SNR ∆√ . Notably, when SNR → 0, one , 2 π , σ/ N can easily verify that N˜1 → 0. Thus, when the SNR is large enough, we will have that N˜1 < 1, and MSE(N1) will strictly ∗ increase in N1 for any integer value in [1,N − 1]; then, it will be minimized at N1 = 1. Ensemble Bootstrapping for Q-Learning

Proof of part (2): For this part of the proof, we focus on the second form of the derivative in Eq. (14) and denote 3/2 −x −x/2 f(x) = x e . Importantly, there exists x0 such that f(x) < e for any x ≥ x0. Since N1 ≥ 1 for any ∆, σ such 1 ∆ 2 that 4 σ ≥ x0, we have that

3 2  2   2 ! 2 dMSE(N1) σ 2σ 1 ∆ 1 ∆ 2 − 4 ( σ ) N1 = 2 − √ 2 N1 e dN1 (N − N1) πN1 4 σ 2  2  σ 2σ 1 ∆ 2 − 8 ( σ ) N1 ≥ 2 − √ 2 e (N − N1) πN1 2  2  σ 2σ 1 ∆ 2 − 8 ( σ ) ≥ 2 − √ 2 e . (N1 ≥ 1) (N − N1) πN1

One can easily verify that this lower bound on the derivative of the MSE is strictly positive for any N1 ∈ (N˜1,N), where

r 2 r  2  − 1 ∆  2  1 2 √ 8 ( σ ) √ − 8N SNR π e π e N˜1 = N = N r 2 r  2  − 1 ∆  2  1 2 √ 8 ( σ ) √ − 8N SNR 1 + π e 1 + π e

1 ∆ 2 ˜ Notably, for any fixed N and large enough SNR, we have that 4 σ ≥ x0 and N1 < 1. Then, the MSE is strictly ∗ increasing in N1 for any integer value N1 ∈ [1,N] and N1 = 1. Proof of part (3): We start from the derivative form of Eq. (14). Notice that the strict maximum of the function f(x) = x3/2e−x over [0, ∞) is achieved at x∗ = 1.5; hence, for any x ≥ 0, we have that f(x) ≤ f(x∗) = 1.51.5e−1.5, which allows us to bound the derivative by

2 2 2 2 dMSE(N1) σ 2σ 1.5 −1.5 σ σ ≥ 2 − √ 2 1.5 e , 2 − 2 d, (15) dN1 (N − N1) πN1 (N − N1) N1 √ where d = 1.51.5e−1.5 √2 . One can easily verify that the lower bound of Eq. (15) equals zero when N = √d N N˜ π 1 1+ d , 1 and is strictly positive for any N1 ∈ (N˜1,N). Notably, N˜1 ≈ 0.405N, and therefore, if N > 10 then dN˜1e < N/2, and choosing N1 = dN˜1e leads to strictly lower MSE than choosing N1 = N/2. This hold for any values of ∆, σ > 0 and proves the third claim of the proposition. Ensemble Bootstrapping for Q-Learning D. Single Estimator (SE) is positively biased

Let X = (X1,...,Xm) be a vector of m independent random variables with expectations µ = (µ1, . . . µm). We wish to |Sa| estimate maxa E [Xa] = maxa µa using samples from same distribution: S = {S1,...Sm} where Sa = {Sa(n)}n=1 is a set of i.i.d. samples from the same distribution as Xa. ∗ The Single Estimator- µˆSE approximates the maximal expectation via the maximal empirical mean over the entire set of 1 P|Sa| samples; namely, if µˆa = Sa(n), then |Sa| n=1

∗ µˆSE(S) , max µˆa ≈ max E [Xa] . (16) a∈[m] a∈[m]

∗ The single estimator presented in (16) overestimates the true maximal expectation. i.e., is positively biased: E [ˆµSE] − maxa EXa ≥ 0.

∗ ∗ Proof. Let a ∈ argmaxa∈[m] µa be an index that maximizes the expected value of X and let aˆ = argmaxa∈[m] µˆa be an index that maximizes the empirical mean. Specifically, by definition, we have that µˆa∗ ≤ µˆaˆ∗ . In turn, this implies that

µa∗ − µˆaˆ∗ ≤ µa∗ − µˆa∗ .

Then, by the monotonicity of the expectation, we get

E [µa∗ − µˆaˆ∗ ] ≤ E [µa∗ − µˆa∗ ] = 0, or E [µa∗ ] ≤ E [ˆµaˆ∗ ], which concludes the proof. The equality holds since we assumed that the samples Sa(j) are drawn from the same distribution as Xa and are therefore unbiased:   |Sa∗ | 1 X E[ˆµa∗ ] = E  Sa∗ (j) = µa∗ . |S ∗ | a j=1 Ensemble Bootstrapping for Q-Learning E. Implementation details E.1. Meta-Chain MDP Environment: We have tested the algorithms (QL, DQL and EBQL) over 5000 episodes and averaged the results of i 6 50 random-seeds. The meta-chain MDP was constructed by 6 chain MDPs with r(D ) ∼ N (µi, 1), where {µi}i=1 = {−0.6, −0.4, −0.2, 0.2, 0.4, 0.6}. In all chain MDPs, the number of actions available at state Bi was set to 10.

Algorithm: The learning rate αt of EBQL was chosen to be polynomial, similar to the one in (Van Hasselt, 2010); i 0.8 i α(st, at) = 1/nt(st, at) , where nt(s, a) is the number of updates made until time t in the (s, a) entry of the i’th estimator. We tested EBQL for different ensemble sizes K ∈ {3, 7, 10, 15, 25}. The discount factor was set to γ = 1.

E.2. Deep EBQL Our code is based on (Arulkumaran, 2019), and can be found in the supplementary material of our paper. EBQL 0 EN\k 0 ∗ k Denote the TD-error of EBQL for ensemble member k as Yk (s, a, r, s ) = r + γQ (s , aˆ ) − Q (s, a).

Algorithm 2 Deep Ensemble Bootstrapped Q-Learning (Deep-EBQL) K Initialize: Q-ensemble with size K, parametrized by random weights: {Q(·, · ; θi)}i=1, Experience Buffer B for t = 0,...,T do hPK i i Choose action at = argmaxa i=1 Q (st, a; θi) at = explore(at) //e.g. -greedy st+1, rt ← env.step(st, at) B ← (st, at, rt, st+1) Set k = t%K ∗ Define a = argmaxa Q(st+1, a; θk) 2 EBQL 0 Take a gradient step to minimize L = E(s,a∗,r,s0)∼B Yk (s, a, r, s ) − Q(s, a; θk) end for  i K Return Q i=1

While in theory, each ensemble member should observe a unique data set, we saw that similar to Osband et al.(2016; 2018); Anschel et al.(2017); Lee et al.(2020), EBQL performs well when all ensemble members are updated every iteration using the same samples from the experience replay. In addition, we use a shared feature extractor for the entire ensemble, followed by the independent head for each ensemble member.

E.3. The Double Q-Learning Algorithm

Algorithm 3 Double Q-Learning (DQL) A B Initialize: Two Q-tables: Q and Q , s0 for t = 1,...,T do  A B  Choose action at = argmaxa Q (st, a) + Q (st, a) at = explore(at) //e.g. -greedy st+1, rt ← env.step(st, at) if t%2 == 0 then ∗ A Define a = argmaxa Q (st+1, a) //Update A A A B ∗ A  Q (st, at) ← Q (st, at) + αt rt + γQ (st+1, a ) − Q (st, at) else ∗ B Define b = argmaxa Q (st+1, a) //Update B B B A ∗ B  Q (st, at) ← Q (st, at) + αt rt + γQ (st+1, b ) − Q (st, at) end if end for Return QA, QB Ensemble Bootstrapping for Q-Learning F. Additional Experimental Results Below we provide the per-environment training curves. A somewhat surprising result is that EBQL is capable of outperform- ing Rainbow in several domains, although Rainbow combines a multitude of improvements which we did not add on top of EBQL (distributional, dueling, prioritized experience replay and multi-step learning).

Figure 7: Comparison between DQN, DDQN, Rainbow and EBQL on each domain over 50m steps averaged across 5 random seeds.