Weighted Double Q-Learning

Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Weighted Double Q-learning Zongzhang Zhang Zhiyuan Pan Mykel J. Kochenderfer Soochow University Soochow University Stanford University Suzhou, Jiangsu 215006 China Suzhou, Jiangsu 215006 China Stanford, CA 94305 USA [email protected] [email protected] [email protected] Abstract to avoid the positive maximization bias when learning the action values. The algorithm has recently been generalized Q-learning is a popular reinforcement learning from the discrete setting to use deep neural networks [LeCun algorithm, but it can perform poorly in stochastic et al., 2015] as a way to approximate the action values in environments due to overestimating action values. high dimensional spaces [van Hasselt et al., 2016]. Double Overestimation is due to the use of a single Q-learning, however, can lead to a bias that results in estimator that uses the maximum action value underestimating action values. as an approximation for the maximum expected The main contribution of this paper is the introduction of action value. To avoid overestimation in Q- the weighted double Q-learning algorithm, which is based learning, the double Q-learning algorithm was on the construction of the weighted double estimator, with recently proposed, which uses the double estimator the goal of balancing between the overestimation in the method. It uses two estimators from independent single estimator and underestimation in the double estimator. sets of experiences, with one estimator determining We present empirical results of estimators of the maximum the maximizing action and the other providing expected value on three groups of multi-arm bandit problems, the estimate of its value. Double Q-learning and compare Q-learning and its variants in terms of the sometimes underestimates the action values. This action-value estimate and policy quality on MDP problems. paper introduces a weighted double Q-learning algorithm, which is based on the construction of 2 Background the weighted double estimator, with the goal of balancing between the overestimation in the single The MDP framework can be applied whenever we have an estimator and the underestimation in the double agent taking a sequence of actions in a system described as estimator. Empirically, the new algorithm is shown a tuple (S; A; T; R; γ), where S is a finite set of states, A to perform well on several MDP problems. is a finite set of actions, T : S × A × S ! [0; 1] is a state-transition model, where T (s; a; s0) gives the probability distribution over state s0 after the agent executes action a in 1 Introduction state s, R : S × A ! R is a reward function, where R(s; a) Sequential decision problems under uncertainty are often gives the reward obtained by the agent after executing action framed as Markov decision processes (MDPs) [Kochenderfer, a in state s, and γ 2 [0; 1) is a discount factor that trades off 2015; Bertsekas, 2007]. Reinforcement learning is concerned the importance of immediate and delayed rewards. with finding an optimal decision making strategy in problems An MDP policy is a mapping from S to A, denoted by where the transition model and rewards are not initially π : S ! A. The goal of solving an MDP is to find an optimal known [Littman, 2015; Wiering and van Otterlo, 2012; Sutton policy π∗ that maximizes V π : S ! R, the value of a state s and Barto, 1998]. Some reinforcement learning algorithms under policy π, defined as involve building explicit models of the transitions and 1 n X t o rewards [Brafman and Tennenholtz, 2001], but other “model- Eπ γ R(st; π(st)) j s0 = s : (1) free” algorithms learn the values of different actions directly. t=0 One of the most popular model-free algorithms is Q- A similar state-action value function is Qπ : S × A ! R, learning [Watkins, 1989]. The original Q-learning algorithm where Qπ(s; a) is the value of starting in state s, taking action inspired several improvements, such as delayed Q-learning a, and then continuing with the policy π. The optimal state- [Strehl et al., 2006], phased Q-learning [Kearns and Singh, value function Q∗(s; a) in the MDP framework satisfies the 1999], fitted Q-iteration [Ernst et al., 2005], bias-corrected Bellman optimality equation [Bellman, 1957]: Q-learning [Lee and Powell, 2012; Lee et al., 2013], and X Q∗(s; a) = R(s; a) + γ T (s; a; s0) max Q∗(s0; a0). (2) weighted Q-learning [D’Eramo et al., 2016]. a0 This paper focuses on an enhancement known as double s02S ∗ ∗ ∗ Q-learning [van Hasselt, 2010; 2011], a variant designed We can use Q to define π (s) 2 arg maxa2A Q (s; a). 3455 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) 3 Estimators of the Maximum Expected Value Algorithm 1 Q-learning This section considers the problem of finding an approxima- 1: Initialize Q, s tion for maxi EfXig, the maximum expected value of the 2: loop set of N random variables X = fX1;X2;:::;XN g. We first 3: Choose action a from state s based on Q and some describe two existing methods, the single estimator and the exploration strategy (e.g., -greedy) 0 double estimator, and then introduce our new weighted dou- 4: Take action a, observe r, s ∗ 0 ble estimator method. 5: a arg maxa Q(s ; a) 6: δ r + γQ(s0; a∗) − Q(s; a) 3.1 Single Estimator 7: Q(s; a) Q(s; a) + α(s; a)δ 8: s s0 Let µ = fµ1; µ2; : : : ; µN g be a set of unbiased estimators such that Efµig = EfXig, for all i. Assume that D = N [ Di is a set of samples, where Di is the subset contain- ∗ U i=1 where β 2 [0; 1] and a 2 arg maxi µi (D). Thus, ing at least one sample for the variable Xi, and the samples µW DE(D) equals the result of the single estimator when in Di are i.i.d. The single estimator method uses the value β = 1, and the double estimator when β = 0. maxi µi(D) as an estimator of maxi EfXig, where µi(D) = Now we consider how to construct the function β. Assume 1 P d is an unbiased estimator for the value of jDij d2Di that the variable Xi follows the distribution Hi, i.e., Xi ∼ EfXig. However, maxi Efµi(D)g ≥ maxi EfXig, and the Hi. Denote the Kullback-Leibler divergence between the inequality is strict if and only if P (j2 = arg maxi µi(D)) > 0 two distributions Hi and Hj as KL(Hi k Hj). When for any j 2 arg maxi EfXig [van Hasselt, 2010]. This im- maxi;j KL(Hi k Hj) = 0, the variables Xi in X plies that there will be a positive maximization bias if we use are i.i.d. To make µW DE(D) be an unbiased estimate the maximum of the estimates as an estimate of the maximum for maxi EfXig, β should be set to 0. Similarly, when of the true values. Such a biased estimate can happen when maxi;j KL(Hi k Hj) is small, we will want to also set β to a all variables in X are i.i.d. The overestimation in the single ∗ small value. We hypothesize that KL(Ha k HaL ), where estimator method is due to the same samples are both used to U aL 2 arg mini µi (D), can serve as an approximation to determine the maximizing action and to estimate its value. ∗ maxi;j KL(Hi k Hj), because Xa and XaL are the two variables with the biggest difference in terms of the expected 3.2 Double Estimator values in the sample subset DU . Since the distributions To avoid maximization bias in the single estimator, Hasselt of variables are unavailable, we further use jEfXa∗ g − V proposed the double estimator approach [van Hasselt, 2010]. ∗ EfXaL gj to approximate KL(Ha k HaL ). Since jµa∗ (D)− V One of the key ideas is to divide the sample set D into two ∗ µa (D)j is an unbiased estimator of jEfXa g − EfXaL gj, U V U U U U L disjoint subsets, D and D . Let µ = fµ1 ; µ2 ; : : : ; µN g we define β as follows: V V V V and µ = fµ1 ; µ2 ; : : : ; µN g be two sets of unbiased V V U V jµa∗ (D) − µa (D)j estimators such that Efµ g = Efµ g = EfXig for all i. β(D; c) = L i i V V , (4) c + jµ ∗ (D) − µ (D)j The two sample subsets are used to learn two independent a aL U 1 P V estimates, µi (D) = U d2DU d and µi (D) = ∗ jDi j i where c ≥ 0. Because there exists one β 2 [0; 1] such that 1 P ∗ U ∗ V jDV j d2DV d, each an estimate of the true value EfXig for β Efµa∗ (D)g + (1 − β )Efµa∗ (D)g = maxi EfXig, and, i i ∗ ∗ U for any β 2 [0; 1], there is a corresponding c 2 [0; +1) all i. The value µi (D) is used to determine the maximizing ∗ ∗ ∗ U such that β = Efβ(D; c )g, we can conclude that there action a 2 arg maxi µi (D), and the other is used to ∗ W DE ∗ V always exists one c 2 [0; +1) such that µ (D; c ) is provide the estimate of its value, µa∗ (D). This estimate V an unbiased estimator of maxi EfXig.

Load more