Diagnosing Bottlenecks in Deep Q-Learning Algorithms

Diagnosing Bottlenecks in Deep Q-learning Algorithms Justin Fu * 1 Aviral Kumar * 1 Matthew Soh 1 Sergey Levine 1 Abstract which potential issues with Q-learning manifest in practice. Q-learning methods are a common class of al- We empirically analyze aspects of the Q-learning method gorithms used in reinforcement learning (RL). in a unit testing framework, where we can employ oracle However, their behavior with function approxi- solvers to obtain ground truth Q-functions and distributions mation, especially with neural networks, is poorly for exact analysis. We investigate the following questions: understood theoretically and empirically. In this 1) What is the effect of function approximation on con- work, we aim to experimentally investigate po- vergence? Many practical RL problems require function tential issues in Q-learning, by means of a ”unit approximation to handle large or continuous state spaces. testing” framework where we can utilize oracles However, the behavior of Q-learning methods under func- to disentangle sources of error. Specifically, we tion approximation is not well understood – there are simple investigate questions related to function approx- counterexamples where the method diverges (Baird, 1995). imation, sampling error and nonstationarity, and To investigate these problems, we study the convergence be- where available, verify if trends found in oracle havior of Q-learning methods with function approximation settings hold true with deep RL methods. We find by varying the function approximator power and analyzing that large neural network architectures have many the quality of the solution found. We find, somewhat sur- benefits with regards to learning stability; offer prisingly, that divergence rarely occurs, and that function several practical compensations for overfitting; approximation error is not a major problem in Q-learning al- and develop a novel sampling method based on gorithms when the function approximator is powerful. This explicitly compensating for function approxima- makes sense in light of the theory: a high-capacity approx- tion error that yields fair improvement on high- imator can perform an accurate projection of the bellman dimensional continuous control domains. Backup, thus mitigating potential convergence issues due to function approximation. (Section4) 1. Introduction 2) What is the effect of sampling error and overfitting? RL is used to solve problems where we do not have access Q-learning algorithms, which are based on approximating to the transition function of the MDP. Thus, Q-learning state-action value functions, are an efficient and commonly methods need to learn by collecting samples in the environ- used class of RL methods. Q-learning methods have sev- ment, and minimizing errors on samples potentially leads to eral very appealing properties: they are relatively sample- overfitting. We experimentally show that overfitting exists efficient when compared to policy gradient methods, and in practice by performing ablation studies on the number of they allow for off-policy learning. This makes them an gradient steps, and by demonstrating that oracle based early appealing choice for a wide range of tasks, from robotic stopping techniques can be used to improve performance of control (Kalashnikov et al., 2018) and video game AI (Mnih Q-learning algorithms. (Section5). et al., 2015) to off-policy learning from historical data for recommender systems (Shani et al., 2005). However, al- 3) What is the effect of distribution shift and a mov- though the basic tabular Q-learning algorithm is convergent ing target? The standard formulation of Q-learning pre- and admits theoretical analysis (Sutton & Barto, 2018), its scribes an update rule, with no corresponding objective counterpart with function approximation is poorly under- function (Sutton et al., 2009a). This results in a process stood. In this paper, we aim to investigate the degree to which optimizes an objective that is non-stationary in two ways: the target values are updated during training (the *Equal contribution 1UC Berkeley. Correspondence to: Justin moving target problem), and the distribution under which Fu <[email protected]>, Aviral Kumar the Bellman error is optimized changes, as samples are <[email protected]>. drawn from different policies (the distribution shift prob- Proceedings of the 36 th International Conference on Machine lem). These properties can make convergence behavior Learning, Long Beach, California, PMLR 97, 2019. Copyright difficult to understand, and prior works have hypothesized 2019 by the author(s). Diagnosing Bottlenecks in Deep Q-learning Algorithms that nonstationarity is a source of instability (Mnih et al., policy starting from a particular state or state-action pair. 2015; Lillicrap et al., 2015). We develop metrics to quantify Q-learning algorithms are based on iterating the Bellman the amount of distribution shift and performance change due backup operator T , defined as to non-stationary targets. Surprisingly, we find that in a con- 0 trolled experiment, distributional shift and non-stationary (T Q)(s; a) = R(s; a) + γEs0∼T [V (s )] targets do not correlate with a reduction in performance, and V (s) = max Q(s; a0) 0 some well-performing methods incur high distribution shift. a Q-iteration is a dynamic programming algorithm that it- 4) What is the best sampling or weighting distribution? erates the Bellman backup Qt+1 T Qt. Because the Deeply tied to the distribution shift problem is the choice Bellman backup is a γ-contraction in the L norm, and Q∗ of which distribution to sample from. Do moving distribu- 1 is its fixed point, Q-iteration can be shown to converge to tions cause instability, as Q-values trained on one distribu- Q∗ (Sutton & Barto, 2018). A deterministic optimal policy tion are evaluated under another in subsequent iterations? can then be obtained as π∗(s) = argmax Q∗(s; a). Researchers have often noted that on-policy samples are a typically superior to off-policy samples (Sutton & Barto, When state spaces are large, function approximation is 2018), and there are several theoretical results that highlight needed to represent the Q-values. This corresponds to fitted favorable convergence properties under on-policy samples. Q-iteration (FQI) (Ernst et al., 2005), a form of approximate However, there is little theoretical guidance on how to pick dynamic programming (ADP), which forms the basis of distributions so as to maximize learning. To this end, we modern deep RL methods such as DQN (Mnih et al., 2015). investigate several choices for the sampling distribution. Sur- FQI projects the values of the Bellman backup onto a family t+1 t prisingly, we find that on-policy training distributions are of Q-function approximators Q: Q Πµ(T Q ). Πµ not always preferable, and that broader, higher-entropy dis- denotes a µ-weighted L2 projection, which minimizes the tributions often perform better, regardless of distributional Bellman error via supervised learning: shift. Motivated by our findings, we propose a novel weight- def 0 2 ing distribution, adversarial feature matching (AFM), which Πµ(Q) = argmin Es;a∼µ[(Q (s; a) − Q(s; a)) ]: (1) is explicitly compensates for function approximator error, Q02Q while still producing high-entropy sampling distributions. The values produced by the Bellman backup, (T Qt)(s; a) In summary, we introduce a unit testing framework for Q- are commonly referred to as target values, and when neural learning to disentangle potential bottlenecks where approxi- networks are used for function approximation, the previous mate components are replaced by oracles. This allows for Q-function Qt(s; a) is referred to as the target network. controlled analysis of different sources of error. We study In this work, we distinguish between the cases when the various sources of instability and error in Q-learning algo- Bellman error is estimated with Monte-Carlo sampling or rithms on tabular domains, and show that many of these computed exactly (see Section 3.1). The sampled variant trends hold true in high dimensional domains. We then corresponds to FQI as described in the literature (Ernst propose a novel sampling distribution that improve perfor- et al., 2005; Riedmiller, 2005), while the exact variant is mance even on high-dimensional tasks. an example of conventional ADP methods (Bertsekas & Tsitsiklis, 1996). Convergence guarantees for Q-iteration do 2. Preliminaries not cleanly translate to FQI. Πµ is an L2 projection, but T is a contraction in the L1 norm – this norm mismatch means Q-learning algorithms aim to solve a Markov decision the composition of the backup and projection is no longer process (MDP) by learning the optimal state-action value guaranteed to be a contraction under any norm (Bertsekas & function, or Q-function. We define an MDP as a tuple Tsitsiklis, 1996), and hence convergence is not guaranteed. (S; A; T; R; γ). S; A represent the state and action spaces, 0 A related branch of Q-learning methods are online Q- respectively. T (s js; a) and R(s; a) represent the dynam- learning methods, in which Q-values are updated while ics (transition distribution) and reward function, respec- samples are being collected in the MDP. This includes tively, and γ 2 (0; 1) represents the discount factor. Let- classic algorithms such as Watkin’s Q-learning (Watkins ting ρ0(s) denote the initial state distribution, the goal & Dayan, 1992). Online Q-learning methods can be viewed in RL is to find a policy π(ajs) that maximizes the ex- as a form of stochastic

Load more