Diagnosing Bottlenecks in Deep Q-Learning Algorithms

Diagnosing Bottlenecks in Deep Q-learning Algorithms Justin Fu * 1 Aviral Kumar * 1 Matthew Soh 1 Sergey Levine 1 Abstract appealing choice for a wide range of tasks, from robotic control (Kalashnikov et al., 2018) to off-policy learning from Q-learning methods represent a commonly used historical data for recommender (Shani et al., 2005) systems class of algorithms in reinforcement learning: and other applications. However, although the basic tabular they are generally efficient and simple, and can Q-learning algorithm is convergent and admits theoretical be combined readily with function approximators analysis (Sutton & Barto, 2018), its non-linear counterpart for deep reinforcement learning (RL). However, with function approximation (such as with deep neural net- the behavior of Q-learning methods with function works) is poorly understood theoretically. In this paper, we approximation is poorly understood, both theo- aim to investigate the degree to which the theoretical issues retically and empirically. In this work, we aim with Q-learning actually manifest in practice. Thus, we em- to experimentally investigate potential issues in pirically analyze aspects of the Q-learning method in a unit Q-learning, by means of a ”unit testing” frame- testing framework, where we can employ oracle solvers to work where we can utilize oracles to disentangle obtain ground truth Q-functions and distributions for exact sources of error. Specifically, we investigate ques- analysis. We investigate the following questions: tions related to function approximation, sampling error and nonstationarity, and where available, ver- 1) What is the effect of function approximation on con- ify if trends found in oracle settings hold true with vergence? Most practical reinforcement learning problems, modern deep RL methods. We find that large neu- such as robotic control, require function approximation to ral network architectures have many benefits with handle large or continuous state spaces. However, the be- regards to learning stability; offer several practi- havior of Q-learning methods under function approximation cal compensations for overfitting; and develop a is not well understood. There are known counterexamples novel sampling method based on explicitly com- where the method diverges (Baird, 1995), and there are no pensating for function approximation error that known convergence guarantees (Sutton & Barto, 2018). To yields fair improvement on high-dimensional con- investigate these problems, we study the convergence be- tinuous control domains. havior of Q-learning methods with function approximation, parametrically varying the function approximator power and analyzing the quality of the solution as compared to the opti- 1. Introduction mal Q-function and the optimal projected Q-function under that function approximator. We find, somewhat surprisingly, Q-learning algorithms, which are based on approximating that function approximation error is not a major problem in state-action value functions, are an efficient and commonly Q-learning algorithms, but only when the representational used class of RL methods. In recent years, such methods capacity of the function approximator is high. This makes arXiv:1902.10250v1 [cs.LG] 26 Feb 2019 have been applied to great effect in domains such as play- sense in light of the theory: a high-capacity function approx- ing video games from raw pixels (Mnih et al., 2015) and imator can perform a nearly perfect projection of the backed continuous control in robotics (Kalashnikov et al., 2018). up Q-function, thus mitigating potentially convergence is- Methods based on approximate dynamic programming and sues due to an imperfect `2 norm projection. We also find Q-function estimation have several very appealing proper- that divergence rarely occurs, for example, we observed ties: they are generally moderately sample-efficient, when divergence in only 0.9% of our experiments. We discuss compared to policy gradient methods, they are simple to use, this further in Section4. and they allow for off-policy learning. This makes them an 2) What is the effect of sampling error and overfitting? *Equal contribution 1Berkeley AI Research, Univer- Q-learning is used to solve problems where we do not sity of California, Berkeley. Correspondence to: Justin have access to the transition function of the MDP. Thus, Fu <[email protected]>, Aviral Kumar Q-learning methods need to learn by collecting samples in <[email protected]>. the environment, and training on these samples incurs sam- Preprint. Under Review. Work in Progress. pling error, potentially leading to overfitting. This causes Diagnosing Bottlenecks in Deep Q-learning Algorithms errors in the computation of the Bellman backup, which de- framework for Q-learning to disentangle issues related to grades the quality of the solution. We experimentally show function approximation, sampling, and distributional shift that overfitting exists in practice by performing ablation where approximate components are replaced by oracles. studies on the number of gradient steps, and by demonstrat- This allows for controlled analysis of different sources of ing that oracle based early stopping techniques can be used error. We perform a detailed experimental analysis of many to improve performance of Q-learning algorithms. (Sec- hypothesized sources of instability, error, and slow training tion5). Thus, in our experiments we quantify the amount in Q-learning algorithms on tabular domains, and show that of overfitting which happens in practice, incorporating a many of these trends hold true in high dimensional domains. variety of metrics, an performing a number of ablations and We propose novel choices of sampling distributions which investigate methods to mitigate its effects. lead to improved performance even on high-dimensional tasks. Our overall aim is to offer practical guidance for 3) What is the effect of distribution shift and a mov- designing RL algorithms, as well as to identify important ing target? The standard formulation of Q-learning pre- issues to solve in future research. scribes an update rule, with no corresponding objective function (Sutton et al., 2009a). This results in a process which optimizes an objective that is non-stationary in two 2. Preliminaries ways: the target values are updated during training, and Q-learning algorithms aim to solve a Markov decision the distribution under which the Bellman error is optimized process (MDP) by learning the optimal state-action value changes, as samples are drawn from different policies. We function, or Q-function. We define an MDP as a tuple refer to these problems as the moving target and distribution (S; A; T; R; γ). S; A represent the state and action spaces, shift problems, respectively. These properties can make con- respectively. T (s0js; a) and R(s; a) represent the dynamics vergence behavior difficult to understand, and prior works (transition distribution) and reward function, and γ 2 (0; 1) have hypothesized that nonstationarity is a source of in- represents the discount factor. The goal in RL is to find stability (Mnih et al., 2015; Lillicrap et al., 2015). In our a policy π(ajs) that maximizes the expected cumulative experiments, we develop metrics to quantify the amount discounted rewards, known as the returns: of distribution shift and performance change due to non- stationary targets. Surprisingly, we find that in a controlled " 1 # ∗ X t experiment, distributional shift and non-stationary targets π = argmax Est+1∼T (·|st;at);at∼π(·|st) γ R(st; at) π do not in fact correlate with reduction in performance. In t=0 fact, sampling strategies with large distributional shift often The quantity of interest in Q-learning methods are state- perform very well. action value functions, which give the expected future re- 4) What is the best sampling or weighting distribution? turn starting from a particular state-action tuple, denoted Deeply tied to the distribution shift problem is the choice Qπ(s; a). The state value function can also be denoted as of which distribution to sample from. Do moving distribu- V π(s; a). Q-learning algorithms are based on iterating the tions cause instability, as Q-values trained on one distribu- Bellman backup operator T , defined as tion are evaluated under another in subsequent iterations? 0 Researchers have often noted that on-policy samples are (T Q)(s; a) = R(s; a) + γEs0∼T [V (s )] typically superior to off-policy samples (Sutton & Barto, V (s) = max Q(s; a0) 2018), and there are several theoretical results that highlight a0 favorable convergence properties under on-policy samples. The (tabular) Q-iteration algorithm is a dynamic program- However, there is little theoretical guidance on how to pick ming algorithm that iterates the Bellman backup Qt+1 distributions so as to maximize learning rate. To this end, T Qt. Because the Bellman backup is a γ-contraction in the we investigate several choices for the sampling distribution. L-1 norm, and Q∗ (the Q-values of π∗) is its fixed point, Surprisingly, we find that on-policy training distributions Q-iteration can be shown to converge to Q∗ (Sutton & Barto, are not always preferable, and that a clear pattern in perfor- 2018). A deterministic optimal policy can then be obtained mance with respect to training distribution is that broader, as π∗(s) = argmax Q∗(s; a). higher-entropy distributions perform better, regardless of a distributional shift. Motivated by our findings, we propose When state spaces cannot be enumerated in a tabular for-

Load more