Learn to Interpret Atari Agents
Total Page:16
File Type:pdf, Size:1020Kb
Learn to Interpret Atari Agents Zhao Yang 1 Song Bai 1 Li Zhang 1 Philip H.S. Torr 1 Abstract (DeepRL) (Mnih et al., 2013; 2015), there is an increas- ing interest in understanding DeepRL models. Combining Deep Reinforcement Learning (DeepRL) agents deep learning techniques with reinforcement learning algo- surpass human-level performances in a multitude rithms, DeepRL leverages the strong representation capacity of tasks. However, the direct mapping from states and approximation power of DNNs for return estimation and to actions makes it hard to interpret the rationale policy optimization (Sutton & Barto, 1998). In modern ap- behind the decision making of agents. In contrast plications where a state is defined by high-dimensional data to previous a-posteriori methods of visualizing input, e.g., Atari 2600 (Bellemare et al., 2013), the task of DeepRL policies, we propose an end-to-end train- DeepRL divides into two essential sub-tasks, i.e., generating able framework based on Rainbow, a representa- (low-dimensional) representations on states and subsequent tive Deep Q-Network (DQN) agent. Our method policy learning using such representations. automatically learns important regions in the input domain, which enables characterizations of the de- As DeepRL does not optimize for class discriminative ob- cision making and interpretations for non-intuitive jectives, previous interpretation methods developed for clas- behaviors. Hence we name it Region Sensitive sification models are not readily applicable to DeepRL mod- Rainbow (RS-Rainbow). RS-Rainbow utilizes els. The approximation of the optimal state value or action a simple yet effective mechanism to incorporate distribution not only operates in a black-box manner, but visualization ability into the learning model, not incorporates temporal information and environment dynam- only improving model interpretability, but leading ics. The black-box and sequential nature of DeepRL models to improved performance. Extensive experiments makes them inherently difficult to understand. on the challenging platform of Atari 2600 demon- Although interpreting DeepRL models is challenging, some strate the superiority of RS-Rainbow. In particu- efforts have been devoted in recent years to studying the lar, our agent achieves state of the art at just 25% behaviors of these complex models. Most of the existing of the training frames. Demonstrations and code interpretation methods (Mnih et al., 2015; Wang et al., 2016; are available at https://github.com/yz93/Learn-to- Zahavy et al., 2016; Greydanus et al., 2018) are a-posteriori, Interpret-Atari-Agents. explaining a model after it has been trained. For instance, some t-SNE-based methods (Mnih et al., 2015; Zahavy et al., 2016) employ game-specific human intuitions and expert 1. Introduction knowledge in RL. Other vision-inspired methods (Wang et al., 2016) adopt traditional saliency methods. The rep- Understanding deep neural networks (DNN) has been a resentative (Greydanus et al., 2018) adopts a data-driven long-standing goal of the machine learning community. approach for illustrating policy responses to a fixed input arXiv:1812.11276v2 [cs.LG] 24 Jan 2019 Many efforts exploit the class discriminative nature of the masking function, requiring hundreds of forward passes per CNN-based classification models (Krizhevsky et al., 2012) frame. As a common limitation, these a-posteriori methods for producing human-interpretable visual explanations (Si- cannot improve training with the deduced knowledge. monyan et al., 2014; Zeiler & Fergus, 2014; Springenberg In this work, we approach from a learning perspective, et al., 2015; Shrikumar et al., 2017; Fong & Vedaldi, 2017; and propose Region Sensitive Rainbow (RS-Rainbow) to Dabkowski & Gal, 2017). improve both the interpretability and performance of a With the advent of Deep Reinforcement Learning DeepRL model. To this end, RS-Rainbow leverages a region-sensitive module to estimate the importance of dif- 1 University of Oxford, Oxford, UK. Correspondence to: ferent sub-regions on the screen, which is used to guide Zhao Yang <[email protected]>, Song Bai <song- [email protected]>, Li Zhang <[email protected]>, Philip policy learning in end-to-end training. Specifically, a sub- H.S. Torr <[email protected]>. region containing a distinctive pattern or objects useful for policy learning is assigned with high importance. A com- Preliminary work. Learn to Interpret Atari Agents (a) (b) (c) (d) (e) (f) Figure 1. Visualizing Atari games (a) beam rider, (b) enduro, (c) frostbite, (d) ms pacman, (e) pong, and (f) space invaders. The left frame is the original game frame. The middle and the right frames each shows a gaze (defined in Sec.4) of RS-Rainbow during inference. The agent learns multiple salient regions containing functional objects, which are annotated in red circles for clarification purpose. bination of important sub-regions replaces the original un- 2. Background weighted screen as the representation of a state. Throughout an episode, the focus points of a pattern detector change 2.1. DQN and Rainbow as a result of game dynamics, and lead to policy variations. As an RL algorithm, DQN seeks to find a policy which Therefore, each pattern detector illustrates a distinct line of maximizes the long-term return of an agent acting in an reasoning by the agent. With the region-sensitive module, environment, with convergence guarantee provided by a we produce intuitive visualizations (see Fig.1) in a single Bellman equation. DQN combines deep learning with the backward pass without human interventions or repetitive, traditional off-policy, value-based Q-learning algorithm by costly passes through the network. employing a DNN as a value approximation function and The primary contribution of this work is to provide, to the the mean-squared error minimization as an alternative for best of our knowledge, the first learning-based approach for temporal difference updating (Sutton, 1988; Tesauro, 1995). automatically interpreting DeepRL models. It requires no Target network and experience replay are two key engineer- extra supervision and is end-to-end trainable. Moreover, it ing feats to stabilize training. In DQN, Q value refers to the possesses three advantages: expected discounted return for executing a particular action in a given state and following the current policy thereafter. 1) In contrast to previous methods (Zahavy et al., 2016; Given optimal Q values, the optimal policy follows as taking Greydanus et al., 2018), RS-Rainbow illustrates the actual the action with the highest Q value. rationale used in inference for decision making, in an intu- itive manner without human interventions. Rainbow (Hessel et al., 2018) incorporates many extensions over the original DQN (Mnih et al., 2013; 2015), each of 2) Besides supporting innate interpretation, quantitative ex- which enhances a different aspect of the model. Such exten- periments on the Atari 2600 platform (Bellemare et al., sions include double DQN (van Hasselt et al., 2016), dueling 2013) demonstrate that RS-Rainbow effectively improves DQN (Wang et al., 2016), priority experience replay (Schaul policy learning. In comparison, previous a-posteriori meth- et al., 2016), multi-step learning (Sutton, 1988), distribu- ods are unable to bring performance enhancements. tional RL (Bellemare et al., 2017), and noisy nets (Fortunato 3) The region-sensitive module, the core component of RS- et al., 2018). Double DQN addresses the over-estimation Rainbow, is a simple and efficient plug-in. It can be poten- of Q in the target function. Dueling DQN decomposes the tially applied to many DQN-based models for performance estimation of Q into separate estimations for a state value gains and a built-in visualization advantage. and an action advantage. Priority experience replay samples training data of higher learning potential with higher fre- The rest of the paper is organized as follows. We provide quency. Multi-step learning looks multiple steps ahead by a brief overview of background knowledge in Sec.2 and replacing one-step rewards and states with their multi-step present the details of the proposed RS-Rainbow in Sec.3. counterparts. Noisy net injects adaptable noises to linear Sec.4 demonstrates the interpretability of RS-Rainbow and layer outputs to introduce state-dependent exploration. In Sec.5 gives the quantitative evaluation of RS-Rainbow on distributional RL, Q is modeled as a random variable whose Atari games. Conclusions are given in Sec.6. distribution is learned over a fixed support set of discrete val- ues. The resulting Kullbeck-Leibler divergence loss enjoys Learn to Interpret Atari Agents Region-sensitive Module n o i ) t ) 2 v v a 2 1 z U , 5 i Advantage Stream n n l 1 , L × a 1 o o 1 × E ( m y y 1 C C ( r U s s o i i L Actions N o o e 4.3 R N N ) 1 2.3 ) = 2 p m ) , = v v v s 4 r 4 1.4 U U U , 6 = n n n 4 , s o L L L , 6 3 1.2 o o o , e e e 2 × N 4 3 3 ( C C C . , × R R R 2 8 4 . ( × Value Stream L . 8 ( y y U s s i i L o o e R N Input Images Image Encoder N Policy Layers Figure 2. The architecture of the proposed RS-Rainbow. convergence guarantee as the return distributions satisfy a 3. Proposed Approach Bellman equation. In this section, we introduce our motivation in Sec. 3.1, then describe the architecture of RS-Rainbow in Sec. 3.2, and 2.2. Understanding DeepRL finally present its capability for visualization in Sec. 3.3. Interpreting RL systems traditionally involves language gen- eration via first-order logic (Dodson et al., 2011; Elizalde 3.1. Motivation et al., 2008; Khan et al., 2009; Hayes & Shah, 2017). These approaches rely on small state spaces and high-level state There are three main considerations in our motivation for variables with interpretable semantics.