State of the Art Control of Atari Games Using Shallow Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
State of the Art Control of Atari Games Using Shallow Reinforcement Learning Yitao Liangy, Marlos C. Machadoz, Erik Talvitiey, and Michael Bowlingz yFranklin & Marshall College zUniversity of Alberta Lancaster, PA, USA Edmonton, AB, Canada {yliang, erik.talvitie}@fandm.edu {machado, mbowling}@ualberta.ca ABSTRACT simultaneously learn a problem-specific representation and The recently introduced Deep Q-Networks (DQN) algorithm estimate a value function. has gained attention as one of the first successful combina- Games have always been an important testbed for AI, fre- tions of deep neural networks and reinforcement learning. Its quently being used to demonstrate major contributions to promise was demonstrated in the Arcade Learning Environ- the field [5, 6, 8, 23, 28]. DQN follows this tradition, demon- ment (ALE), a challenging framework composed of dozens strating its success by achieving human-level performance in of Atari 2600 games used to evaluate general competency the majority of games within the Arcade Learning Environ- in AI. It achieved dramatically better results than earlier ment (ALE) [2]. The ALE is a platform composed of dozens approaches, showing that its ability to learn good represen- of qualitatively diverse Atari 2600 games. As pictured in tations is quite robust and general. This paper attempts to Figure 1, the games in this suite include first-person per- understand the principles that underlie DQN's impressive spective shooting games (e.g. Battle Zone), platforming performance and to better contextualize its success. We sys- puzzle games (e.g. Montezuma's Revenge), sports games tematically evaluate the importance of key representational (e.g. Ice Hockey), and many other genres. Because of this biases encoded by DQN's network by proposing simple linear diversity, successful approaches in the ALE necessarily ex- representations that make use of these concepts. Incorporat- hibit a degree of robustness and generality. Further, because ing these characteristics, we obtain a computationally prac- it is based on problems designed by humans for humans, the tical feature set that achieves competitive performance to ALE inherently encodes some of the biases that allow hu- DQN in the ALE. Besides offering insight into the strengths mans to successfully navigate the world. This makes it a and weaknesses of DQN, we provide a generic representation potential stepping-stone to other, more complex decision- for the ALE, significantly reducing the burden of learning a making problems, especially those with visual input. representation for each game. Moreover, we also provide a DQN's success in the ALE rightfully attracted a great deal simple, reproducible benchmark for the sake of comparison of attention. For the first time an artificial agent achieved to future work in the ALE. performance comparable to a human player in this challeng- ing domain, achieving by far the best results at the time and demonstrating generality across many diverse problems. Keywords It also demonstrated a successful large scale integration of Reinforcement Learning, Function Approximation, DQN, deep neural networks and RL, surely an important step to- Representation Learning, Arcade Learning Environment ward more flexible agents. That said, problematic aspects of DQN's evaluation make it difficult to fully interpret the 1. INTRODUCTION results. As will be discussed in more detail in Section 6, the DQN experiments exploit non-standard game-specific prior In the reinforcement learning (RL) problem an agent au- information and also report only one independent trial per tonomously learns a behavior policy from experience in or- game, making it difficult to reproduce these results or to der to maximize a provided reward signal. Most successful make principled comparisons to other methods. Indepen- RL approaches have relied upon the engineering of problem- dent re-evaluations have already reported different results specific state representations, diminishing the agent as fully due to the variance in the single trial performance of DQN autonomous and reducing its flexibility. The recent Deep (e.g. [12]). Furthermore, the comparisons that Mnih et al. Q-Network (DQN) algorithm [20] aims to tackle this prob- did present were to benchmark results using far less training lem, presenting one of the first successful combinations of data. Without an adequate baseline for what is achievable RL and deep convolutional neural-networks (CNN) [13, 14], using simpler techniques, it is difficult to evaluate the cost- which are proving to be a powerful approach to represen- benefit ratios of more complex methods like DQN. tation learning in many areas. DQN is based upon the In addition to these methodological concerns, the evalu- well-known Q-learning algorithm [31] and uses a CNN to ation of a complicated method such as DQN often leaves open the question of which of its properties were most im- Appears in: Proceedings of the 15th International Conference portant to its success. While it is tempting to assume that on Autonomous Agents and Multiagent Systems (AAMAS 2016), the neural network must be discovering insightful tailored J. Thangarajah, K. Tuyls, C. Jonker, S. Marsella (eds.), representations for each game, there is also a considerable May 9{13, 2016, Singapore. Copyright c 2016, International Foundation for Autonomous Agents and amount of domain knowledge embedded in the very struc- Multiagent Systems (www.ifaamas.org). All rights reserved. 485 π : P1 k V (s) = Eπ k=0 γ rt+k+1jst = s for all s 2 S, where γ 2 [0; 1) is known as the discount factor. As a critical step toward improving a given policy π, it is common for reinforcement learning algorithms to learn a state-action value function, denoted: : Qπ(s; a) = R(s ; a ; s ) + γV π(s ) s =s; a =a: Figure 1: Examples from the ALE (left to right: Eπ t t t+1 t+1 t t Battle Zone, Montezuma's Revenge, Ice Hockey). However, in large problems it may be infeasible to learn a value for each state-action pair. To tackle this issue agents often learn an approximate value function: Qπ(s; a; θ) ≈ ture of the network. We can identify three key structural Qπ(s; a). A common approach uses linear function approx- biases in DQN's representation. First, CNNs provide a form imation (LFA) where Qπ(s; a; θ) = θ>φ(s; a), in which θ of spatial invariance not exploited in earlier work using lin- denotes the vector of weights and φ(s; a) denotes a static ear function approximation. DQN also made use of multiple feature representation of the state s when taking action a. frames of input allowing for the representation of short-range However, this can also be done through non-linear function non-Markovian value functions. Finally, small-sized convo- approximation methods, including neural networks. lutions are quite well suited to detecting small patterns of One of the most popular reinforcement learning algorithms pixels that commonly represent objects in early video game is Sarsa(λ) [22]. It consists of learning an approximate graphics. These biases were not fully explored in earlier action-value function while following a continually improv- work in the ALE, so a natural question is whether non-linear ing policy π. As the states are visited, and rewards are deep network representations are key to strong performance observed, the action-value function is updated and conse- in ALE or whether the general principles implicit in DQN's quently the policy is improved since each update improves network architecture might be captured more simply. the estimative of the agent's expected return from state s, The primary goal of this paper is to systematically in- taking action a, and then following policy π afterwards, i.e. vestigate the sources of DQN's success in the ALE. Obtain- Qπ(s; a). The Sarsa(λ) update equations, when using func- ing a deeper understanding of the core principles influencing tion approximation, are: DQN's performance and placing its impressive results in per- spective should aid practitioners aiming to apply or extend δt = rt+1 + γQ(st+1; at+1; θ~t) − Q(st; at; θ~t) it (whether in the ALE or in other domains). It should also ~ reveal representational issues that are key to success in the ~et = γλ~et−1 + rQ(st; at; θt) ALE itself, potentially inspiring new directions of research. θ~t+1 = θ~t + αδt~et; We perform our investigation by elaborating upon a sim- ple linear representation that was one of DQN's main com- where α denotes the step-size, the elements of ~et are known parison points, progressively incorporating the representa- as the eligibility traces, and δt is the temporal difference er- tional biases identified above and evaluating the impact of ror. Theoretical results suggest that, as an on-policy method, each one. This process ultimately yields a fixed, generic fea- Sarsa(λ) may be more stable in the linear function approx- ture representation able to obtain performance competitive imation case than off-policy methods such as Q-learning, to DQN in the ALE, suggesting that the general form of which are known to risk divergence [1, 10, 18]. These theo- the representations learned by DQN may in many cases be retical insights are confirmed in practice; Sarsa(λ) seems to more important than the specific features it learns. Further, be far less likely to diverge in the ALE than Q-learning and this feature set offers a simple and computationally prac- other off-policy methods [7]. tical alternative to DQN as a platform for future research, especially when the focus is not on representation learning. 2.2 Arcade Learning Environment Finally, we are able provide an alternative benchmark that In the Arcade Learning Environment agents have access is more methodologically sound, easing reproducibility and only to sensory information (160 pixels wide by 210 pixels comparison to future work.