An Efficient Deep Reinforcement Learning Agent Mastering

SCC: an Efficient Deep Reinforcement Learning Agent Mastering the Game of StarCraft II Xiangjun Wang * 1 Junxiao Song * 1 Penghui Qi * 1 Peng Peng 1 Zhenkun Tang 1 Wei Zhang 1 Weimin Li 1 Xiongjun Pi 1 Jujie He 1 Chao Gao 1 Haitao Long 1 Quan Yuan 1 Abstract hundreds of TPUs for months (Vinyals et al., 2019b). AlphaStar, the AI that reaches GrandMaster level StarCraft, one of the most popular and complex Real-Time in StarCraft II, is a remarkable milestone demon- Strategy (RTS) games, is considered as one of the grand strating what deep reinforcement learning can challenges for reinforcement learning. The reinforcement achieve in complex Real-Time Strategy (RTS) learning algorithms need to make real-time decisions from games. However, the complexities of the game, al- combinatorial action spaces, under partially observable in- gorithms and systems, and especially the tremen- formation, plan over thousands of decision makings, and dous amount of computation needed are big ob- deal with a large space of cyclic and counter strategies. stacles for the community to conduct further re- Competing with human players is especially challenging be- search in this direction. We propose a deep rein- cause humans excel at reacting to game plays and exploiting forcement learning agent, StarCraft Commander opponents’ weaknesses. (SCC). With order of magnitude less computation, In this paper, we propose StarCraft Commander (SCC). Sim- it demonstrates top human performance defeat- ilar to AlphaStar, it comprises two training stages, starting ing GrandMaster players in test matches and top with imitation learning, followed by league style reinforce- professional players in a live event. Moreover, it ment learning. We’ll describe what it takes to train a rein- shows strong robustness to various human strate- forcement learning agent to play at top human performance gies and discovers novel strategies unseen from with constrained compute resources, as well as the analysis human plays. In this paper, we’ll share the key and key insights of model training and model behaviors. insights and optimizations on efficient imitation learning and reinforcement learning for StarCraft First, we conduct extensive neural network architecture ex- II full game. periments to squeeze the performance gain while reducing memory footprint. For example, reducing the input minimap size from 128 × 128 to 64 × 64 reduces the sample data 1. Introduction size almost by half with almost identical performance in the supervised learning stage. We also observed additional Games as research platforms have fueled a lot of recent performance improvements with various techniques such advances in reinforcement learning research. The success as group transformer, attention based pooling, conditioned of Atari (Mnih et al., 2013), AlphaGo (Silver et al., 2016), concat attention, etc. arXiv:2012.13169v3 [cs.LG] 9 Jun 2021 OpenAI Five (Berner et al., 2019) and AlphaStar (Vinyals et al., 2019b) have demonstrated the remarkable results Second, we evaluated the effect of data size and quality for deep reinforcement learning can achieve in various game imitation learning. To our surprise, we were able to get most environments. of the performance using only a small number of replays (4,638) compared to the full dataset (105,034 replays). The As the game complexity increases, those advances come additional performance gain of large dataset only comes with extremely large computational overhead. For example, with large batch size. The best result can be obtained with in order to train OpenAI Five that reaches Dota 2 profes- large dataset training with large batch size and fine tuning sional level, it utilized thousands of GPUs over multiple with small dataset of high quality replays. In the end, the months (Berner et al., 2019). AlphaStar also trained on supervised learning model can beat the built-in elite bot with 97% win rate consistently. *Equal contribution 1inspir.ai, Beijing, China. Correspondence to: Xiangjun Wang <[email protected]>. Third, during the reinforcement learning stage, due to game theoretic design of StarCraft, there exist vast spaces of strate- Proceedings of the 38 th International Conference on Machine Learning, PMLR 139, 2021. Copyright 2021 by the author(s). gies and cyclic counter strategies. It’s crucial to have both SCC: an Efficient Deep RL Agent Mastering the Game of StarCraft II strength and diversity so that agents are robust and invulner- on the difficult opponents. Second, main exploiters play able to various counter strategies, which are also the main only against current main agents to find weaknesses in main drivers behind the need for large computational resources. agents. Third, league exploiters use a similar PFSP mech- We propose agent branching for efficient training of main anism against all agents to find global blind spots in the agents and exploiters. Even though league training was re- league. They work together to ensure the main agents im- stricted on a single map and race, the agents exhibit strong prove strength and robustness when competing with human generalization playing against other races, on other maps, players. including unseen ones. AlphaStar introduced the first version with high level ideas Lastly, SCC was evaluated in test matches with players at described on a blog post (Vinyals et al., 2019a). It spe- different levels. We also held a live match event against pro- cialized in race Protoss and was evaluated against two professional players. SCC won all the matches against players fessional players. A revised version was published later from GrandMaster to top professionals. According to the (Vinyals et al., 2019b). The later version changed the mech- feedback from those players, SCC not only learned to play anism of league training to be more generic, utilized statstic in a way similar to how humans play, but also discovered z to encode build order and units, and trained all three races new strategies that are rare among human games. with constrained actions per minute (APM) and camera interface setting. On the infrastructure side, a total of 12 2. Related Work separate training agents are instantiated with four for each race, and for every training agent, it runs 16,000 concurrent In this section, we briefly review early work for StarCraft StarCraft II matches to collect samples and the learner pro- AI and describe AlphaStar algorithms. cesses about 50,000 agent steps per second. It was evaluated on the official online matching system Battle.net and rated StarCraft is a popular real time strategy game involving strat- as the top level (GrandMaster) on the European server. egy planning, balance of economy and micromanagement, game theoretic challenge. Those combined challenges make TStarBot-X (Han et al., 2020) is a recent attempt to reim- StarCraft an appealing platform for AI research. StarCraft: plement AlphaStar, with specialization in race Zerg. It en- Brood War has an active competitive AI research commu- countered difficulties reimplementing AlphaStar’s imitation nity since 2010 (Ontanón et al., 2013; Weber, 2010), where learning and league training strategy. To overcome those most bots are built with heuristic rules together with search issues, it introduced importance sampling in imitation learn- methods (Churchill & Buro, 2013; Churchill et al., 2017). ing, rule-guided policy search and new agent roles in league There has been some work using reinforcement learning training. Those methods helped efficiency and exploration for mini-games and micromanagement (Peng et al., 2017; but it had to incorporate multiple hand-crafted rules such Vinyals et al., 2017; Zambaldi et al., 2018; Foerster et al., as rule-guided policy search and curated datasets for 6 fine- 2017; Usunier et al., 2016). Most recently, reinforcement tuned supervised model. In the end, the human evaluation learning was used to play the full game, combined with showed comparable performance with two human Master hand-crafted rules (Sun et al., 2018; Lee et al., 2018; Pang players whose expertise are not race Zerg. et al., 2019). Even though some of the bots successfully beat the game built-in AI (Sun et al., 2018; Pang et al., 2019), 3. StarCraft Commander (SCC) none of the early work reached competitive human level. To the best of our knowledge, SCC is the first learning- AlphaStar is the first end-to-end learning algorithm for Star- based agent that reaches top human professional level after Craft II full game that reaches GrandMaster level. It uses AlphaStar, while using order of magnitude less computa- imitation learning to learn initial policy from human replay tion and without using any hand-crafted rules. SCC was data, which not only provides a strong initialization for rein- developed around the time the first version of AlphaStar was forcement learning, more importantly, it learns diverse sets published (Vinyals et al., 2019a) and adopted the main ideas of human strategies that are extremely hard for reinforce- from it. SCC interacts with the game of StarCraft II (getting ment learning to learn from scratch. It also uses statistic z to observations and sending actions) using the s2client pro- encode build orders and build units for guiding play strategy. tocol (Blizzard) and the PySC2 environment (Deepmind), In addition to self-play, league training is adopted for multi- provided by Blizzard and DeepMind respectively. agent reinforcement learning. The league consists of three distinct types of agents for each race, main agent, main ex- Without knowing all the AlphaStar algorithms details at ploiter and league exploiter. First, the main agents utilize a the time, we independently experimented with network ar- prioritized fictitious self-play (PFSP) mechanism that adapts chitecture, imitation and reinforcement learning training the mixture probabilities proportionally to the win rate of mechanisms. Given our computational constraint, we did each opponent against the agent, to dynamically focus more extensive optimizations to squeeze the efficiencies out of SCC: an Efficient Deep RL Agent Mastering the Game of StarCraft II each stage of learning.

Load more