Mastering Basketball with Deep Reinforcement Learning
Total Page:16
File Type:pdf, Size:1020Kb
Extended Abstract AAMAS 2020, May 9–13, Auckland, New Zealand Mastering Basketball with Deep Reinforcement Learning: An Integrated Curriculum Training Approach∗ Extended Abstract Hangtian Jia 1, Chunxu Ren 1, Yujing Hu 1, Yingfeng Chen 1+, Tangjie Lv 1, Changjie Fan 1 Hongyao Tang 2, Jianye Hao 2 1Netease Fuxi AI Lab, 2Tianjin University {jiahangtian,renchunxu,huyujing,chenyingfeng1,hzlvtangjie,fanchangjie}@corp.netease.com {bluecontra,jianye.hao}@tju.edu.cn ABSTRACT 1 INTRODUCTION AND RELATED WORK Despite the success of deep reinforcement learning in a variety type Although different kinds of games have variant properties and of games such as Board games, RTS, FPS, and MOBA games, sports styles, deep reinforcement learning (DRL) has conquered a variety games (SPG) like basketball have been seldom studied. Basketball of them through different techniques, such as DQN for Atari games is one of the most popular and challenging sports games due to [11, 12], Monte Carlo tree search (MCTS) for board games [15, its long-time horizon, sparse rewards, complex game rules, and 16], game theory for card games [2], curriculum learning (CL) for multiple roles with different capabilities. Although these problems first-person shooting (FPS) games [8], hierarchical reinforcement could be partially alleviated by common methods like hierarchical learning (HRL) for multiplayer online battle arena (MOBA) games reinforcement learning through a decomposition of the whole game [5, 13] and real-time strategy (RTS) games [10, 18]. However, sports into several subtasks based on game rules (such as attack, defense), games (SPG) have rarely been studied except that Google released a these methods tend to ignore the strong correlations between these football simulation platform. However, this platform only provided subtasks and could have difficulty in generating reasonable poli- several benchmarks with baseline algorithms without proposing cies across the whole basketball match. Besides, the existence of any general solutions [7]. multiple agents adds extra challenges to such game. In this work, we propose an integrated curriculum training approach (ICTA) which is composed of two parts. The first part is for handling the correlated subtasks from the perspective of a single player, which contains several weighted cascading curriculum learners that can smoothly unify the base curriculum training of corresponding sub- tasks together using a Q-value backup mechanism with a weight factor. The second part is for enhancing the cooperation ability of the basketball team, which is a curriculum switcher that focuses on learning the switch of the cooperative curriculum within one Figure 1: Correlations of Subtasks in Basketball team by taking over collaborative actions such as passing from a single-player’s action spaces. Our method is then applied to a As another popular sports game, basketball represents one spe- commercial online basketball game named Fever Basketball (FB). cial kind of comprehensive challenge that unifies most of the critical Results show that ICTA significantly outperforms the built-in AI problems for RL. First of all, the long-time horizon and sparse re- and reaches up to around 70% win-rate than online human players wards properties remain an issue to modern DRL algorithms. In during a 300-day evaluation period. basketball, it is normally not until scoring shall the agents get a KEYWORDS reward signal (goal in or not), which requires a long sequence of consecutive events such as dribbling and breaking through the de- basketball, reinforcement learning, curriculum learning fense of opponents. Second, a basketball game is composed of many ACM Reference Format: different sub-tasks according to game rules (Figure 1), for exam- Hangtian Jia 1, Chunxu Ren 1, Yujing Hu 1, Yingfeng Chen 1+, Tangjie Lv 1, ple, the offense sub-taskattack ( , assist), the defense sub-task, the Changjie Fan 1 Hongyao Tang 2, Jianye Hao 2 . 2020. Mastering Basketball sub-task of acquiring ball possession (ballclear), and the sub-task of with Deep Reinforcement Learning: An Integrated Curriculum Training navigation to the ball (freeball), each of which can be independently Approach. In Proc. of the 19th International Conference on Autonomous Agents formulated as a RL problem. Third, it is a multi-agent system that and Multiagent Systems (AAMAS 2020), Auckland, New Zealand, May 9–13, needs the players in a team to cooperate well to win the game. The 2020, IFAAMAS, 3 pages. last but not least, there are several characters or positions classified ∗Equal contribution by the first two authors. + Corresponding author. based on the players’ capabilities or tactical strategies such as cen- ter (C), power forward (PF), small forward (SF), point guard (PG), Proc. of the 19th International Conference on Autonomous Agents and Multiagent Systems shoot guard (SG), which adds extra stochasticity to the game. (AAMAS 2020), B. An, N. Yorke-Smith, A. El Fallah Seghrouchni, G. Sukthankar (eds.), May 9–13, 2020, Auckland, New Zealand. © 2020 International Foundation for Autonomous For solving complex problems, hierarchical reinforcement learn- Agents and Multiagent Systems (www.ifaamas.org). All rights reserved. ing (HRL) is commonly used by training a higher-level agent to 1872 Extended Abstract AAMAS 2020, May 9–13, Auckland, New Zealand Figure 2: Scenes and Proposed Training Approaches in FB. generate policies on manipulating the transitions of those sub-tasks 30 0.7 [1, 6, 9, 14, 17]. However, HRL is not feasible here since these transi- 20 0.6 tions are controlled by the basketball game. Besides, the correlations 10 0.5 between sub-tasks (Figure 1) make it more challenging to generate 0 0.4 a reasonable policy across the whole basketball match since policies 0.3 for these sub-tasks are highly correlated. −10 Average Score Gap per Round 0.2 −20 0.1 −30 2 ALGORITHM Win Rate of RL AI Team vs Human Players in 3V3 PVPs 0 20 40 60 80 100 0 50 100 150 200 250 300 Round Day One Model Training Cascading Curriculum Training Win Rate of Base Curriculum Training By regarding the five subtasks as base curriculums for basketball Base Curriculum Training ICTA Win Rate of ICTA playing, we propose an integrated curriculum training approach (ICTA) which is composed of two parts. The first part is for han- (a) Training Process with Built-in AI (b) Win Rate during Online Evaluation dling the correlated subtasks from the perspective of a single player, Figure 3: Performance of Our Models in FB. which contains several weighted cascading curriculum learners of corresponding subtasks. Such training can smoothly establish rela- 4 DISCUSSION AND CONCLUSIONS tionships during the base curriculum training of correlated subtasks In the ablation experiments, we can find that the one-model training (such as g , g ) by using a Q-value backup mechanism when calcu- 8 9 fails to learn a generalized model for these five diverse sub-tasks (e.g. lating the Q-label ~ [11] of the former subtask g and heuristically g8 8 the difference on action space and the goal of each subtask). Players adjusting the weight factor [ 2 »0, 1¼: trained with five-model base curricula can generate some funda- ( ∗ 0 0 A ¸ [W<0G 0& ¹B , 0 ;\ º, terminal s of g mental policies but lack tactical movements since the correlations = 8 0 g9 g9 g9 9 8 ~g8 ∗ 0 0 between related sub-tasks are ignored. The weighted cascading A8 ¸ W<0G00& ¹B , 0 ;\8 º, otherwise g8 g8 g8 curriculum training can make further improvements based on base The second part is a curriculum switcher that targets at enhanc- curriculum training because the correlation between related sub- ing the cooperation ability of the whole team. It focuses on learning tasks is retained and the policy can be optimized over the whole the switch of the cooperative curriculum within one team by taking task from the perspective of a single player. However, the coor- over collaborative actions such as passing from single-player’s ac- dination within one team remains a weakness. The ICTA model tion spaces (Figure 2). The switcher has a relatively higher priority significantly outperforms other training approaches since the co- over those base curriculum learners on action selection to ensure ordination performance can be essentially improved by using the the performance of coordination within a team. coordination curriculum switcher. The results of the 300-day online evaluation show that ICTA reaches up to around 70% win-rate 3 EXPERIMENTAL RESULTS than human players during 3v3 PVPs despite that human players We evaluate our models in a commercial online basketball game can purchase equipments to become stronger and they may dis- named Fever Basketball 1(FB). The ablation experiments (Figure 3 cover the weakness of our models as time goes by. We can conclude (a)) are first carried out by training with the rule-based built-in AI that ICTA can be used as a general method for handling SPG like that has average human capability to demonstrate the contributions basketball. of our methods. The results of a 300-day online evaluation with 1, 130, 926 human players are illustrated in Figure 3(b). The base ACKNOWLEDGMENTS algorithm we used is APEX-Rainbow [3, 4]. Hongyao Tang and Jianye Hao are supported by the National Natu- 1https://github.com/FuxiRL/FeverBasketball ral Science Founda- tion of China (Grant Nos.: 61702362, U1836214). 1873 Extended Abstract AAMAS 2020, May 9–13, Auckland, New Zealand REFERENCES IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, [1] Matthew M Botvinick, Yael Niv, and Andrew C Barto. 2009. Hierarchically orga- 4540–4546. https://doi.org/10.24963/ijcai.2019/631 nized behavior and its neural foundations: A reinforcement learning perspective. [11] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Cognition 113, 3 (2009), 262–280. Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep [2] Johannes Heinrich and David Silver.