Connect6 Opening Leveraging Alphazero Algorithm and Job-Level Computing
Total Page:16
File Type:pdf, Size:1020Kb
4N4-IS-1c-05 The 35th Annual Conference of the Japanese Society for Artificial Intelligence, 2021 Connect6 Opening Leveraging AlphaZero Algorithm and Job-Level Computing Shao-Xiong Zheng*1,2 Wei-Yuan Hsu*1,2 Kuo-Chan Huang*3 I-Chen Wu*1,2,4ġġġ *1 Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan *2 Research Center for IT Innovation, Academia Sinica, Taiwan *3 Department of Computer Science, National Taichung University of Education, Taichung, Taiwan *4 Pervasive Artificial Intelligence Research (PAIR) Labs, Taiwan For most board games, players commonly learn to increase strengths by following the opening moves played by experts, usually in the first stage of playing. In the past, many efforts have been made to use game-specific knowledge to construct opening books. Recently, DeepMind developed AlphaZero (2017) that can master game playing without domain knowledge. In this paper, we present an approach based on AlphaZero to constructing an opening book. To demonstrate the approach, we use a Connect6 program trained based on AlphaZero for evaluating positions, and then expand the opening game tree based on a job-level computing algorithm, called JL-UCT (job-level Upper Confidence Tree), developed by Wu et al. (2013) and Wei et al. (2015). In our experiment, the strengths of the Connect6 programs using this opening book are significantly improved, namely, the one with the opening book has a win rate of 65% against the one without using the book. In addition, the one without opening lost to Polygames in the Connect6 tournament of TCGA 2020 competitions, while the one with opening won against Polygames in TAAI and Computer Olympiad competitions later in 2020. To demonstrate the feasibility of our approach, we used the 1. Introduction proposed approach to construct a Connect6 opening book, and Opening book construction is an important research topic for assessed its quality by comparing the strength of our twoġConnect6 increasing the strengths of game-playing programs [Wei 2015]. programs with and without the opening book, respectively. It turns Opening books of a strategy game are databases containing good out that the opening book significantly improves the strength of actions at the opening game stage. An opening book can especially our Connect6 program. In our experiment, the one with opening bring in significant advantage in time-limited game competitions book has a win rate of 65% against the one without using the book. because a lot of search and computation time can be saved. The opening book also helped our Connect6 program get higher In the past, many efforts have been made to use game-specific ranking in real tournaments. Our original program without the knowledge to construct opening books, including analyzing opening book lost to Polygames [Cazenave 2020] in the Connect6 opening moves made by top players and using programs that tournament of TCGA 2020 competitions, while the new one with implement domain-specific algorithms to suggest opening moves. the opening book defeated Polygames and won the gold medal in These methods may encounter two problems: the quality of both TAAI and Computer Olympiad competitions later in 2020. opening books depends on human knowledge of the game, and a The remainder of this paper is organized as follows. Section 2 successful method in one game might not be applied to another presents the necessary background knowledge on MCTS, game. To solve these two issuesĭ this paper proposes an approach AlphaZero, Job-Level Computing, and discuss related work on based on AlphaZero [Silver 2018] to constructing a high-quality opening book construction. In Section 3, we describe our approach opening book without domain knowledge. to constructing a Connect6 opening book and evaluate the The AlphaZero algorithm [Silver 2018], developed by performance of the opening book. Section 4 concludes this paper. DeepMind, demonstrates the capability of reinforcement learning to master game playing without domain knowledge. In our 2. Background and Related Work opening book construction approach, a program trained based on 2.1 MCTS AlphaZero is used to evaluate the positions during expanding the Monte-Carlo tree search (MCTS) is a decision-making opening game tree. For game tree expansion, we use the Job-Level algorithm based on Monte Carlo evaluation and best-first search, Upper Confidence Tree (JL-UCT) distributed algorithm [Wei typically used in turn-based games [Chaslot 2008]. The algorithm 2015] to explore the game tree and select opening positions to repeatedly simulates the possible consequences of each action in a evaluate. The evaluation data on the opening game tree are then way that promising actions, selected based on current simulation collected and converted into an opening book. results, are given more additional simulations. Each iteration of MCTS consists of the following four stages. Contact: I-Chen Wu, Department of Computer Science, National Selection: Starting at the root node, a selection policy is Yang Ming Chiao Tung University, Hsinchu, Taiwan, +886- recursively applied to choose a child for each visited node 3-5731855, +886-3-5733777, [email protected]. until a leaf node is reached. A key issue in this stage is the balance between exploration and exploitation. A commonly - 1 - 4N4-IS-1c-05 The 35th Annual Conference of the Japanese Society for Artificial Intelligence, 2021 used selection policy is first to evaluate the UCT value of Compared with the similar game Gomoku, Connect6 has better each child i, based on the following formula. fairness. In Gomoku, a larger board size tends to gives the first player a greater advantage [Hsu 2020]. It has even been proved ݈݃ͳͲ ܰ ൈට (1) that the first player wins the 15x15 Gomoku [Allis 1994]. Inܥൌݔ݅ ݅ܶܥܷ ܰ ݅ contrast, so far there is no evidence that the same unfairness exists in Connect6. In addition, the Connect6 is more complex than many where ݔ is the win rate of child i, N and ܰ are the visiting other games, since placing two stones in one move makes the counts of the node and its child i respectively, and C is an number of actions much higher. These properties make Connect6 coefficient. MCTS is inclined to explore for larger C, and regarded as one of the most ideal games for studying computer tends to exploit with smaller C. Then, the policy is to select games[Tao 2009]. the child with the maximum ܷܥܶ݅. This allows MCTS to converge to the optimal decision after a sufficiently large 2.4 Job-Level Computing number of simulations [Browne 2012]. Since solving game problems often requires a large amount of computation, parallelization is usually necessary in practice. To Expansion: The tree is expanded by adding one or more child help solve game problems, Wu et al. proposed a general nodes to the selected leaf nodes, according to the available distributed computing model named job-level computing [Wu actions at the state represented by the selected node. 2013]. Job-level (JL) computing consists of JL clients and the JL Simulation: Simulations are run from the new node(s) by taking system. A JL client dynamically divides game problem solving a series of random actions or according to a default policy into tasks that can be completed by specific executions of game until outcomes are obtained at the terminal states. programs. The requests to execute game programs are Backpropagation: Simulation results are used to update the UCT encapsulated as jobs and sent to the JL system. The JL system, values for all ancestors. comprised of a broker and a collection of (remote) workers, helps The four MCTS stages will be repeatedly applied until a perform the jobs simultaneously by dispatching them to available predefined time or iteration constraint is reached. workers. The job results are then returned to the JL client. Under the JL computing model, many general problem solving 2.2 AlphaZero algorithms that are not limited to specific games has been AlphaZero is an algorithm that allows programs to learn to proposed. A useful one called JL-UCT [Wei 2015] will be used in master a game without human knowledge. The algorithm our opening book construction in Section 3. JL-UCT is a game tree combines Monte Carlo Tree Search and deep neural networks in a expansion algorithm adopting ideas similar to MCTS, and works reinforcement learning framework, described as follows. During as follows. In the JL system, a game-playing program is used to the self-play, a MCTS-based program plays against itself to serve as an agent that suggests actions and evaluates expected generate game records, and a deep neural network is trained with outcomes of given positions. The JL client starts to build a JL the game outcomes as well as the probability distribution of game tree rooted at a given position. Then, the JL client repeatedly actions chosen by MCTS. Note that the selection in MCTS of requests the execution of the game-playing program and expands AlphaZero slightly modified the formula (1) by considering the tree nodes representing the succeeding positions corresponding to probability provided by the network policy in the second term, as the suggested actions to the game tree. For JL-UCT, the JL client described in greater detail in [Silver 2018], recursively applies the UCT formula to select the child nodes to AlphaZero trains programs without using game specific be visited until it reaches the leaf node. JL-UCT allows to perform knowledge, so it can be generally applied to training many other the above game tree expansion in different games when games or applications, such as Go, Chess and Shogi, and reach implemented on different games. For example, when applied on state-of-the-art [Silver 2018].