Alphazero with Input Convex Neural Networks

DEGREE PROJECT IN COMPUTER SCIENCE AND ENGINEERING, SECOND CYCLE, 30 CREDITS STOCKHOLM, SWEDEN 2020 AlphaZero with Input Convex Neural Networks SHUYUAN ZHANG KTH ROYAL INSTITUTE OF TECHNOLOGY SCHOOL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCE AlphaZero with Input Convex Neural Networks SHUYUAN ZHANG Master in Machine Learning Date: July 27, 2020 Supervisor: John Folkesson Examiner: Hossein Azizpour School of Electrical Engineering and Computer Science Host company: RISE AB iii Abstract Modelling and solving real-life problems using reinforcement learning (RL) approaches is a typical and important branch in the world of artificial intelligence (AI). For playing board games, AlphaZero has been proved to be successful in games such as Go, Chess, and Shogi against professional human players or other AI counterparts. The very basic components of AlphaZero algorithm are MCTS tree search and deep neural networks for state value and policy prediction. These deep neural networks are designed to fit the mapping function between a state and its value/policy to make the initialization of the state value/policy more accurate. In this thesis project, we propose Convex-AlphaZero to exploit a new prediction structure for the state value and policy and test its availability by providing theoretical evidence and experimental results. Instead of using one feed-forward process to get these values, our adaptation treats the problem as an optimization process by using input convex neural networks which can model the state value as a convex function of the policy given the state (i.e. game board configuration). The results of our experiments show that our method outperforms traditional mini-max approaches and worth further research on applying it to games other than Connect Four used in this thesis project. iv Sammanfattning Modellering och lösning av verkliga problem med hjälp av förstärkningsinlär- ningssätt (RL) är en typisk och viktig gren i världen av konstgjord intelligens (AI). För att spela brädspel har AlphaZero visat sig vara framgångsrikt i spel som Go, Chess och Shogi mot professionella mänskliga spelare eller andra AI-motsvarigheter. De mycket grundläggande komponenterna i AlphaZero- algoritmen är MCTS-trädsökning och djupa nervnätverk för statligt värde och policyförutsägelse. Dessa djupa neurala nätverk är utformade för att passa kartläggningsfunktionen mellan ett tillstånd och dess värde / politik för att göra initieringen av tillståndsvärdet / politiken mer exakt. I det här avhandlingsprojek- tet föreslår vi Convex-AlphaZero att utnyttja en ny förutsägelsestruktur för det statliga värdet och policyn och testa dess tillgänglighet genom att tillhandahålla teoretiska bevis och experimentella resultat. Istället för att använda en framåt- riktad process för att få dessa värden, behandlar vår anpassning problemet som en optimeringsprocess genom att använda inmatade konvexa neurala nätverk som kan modellera tillståndsvärdet som en konvex funktion av politiken som ges tillståndet (dvs. spelskortkonfiguration) . Resultaten från våra experiment visar att vår metod överträffar traditionella mini-max-tillvägagångssätt och är värt ytterligare forskning om att använda den på andra spel än Connect Four som används i denna avhandling. Contents 1 Introduction1 1.1 Motivation............................2 1.2 Problem Definition.......................2 1.3 Research Question.......................2 1.4 Scope, challenges, and limitations...............3 1.5 Contributions..........................3 1.6 Societal Impacts.........................4 1.7 Ethical Considerations.....................4 1.8 UN SDG Goals.........................4 1.9 Acknowledgements.......................4 2 Background5 2.1 Reinforcement Learning....................5 2.1.1 Basics of RL......................5 2.1.2 Markov Decision Processes (MDP)..........6 2.1.3 Sampling Methods in RL................7 2.1.4 Policy Optimization Using Policy Gradient Methods.8 2.1.5 Model-based and Model-free RL............9 2.1.6 Deep Reinforcement Learning............. 10 2.1.7 Exploration and Exploitation.............. 11 2.2 AlphaZero Overall Description................. 12 2.2.1 Deep Neural Network in AlphaZero.......... 12 2.2.2 Monte Carlo Tree Search in AlphaZero........ 13 2.2.3 Playing......................... 16 2.2.4 Replay Buffer...................... 16 2.3 Input Convex Neural Networks................. 16 2.3.1 Structure of ICNN................... 17 2.3.2 Inference in ICNN................... 19 2.3.3 Training in ICNN.................... 19 v vi CONTENTS 2.3.4 ICNN in AlphaZero.................. 19 2.3.5 Application of ICNNs in RL.............. 20 2.4 The Game of Connect Four................... 20 2.4.1 Game Introduction................... 20 2.4.2 Previous Solutions of the Game............ 22 3 Methods 23 3.1 Input Convex Neural Network in AlphaZero.......... 23 3.1.1 Network Structure................... 23 3.1.2 Inference........................ 25 3.1.3 Training......................... 26 3.1.4 Causality Reasoning.................. 26 3.2 Variance Reduction....................... 27 3.2.1 Uniform Policy..................... 27 3.2.2 Merging Matching States................ 28 3.3 Other Details.......................... 30 3.3.1 Game State Representation............... 30 3.3.2 Extending Data Set................... 31 3.3.3 Clipping And Normalizing Policies.......... 32 4 Results 33 4.1 Training Curves......................... 33 4.2 Player Strength Comparison.................. 34 4.2.1 Play with a Mini-max Agent.............. 35 4.2.2 AlphaZero vs Convex-AlphaZero........... 37 4.2.3 Winning Rate under Different Decision Time..... 37 4.3 Experiments about the ICNN.................. 39 4.3.1 Raw Network Performance............... 39 4.3.2 Average Game Length................. 41 5 Discussion 43 5.1 Comments on General Performance.............. 43 5.2 Coments on ICNN Performance................ 43 5.3 Limitations........................... 44 5.4 Future Work........................... 44 6 Conclusions 46 Bibliography 47 CONTENTS vii A Computing Platform Configuration 51 Chapter 1 Introduction Human-computer competition on board games has always been a hot topic in the area of computer science and artificial intelligence for decades. As early as 1948, Alan Turing showed the possibility of letting intelligent machines play games such as chess, bridge and poker like a human [1]. Then in 1956, the first chess program Los Alamos [2] was developed by Paul Stein and Mark Wells for the MANIAC I computer, which opened the era of playing chess and other games with computers. Mini-max [3] is a basic algorithm used in game-playing programs. The most challenging task for such programs is searching game states with efficiency. Chess, as one of the most famous board games, has 1047 [4] possible game states. A complete search in such a large game space is impossible even if we use modern supercomputers. As a result, scientists tried hard to reduce the scale of searching, either by limiting the depth of Mini-max search or by pruning [5]. The most successful implementation of Mini-max based approach was Deep Blue [6], a Chess program that defeated the human Chess world champion, Kasparov, at the time in 1997. When it comes to Go, however, previous algorithms no longer works because Go has 10170 game states (10123 times more than Chess), which makes the Mini-max based approach infeasible. Although the game of Go is extremely complex, it did not discourage scientists as they then tried some more modern approaches such as reinforcement learning (RL) [7] and deep learning. The combination of RL and searching algorithms finally gave birth to AlphaGo [8], which again defeated a human world champion Lee Sedol in 2016. 1 2 CHAPTER 1. INTRODUCTION 1.1 Motivation The direct motivation of this project is to propose and evaluate a new variant of AlphaZero, which is called Convex-AlphaZero. It treats state value as a function of both state and policy using input convex neural networks. AlphaZero achieved state of the art results when compared with other AI programs, but the network in AlphaZero mapped states directly to their values, thus ignored the relationship between policies and values. Convex-AlphaZero may benefit from introducing an extra causal link between current policy and value using a 2-inputs, 1-output network. We would like to investigate the performance of the proposed method and see how the modification will affect the agent’s overall performance. The motivation of this project goes beyond applying Convex-AlphaZero to board games. In real-life automatic control or planning, the amount of data sometimes can be extremely large. Data pre-processing is expensive and su- pervised learning seems infeasible. However, using reinforcement learning algorithms like AlphaZero, which do not need human force to process the data while reduce the computing resource cost by sampling, may introduce a new form of artificial intelligence to these practical areas. 1.2 Problem Definition AlphaZero currently uses a feed-forward network to predict a given state’s value and best policy. We change this network to an input convex neural network to model the state value in another way. Input convex neural networks have been proven to be effective in areas such as multi-label classification, image completion, and continuous action reinforcement learning. In conclusion, the main problem we want to discuss can be boiled down to : How does altering the convolutional neural network to input convex neural network in AlphaZero affect the

Load more