Integrating Reinforcement Learning Into Strategy Games
Total Page:16
File Type:pdf, Size:1020Kb
Integrating Reinforcement Learning into Strategy Games by Stefan Wender Supervised by Ian Watson The University of Auckland Auckland, New Zealand A Thesis submitted in fulfillment of the requirements for the degree of Master of Science in Computer Science The University of Auckland, February 2009 Abstract The present thesis describes the design and implementation of a machine learning agent based on four different reinforcement learning algorithms. The reinforcement learning agent is integrated into the commercial computer game Civilization IV. Civilization IV is a turn- based empire building game from the Civilization series. The reinforcement learning agent is applied to the city placement selection task. The city placement selection determines the founding sites for a player’s cities. The four reinforcement learning algorithms that are evaluated are the off-policy algorithms one-step Q-learning and Q(λ) and the on-policy algorithms one-step Sarsa and Sarsa(λ). The aim of the research presented in this thesis is the creation of an adaptive machine learning approach for a task which is originally performed by a complex deterministic script. Since the machine learning approach is nondeterministic, it results in a more challenging and dynamic computer AI. The thesis presents an empirical evaluation of the performance of the reinforcement learning approach and compares the performance of the adaptive agent with the original deterministic game AI. The comparison shows that the reinforcement learning approach outperforms the deterministic game AI. Finally, the behaviour and performance of the reinforcement learn- ing algorithms are elaborated on and the algorithms are further improved by analysing and revising their parameters. The performance of the four algorithms is compared using their identified optimal parameter settings and Sarsa(λ) is shown to perform best at the task of city site selection. iii Acknowledgements I am very grateful for the supervision and support of my advisor Ian Watson. Without his ideas and guidance this thesis would not have been possible. My fellow computer science students Sebastian, Paul, Ali and Felix made sure that the time in the lab was never wasted, or at least never wasted alone. Thank you for many insightful and enlightening conversations. The comments and time given by my readers Clare and Christine greatly improved and clarified this thesis. I would especially like to thank my family who have supported me in every possible way for all my life, whatever I wanted to do, wherever I wanted to go. To my mum and dad, Susanne and Bernd, and to my brother and sister, Andreas and Christine, thank you for always being supportive. v Contents List of Tables xi List of Figures xiii List of Algorithms xv 1 Introduction, Motivation and Objectives1 2 Thesis Outline5 3 Background 7 3.1 Machine Learning in Games............................7 3.2 Evolution of Computer Game AI......................... 10 3.3 Origins of Reinforcement Learning........................ 11 3.4 Related Work.................................... 12 3.4.1 Q-Learning................................. 13 3.4.2 Sarsa..................................... 13 3.4.3 Reinforcement Learning and Case-Based Reasoning........... 14 3.4.4 Neuroevolution............................... 15 3.4.5 Dynamic Scripting............................. 16 3.4.6 Motivated Reinforcement Learning.................... 16 3.4.7 Civilization Games as Testbed for Academic Research......... 17 3.5 Algorithms..................................... 18 3.5.1 The Markov Property........................... 18 3.5.2 Markov Decision Processes........................ 19 3.5.3 Temporal-Difference Learning....................... 19 3.5.4 The Q-Learning Algorithm........................ 20 3.5.5 The Sarsa Algorithm............................ 20 3.5.6 Eligibility Traces.............................. 22 4 Testbed 25 vii Contents 4.1 Selection Criteria.................................. 25 4.2 Testbed Selection.................................. 26 4.3 The Civilization Game............................... 28 4.4 Existing Civilization IV Game AI......................... 29 4.5 City Placement Task................................ 29 4.6 Existing Procedure for City Foundation..................... 31 5 Design and Implementation 35 5.1 Policies....................................... 35 5.1.1 Greedy Policies............................... 35 5.1.2 -Soft Policies................................ 36 5.1.3 -Greedy Policies.............................. 36 5.1.4 Softmax Policies.............................. 36 5.2 Reinforcement Learning Model.......................... 37 5.2.1 States.................................... 37 5.2.2 Actions................................... 39 5.2.3 The Transition Probabilities........................ 40 5.2.4 The Reward Signal............................. 40 5.3 Implementation of the Reinforcement Learning Algorithms........... 43 5.3.1 Integration of the Algorithms....................... 43 5.3.2 Implementation of Eligibility Traces................... 46 5.4 Accelerating Convergence............................. 49 5.4.1 Reduction of the State Space....................... 49 5.4.2 Pre-Initialised Q-Values.......................... 49 6 Evaluation 51 6.1 Experimental Setup................................ 51 6.1.1 Algorithm Parameters........................... 51 6.1.2 Game Settings............................... 52 6.2 Parameter Selection for the Reinforcement Learning Algorithms........ 55 6.2.1 Aim of Parameter Selection........................ 55 6.2.2 Preliminary Test Runs........................... 55 6.2.3 Test Runs with Full Discount....................... 59 6.2.4 Test Runs with Pre-Initialised Q-Values................. 60 6.2.5 Test Runs with a Reduced Number of Turns............... 62 6.3 Comparison of Reinforcement Learning Algorithm Performance........ 64 6.4 Comparison of the Standard Game AI with Reinforcement Learning..... 68 6.5 Optimising the Learning Rate α .......................... 71 viii Contents 7 Discussion and Future Work 75 8 Conclusion 81 A Paper Presented at the 2008 IEEE Symposium on Computational Intelligence and Games (CIG’08) 83 B Results for 1000 Episodes of Length 60 Turns using Sarsa with Declining -Greedy Policy 91 C RLState class 95 References 97 Abbreviations 103 ix List of Tables 4.1 Comparison of Civilization Games........................ 26 5.1 Relationship between the Map Size and the Number of Possible States.... 38 6.1 Game Settings................................... 54 6.2 Parameter Settings for Preliminary Runs..................... 56 6.3 Parameter Settings for Full Discount Runs.................... 59 6.4 Parameter Settings for Tests with Pre-Initialised Q-Values........... 60 6.5 Parameter Settings for Test Runs with Reduced Length............ 62 6.6 Parameter Settings for Comparison of Reinforcement Learning with the Stan- dard AI....................................... 68 xi List of Figures 4.1 Civilization IV: Workable City Radius...................... 30 4.2 Civilization IV: The Border of an Empire.................... 31 4.3 City Distribution.................................. 33 5.1 Excerpt of the State Space S including the Actions that lead to the Transition 39 5.2 Computation of the Rewards........................... 42 5.3 General View of the Algorithm Integration for City Site Selection....... 43 5.4 Flow of Events for City Site Selection...................... 44 5.5 Computation of the Rewards........................... 47 5.6 Comparison of Value Updates between One-Step Q-Learning and Watkins’ Q(λ) 48 6.1 Results for Preliminary Tests using the One-Step Versions of the Algorithms. 56 6.2 Results for Preliminary Tests using the Eligibility Trace Versions of the Algo- rithms........................................ 56 6.3 Results for Fully Discounted One-Step Versions of the Algorithms....... 59 6.4 Results for Fully Discounted Eligibility Trace Versions of the Algorithms... 60 6.5 Results for Pre-Initialised Q-Values using the One-Step Versions of the Algorithms 61 6.6 Results for Pre-Initialised Q-Values using the Eligibility Trace Versions of the Algorithms..................................... 61 6.7 Results for Short Runs using the One-Step Versions of the Algorithms.... 63 6.8 Results for Short Runs using the Eligibility Trace Versions of the Algorithms 63 6.9 Illustration of Early Algorithm Conversion.................... 67 6.10 Results for Declining -Greedy using the One-Step Versions of the Algorithms 69 6.11 Results for Declining -Greedy using the Eligibility Trace Versions of the Al- gorithms....................................... 69 6.12 Performance Comparison between Standard AI and Reinforcement Learning Algorithms..................................... 70 6.13 Comparison of Different Learning Rates using One-Step Q-Learning..... 71 6.14 Comparison of Different Learning Rates using One-Step Sarsa......... 72 6.15 Comparison of Different Learning Rates using Q(λ)............... 72 xiii List of Figures 6.16 Comparison of Different Learning Rates using Sarsa(λ)............. 73 7.1 Performance Comparison between Standard AI and Reinforcement Learning Algorithms with Optimal Settings......................... 77 xiv List of Algorithms 1 A Simple TD Algorithm for Estimating V π .................... 19 2 Pseudocode for One-Step Q-Learning....................... 20 3 Pseudocode for One-Step Sarsa........................... 21 4 Pseudocode for TD(λ) ................................ 22 5 Pseudocode for Sarsa(λ) ..............................