Player Modelling in Civilization IV

Home , Civilization (series), Civilization IV

Freek den Teuling HAIT Master Thesis series nr. 10-001

THESIS SUBMITTED IN PARTIAL FULFILMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

MASTER OF ARTS IN COMMUNICATION AND INFORMATION SCIENCES ,

MASTER TRACK HUMAN ASPECTS OF INFORMATION TECHNOLOGY ,

AT THE FACULTY OF HUMANITIES

OF TILBURG UNIVERSITY

Thesis committee:

Dr. Ir. P.H.M. (Pieter) Spronck

Dr. M.M. (Menno) van Zaanen

Prof. Dr. E.O. (Eric) Postma

Tilburg University

Faculty of Humanities

Department of Communication and Information Sciences

Tilburg, The Netherlands

April 2010 P a g e | 2

Preface

Knowledge is power. Knowledge about an opponent is power over that opponent. This reasoning forms the base of opponent modelling. Once we are capable to make a model of an opponent, we have nothing to fear from that opponent. We know the opponent; we know his weaknesses, his skills, his preference. That is the power of opponent modelling. As a student for the master track Human Aspects of Information Technology at Tilburg University, fanatic gamer, and computer enthusiast, the modelling of an opponent was for me instantly associated with computers and games. It would be great for gamers if computers were able to create a model of human players. Ideas came to mind of computer games that act based on specific needs of players (as ally or as adversary), and game applications which are capable of automatically modelling opponents. In short, an increase in level of interaction and adaption between player and computer. However, these were wild ideas without knowledge of the possibilities. During meetings with my thesis supervisor Dr. Ir. Pieter Spronck we discussed several games we could use as a digital research field. We discussed games like WORLD OF WARCRAFT , NEVERWINTER NIGHTS and SID MEIER ’S CIVILIZATION IV. Pieter already did some experiments with CIVILIZATION IV in which he used classifiers to predict when a player would declare war in that game. His experience with the game was one of the reasons to choose CIVILIZATION IV as our digital research field. We have altered his concept, so that it could be used for opponent modelling. My goal would be to find out to what extent it would be possible to create a player model based on the game CIVILIZATION IV by using a classifier. This research has not as goal to actually create a player model, but rather to find out whether our approach is a valid approach to create a player model. While reading literature, it became clear that opponent modelling by means of greedy classification has two major disadvantages: (1) large amounts of data are needed in order to train a classifier, and (2) training of a classifier takes time. However, our approach is designed to bypass these problems and makes it therefore a useable concept to be actually implemented in future games. This research would not have been completed without the ever helpful Pieter Spronck. I have absorbed large amounts of his time with endless talks about my research. Whenever I was facing obstacles in my research, he helped me past those. My friends and family have also been fantastic by offering advice and moral support. Furthermore, I want to thank the participants, Walter and Job, for their time.

May this research be of as much value to you as it was to me.

Have fun reading!

Freek den Teuling

P a g e | 3

Abstract

In order for humans to play a game, they just need to understand the rules of the game. However, to become victorious in a game it is necessary understand the game and the opponent. In other words, besides knowing the rules it is important to have an opponent model. An opponent model, or player model, is an abstract representation of player. It can consist of a player’s strengths, weaknesses, preferences, strategies, skills, or a combination of those. The player model in this research is based on the preferences of players and is constructed by means of classification. The construction of a player model by means of classification requires large datasets that need to be generated by players, which in turn requires time. To speed up the construction of the player model, computer-controlled players with preferences embedded in their code are used to create a classification model. These computer-controlled players with preferences embedded in their code were found in the commercial computer game SID MEIER ’S CIVILIZATION IV. In this game the computer-controlled players are represented by leaders of different civilizations. These computer-controlled leaders are selected to generate a large database. The classifier is to train on this database in order for it to be capable of predicting preferences of computer-controlled players and even human players; thus creating player models.

After an introduction in player modelling, classification, and CIVILIZATION IV follows the mid-section of the research. This comprises of preparations, the experiments, the results, and the discussion of these results. The preparation consists of: (1) the process of the selecting computer-controlled players, (2) the generation of data, (3) the construction of databases, (4) the selection of the appropriate classifier, and (5) tweaking and fine-tuning mechanics. Three experiments are conducted: (1) Classification Model Validation in which the constructed classification models are validated, (2) Modelling of AI were an attempt is made to predict the preferences of computer-controlled players, and (3) Modelling of Humans were an attempt is made to model two human players. Following the experiment section is a detailed result section, displaying the results of the experiments. Noteworthy results are further elaborated on in the discussion. The conclusion of our investigation is that (1) a fairly accurate preference-based player model can be constructed by means of classification. Therefore classification seems a suitable approach to player modelling. (2) The predictions on computer-controlled players other than the ones used to create the database, are not that accurate. Possible reason is the choice for preferences as class. Besides preferences there appear to be more influences that steer the actions of the leaders, e.g. special-purpose code. Furthermore we notice that it is harder for the classifier to distinguish computer-controlled players that have roughly similar preferences. (3) The creation of a human player model does not seem that accurate either. Interesting is that the accuracy of the player model appears to differ depending on play styles. Although, this subject is researched briefly and is well worth future research.

P a g e | 4

Table of Contents

Preface ...... 2

Abstract ...... 3

Table of Contents ...... 4

1. Introduction ...... 6

2. Background ...... 8 2.1. Sid Meier’s Civilization IV ...... 8 2.2. Player Modelling ...... 9 2.3. Sequential Minimize Optimization Classifier ...... 10

3. Experimental Setup ...... 13 3.1. Research Elaboration ...... 13 3.2. Leaders and Preferences ...... 15 3.3. Data Generation ...... 16 3.4. Observation ...... 18 3.5. Classifier Selection...... 20 3.6. Experiments ...... 21 Classification Models Validation ...... 21 Modelling of AI ...... 22 Modelling of Humans ...... 23

4. Results ...... 25 4.1. Classification Models Validation ...... 25 InfoGain and GainRatio ...... 26 Minus100 ...... 27 Validation ...... 29 Summary ...... 30 4.2. Modelling of AI ...... 30 Preference Classification ...... 31 Leader Classification ...... 32 Summary ...... 33 4.3. Modelling of Humans ...... 33 Casual Player ...... 34 Expert Player ...... 35 Close or Not? ...... 36 Summary ...... 38

P a g e | 5

Geef titel van hoofdstuk op (niveau 3) ...... 6 5. Discussion ...... 39 5.1. Constructing the Classification Models ...... 39 5.2. Predicting Preference or Player ...... 40 First Solution ...... 41 Second Solution ...... 41 5.3. How to Classify Humans? ...... 42 Geef titel van hoofdstuk op (niveau 3) ...... 6 6. Conclusions ...... 43 6.1. Suitable Player Model ...... 43 6.2. Predicting AI Opponents ...... 43 6.3. Predicting Human Opponents ...... 44 6.4. Answer to the Problem Statement ...... 44 6.5. Future Work ...... 44 Implementation in Civilization IV ...... 45 Geef titel van hoofdstuk op (niveau 3) ...... 6 References ...... 46 Geef titel van hoofdstuk op (niveau 3) ...... 6 Appendices ...... 47 A. Features ...... 47 B. SMO ...... 49 C. InfoGain & GainRatio ...... 51 D. InfoGain & GainRatio Features ...... 53 E. Reports ...... 54 F. SMO Output ...... 57

P a g e | 6

Know thy self, know thy enemy. A thousand battles, a “thousand victories. Sun Tzu (500 BC) ” 1 Introduction

When humans play games, they strive to master these games. The mastering of a game is a prerequisite to become victorious in a game. That is not the only prerequisite however. Besides mastering the game it is equally important, or perhaps even more important, to know the opponent. To become victorious, it is necessary to understand the opponent’s preferences, strategies, skills and weaknesses. This combination of (1) mastering the game itself and (2) learning about the opponent are both part of the preparations of top grandmasters in CHESS , CHECKERS or SHOGI (Carmel & Markovitch, 1993; Van den Herik, Donkers & Spronck, 2005). Mastering a game can be achieved by observing games, learning the rules, and practice. In order to learn about an opponent, a model can be constructed. Such a model is called an opponent model or more general: a player model. A player model can be described as an abstract representation of certain characteristics of a player and his behaviour. A player model can specify a player’s preferences, strategies, skills and weaknesses or any combination of those (Houlette, 2004; Van den Herik et al., 2005; Donkers & Spronck, 2006). It has been shown in other research that it is possible to create a player model based on actions that players make in certain game states (Houlette, 2004). However, the actual actions of a player are defined by the strategy of a player, which are in turn defined by a player’s preferences. A player model that is purely based on actions and the predicting of actions, tends to be of limited accuracy (Donkers & Spronck, 2006). To overcome the issue of limited accuracy and the desire to model the mind of a player, a different approach is chosen in this research. Here, an attempt is made to model the preferences of a player rather than the actions of a player. The problem statement is defined as follows: To what extent can a model be constructed of a player, which accurately predicts the player’s preferences? We have limited our research to players in the commercial computer game SID MEIER ’S CIVILIZATION IV. First an attempt will be made to create a classification model based on data generated by computer-controlled players. Second, an attempt will be made to model other computer-controlled players with the constructed classification model. Third, an attempt will be made to model human players with the constructed classification model. This has been formulated in three research questions:

1. What is a suitable player model for the computer game SID MEIER ’S CIVILIZATION IV? 2. To what extent can a model be constructed, using a classification algorithm, which recognizes the preferences of a computer-controlled player in SID MEIER ’S CIVILIZATION IV? 3. To what extent can a model be constructed, using a classification algorithm, that recognizes the preferences of a human player in SID MEIER ’S CIVILIZATION IV? An advantage of constructing a classification model on data generated by computer-controlled players before using it to model a player (computer-controlled or human) is that it bypasses two major disadvantages of using greedy classifiers. The first disadvantage of a greedy classifier is the need for human players to invest time in generating data for the classifier to train on. Data to create a database is still needed for this approach. However, that database can be constructed by using computer-controlled players.

P a g e | 7

In other words, no human players need to invest time in the generation of data. The second disadvantage of a greedy classifier is the amount of time it takes for a classifier to train on the data. In this research, time is still needed to train the classifier. However, once the classifier is trained the modelling of players, computer-controlled or human, will only take seconds. The outline of this thesis is as follows. In Chapter 2 background information is provided, containing the ins and outs of the computer game SID MEIER ’S CIVILIZATION IV, elaboration on the approach of player modelling in general, and finally a description of the classifier that is used in this research. Chapter 3 concerns the experimental setup. It contains an elaboration on this research, an explanation on the selected preferences and computer-controlled players, the process of transferring the preferences to the game world, the selection of which features should be incorporated in the database, validation of the used classifier, and finally an explanation of the experiments. Chapter 4 contains the results of the experiments. These results are discussed in Chapter 5, followed by the conclusions and recommendations for future work in Chapter 6.

P a g e | 8

2 Background

This chapter serves as background and reference for the rest of this research. In Section 2.1 the commercial computer game SID MEIER ’S CIVILIZATION IV is described. It gives an overview of the computer game and its mechanics. Furthermore, it describes the reason why this particular computer game is chosen as digital research field. In Section 2.2 the modelling of a player is elaborated further, describing player modelling in classical games as well as in modern games. In Section 2.3 the classification algorithm that is used for this research is explained. 2.1 Sid Meier’s Civilization IV

SID MEIER ’S CIVILIZATION IV (CIV 4) is a turn-based strategy (TBS) game in which the player builds an empire. In general, TBS games are strategically oriented computer games. An important difference with most common computer games it that TBS games are played in turns rather than in real-time. Board games like Chess and Checkers are good examples of turn-based games. Similar to Chess and Checkers players take alternating turns in CIVILIZATION IV. Taking turns provides more time for players to think about their next action. Strategies and planning are therefore even more important for TBS games than for real-time strategy (RTS) games.

In CIV 4 a player begins with selecting an empire and an appropriate leader. There are eighteen different empires available and a total of 26 leaders. Once the empire and leader have been selected, the game starts in the year 4000 BC. From here on the player has to compete with rival leaders, manage cities, develop infrastructure, encourage scientific and cultural progress, found religions, etcetera. An original characteristic of CIV 4, is that defeating the opponent is not the only way to be victorious. There are six conditions to be victorious as mentioned in the CIVILIZATION IV MANUAL (2005): (1) Time Victory , (2) Conquest Victory , (3) Domination Victory , (4) Cultural Victory , (5) Space Race and (6) Diplomatic Victory . Because of these six different victory conditions the relation between the player and the opponent is different from most strategy games. The main part of the game the player is at peace with his opponents. Therefore it is possible to interact, to negotiate, to trade, to threaten and to make deals with opponents. Only after declaring war or being declared war upon, a player is at war. Any player can declare war any time, unless that player is in an agreement with an opponent which specifically forbids war declaration. To provide an impression of the game, an in-game screenshot of CIV 4 is displayed in Figure 2.1

The reason to choose CIV 4 as digital research field was fourfold. (1) There has been done previous research with this game. Having previous research as reference prevents mistakes that otherwise could have been made. Furthermore, it provides guidelines and even useable programs (Houlette, 2004; Rohs, 2007). (2) There was an extensive database already available containing data of numerous played games. The extensive database reduced time-consuming data generation, although this existing database is drastically expanded in this research (see Chapter 3 Section 3.4). (3) The fact that there are multiple (peaceful) ways to be victorious. The multiple victory conditions and the possibility of interaction between players, stimulates the use of preferences and strategies. (4) Most importantly, the computer-controlled leaders act based on preferences given to them by the designers of the game. We believe that the computer-controlled leaders act, based on their preferences. This last fact forms the core of this thesis. As mentioned in Chapter 1 the general goal is to find out to what extent a model can be constructed of a player, which accurately predicts

P a g e | 9

preferences of the player. The predefined preferences that are implemented by the designers of CIV 4, provide a solid base upon which classification models can be build. This will be further elaborated in Section 3.1.

Figure 2.1 – A screenshot of a game of CIV 4. The border between two empires is visible as well as one city of each empire. Furthermore, one can see the division of the land in tiles, the availability of resources, different terrains, and some units

2.2 Player Modelling

In order to elaborate on player modelling, a distinction is made between classical games and computer games. First, a short overview of the purposes of a player model is given. Followed by player modelling in classical games and then in computer games. For an extensive description of the history of player modelling, look into Donkers (2003). Basically, a model of a player can be used in three different ways. (1) In order for a game to adapt to the player it is beneficial to have a player model. Knowing how a player acts can give the game data (and thus knowledge) about the strengths and weaknesses of the player. A game can use this data to defeat strong players by exploiting the weaknesses of those players or by tutoring the players to overcome their weaknesses. (2) A player model can be used to implement humanlike characteristics in a non-player character (NPC). By copying human models, hence human behaviour into NPC’s the game can be more realistic (Van den Herik et al ., 2005; Donkers & Spronck, 2006; Bakkes, to appear ). (3) Another benefit of a player model is that a model can maximize game results even though a game-theoretical optimal solution is known (Bakkes, Spronck & Van den Herik, 2009). Take for example the game of ROSHAMBO (Rock- Paper-Scissors). The game-theoretical optimal solution is for both players to randomly select one of the three options. The result will roughly be that both players will win half of the time and both will lose half of the time. Now consider an opponent that always chooses rock . Sticking with the game-theoretical optimal solution will have the same result as before. Using the knowledge about the opponent, hence the model of the opponent, one could maximize the result by choosing paper all the time (Fürnkranz, 2007). Player modelling has been envisaged a long time ago in the domain of classic games. In the 1970s, chess programs incorporated a contempt factor. This means that the program accepted a draw against a stronger opponent and declined a draw against a weaker opponent (Van den Herik et al ., 2005). It took the other player in consideration and was therefore considered as the first form of player modelling. The first

P a g e | 10 attempt to really model an opponent in a classic game was taken by Slagle and Dixon (1970) who used a optimized mini-max procedure. In 1993, research specifically focussed on the opponent modelling. In Israel, Carmel and Markovitch (1993) investigated, in depth, the learning of models of opponent strategies. At the same time in the Netherlands, Iida, Uiterwijk, and Van den Herik (1993) investigated potential applications of opponent-model search. Both research teams called their approach opponent-model search. In 1994, Uiterwijk and Van den Herik invented a search technique to speculate on the fallibility of the opponent player. In the 2000s a probabilistic opponent model was defined by Donkers, Uiterwijk & Van den Herik (2001) and Donkers (2003), that incorporated the player’s uncertainty about the opponent’s strategy (Bakkes, to appear ). The general aim of player modelling in computer games is to make the games more entertaining to the player whereas in contradiction with classic games, the general aim is to beat the opponent, (Van den Herik et al. , 2005). This is one of the reasons player modelling is of increasing importance in modern computer games (Fürnkranz, 2007). According to Van den Herik et al. (2005), player modelling has two main goals in a computer game’s artificial intelligence (AI): (1) as a companion and (2) as a opponent. Companion role : In order for the AI to be a good companion, it is necessary that the AI behaves according to the humans expectations. For instance, when the human player prefers a sneaky approach to deal with hostile characters (e.g. by attempting to maintain undetected), he will not be pleased when the computer-controlled companions immediately attack every hostile character that is near. The entertainment value of a game would soon be impaired if the AI fails to predict the preferences of the human player (Bakkes, to appear ). Opponent role : As an opponent it is important for the AI to keep the game interesting. Research has shown that if an opponent is too weak, the player quickly looses interest in the computer game. On the other hand it is not desirable for the AI to be stronger than the player. This results in the player getting stuck, which also reduces the entertainment value (Van Lankveld, Spronck & Van den Herik, 2009). Player modelling has been around for some decades now. It started with classic games and is now becoming a point of interest in modern computer games. Much research has been done, but there is still much to learn. 2.3 Sequential Minimize Optimization Classifier

Data extracted from the game world through observations needs to be classified in order to construct a player model. There are many different classification algorithms available. To choose the optimal classification method we considered several different classification methods, each of these classification methods use their own approach in classification. Among them were: (1) the naive Bayes classifier NaiveBayes, (2) the optimised Support Vector Machine (SVM) ‘Sequential Minimal Optimization’ (SMO), (3) the k-nearest-neighbour classifier IBk, (4) the decision tree builder J48, and (5) the rule-based classifier JRip. Based on a small test, described in Chapter 3 Section 3.5, the fastest and most accurate classification algorithm was SMO. Essentially, the SMO is closely related to SVM and is designed to outperform a standard SVM (Platt, 1998, 2000). To understand SMO it is necessary to understand SVM. In the remainder of this section a short overview of SVM is given, concluding with the added value of SMO. For in-depth information about SMO look into Platt (1998); Keerthi, Shevade, Bhattacharyya & Murthy (2000); Mak (2000). For further information on the other four classifiers we refer to John & Langley’s (1995) Estimating Continuous Distributions in Bayesian Classifiers , Aha & Kibler’s (1991) Instance-based Learning Algorithms, Quinlan’s (1993) C4.5: Programs for Machine Learning , and Cohen’s (1995) Fast Effective Rule Induction respectively. These four classification methods are not elaborated on, since they are of little relevance to this research.

P a g e | 11

SVM uses linear models to implement nonlinear class boundaries. This is done by transforming the input using nonlinear mapping. A new instance space is created, called vector space. In this vector space a linear line can be drawn, that appears nonlinear in the real instance space. The instance space is shown in Figure 2.2 were it is clearly visible that the instances cannot be separated by a single linear line. Figure 2.3 shows the same instances, but in a vector space in which they can easily be divided by a linear line.

Class A Class B

Figure 2.2 – Non linear separable instance space

Class A Class B

Figure 2.3 – Nonlinear separable instance space after transformation into a linear separable vector space

It is best explained considering a dataset that consists of twenty-four instances, also called vectors. Each vector is filled with a number of attribute values, followed by a class. In this case there are two classes, Class A and Class B. So each vector belongs to Class A or Class B. By using a nonlinear mapping formula the vectors are transformed into points in a vector space. These points are called feature vectors. The basic idea is that the SVM can draw a linear line that separates all the feature vectors from Class A from all the feature vectors from Class B. The linear model is called a hyperplane. This is shown in Figure 2.3. Class A is represented by diamonds, Class B by squares. Whenever a new vector is encountered, it will be transformed to a feature vector. Depending on the place of the feature vector in the vector space, it is

attributed to Class A or Class B. The SVM uses an even more elaborated technique then an ordinary hyperplane, it is called maximum margin hyperplane. It is a hyperplane which is furthest away from both classes and which is orthogonal to the shortest line connecting the outer boundaries. The instances that are closest to the maximum margin

P a g e | 12 hyperplane are called support vectors. These support vectors determine the lay-out of the hyperplane. Each class has at least one support vector, but often more. In Figure 2.4 the vectors that are considered support vectors, are represented by larger squares and diamonds. It is important to note that when the support vectors are known, all the other vectors are irrelevant. They can be deleted without changing the shape of the maximum margin hyperplane, since the hyperplane is constructed based solely upon the vectors of one class with minimal distance to the vectors of the other class. All the vectors that are further away do not influence the maximum margin hyperplane.

Class A Class B

Figure 2.4 - Maximum margin hyperp lane with su pport vectors in a vector space

The example given here is that of a rather simple classification problem and would not take long to execute. However, using a SVM with a large dataset containing multiple classes and large amounts of attributes per vector will result in a not feasible computational complex problem. It will take a considerable amount of time and computational resources to classify such a dataset. This problem is also known as constrained quadratic optimization. This is where SMO provides the solution. SMO divides the quadratic programming (QP) optimization problem into smaller parts. These small QP problems are solved analytically, which avoids using a time-consuming numerical QP optimization as an inner loop. The amount of memory required for SMO is linear to the training set size, which allows SMO to handle very large training sets. Because matrix computation is avoided, SMO scales somewhere between linear and quadratic in the training set size for various test problems (Platt, 1998). Concluding, a SMO is an optimization of a SVM. It reduces the time and computational recourses that a SVM would need to perform a classification, without losing accuracy.

P a g e | 13

3 Experimental Setup

The aim of this research is to explore to what extent we can create a player model based on preferences by using a classifier. In order to structure the research we formulated a research plan which is a theoretical representation of the intended research process. This research plan is further elaborated on in Section 3.1. Section 3.2 revolves around the selection of the computer-controlled leaders and their preferences. Once those are selected the generation of data starts in Section 3.3. Which data is to be observed and stored is discussed in Section 3.4 followed by the selection of the appropriate classifier in Section 3.5. Finally, an explanation of the experiments is provided in Section 3.6. 3.1 Research Elaboration

The choice for the commercial computer game CIV 4 as the digital research field has four reasons (Section 2.2). Most important reason is the availability of preferences in the game code of the CIV 4 leaders. These preferences are the base of our player model. The goal is to make a player model based on the preferences of players. In order to create the player model we need to start with the construction of classification models, one for each preference. Once the classification models are constructed, an attempt is made to (1) model computer-controlled players and (2) human players. We believe that given a player’s preferences, this player exhibits certain strategies. However, strategies themselves cannot be observed, since a strategy is not explicitly manifested by parameters of the game world. The actions of the player on the other hand, are based upon those strategies and are, unlike the strategies itself, observable in the game world. Via mechanics described in Section 3.3 such observations can be extracted from the game world and stored in a database, Database A . The observations are linked to the preferences of the computer-controlled players that generate the observable actions. Because the observations are linked to the preferences it is possible for the SMO classifier to train on the observations in the database. After the classifier has trained on the data, one classification model per preference is created. Each classification model is assumed to be capable of predicting one preferences of the computer- controlled players. Even if it is to predict preference of players that were not part of the data generation. The combination of the classification models of all the preferences is considered a player model. In the first stadium we validate the classification models by letting them predict the preferences of the same computer-controlled players that are used to construct the classification models. In Figure 3.1 an overview is presented of the process of building a classification model. It contains seven aspects consisting of: (1) squares that represent entities, (2) ovals to indicate actions, processes or steps, and (3) the cylinder that represents a database. Each entity is a prerequisite to reach the next entity through an action, process or step. Determining the accuracy of the predictions of the classification models on computer-controlled players and human players is the following step in our process of player modelling. Once the classification models are build, they are assumed to be able to predict the preferences of other computer-controlled players and even human players. For the created classification models to provide predictions on these new players, new databases are needed. These databases are called Database B for the computer-controlled players and Database C for the human players. Determining the accuracy of the predictions on computer-

P a g e | 14

Predicts (1) A computer-controlled player; more precise the preferences 7. of that player. Classification 1. AI Player Model (2) The process of transferring preferences through actions into the game world.

(3) The game world.

6. 2. Acting (4) The process of observing the game world and extracting Training data.

(5) A database named Database A . This database is used for the experiment Classification Model Validation.

3. Game (6) The process of the classifier that trains on the data in 5. Database A World Database A . 4. Observing (7) A classification model is built, capable of predicting one preference.

Figure 3.1 - Building a classification model based on the preferences implemented in the computer -controlled players

controlled players and human players is rather similar to the creation of the classification with minor adjustments. This process is presented in Figure 3.2. It contains nine aspects rather similar to Figure 3.1 consisting of: (1) squares are entities, (2) ovals indicate actions, processes or steps, and (3) the cylinder represents a database. Each entity is a prerequisite to reach the next entity through an action, process or step.

Similarity (1) A player; computer-controlled or human.

9. Player Model 1. Player (2) The process of transferring preferences through actions into the game world.

(3) The game world.

8. Combining 2. Actions (4) The process of observing the game world Results and extracting data.

(5) A database. Database B for the experiment Modelling of AI. Database C for the experiment Modelling of Humans.

7. Classification 3. Game World Model (6) The process of feeding the data to the previously built classification model.

(7) The classification model.

6. Feeding 4. (8) The process of combining the results of Data Observing the classification models into a player model.

5. Database B / C (9) A player model.

Figure 3.2 - Building a player model based on a classificat ion model and player data

The first five entities in Figure 3.2 are similar to the first five entities in Figure 3.1. These entities are necessary to create a database of game data from which predictions about preferences can be made. Starting at entity six, the approach slightly differs from the approach in Figure 3.1. Instead of letting the SMO classifier train on the data in the database, the data is inserted in the previously built classification models. In other words, the previously created classification models are used to predict preferences based on new data. This results in predictions of the preferences of the players that are used to create the new databases. Database B will consist of game data from a set of computer-controlled players, different computer-

P a g e | 15 controlled players than those initially used to create the classification model. Database C will be created by human players. Assuming the accuracy of the predictions is sufficient, the predictions would form the base of our player model. The player model would then provide the name of a player, his preferences, and his preference values. Ultimately the player model would have preferences and preference values similar to those of the actual player. 3.2 Leaders and Preferences

In CIV 4 there are 26 leaders to choose from. Each leader has its own personality. These personalities are for a significant part attributed to preferences that are in the leader’s game code. In this section we will discuss and motivate the choice of the leaders and their preferences to create the classification model.

The leaders in CIV 4 are designed with, what the game designers call, ‘flavours’. All these flavours are attributed to each of the 26 leaders. In other words, each leader has these flavours. Flavours are identical to what we call preferences. Besides these flavours there are several other parameters that can be interpreted as preferences, also found in the game code of each leader. From the flavours, six are chosen to function as preferences in this research. Finally the preference, Aggression , is added to the selected six flavours. Aggression is one of the additional parameters that also influence the leader’s behaviour. It is added to the six preferences because a players tendency towards aggression is valuable information in CIV 4. This sums up to a total of seven preferences that serve as the base on which the player models will be built. These preferences are called: (1) Aggression , (2) Culture , (3) Gold , (4) Growth , (5) Military , (6) Religion , and (7) Science . All 26 leaders have these seven preferences, only the values of the preferences differ per leader. This ensures that each leader will exhibit different behaviour in the game world. These preference values can also be found in the game code. For the six preferences that are defined as flavours these values can be in the range {0, 2, 5, 10}. This can be interpreted as no preference, minor preference, major preference, and only preference respectively. For the added preference Aggression the preference values are in the range {1, 2, 3, 4, 5}. This can be interpreted as very low aggression, low aggression, medium aggression, high aggression, and very high aggression respectively. In Section 2.2 was mentioned that a large database of game data was already available. That game data was accumulated by letting leaders duel each other. Using that data implies that three leaders were already selected for us. Three additional leaders were randomly selected to bolster the ranks to six leaders who are to function as base for our classification models. These leaders are presented in Table 3.1 with the preference values per preference and a subjective description of the leaders behaviour in the game world based on game play experience.

Table 3.1 The six selected leaders with their preference values per preference and a subjective description of the leaders behaviour in the game world based on game play experience Aggression Culture Gold Growth Military Religion Science Alexander 5 0 0 2 5 0 0

Alexander shows some respect to other players that have a strong military force. He is even willing to enter into treaties with those players. As soon as he ascertains that his military force is larger than that of his opponent, he cancels the treaties with that opponent to attack him.

Hatshepsut 3 5 0 0 0 2 0

Hatshepsut is a peace loving leader. She is focussing on her own culture and expands her empire gradually. She makes many deals and treaties with other players and will almost never attack without serious provocation.

P a g e | 16

Aggression Culture Gold Growth Military Religion Science Louis XIV 3 5 0 0 2 0 0

Louis XIV is a culturally oriented leader. He expands his empire by increasing its cultural value. He remains peaceful as long as his empire is not threatened by military or cultural means. Once in war with this leader it becomes clear that he is not to be trifled with.

Mansa Musa 1 0 5 0 0 2 0

Mansa Musa generates lots of gold and invests it mainly into science. Therefore he is almost the only leader who ever attempts to achieve a space victory. Playing against this leader always gives his opponents a scientific disadvantage.

Napoleon 4 0 2 0 5 0 0

Napoleon is a military oriented leader who is always looking for a fight. He produces many military units and is easily annoyed with other players. Once annoyed he declares war without hesitation.

Tokugawa 4 0 0 0 2 0 5

Tokugawa is a very isolated leader who is only concerned with his own empire. He will not negotiate with other players and will not even allow other players to pass through his lands. He is a fairly strong adversary once in war because of some scientific advantage and a sufficient large military force.

From Table 3.1 can be concluded that the preference values 2 for Aggression and 10 for the other six preferences are missing. This can be attributed to the random selection of leaders. As a result not every preference value was represented, possible implications will be discussed in Chapter 5. Each preference value will function as a class for the SMO classifier. This implies that each preference will require a separate classification model, which results in seven classification models. The SMO classifier is trained on game data generated by these six leaders and their preference values. After training it is expected to predict preferences based on game data of other leaders and even human players. The classification model predicts the preference values of players. In case of aggression it predicts a player’s preference value to be {1, 3, 4, 5}. For the other preferences the classification model predicts a player his preference value to be {0, 2, 5}. To summarize, six leaders are selected. Each of these leaders has the same seven preferences. The difference can be found in the preference values. These leaders will be used to generate data on which the SMO classifier can train. The training results in the construction of one classification model per preference. Each classification model is assumed to predict the preference values of unknown players (computer- controlled or human). 3.3 Data Generation

The selection of six leaders (Alexander Set ) and their seven preferences results in a set of computer- controlled players that form the base to create a classification model. The process of transferring their preferences into actions in the game world is discussed in this section. For the Alexander Set to transfer their preferences into actions in the game world, they need to play. Normally a computer-controlled player cannot play a game of CIV 4 alone, it requires at least one human player as opponent. However, it is not desirable to use human players to compete with the Alexander Set

P a g e | 17

for two reasons. (1) Consider the amount of time an average game in CIV 4 requires if played by a human player. This can vary from half an hour to three hours. If the human player would only play once against each leader of the Alexander Set , it would take ten and a half hour on average. (2) Another reason is that the SMO classifier needs large amounts of data to train on since it is a greedy classifier. This would require human players to play numerous games against the leaders of the Alexander Set . This would nullify the added value of this approach, since we aimed to minimize the need for humans to generate data.

A solution is found by implementing an application for CIV 4 called: AIA UTO PLAY. This application makes it possible to let computer-controlled players fight each other without the requirement of a human player. It provides the option for a human player to give the control to the AI for any desired amount of turns. It only requires a human to initiate the game between computer-controlled players and set the amount of turns the game should continue. For this research each game is set to take 460 turns. The reason being that games in CIV 4 always end with a winner in the 460th turn through a Time Victory (Section 2.2). If another victory condition has been met by a player before the 460th turn, the game will continue until it reaches the 460th turn. The turns after reaching a victory condition are erased afterwards. To improve clarity of future observations only duels between computer-controlled leaders of the Alexander Set are initiated. Creating games with more leaders would possibly hinder some leaders to utilize their desired preferences, creating unclear observations. The duels between the Alexander Set leaders are structured according to a battle plan. This battle plan is presented in Table 3.2.

Table 3.2 Battle plan presenting the duels for each leader Player 1 Player 2 Player 1 Player 2 Alexander vs. Hatshepsut Hatshepsut vs. Louis XIV Alexander vs. Louis XIV Hatshepsut vs. Mansa Musa Alexander vs. Mansa Musa Hatshepsut vs. Napoleon Alexander vs. Napoleon Hatshepsut vs. Tokugawa Alexander vs. Tokugawa Hatshepsut vs. Alexander

Louis XIV vs. Mansa Musa Mansa Musa vs. Napoleon Louis XIV vs. Napoleon Mansa Musa vs. Tokugawa Louis XIV vs. Tokugawa Mansa Musa vs. Alexander Louis XIV vs. Alexander Mansa Musa vs. Hatshepsut Louis XIV vs. Hatshepsut Mansa Musa vs. Louis XIV

Napoleon vs. Tokugawa Tokugawa vs. Alexander Napoleon vs. Alexander Tokugawa vs. Hatshepsut Napoleon vs. Hatshepsut Tokugawa vs. Louis XIV Napoleon vs. Louis XIV Tokugawa vs. Mansa Musa Napoleon vs. Mansa Musa Tokugawa vs. Napoleon

According to this battle plan each leader from the Alexander Set is player 1 for five duels and player 2 for five more duels. This sums up to a total of 10 games per leader per battle plan. To generate more data the battle plan was executed four times, summing up to a total of 40 games per leader and 240 games in total. These games are the start of a database on which the SMO classifier is to train. To control the accuracy of the SMO classifier a test set is required. This test set contains data that the SMO classifier has not trained on, but contains data of the same computer-controlled players that generated the training data. To this end 30 games per leader will serve as training data and 10 games per leader will serve as test data.

P a g e | 18

The number of games used as training set and test set are no arbitrary numbers. These are based on a preliminary experiment. This preliminary experiment is executed to determine whether the results of a 10- fold cross-validation on the training set are roughly similar to the results of a classification on future test sets. With 30 games per leader as training set and 10 games per leader as test set, the results would only differ about 5%, which we found a statistical acceptable difference. It is necessary for the difference in results to be as small as 5% or less. Future adjustments, to for instance the database, need to be tested. It is important to know whether an adjustment is an improvement or not. Conducting experiments with the training set and test set would probably cause overfitting on the test set. In other words, each adjustment is then made to improve the results on that specific test set. If another test set is used, these adjustments may not be beneficial for that new test set. To prevent this from happening, all adjustments are tested by 10-fold cross-validation on the training set. 3.4 Observations

All duels serve the purpose of generating data. They need to be observed and data needs to be extracted to create a database. In this section we will discuss how we extract the data, which data we extract and why we extract that data. The classification models that are to be constructed are based upon the observations made during the duels. It is essential that these observations are useful and contribute to the creation of the classification models. The purpose of the player model is to be applicable in the game for players to use. Therefore it is necessary that the data upon which the classification model is build, and later the player model, is available to all players at all time. The database can only contain data that is available to both leaders in the duels. In strategy games it is common that a player has little knowledge about his opponent’s state. For example, most strategy games (including CIV 4) have a ‘fog of war’, which hides the opponent’s unit. Opposed to most strategy games, CIV 4 provides statistical information about other players via tables and schematics. These are freely accessible during a game. All that information can be used to create a database on which the classification model can train. An example of such a schematic is shown in Figure 3.3.

Figure 3.3 - Example of a schematic, containing the score for both players. This schematic is available to both players during a game of CIVILIZATION IV

P a g e | 19

The selected features need to meet two requirements: (1) the features need to be an indication of a preference and (2) the features must be available to each player at all times. The result is a list of twenty base features as displayed in Table 3.3.

Table 3.3 Base features and their meaning # Feature Range Meaning 1. Turn 1-459 Turn number 2. War 0, 1 0 = not in war; 1 = in war 3. Cities 0-15 Number of cities 4. Units 0-200 Number of units 5. Population 0-200 Amount of population 6. Gold 0-10000 Amount of gold 7. Land 0-200 Amount of land tiles 8. Plots 0-400 Amount of land and water tiles 9. Techs 0-100 Number of technologies researched 10. Score 0-10000 Overall score 11. Economy 0-300 Overall economic score 12. Industry 0-500 Overall industrial score 13. Agriculture 0-400 Overall agriculture score 14. Power 0-4000 Overall power score 15. Culture 0-300000 Overall cultural score 16. Maintenance 0-100 Gold needed for maintenance per turn 17. GoldRate 0-1000 Amount of gold gained per turn 18. ResearchRate 0-2000 Amount of research gained per turn 19. CultureRate 0-3000 Amount of culture gained per turn 20. StateReligionDiff -1, 0, 1 -1 = different religion; 0 = no religion; 1 = same religion

The game CIV 4 is a TBS (Section 2.2). This means that each player has a limited amount of actions to perform in a turn. Once that player is not allowed any more actions, his turn is ended and the turn of the opponent starts. At the end of each turn an observation of all these features is made. Resulting in one row of data (instance) which contains the turn, all the observations of the features (feature values), and the corresponding preference values of that specific player. This is how all observable data is stored in the database to make it possible for the SMO classifier to train on. To extract even more useful information out of the game, several features are added. The added features are seven modifications of almost all 20 base features. The exceptions are: (1) Turn , (2) War , and (3) StateReligionDiff . The seven modifications to the base features are displayed in Table 3.4, including the calculation (Spronck & Den Teuling, to appear ) and the meaning of the modification.

Table 3.4 Modifications to the base features # Modification Calculation Meaning

1. Derivate − Increase or decrease in the base feature per turn

2. Trend Average of base feature over multiple turns 5 − 3. TrendDerivate Derivate of the trend 5

4. Diff − Difference of the base feature with the opponent’s

P a g e | 20

# Modification Calculation Meaning

5. DiffDerivate − − − Derivate of the difference

− 6. DiffTrend Trend of the difference 5 − − − 7. DiffTrendDerivate Derivate of the trend of the difference 5

To further clarify Table 3.4 an example is provided concerning the base feature Cities . By modifying it, seven new features are created: (1) CitiesDerivate , (2) CitiesTrend , (3) CitesTrendDerivate , (4) CitiesDiff , (5) CitiesDiffDerivate , (6) CitiesDiffTrend , and (7) CitiesDiffTrendDerivate . With the preference Aggression in mind, it is our intuition that the base feature War could be an asset to predicting this preference. Since the seven modifications cannot be applied to War (as it is a Boolean, not a numeric value), we designed five new modifications. Three modifications focus on the declaration of war instead of being in war, the other two modifications are based upon being in war. All five modifications, means of calculation and meaning are displayed in Table 3.5.

Table 3.5 Modifications to the base features War # Modification Calculation Meaning 1. DeclaredWar 0 = not declaring war; 1 = declaring war. Since value one is scarce this feature attributes little, but it is used to calculate other features. It is an indication of aggression to declare war. 2. CumulativeDeclaredWar Current value of DeclaredWar plus the Adding up each time a player declares war. The previous value of higher the value, the more aggressive the CumulativeDeclaredWar. player. 3. AverageDeclaredWar Current value of Taking in account the turns, gives an indication CumulativeDeclaredWar divided by the in the time between declarations of war. turn. 4. CumulativeWar Current value of War plus the previous Adding up each time a player is in war. value of CumulativeWar. 5. AverageWar Current value of CumulativeWar divided Taking in account the turns, gives an indication by the turn. in the time of being in war.

All modifications to the 20 base features lead to a feature list containing 128 features. This list can be found in Appendix A. The extractions of these feature values from the game is made possible via a modified version of AIA UTO PLAY . Besides allowing two computer-controlled players to duel each other, the modified application is also capable of extracting feature data from the game. The use of a modified AIA UTO PLAY meant killing two birds with one stone. In short, the creation of a large database with useful data in relative short amount of time and little human effort, ready for the SMO classifier to train on. 3.5 Classifier Selection

In the previous section the creation of the database was discussed. The next step for a classifier is to train on the database. To find the most suitable classifier a classification experiment between five classifier types was designed (Section 2.3). These five classifiers being: (1) NaiveBayes, (2) SMO, (3) IBk, (4) J48, and (5) JRip. All five classifiers are found in the open source software WEKA 1. In this section the results of this experiment are discussed.

1 WEKA is a collection of machine learning algorithms for data mining tasks. It contains tools for data pre-processing, classification, regression, clustering, association rules, and visualization. It is also well-suited for developing new machine learning schemes. It can be downloaded from http://www.cs.waikato.ac.nz/ml/weka/ .

P a g e | 21

Two of the seven preferences were randomly chosen for the classifiers to be tested on. These preferences were (1) Growth and (2) Military . This experiment was conducted before any of the modifications to the database that were discussed in Section 3.4. This implies that the database used for this experiment only contained the original 20 features as displayed in Table 3.3. Furthermore, it is important to note that we chose the standard settings of WEKA. The classifications are done with a test set. There is no reason to fear that overfitting on the test set might occur, since the features are not modified yet. The results of this experiment are presented in Table 3.6.

Table 3.6 Results of a classification presenting the accuracy of the predictions per classifier in percentages on the preference Growth , the preference Military , and the average accuracy of the predictions (n = 23999) NaiveBayes SMO IBk J48 JRip Growth 72.1% 82.5% 74.2% 75.0% 76.5% Military 43.6% 54.4% 39.3% 44.6% 44.6% Average 57.8% 68.5% 56.7% 59.8% 60.6%

From the results of this experiment, SMO was chosen as the optimal classifier for this research. Every following experiment is therefore executed with the SMO classifier and the standard settings of WEKA for that classifier. For these standard settings we refer to Appendix B. 3.6 Experiments

In the previous sections all preparations leading to the creation of the classification models are discussed. The next step is actual creation of the classification models. Creating the classification models is the first experiment to be conducted. The completion of the classification models also means that the research process from Figure 3.1 (Section 3.1) is complete. Following the creation of a classification model is the creation of player models for computer-controlled players and human players. These will be the second and third experiment respectively. Both relating to the research process depicted in Figure 3.2. This sums up to a total of three experiments: (1) Classification Model Validation, (2) Modelling of AI, and (3) Modelling of Humans. In this section we will discuss these experiments; the preparations for these experiments and in particular their goals. Their results will be discussed in Chapter 4.

Classification Models Validation The experiments in Classification Models Validation are designed to create and validate the seven classification models, one for each preference. Those classification models are then to be used in the experiments Modelling of AI and Modelling of Human to create player models. The preparations to the creation of the classification models are described from Section 3.2 to Section 3.5. Before an attempt is made to actually create the classification models, we first aim to increase the accuracy of the predictions by the future classification models. This will be done by manipulating the database with feature selection algorithms like InfoGain and GainRatio . Furthermore, an approach called Minus100 is attempted. Each of the possible improvements is to be tested. Testing the possible improvements is done by comparing the baseline of the trainings set to the results of a 10-fold cross-validation on that same trainings set. To further elaborate, we chose for a majority class baseline (frequency baseline) instead of a chance baseline. By using a frequency baseline we take into account that not every preference value (class) is represented equally in our dataset. A chance baseline ignores that fact. The frequency baseline error rate is

= 1 – where is the percentage of instances in the data that belong to the most frequent class. For example, if we have two classes, Class A and Class B in our dataset of 40 instances, but Class A occurs in 30 of the instances and Class B occurs only in 10 of the instances. The chance baseline would still be 50%, since it is a two-way classification. However, the frequency baseline is 75% since this is the percentage of instances in the data that belong to the most frequent class. The error rate is = 1 – 75% = 25%, which is more accurately representing the classes in the data set.

P a g e | 22

Following the possible improvements with regard to the accuracy of the future classification models, the classification models are finally validated by using a test set. Validating the classification models implies that they are successfully capable of predicting the preferences of the leaders in the Alexander Set . Although in this research there has not been made an actual player model for the leaders in the Alexander Set , it would have been possible. Modelling of AI Once the classification models are validated an attempt is made to predict the preference of other computer- controlled players than the leaders from the Alexander Set . If these predictions are accurate enough, player models of these other computer-controlled players can be created. For the prediction of a new set of computer-controlled leaders Database B needs to be created (Section 3.1). The creation of Database B follows exactly the same steps as the creation of Database A , which is described in Section 3.2 to Section 3.4. This overlap in approaches is already visible when comparing Figure 3.1, the creation of the classification model with Figure 3.2, the modelling of computer-controlled players and human players based on the classification model. This means that a selection of six new leaders and their preference is necessary. The selection was made randomly under the remaining 20 leaders. The only restriction was that the new leaders could not have preference values that were not present in the Alexander Set . The six new computer-controlled leaders and their preferences are listed in Table 3.5 and are called the Cyrus Set .

Table 3.5 Selected leaders with their preference values based on the seven preferences, called the Cyrus Set

Aggression Culture Gold Growth Military Religion Science Cyrus 4 0 0 2 5 0 0

Although a strong military oriented leader, he is not reckless. Only attacking once provoked on threatening this leader. However, once he has a powerful army and needs to expand he will attack.

Montezuma 5 0 2 0 5 0 0

Very aggressive leader who will prefer war over peace. Without any provocation this leader declares war. Pillaging and plundering everything he conquers.

Peter 4 0 0 2 0 0 5

Not a strong military force, but still hard to be in war with because of his scientific advantage. Focussed mainly in science and expansion. Getting agitated when there is no room to expand.

Saladin 3 0 0 0 5 2 0

Religious leader with strong military presence. Appears as wise, since he will only use his forces if necessary. Attempts to convert other players to the same religion, first by words, later by threads.

Victoria 3 0 5 2 0 0 0

Not a prominent leader. Mainly dealing with her own affairs. Increasing her gold and ground. No remarkable military forces and easy to conquer.

P a g e | 23

Aggression Culture Gold Growth Military Religion Science Washington 3 0 0 2 5 0 0

Resembles the style of Cyrus, although less aggressive. Only attacking when seriously provoked out taunted.

A slight mismatch between the Alexander Set and the Cyrus Set can be found with regard to the preference values. Some preference values do not occur in the Cyrus Set that do occur in the Alexander Set . In general such a mismatch does not matter. However, the SMO classifier in WEKA is unable to function when there is a mismatch in classes between a classification model and a test set. To overcome this mismatch dummy classes are introduced to Database B . To further clarify, a dummy class is introduced as a new instance in the database. This added instance in the database consists solely of the value zero for each feature. At the end of the added instance the mismatched preference value is added. These dummy classes are excluded from the results and are solely used to solve a mismatch. The construction of Database B follows the same battle plan as with the Alexander Set (Section 3.3). The main difference is that the battle plan was executed only once for the Cyrus Set . This comes down to 10 games per leader, similar to the part of Database A that is to function as test set. The previously built classification models and Database B as test set are used to measure the accuracy of our classification model in predicting the preferences of other computer-controlled opponents. The results of the classification of the Cyrus Set leads to another experiment that is to investigate whether it is possible that the classification model is more suitable to predict players instead of preferences. For that experiment Database A will be used instead of Database B . The preference values in Database A will then be replaced with numbers representing the computer-controlled players from the Alexander Set . The SMO classifier will then train on that database in an attempt to predict which computer-controlled players belong to which data.

Modelling of Humans The third and final experiment is called Modelling of Humans and is an attempt to model human players based on their preference. This experiment follows the same process as Modelling of AI as presented in Figure 3.2. The human players need to play several games in order to create Database C. Once that database is created the classification model is to predict the preferences of the human players. The difference between computer-controlled players and human players is that although humans have preferences, it is not possible to measure them objectively. An attempt is made to transfer human preferences into preference values. The participants were asked to write game reports. In these reports the participants are to describe what their current preferences are, based on the seven preferences (Section 3.1). Once the participants are playing a duel against a computer-controlled opponent they are to keep track of their own preferences. If they feel that their preferences change, they are to report this. This includes the turn in which their preferences changed. These reports are than interpreted by the researchers and converted into preference values.

Due to the length of one game of CIV 4 (Section 3.3) only two participants were invited. Timewise it was not feasible to use more participants. These two participants were an expert player, with much experience in playing CIV 4 and a casual player who played some games of CIV 4. Each player is given a different assignment for the purpose of this research.

P a g e | 24

Expert player : This person was asked to play several duels with predetermined preferences in mind. Beforehand the player knew what his preferences values were and was supposed to play according to them. However, if the duel forced the participant to alter preferences or preferences values, the participant reports the alteration. The consistency of the preference values resembles the style of the computer-controlled players. This will probably result in more accurate predictions. Casual player : This person was asked to play several duels. Instead of asking the player to begin the duel with predetermined preferences, the participant was asked to keep notes of his preferences during the duel. Whenever the participant switched preference or adjusted his aggression level, he was asked to report this. This approach resembles a more natural style of playing. Therefore this approach differs more from the style of computer-controlled players. This will probably result in less accurate predictions. Because both human players are considered different from each other, it was not possible to create on database with their data mixed. Therefore two databases are created: (1) Expert Database and (2) Casual Database . Both databases will be considerably smaller than the previous databases due to the fact that they are generated by humans. Once the databases are created, the classification models are used to try and predict the preferences of the human players. Resulting in valuable information to answer our research questions.

P a g e | 25

4 Results

In this chapter we elaborate on the experiments and their results. The chapter is divided into three sections. In each section we discuss one main experiment and smaller, related experiments. Section 4.1 will focus on the Classification Model Validation experiment, Section 4.2 on the Modelling of AI experiment, and Section 4.3 on the Modelling of Humans experiment. Each section ends with a short summary of interesting and noticeable results. 4.1 Classification Models Validation

We want to construct classification models for each preference by using the SMO classifier (Section 3.1) on Database A . Before we can validate the classification models we attempt to improve the classification results by adjusting the feature data in Database A . Thereafter we test the classification models on that adjusted feature data by means of frequency baselines and 10-fold cross-validations. Only after these two preceding steps it is possible to validate the classification models. Whenever the term feature set is used, it refers to the features and feature data from Database A . Because of future alterations to the data in Database A we name the unchanged data we work with Original Feature Set . To improve the feature data we need a baseline to compare the results of our improvements. Therefore we started by calculating the baseline of the training set for each preference. Normally a baseline is calculated on the test set. Since we are conducting preliminary experiments, we use 10-fold cross- validations on the training set. Already using the test set in this stage might result in overfitting on the test set, therefore we calculate the frequency baseline on the training set. Table 4.1 provides an overview how the frequency baseline is calculated for each of the seven preferences. Unlike the explanation of the frequency baseline indicate (Section 3.6), we do not calculate or use the error rate. We use the frequency baseline itself to compare our results with.

Table 4.1 Amount of instances per preference value to illustrate the calculation of the frequency baseline for the feature set from Database A Preferences Values Total Instances Frequency Baseline 1 3 4 5 Aggression 12302 24730 23641 12221 72894 24730 33.9%

Preferences Values Total Instances Frequency Baseline 0 2 5 Culture 48164 - 24730 72894 48164 66.1% Gold 48626 11966 12302 72894 48626 66.7% Growth 60673 12221 - 72894 60673 83.2% Military 24540 24167 24187 72894 24540 33.7% Religion 48354 24540 - 72894 48354 66.3% Science 61219 - 11675 72894 61219 84.0%

P a g e | 26

Since the baseline is calculated, we can use this information to compare the classification results. We conducted a 10-fold cross-validation on the training set. This way we know whether the SMO classifier performs better than the baseline and how the SMO classifier performs in general. The results are displayed in Table 4.2. The first column displays the frequency baseline per preference, the second column displays the accuracy of the classification model per preference with the root mean squared error between brackets, and the third column displays the improvement of the Original Feature Set over the frequency baseline.

Table 4.2 Classification results in percentages to compare the accuracy of the SMO classifier on the Original Feature Set with the frequency baseline (n = 72894; root mean squared error is between brackets) Frequency Baseline Original Feature Set Improvement Aggression 33.9% 67.4% (0.37) 98.8% Culture 66.1% 81.3% (0.43) 23.0% Gold 66.7% 78.2% (0.37) 17.2% Growth 83.2% 83.3% (0.41) 0.1% Military 33.7% 61.4% (0.43) 82.2% Religion 66.3% 75.6% (0.49) 14.0% Science 84.0% 89.6% (0.32) 6.7%

It is clear that the classifier performs better than the frequency baseline. The accuracy of each of the predictions is above 60% for all preferences. Besides looking at the accuracy of the predictions, it is important to look at the improvement with regard to the baseline. For example, the preference Growth is predicted with high accuracy (83.3%). In itself this is a good result, but compared to the frequency baseline it is only a very small improvement of (0.1%). If the classifier would have ‘guessed’ the preference values by selecting the most frequent value, it would have resulted in almost the same accuracy. Although the accuracy of the prediction of the preferences Aggression and Military are the lowest (67.4% and 61.4% respectively), the improvements are the largest (98.8% and 82.2% respectively). Based on a z-test the improvement for each preference is z < -1.96, except for the preference Growth (z = -0.89). This indicates that all improvements are significant with regard to the frequency baseline, except the improvement for the preference Growth .

InfoGain and GainRatio Now that the baseline and a minimal performance are determined, we aim to improve the accuracy of predictions made by the SMO classifier by adjusting the feature data from Database A . Three different adjustments to the feature set are considered: (1) InfoGain , (2) GainRatio , and (3) Minus100 . Based on small scale experiments we try to determine which adjustment or adjustments improve the accuracy of the prediction. We will start by looking at the first two adjustments, InfoGain and GainRatio , since they are relatively similar. The third adjustment to the feature set, Minus100 , will be discussed after that. Both InfoGain and GainRatio are feature selection algorithms that are available through WEKA. There, they are officially called INFO GAIN ATTRIBUTE EVAL and GAIN RATIO ATTRIBUTE EVAL . In Appendix C we added the standard settings that we used for our experiments with respect to InfoGain and GainRatio . The feature selection algorithms aim to reward those features that contribute most to an accurate prediction and ‘punish’ those features that contribute least to an accurate prediction. This results in a ranking of weighted features from very relevant features to irrelevant or even distorting features. The reasoning behind this approach is that by only using the higher ranked features and discarding the lower ranked features, the general quality of the classification models is increased and the time it takes for the SMO classifier to build a classification model is reduced. To build the InfoGain Feature Set we took the top 10 features per preference. The selection of the top 10 features was an arbitrary decision. Furthermore, we added all features that got a weight > 0.05. The boundary of > 0.05 was based upon an approximation of the average division of the weights over the

P a g e | 27 features. To build the GainRatio Feature Set we also took the top 10 features. Instead of > 0.05 as a boundary, we took > 0.02. That boundary was again based upon an approximation of the average division of the weights over the features. This way the InfoGain Feature Set consisted of 32 features, and the GainRatio Feature Set consisted of 33 features. The list of features per feature set are listed in Appendix D. We tested the InfoGain Feature Set and the GainRatio Feature Set on the preferences Aggression and Culture with a 10-fold cross-validation. The classification results of these two feature sets are compared with the classification results on the Original Feature Set . An overview of the results is presented in Table 4.3.

Table 4.3 Classification results in percentages between the Original Feature Set and two possible improvements InfoGain and GainRatio (n = 72894; root mean squared error is between brackets) Original Feature Set InfoGain Feature Set GainRatio Feature Set Aggression 67.4% (0.37) 56.2% (0.39) 55.1% (0.39) Culture 81.3% (0.43) 74.3% (0.51) 75.2% (0.50)

From this table we can conclude that both the InfoGain Feature Set and the GainRatio Feature Set were less accurately predicted than the Original Feature Set . Reducing the features set by feature selection algorithms did not improve the classification results, but it did drastically reduce the time it took for the SMO classifier to build a classification model by 75%. Apparently by reducing the size of the feature set, the amount of useful information in the feature set is also reduced. In other words useful information is thrown away. Since accuracy is of more importance to the research than time the SMO classifier needs to build a classification model, we did not continue with either the InfoGain Feature Set or the GainRatio Feature Set .

Minus100 The third possible adjustment to the feature set is named Minus100 . With this adjustment the first 100 turns (instances) from each game are removed from the Original Feature Set . The adjustment is based on the design of CIV 4. 100 turns is not an arbitrary number of turns, but is based on our experiences while playing the game and it shoulders on four thoughts: (1) The game starts in the prehistory with few development choices for the players. This basically means that although players have different preferences, and therefore different strategies, they will not be able to express that to their full extent at the beginning of each game. (2) Furthermore, despite their preferences each leader will have to perform more or less the same actions to start an empire: build the first city, explore the surrounding countryside, cultivate the surrounding area, etcetera. (3) The third thought is based on the time a game of CIV 4 takes to play. The first 100 turns are played in several minutes by player, while a duel in CIV 4 takes two or three hours. (4) Last, contact with the other player also occurs after approximately 100 turns. In other words, the most interesting aspects for this research only really starts after about a 100 turns. In general options are limited in the first 100 turns, so the data from the start of the game tends to be more similar for each leader and thus ambiguous for the SMO classifier. It is expected that removing the first 100 turns will amplify the different preferences and reduce the ambiguity in the feature set. To prove this, we adjusted the Original Feature Set and created the Minus100 Feature Set . We tested the Minus100 Feature Set on all seven preferences with a 10-fold cross-validation. The classification results of this feature set are compared with the classification results on the Original Feature Set . Furthermore, it contains the improvement in percentages of the Minus100 Feature Set over the Original Feature Set . The results are presented in Table 4.4.

P a g e | 28

Table 4.4 Classification results in percentages between the Original Feature Set and the Minus100 Feature Set including the improvement in percentages (Original Feature Set n = 72894; Minus100 Feature Set n = 54714; root mean squared error is between brackets) Original Feature Set Minus100 Feature Set Improvement Aggression 67.4% (0.37) 73.8% (0.35) 9.5% Culture 81.3% (0.43) 84.4% (0.40) 3.8% Gold 78.2% (0.37) 82.8% (0.34) 5.9% Growth 83.3% (0.41) 83.9% (0,40) 0.7% Military 61.4% (0.43) 67.9% (0.40) 10.6% Religion 75.6% (0.49) 77.7% (0.47) 2.8% Science 89.6% (0.32) 92.0% (0.28) 2.7%

From these results can be concluded that the accuracy of the predictions on the Minus100 Feature Set are higher than the accuracy of the predictions on the Original Features Set for all seven preferences. We can also conclude that the improvement for the preferences Aggression and Military (9.5% and 10.6% respectively) is considerable. For the other preferences the improvements are marginal, especially for the preference Growth with an improvement of only 0.7%. However, based on a z-test, each improvement scored z < -1.96. This means we can conclude that the improvements by the Minus100 Feature Set are a significant improvement with regard to the Original Feature Set for each preference. It appears that by erasing the first 100 turns per games, the overall results of the classification are improved. The Original Feature Set is abandoned and we continue with the Minus100 Feature Set . This concludes the preliminary experiments to improve the classification results by adjusting the feature data in Database A . It is necessary to test the accuracy of the SMO classifier on the adjusted feature set. This approach provides an impression of the results we might get if we would use the test set. First we calculated a new frequency baseline for the Minus100 Feature Set , since the frequency baseline can be altered. By erasing the first hundred turns we also erased preference values. The division of preference values in the feature set is altered, so is the frequency baseline. The calculation of the frequency baseline for the Minus100 Feature Set , is done over the training set since we are still not using the test set. The calculation of the frequency baseline is presented in Table 4.5.

Table 4.5 Number of instances per preference value to illustrate the calculation of the frequency baseline for the Minus100 Feature Set Preferences Values Total Instances Frequency Baseline 1 3 4 5 Aggression 9272 18670 17581 9191 54714 18670 34.1%

Preferences Values Total Instances Frequency Baseline 0 2 5 Culture 36044 - 18670 54714 36044 65.9% Gold 36506 8936 9272 54714 36506 66.7% Growth 45523 9191 - 54714 45523 83.2% Military 18480 18107 18127 54714 18480 33.8% Religion 36234 18480 - 54714 36234 66.2% Science 46069 - 8645 54714 46069 84.2%

To see the performance of the SMO classifier on the Minus100 Feature Set , we combined the frequency baseline of the Minus100 Feature Set , the results of the 10-fold cross-validation on the Minus100 Feature Set and the improvement in percentages in one table. The results are presented in table 4.6.

P a g e | 29

Table 4.6 Classification results in percentages to compare the accuracy of the SMO classifier on the Minus100 Feature Set with the previously calculated frequency baseline (n = 54714; root mean squared error is between brackets)

Frequency Baseline Minus100 Feature Set Improvement Aggression 34.1% 73.8% (0.35) 116.4% Culture 65.9% 84.4% (0.40) 28.1% Gold 66.7% 82.8% (0.34) 24.1% Growth 83.2% 83.9% (0,40) 0.8% Military 33.8% 67.9% (0.40) 100.9% Religion 66.2% 77.7% (0.47) 17.4% Science 84.2% 92.0% (0.28) 9.3%

From this table we can conclude that for all preference the classification model performs better than the baseline. We can also conclude that all preferences are predicted with at least an accuracy of 67.9% and that the preferences are predicted correctly in an average of 80.4% of the instances. For the preferences Culture , Gold, Religion and Science the improvements over the baseline are reasonable. The improvement for the preference Growth is marginal. Based on a z-test each difference between the frequency baseline and the Minus100 Feature Set had a z-value of z <-1.96. Meaning that each improvement is significant, even the preference Growth .

Validation The results of the SMO classifier on the Minus100 Feature Set were encouraging enough to proceed and to validate the classification models. To validate the classification models we use a training set and a test set instead of only a training set. As training set we use the previously used Minus100 Feature Set . As a test set we used the feature set that was constructed specifically for this task as mentioned in Section 3.3. To make the training set and test set compatible, we also erased all first 100 turns per game in the test set. The training set contains approximately 55000 instances (Minus100 Trainings Set ), the test set contains approximately 18000 instances (Minus100 Test Set ). We want to compare the classification results against the frequency baseline. In this case we need the frequency baseline of the test set, since we are trying to predict the preference values of the test set and not those of the training set. The calculation of the frequency baseline of the test set is presented in table 4.7.

Table 4.7 Amount of instances per preference value to illustrate the calculation of the frequency baseline for the Minus100 Test Set Preferences Values Total Instance Frequency Baseline 1 3 4 5 Aggression 2979 6111 5666 3183 17939 6111 34.1%

Preferences Values Total Instance Frequency Baseline 0 2 5 Culture 11828 - 6111 17939 11828 65.9% Gold 11937 3023 2979 17939 11937 66.5% Growth 14756 3183 - 17939 14756 82.3% Military 6337 5396 6206 17939 6337 35.3% Religion 11602 6337 - 17939 11602 64.7% Science 15296 - 2643 17939 15296 85.3%

P a g e | 30

To see the performance of the SMO classifier on the Minus100 Test Set , we combined the frequency baseline of the test set, the results of the SMO classifier on the test set and the improvement in percentages in one table. The results are presented in table 4.8.

Table 4.8 Classification results in percentages to compare the accuracy of the SMO classifier on the Minus100 Test Set with the calculated frequency baseline (n = 17939; root mean squared error is between brackets) Frequency Baseline Minus100 Test Set Improvement Aggression 34.1% 65.8% (0.38) 93.0% Culture 65.9% 78.9% (0.46) 19.7% Gold 66.5% 74.6% (0.38) 12.2% Growth 82.3% 83.5% (0.41) 1.5% Military 35.3% 61.0% (0.43) 72.8% Religion 64.7% 79.0% (0.46) 22.1% Science 85.3% 88.4% (0.34) 3.6%

We can conclude that the validation of the classification models is a success, as for all preferences the classification model performs better than the baseline. We can also conclude that all preferences are predicted with at least an accuracy of 61.0% and that the preferences are predicted correctly in an average of 75.9% of the instances. For the preferences Culture , Gold , and Religion and the improvements over the baseline are reasonable. The improvements on the preferences Growth and Science are marginal. However, it is important to note that based on a z-test, each improvement over the baseline is a significant improvement (z < -1.96). We can also conclude that the results of the classification are less accurate on the test set than on the training set itself (Table 4.5). This is a normal difference resulting from an expected difference between the results on the training set and test set (Section 3.3).

Summary In this section we elaborated on how we strived to achieve higher accuracy of the predictions. We also preformed a test on how much the accuracy is improved. Furthermore, we discussed the validation of the classification models. To increase the accuracy we tried three approaches: (1) InfoGain , (2) GainRatio , and (3) Minus100 . It appeared that only the Minus100 approach gave positive results so we continued with the Minus100 Feature Set . To validate our classification model, we used a previously constructed test set. The results were marginally less accurate then when we only used training set. This difference was expected. The results of the classification were sufficiently accurate to precede to the Modelling of AI. 4.2 Modelling of AI

In the previous section we created seven validated classification models. These validated classification models are the base for the second experiment: Modelling of AI. In this section we will discuss (1) the classification of the preferences of computer controlled players and (2) the classification of specific CIV 4- leaders.

Six CIV 4 leaders were used to built a training set and a test set to validate the classification models. The results on the of the classification models on the test set are encouraging. The predictions had a minimal accuracy of 61.0% and an average accuracy of 75.9%. Now we are aiming to predict CIV 4 leaders that were not involved in the validation of the classification models. To this end we created a new test set. This test set, the Cyrus Test Set , was build on data from the following six CIV 4 leaders: (1) Cyrus, (2) Montezuma, (3) Peter, (4) Saladin, (5) Victoria, and (6) Washington. The construction of this test set is discussed in Section 3.6. The Cyrus Test Set is similar to Database B . To make the Cyrus Test Set compatible with our classification models, Minus100 adjustment are also applied to the Cyrus Test Set .

P a g e | 31

Furthermore, dummy classes are used. These dummy classes are used to overcome slight mismatches between the preference values of the leaders in the Alexander Set and the preference values of the leaders in the Cyrus Set . If a preference value did not occur in the Cyrus Set , an instance was added to the set containing the missing preference value. These dummy classes are excluded from the results and are solely used to solve a mismatch. For further elaboration see Section 3.6.

Preference Classification Since the validated classification models and the Cyrus Test Set are now compatible, we want to see to what extend our validated classification models are capable of predicting the preference values of the leaders in the Cyrus Test Set . Since the Cyrus Test Set differs from the previous Alexander Test Set , we need to calculate a new frequency baseline. No training set is created for the Cyrus Set , since no further adjustments to that set are planned. The calculation of this frequency baseline is given in table 4.9.

Table 4.9 Amount of instances per preference value to illustrate the calculation of the frequency baseline for the Cyrus Test Set Preferences Values Total Instance Frequency Baseline 1 3 4 5 Aggression - 8511 5577 2242 16330 8511 52.1%

Preferences Values Total Instance Frequency Baseline 0 2 5 Culture 16330 - - 16330 16330 100.0% Gold 11412 2242 2676 16330 11412 69.9% Growth 4981 11349 - 16330 11349 69.5% Military 5454 - 10876 16330 10876 66.6% Religion 13591 2739 - 16330 13591 83.2% Science 13552 - 2778 16330 13552 83.0%

To see the performance of the validated classification models on the Cyrus Test Set , we combined the frequency baseline of the test set, the results of the predictions of the test set and the improvement in percentages in one table. The results are presented in table 4.10.

Table 4.10 Classification results in percentages to compare the accuracy of the SMO classifier on the Cyrus Test Set with the frequency baseline (n = 16330; root mean squared error is between brackets) Frequency Baseline Cyrus Feature Set Improvement Aggression 52.1% 24.9% (0.47) -52.2% Culture 100.0% 88.2% (0.34) -11.8% Gold 69.9% 38.6% (0.50) -44.8% Growth 69.5% 30.8% (0.83) -55.7% Military 66.6% 34.6% (0.56) -48.0% Religion 83.2% 59.0% (0.64) -29.1% Science 83.0% 71.0% (0.54) -14.5%

From table 4.10 we can conclude that the preferences are predicted correct in at least 24.9% or more of the cases and predicts 49.6% of the instances correct on average. The results also show that the accuracy of the predictions on the Cyrus Test Set is worse than the accuracy of the predictions on the Alexander Test Set . We can also conclude that based on a z-test the classification is significantly worse than the frequency baseline for all preferences (z > 1.96). Possible reasons as to why the classification results are significantly worse will be discussed in Chapter 5. In short, the classification of the preference values of the other

P a g e | 32 computer-controlled players were no improvement with regard to the frequency baseline. Thus the classification is no improvement.

Leader Classification The results of the classifications of the Cyrus Set are not as good as expected. A possible explanation for the results is that the classification models are overfitting on the leaders of the Alexander Set . If that holds true the classification models should be able to predict leaders instead of preferences. To further verify this possible explanation an attempt is made to predict which CIV 4 leader was linked to an instance. Initially an experiment is conducted to determine whether the SMO classifier is capable to recognise leaders from the Alexander Set . In order to execute this experiment we adjusted the Minus100 Training Set and Minus100 Test Set from Database A . Instead of a preference value as class we used numbers that range from {1, 2, 3, 4, 5, 6} to represent the CIV 4 leaders from the Alexander Set . To measure the performance of the SMO classifier a frequency baseline is calculated. For the

Minus100 Test Set counts n = 17939 and = 3358. That results in a frequency baseline of 18.7%. After classification the predictions were accurate in 45.7% of the instances. This is a significant improvement with regard to the chance baseline of 144.1%. In other words, we can concluded that the SMO classifier is fairly capable recognising leader when compared to the baseline. Although an accuracy of 45.7% is not that accurate. To further illustrate these results we present a crosstab in Table 4.11. The most occurring classification is made bold, the classifications that are to be expected to occur the most are shaded grey. The numbers represent absolute number of instances from the Minus100 Test Set .

Table 4.11 Crosstab displaying how the instances were attributed to the leaders. The leaders on the left were being classified as the top leaders (n = 17939) Alexander Hatshepsut Louis XIV Mansa Musa Napoleon Tokugawa Alexander 1742 288 350 161 440 202 Hatshepsut 167 1152 1369 442 184 44 Louis XIV 311 966 1044 26 369 37 Mansa Musa 198 136 250 2065 229 101 Napoleon 1081 182 269 140 1207 144 Tokugawa 538 371 175 51 529 979

From the table we can conclude that the most occurring classification is also the correct classification for all leaders except for Hatshepsut. Furthermore we can conclude that Hatshepsut and Louis XIV are hard to discern for the SMO classifier. Interesting is that they are both mildly-aggressive cultural-oriented leaders. Confusion can also be found, although slightly less, for Napoleon and Alexander. Both highly- aggressive military-oriented leaders. Furthermore, Napoleon and Alexander are slightly interfering with the classification of Tokugawa. Finally, we can concluded that Mansa Musa is easy to discern. This can be attributed to him being the only leader to be not aggressive gold-oriented. The implications will be further discussed in Chapter 5. Continuing with the impression that the classification models are overfitting on the leaders of the Alexander Set another experiment is conducted. In this experiment is investigated as which leader from the Alexander Set the leaders of the Cyrus Set are classified. Therefore the Minus100 Training Set with the leaders from the Alexander Set as class is used, combined with a Cyrus Test Set with the leaders from the Cyrus Set as class. For this experiment it is not necessary to calculate a frequency baseline or even consider the accuracy of the classification. It is not more correct if a leader is classified as one or the other. E.g. Cyrus classified as Alexander or as Mansa Musa does make no difference. Information that is important is

P a g e | 33 presented in Table 4.12. Followed by Table 4.13, which presents an overview of the leaders from both sets and their preference values.

Table 4.12 Crosstab displaying as which leaders from the Alexander Set the leaders from the Cyrus Set are classified (n = 16333) Alexander Hatshepsut Louis XIV Mansa Musa Napoleon Tokugawa Cyrus 186 692 121 445 1301 54 Montezuma 815 294 173 411 394 155 Peter 436 302 284 513 1155 88 Saladin 591 206 288 497 978 179 Victoria 685 19 94 578 511 789 Washington 117 61 8 2110 42 761

Table 4.13 Displaying the preferences of each leader from the Alexander Set (left) and the Cyrus Set (right) AG CU GO GR MI RE SC AG CU GO GR MI RE SC Alexander 5 0 0 2 5 0 0 Cyrus 4 0 0 2 5 0 0 Hatshepsut 3 5 0 0 0 2 0 Montezuma 5 0 2 0 5 0 0 Louis XIV 3 5 0 0 2 0 0 Peter 4 0 0 2 0 0 5 Mansa Musa 1 0 5 0 0 2 0 Saladin 3 0 0 0 5 2 0 Napoleon 4 0 2 0 5 0 0 Victoria 3 0 5 2 0 0 0 Tokugawa 4 0 0 0 2 0 5 Washington 3 0 0 2 5 0 0 Note: AG=Aggression, CU=Culture, GO=Gold, GR=Growth, MI=Military, RE=Religion, SC=Science

From these two tables we can conclude that there are no clear relations between preference values and the leader combinations. Although some leaders appear to be a match based on preference values, others do not. E.g. Montezuma being mostly classified as Alexander appears logical when looking at the preference values. However, why Washington is plainly classified as Mansa Musa is not logical considering the preference values. There appears to be more to this than meets the eye, therefore these results will be further elaborated on in Chapter 5.

Summary In this section we tried to model AI players from the game CIV 4. The accuracy of the predictions on the Cyrus Test Set were no improvement to the frequency baseline. Therefore, we can conclude that the predictions are no improvement. We aimed for a fairly accurate classifications on other computer- controlled players. Although we expected that the accuracy of the classifications on the Cyrus Test Set would be less accurate than the accuracy of the classifications on the Alexander Test Set . The following experiments to classify the Alexander Set leaders and the classification of the Cyrus Set leaders as Alexander Set leaders provided a different point of view on the feature sets and the SMO classifier that we will elaborate on further in Chapter 5. Despite the accuracy of the predictions on AI were no improvement, an attempt is made to model human players in the next section. 4.3 Modelling of Humans

In this section we discuss the use of the validated classification models from the first experiment (Section 4.1) to predict preferences of human players. We asked two participants to play several games. We chose a casual player and an expert player of CIV 4. In this section we will discuss the results of the classification for the casual player and the expert player.

P a g e | 34

Since human players do not have their preference available in game code like the CIV 4 leaders, each player is asked to make a report about their actions during a game. The reports are then converted into preferences and corresponding preference values. This report contained the following elements: (1) it states which leader the human player played and which leader the AI played, (2) it would state the preferences that the human player had, before the start of the game, (3) it contained a short transcription of what happened during the game, (4) it would note if a human player changed his preference values during a game, and (5) it ended with the outcome of the game and a summary of the preferences that the human player had during the game. All reports and their interpretation into preference values are presented in Appendix E. For further elaboration on the preparations look into Section 3.6.

Casual Player The casual player played a total of three duels. The data extracted from these duels was combined in one database, the Casual Database . The Minus100 adjustment is also applied on this database as well as dummy classes to overcome mismatches. As before, these dummy classes are removed from the results. This resulted in a Casual Test Set , that contains 654 instances. First, a new frequency baseline is calculated. The calculation of the frequency baseline is presented in table 4.14.

Table 4.14 Amount of instances per preference value to illustrate the calculation of the frequency baseline for the Casual Test Set Preferences Values Total Instance Frequency Baseline 1 3 4 5 Aggression 433 54 123 44 654 433 66.2%

Preferences Values Total Instance Frequency Baseline 0 2 5 Culture 167 - 487 654 487 74.5% Gold 654 - - 654 654 100.0% Growth - 654 - 654 654 100.0% Military 487 89 78 654 487 74.5% Religion 654 - - 654 654 100.0% Science 654 - - 654 654 100.0%

Now that the baselines are calculated we compare them with the classification results of the Casual Test Set . The results of the classification are presented in Table 4.15, including the frequency baseline and the improvement in percentages.

Table 4.15 Classification results in percentages to compare the accuracy of the SMO classifier on the Casual Test Set with the previously calculated frequency baseline (n = 654; root mean squared error is between brackets) Frequency Baseline Casual Feature Set Improvement Aggression 66.2% 7.5% (0.50) -88.7% Culture 74.5% 62.2% (0.61) -16.5% Gold 100.0% 97.3% (0.28) -2.7% Growth 100.0% 0.3% (1.00) -99.7% Military 74.5% 59.0% (0.47) -20.8% Religion 100.0% 68.0% (0.57) -32.0% Science 100.0% 99.9% (0.04) -0.1%

From table 4.15 we can conclude that the preferences are predicted correct in at least 0.3% or more of the cases and predicts 56.3% of the instances correct on average. We can also conclude, based on a z-test,

P a g e | 35 that the classification is significantly worse than the frequency baseline for almost all preferences (z > 1.96). For the preference Science z = 1.62 and is therefore not significantly worse than the frequency baseline. The results also show that the average improvement on the Casual Test Set is worse than the average improvement on the Alexander Test Set , but roughly similar to the average improvement on the Cyrus Test Set . This concludes the experiment for the casual player.

Expert Player The expert player played a total of four duels. The data extracted from these duels was combined in one database, the Expert Database . We also applied the Minus100 adjustment to this feature set and introduced dummy classes to overcome mismatches. As before, these dummy classes are removed from the results. This resulted in an Expert Test Set containing 507 instances. Although this player played more games than the casual player, the test set contains fewer instances. This indicates that the average number of turns needed to win a duel was less in the games of the expert player. Before we take a look at the classification results, we want to know the frequency baseline. The calculation of that baseline is presented in table 4.16.

Table 4.16 Amount of instances per preference value to illustrate the calculation of the frequency baseline for the Expert Test Set Preferences Values Total Instance Frequency Baseline 1 3 4 5 Aggression 402 - - 105 507 402 79.3%

Preferences Values Total Instance Frequency Baseline 0 2 5 Culture 255 - 252 507 255 50.3% Gold 281 77 149 507 281 55.4% Growth 507 - - 507 507 100.0% Military 402 - 105 507 402 79.3% Religion 255 252 - 507 255 50.3% Science 142 - 365 507 365 72.0%

To see the performance of the our validated classification models on the Expert Test Set, we combined the frequency baseline of the test set, the results of the predictions of the test set and the improvement in percentages in one table. The results are presented in table 4.17.

Table 4.17 Classification results in percentages to compare the accuracy of the SMO classifier on the Expert Test Set with the frequency baseline (n = 507; root mean squared error is between brackets) Frequency Baseline Expert Feature Set Improvement Aggression 79.3% 32.0% (0.45) -59.6% Culture 50.3% 92.3% (0.28) 83.5% Gold 55.4% 81.3% (0.39) 46.8% Growth 100.0% 97.8% (0.15) -2.2% Military 79.3% 76.6% (0.39) -3.4% Religion 50.3% 50.9% (0.70) 1.2% Science 72.0% 43.6% (0.75) -39.4%

From table 4.17 we can conclude that the preferences are predicted correct in at least 32.0% or more of the cases and predicts 67.8% of the instances correct on average. We can also conclude that the classification results are unilaterally better or worse than the frequency baseline. The preferences Culture

P a g e | 36 and Gold are classified significantly better than the baseline (z < -1.96). Aggression and Science are classified significantly worse than the baseline (z > 1.96). Growth and Military are classified with an accuracy roughly similar to the baseline, but still significantly worse according to a z-test (z > 1.96). Only for the preference Religion there is no significant difference (z = -0.38). The average improvement with regard to the frequency baseline for the Expert Test Set is worse than the average improvement on the Alexander Test Set , but better than the average improvement on the Cyrus Test Set and the Casual Test Set . In other words, the expert player was predicted more accurate than the casual player and the other set of computer-controlled players. This concludes the experiment for the expert player.

Close or Not? Besides the accuracy of the predictions it is interesting to research whether the wrong predictions are close to correct predictions or not. There is a difference in the level of accuracy if the classification models predict a player to have a wrong preferences value, but close to the correct preference value. Compared to a prediction of a player to have a wrong preference value and not even be close to the correct preference value. To this end, all erroneous predictions are presented in Table 4.18A till Table 4.18G. Figure 4.18A presenting the wrong predictions of the first preference Aggression till Figure 4.18G which presents the wrong predictions of the last preference Science . These tables are divided in two main parts. The left part for the wrong classifications on the Expert Database , the right part for the wrong predictions on the Casual Database . The number of erroneous predictions differs between players and preferences. It is considered a positive error if the difference between preference values is a minimal one. It is considered a negative difference if the difference between preference values is more than the minimal. In the tables this a ‘+’ represents a positive error and a ‘–‘ represents a negative error. Only occurring erroneous predictions are incorporated in the tables. The preference Aggression consisted of four preference values, meaning that there are six positive erroneous combinations and six negative erroneous combinations possible. For this preferences to be considered close to the correct prediction, the positive erroneous predictions should be 50% or more. The preference Gold and Military consisted of three preference values, meaning that there are four positive erroneous combinations and two negative erroneous combinations possible. For these two preferences to be considered close to the correct prediction, the positive erroneous predictions should be 66.6% or more. The preferences Culture , Growth , Religion , and Science all consisted of two preference values. There is no positive or negative erroneous combination possible. Ideally the errors should be divided equally.

Table 4.18A Prediction errors made by the classification model on the preference Aggression Casual Player (n = 607) Expert Player (n= 346 ) Actual Predicted Difference Percentage Actual Predicted Difference Percentage 1 3 + 59.6% 1 3 + 69.4% 1 5 - 12.0% 4 3 + 0.3% 3 4 + 1.0% 5 1 - 21.1% 3 5 - 1.5% 5 3 - 8.4% 4 3 + 17.1% 5 4 + 0.9% 4 5 + 1.5% 5 3 - 7.2%

From this table it can be concluded that for the casual player 79.2% of the erroneous predictions were close to the correct predictions. This is 70.6% for the expert player. Both well above the necessary 50%. Although the preference Aggression was predicted worse than the baseline for both players (-88.7% and - 59.5% respectively) the erroneous predictions appear to be close the correct predictions.

P a g e | 37

Table 4.18B Prediction errors made by the classification model on the preference Culture Casual Player (n = 248) Expert Player (n = 39) Actual Predicted Percentage Actual Predicted Percentage 0 5 50.4% 0 5 69.2% 5 0 49.6% 5 0 30.8%

From this table can be concluded that the division of the erroneous predictions is 50.4% versus 49.6% for the casual player data and 69.2% versus 30.8% for the expert player. Both erroneous predictions tend to assume more cultural preference than the players indicated, especially for the expert player.

Table 4.8C Prediction errors made by the classification model on the preference Gold Casual Player (n = 18) Expert Player (n = 95) Actual Predicted Difference Percentage Actual Predicted Difference Percentage 0 2 + 55.6% 0 5 - 1.1% 0 5 - 33.3% 2 5 + 81.1% 2 0 + 5.6% 5 0 - 17.9% 5 0 - 5.6%

From this table can be concluded that 61.2% of the erroneous predictions can be considered positive for the casual player. This does not exceed the required 66.6%. For the expert player 81.1% of the erroneous predictions can be considered positive. Well above the border of 66.6%. Therefore we can conclude that the predictions for the expert player are close to the correct predictions.

Table 4.18D Prediction errors made by the classification model on the preference Growth Casual Player (n = 654) Expert Player (n = 11) Actual Predicted Percentage Actual Predicted Percentage 2 0 100.0% 0 2 90.9% 2 0 9.1%

From this table can be concluded that the erroneous predictions are exactly 100% for the casual player data and 90.9% versus 9.1% for the expert player. This indicates that the casual player considered himself to be focussing on Growth , while the classification model considered the player to not focus on Growth . The opposite holds true for the expert player. Apparently the meaning of the preference Growth is hard to comprehend.

Table 4.18E Prediction errors made by the classification model on the preference Military Casual Player (n = 269) Expert Player (n = 119) Actual Predicted Difference Percentage Actual Predicted Difference Percentage 0 2 + 31.6% 0 2 + 26.9% 0 5 - 31.6% 5 0 - 64.7% 2 0 + 11.2% 5 2 + 8.4% 2 5 + 14.5% 5 0 - 11.2%

From this table it can be concluded that 57.3% of the erroneous predictions can be considered positive for the casual player. For the expert player 35.3% can be considered positive erroneous. Both not exceeding the border of 66.6% that is required. The correct preference value for Military appears hard to estimate correctly.

P a g e | 38

Table 4.18F Prediction errors made by the classification model on the preference Religion Casual Player (n = 210) Expert Player (n = 250) Actual Predicted Percentage Actual Predicted Percentage 0 2 100.0% 0 2 91.6% 2 0 8.4%

From this table can be concluded that the erroneous predictions are exactly 100% for the casual player data and 91.6% versus 8.4% for the expert player. These results indicate that both players did not find themselves to be focussing on Religion , while the classification model did conclude that. Contrary to the preference Growth , the predictions are consistent for the casual player and expert player.

Table 4.18G Prediction errors made by the classification model on the preference Science Casual Player (n = 1) Expert Player (n = 287) Actual Predicted Percentage Actual Predicted Percentage 5 0 100.0% 0 5 0.3% 5 0 99.7%

From this table can be concluded that the erroneous predictions are exactly 100% for the casual player data and 0.3% versus 99.7% for the expert player. Again there is a large overlap between players about the erroneous predictions of the classification model and the indicated preference values by the players. Although it should be noted that the number of erroneous classifications for the casual player n=1.

Summary In this section we presented the results of the SMO classifier with regard to human players. The results on the computer-controlled players were less accurate than we expected. Therefore we did not expect higher accuracies of the predictions on the human players. Our expectations were true for the casual player. The results on the expert player were in contrast with our expectations. None of the predictions on the humans were close to accurate, but in general more accurate than the predictions on the Cyrus Set . We discuss possible causes and implications in Chapter 5.

P a g e | 39

5 Discussion

From the results in Chapter 4, we derived possible optimizations that we could use to improve the results. What these optimizations are and whether they will succeed will be discussed here. The first section, Section 5.1, will discuss the construction of the classification models. Especially, the inequality in the classification results and the improvements between the preferences. The following section, Section 5.2, will address possible solutions to the performance of the classification model on the Cyrus Set . Furthermore, it will contain insight about the leader classifications. Finally we discuss how to predict human players and the conversion of human preferences in numerical preferences and preference values in Section 5.3. 5.1 Constructing the Classification Models

The accuracy of the predictions on the Alexander Test Set is 75.9% on average, although there are differences between the level of accuracy per preference. There are even more differences between the frequency baselines and the improvement with regard to that baseline. The classification models would be more usable if the accuracy and the improvement with regard to the baseline are similar and equally reliable. By selecting the first six leaders we selected certain preferences and preference values. Some preferences were more dominant in the training set than others. This difference in occurrence in the training set, contributed to the inequality in the predictions. The occurring preferences values in Database A for Growth consisted only of {0, 2}. This can be translated to ‘no preference’ and ‘minor preference’. Furthermore, Growth is necessary for each player to develop his empire. In other words, because the preference values in Database A were not equally represented some preference values were harder to discern for the SMO classifier than others. To achieve equality in predictions we need to adjust the selection of leaders that we used to create our classification models. The preference values of the leaders are not equally divided in itself, but the division of preference values in training set can be made more equal. The preference value 0 always occurs more than the preference values 2 and 5. We need to accept this inequality and divide the preference values as evenly as possible over the leaders and preferences. Table 5.1 presents a theoretical optimal division of the preference values per preference and per leader for CIV 4.

Table 5.1 Theoretical optimal division of the preference values between preferences and leaders Leader 1 Leader 2 Leader 3 Leader 4 Leader 5 Leader 6 Leader 7 Culture 0 0 0 0 0 2 5 Gold 0 0 0 0 2 5 0 Growth 0 0 0 2 5 0 0 Military 0 0 2 5 0 0 0 Production 0 2 5 0 0 0 0 Religion 2 5 0 0 0 0 0 Science 5 0 0 0 0 0 2

P a g e | 40

In this table we excluded the preference Aggression . Although the preference Aggression is interesting to predict, its preference values {1, 2, 3, 4, 5} are not in line with the other preference values {0, 2, 5}. Therefore this preference would only disturb the balance we strive to create. We included the preference Production . This is one of the ‘flavours’ as discussed Section 3.2. We also added a seventh leader to make an equal division of preference values possible. The preference values per leader are not so neatly divided as illustrated in Table 5.1. However, it is possible to create a table with equally divided preference values per preference per leader with existing CIV 4 leaders. This is presented in Table 5.2.

Table 5.2 Optimal division of the preference values between preferences and actual CIV 4 leaders Alexander Asoka Catherine Louis XIV Mansa Musa Mao Zedong Roosevelt Culture 0 0 2 5 0 0 0 Gold 0 0 0 0 5 0 2 Growth 2 0 0 0 0 5 0 Military 5 0 0 2 0 0 0 Production 0 0 0 0 0 2 5 Religion 0 5 0 0 2 0 0 Science 0 2 5 0 0 0 0

If we would use these seven leaders and their corresponding preference values, we would get roughly similar frequency baselines. The accuracy of the predictions and the improvements with regard to these baselines will then be easier to interpret. Furthermore, the classification models will have roughly the same number of instance per preference value to train on. Possibly resulting in more accurate predictions of new instances, since it is more likely it resembles the trained instances. This approach will have the most effect on the preferences that had the most unequal divided preference values like Growth and Science . However, researching whether this approach is successful is future work. 5.2 Predicting Preferences or Players

The second experiment was designed to predict the preferences of other AI’s; other AI’s than we used to construct the classification models with. The results of that experiment were below our expectations. However, they also caused us to review the possibility that we were predicting something else then preferences. Maybe we were predicting leaders or play styles. It appeared that we could predict the leaders from the Alexander Set fairly accurately. Furthermore, we noticed that leaders with roughly similar preferences were harder to discern for the SMO classifier than leaders with different preferences. This is visible in Table 4.11 of Section 4.2. Hatshepsut and Louis XIV are often exchanged. Hatshepsut is even classified more as Louis XIV than as Hatshepsut. Napoleon is confused with Alexander. If we look at Table 3.1 in Section 3.2, we see that the preference values per of Louis XIV and Hatshepsut are similar and those of Alexander and Napoleon are almost similar. This overlap between predicting leaders and predicting preferences can be attributed to the construction of the feature set. We extract feature data from games played by the leaders from the Alexander Set . The SMO classifier trains on that feature set to predict preferences. However, there is a contradiction between the extracted feature data in Database A and the aimed prediction. The extracted feature data originates from a game of a leader with a combination of seven preference values. That same feature data is used to predict only one preference value. In other words, there is noise in the feature data. The other preference value is causing this noise.

P a g e | 41

First Solution There are two possible solutions to this problem. The first solution is to use leaders with only strong preference values. In the CIV 4 there are only five leaders that have these strong preference values: (1) Bismarck, (2) Frederick, (3) Gandhi, (4) Huayna Capac, and (5) Isabella. This would result in the loss of the preferences Growth and Science since there are no leaders in CIV 4 with a strong preference value for those two preferences. That situation is presented in table 5.3.

Table 5.3 Set of leaders that would result in a classification model without ‘noise’ Bismarck Frederick Gandhi Huayna Capac Isabella Culture 0 0 10 0 0 Gold 0 0 0 10 0 Military 10 0 0 0 0 Production 0 10 0 0 0 Religion 0 0 0 0 10

Although this solution would solve the problem of noise, it creates another problem. The preference values of all other leaders do not correspond with these five leaders, the Bismarck Set of leaders. Each leader in the Bismarck Set either has a preference or does not have a preference. All the other leaders have two gradations in preferences. If we would use the Bismarck Set to create our classification models, we would have trouble predicting any other AI from CIV 4. To predict the preferences of human players, this approach can certainly improve the classification results. Although the predictions will be less specific since a player can only have a preference or not have a preference.

Second Solution The second solution to the problem of noise, is to not consider it noise. Besides the seven preferences per leader there are other factors that can create noise or hamper accurate classifications. It is known that the behaviour of the CIV 4 leaders is not only determined by their preferences. There are many more aspects in the game code that influence the behaviour of the leaders. Instead of focussing solely on preferences, it might be wise to define a broader base for a player model. We expected that a play style of a player was based on that player’s preferences. However, a play style can also be interpreted as a combination of preferences and other parameters that influence a player’s actions. This is a plausible explanation for the accurate classification on the Alexander Set (Table 4.8), the less accurate classification on the Cyrus Set (Table 4.10), the fairly accurate Alexander Set leader classification (Table 4.11), and the somewhat odd combinations between the Cyrus Set and the Alexander Set (Table 4.12 and Table 4.13). The difference in accuracy of the predictions on the Alexander Set and the Cyrus Set can be explained by considering play styles. Although the same preference values are represented in both sets, the play style of the leaders does not match at all. The Alexander Set contains cultural-oriented leaders, while there are none in the Cyrus Set . Another example, Mansa Musa and Victoria both have a high preference value for gold, although with regard to gold Mansa Musa’s play style is far superior to that of Victoria. The confusion in recognizing leaders from the Alexander Set in Table 4.11 can also be interpreted better when considering play styles instead of preference values. Leaders who play alike are confused. For example Hatshepsut and Louis XIV, Napoleon and Alexander. This counts even more for the confusion in matching leader from the Cyrus Set to the Alexander Set (Table 4.12 and Table 4.13). Consider Washington who is almost perfectly classified as Mansa Musa, while based on their preference values there is absolutely no match. Although there preference values do not match, their play styles in the game are rather similar. Both leaders highly-peaceful and intelligent leaders.

P a g e | 42

5.3 How to Classify Humans

The main problem for the classification of human players can be found in the transfer of human preferences in to preference values. The approach in this research leads to two situations in which subjectivity is necessary. The first occasion is the interpretation of the participant of the seven preferences. The names of the preferences suggest a meaning, but this meaning is interpreted by participants. Based on their interpretations the participants wrote reports about their games. The second occasion of subjectivity is the transmission of the reports into preference values. The reports were written in plain language, but the preferences of the participants need to be converted in the preference values {0, 2, 5} for each of the preferences. As a solution to this problem it is necessary to create a less subjective approach to the classification of human players. (1) The difference between approaches to a CIV 4 game can be maintained or altered, depending on the goal of the research. It is interesting to measure the difference in classification results between a more natural approach (casual player approach) or a more artificial approach (expert player approach). Regardless of the approach, this can be achieved by giving clear instructions to the participants. (2) It is also possible to create roughly equal understanding for the participants about the preferences. This could be done by defining each preference. Furthermore, we could provide a list of CIV 4 leaders and their preferences. The participants can derive the meaning of preferences from previous encounters with some CIV 4 leaders. (3) Finally, an issue in this research was the interpretation of the game reports by the researchers. Once there is enough understanding by the participants about the preferences and the preference values, we advise to let the participants write their preferences in preference values, instead of sentences. This way there is only one subjective phase instead of two. That being the interpretation of the participants preference by the participants themselves. An alternative solution is a questionnaire, forcing the participants to provide more controlled answers. Although these three solutions may not greatly enhance the classification results, they will provide a more objective and consistent base for the classification on data that is generated by human players. Once the gathering of preference values is improved it is possible to start thinking about further adjustments or improvements to enhance the classification results.

P a g e | 43

6 Conclusions

In the introduction we stated our problem statement: To what extent can a model be constructed of a player, which accurately predicts the player’s preferences? To further specify the problem statement and to come to an answer we divided it in three research questions:

We conclude that preferences are a suitable base for creating a player model in CIV 4. There are more possibilities to use as a base for a player model such as player’s strategies, skills, and weaknesses or any combination of those (Houlette, 2004; Van den Herik et al. , 2005; Donkers & Spronck, 2006). Since preference are already embedded in the game code (Section 3.2) in contradiction to skills, strategies and weaknesses we deem preferences more suitable for CIV 4. Referring back to the discussion section, it may also be a wise to use a broader base for a player model. Instead of using preference, play styles can be considered as base for the player model. Downside of choosing play style is that it is not specified in the game code, but is more a combination of every game code that steers the behaviour of computer-controlled players. Using a classifier to create a player model is also suitable. A problem of using a classifier is the need for large feature sets. Because of the preferences in the game code, we could use computer-controlled players to generate these large feature sets (Section 3.3). Therefore we solved that problem and could fully benefit of the capabilities of the classifier without the drawbacks. The ability of a classifier to create a model and the ability to make predictions are both exactly what was required for this research. 6.2 Predicting AI Opponents

To answer to what extent a player model can be constructed using a classification algorithm to predict the preferences of computer-controlled players, we need to look at the results of the Classification Models Validation (Section 4.1) and the Modelling of AI (Section 4.2). In Section 4.1 we constructed classification models and validated them by using the Alexander Test Set . The leaders from the Alexander Test Set were similar to the leaders that were used to create the classification models. The accuracy of the predictions of these leaders is encouraging. All prediction were correct in 65% or more of the instances and all predictions were a significant improvement over the frequency baseline. However, in Section 4.2 the real test for the classification models was initiated: to predict a new set of computer-controlled leaders, the Cyrus Set . The predictions on the Cyrus Set were less

P a g e | 44 accurate than we anticipated. Each preference value prediction had a lower accuracy than the frequency baseline. A possible explanation and solution are already discussed in Section 5.2. We conclude that the extent to which we can construct a model, using a classification algorithm which recognizes the preferences of a computer-controlled opponent in CIV 4, is limited. We conclude that when the classification models are constructed based on Leader Set A, they are capable of predicting the preferences of Leader Set A fairly accurate. When the classification models are constructed based on Leader Set A then it is not capable of accurately predicting the preferences of Leader Set B. This conclusion is based on the results of this research. Although this is what the results tell us now, we strongly feel that classification algorithms can be used to model and accurately predict computer-controlled player’s in CIV 4. Especially when the solutions are used that are provided in Section 5.2. 6.3 Predicting Human Opponents

We conclude, based on the results of Modelling of Humans (Section 4.3), that the classification models are not capable to accurately predict human players in general. The classification of human players is only marginally researched, further research is needed to provide conclusive answers to this research question. Suggestions for further research are made in Section 6.5. From this research we conclude that the extent to which we are capable of predicting the preferences of humans, based on classification models constructed on computer-controlled players, is limited. 6.4 Answer to the Problem Statement

From this research we can conclude that it is possible to create a model that can fairly accurately predict the preferences of computer-controlled players. This is only possible however, if the computer-controlled players are also used to create the classification models. When the classification models are to predict preferences of ‘unknown’ computer-controlled players, the models are not capable of predicting the preferences of those computer-controlled players. For the classification models to be able to predict ‘unknown’ players, it is necessary to reconsider the construction of the classification models. This is elaborately discussed in Section 5.2. The modelling of a human opponent is only marginally researched here. We can concluded that it is more likely to predict the preferences of a human player accurately if the human player plays consistent. Although this is contradictory to our approach, as a strength of our approach should be the ability to adapt to changes. Further research is required however. 6.5 Future Work

This research is an interesting approach to player modelling and would most certainly deserve further research. There are several recommendations we would like to share, based on our experience during this research. The recommendations will mainly contain improvements or alternations to our own research process. At the end of this section, a possible implementation of the player model in CIV 4 is discussed. First and foremost we would like to stress the importance of a solid base for the construction of the classification models. Once solid classification models are created, it is possible to even accurately predict skewed feature sets. To be more specific to player modelling in CIV 4, it may be wise to try to divide the preference as well as possible over the first set of leaders. This way the classification algorithm gets sufficient examples of each preference value to construct a more accurate model. Furthermore, preferences may not be the ideal class to predict on. Suggestion are made towards the more general play styles (Section 5.2). Second, is the extrapolation of the preference values. Instead of selecting leaders with preference values {0, 2, 5}, select leaders with preference values {0, 10}. The advantage can be attributed to larger differences in behaviour by the leaders. Therefore it should be easier for the SMO classifier to predict the preference values better. Whether this will indeed generate better results, needs to be investigated in future

P a g e | 45

research. Third, it is possible to try this approach in a less complicated environment. CIV 4 is a highly complex environment with a multitude of parameters that influence each other and the game environment. To research the full potential of the approach in this research the environment should be simplified. Although we did not optimize the classifier, we did optimize the feature set. It includes a large amount of possible features that are available to both players. The modifications made to the feature set (Section 3.4) all resulted in better classification results. Furthermore, we used three different techniques to even further improve the feature set (Section 4.1). We aimed to improve the feature set as much as possible and believe it hard to improve the feature set even further. There is of course room for fine tuning in two areas. (1) To include time in our feature set we used the adjustment ‘Trend’. We chose for a time span of five turns, since we felt this was a valid number of turns for a game of CIV 4. It is possible that different time spans contribute more to the accuracy of the classification. (2) We choose to erase the first 100 turns of each game, again because we felt this was an appropriate number of turns considering a game of CIV 4. Here it is also possible that the erasing of a different amount of turns would improve the accuracy of the predictions. With regard to human participants we advise to adjust our approach slightly. For example, presenting the participant with a table that includes all the preferences to predict. Explain how these preference are to be interpreted and which preferences values are available. Let the participant fill in the preference values themselves. This way there is only one subjective interpretation phase, which is the interpretation of the participants preferences by the participant self. This is discussed extensively in Section 5.3. We believe that these adjustments can improve the research in player modelling by means of classification, especially when future research is done in the approach we choose and discussed. Implementation in Civilization IV The actual implementation of a player model was beyond the reach of this research, but we have strong ideas on how to use the classification output as a base for a player model. To illustrate what the output of a SMO classifier on a feature set looks like, we have included a shortened version of such an output in Appendix F. The SMO classifier output includes the predicted preference value for a preference per instance. We would need to write a program that extracts instances similar to AIA UTO PLAY . These instances need to be accumulated and combined in a small feature set. It may even be possible for the player to adjust the number of instances he wants to collect. Those combined instances will form a test set. This test set will then be used on all seven classification models, one for each preference. The output back towards the player would be the predicted preference value by the SMO classifier. A theoretical representation of the information towards the player is presented in table 6.1.

Table 6.1 Theoretical representation of the feedback to the player for n instances Low (1) Medium (3) High (4) Very High (5) Aggression ...% ...% ...% ...%

Low (0) Medium (2) High (5) Culture ...% ...% ...% Gold ...% ...% ...% Growth ...% ...% ...% Military ...% ...% ...% Religion ...% ...% ...% Science ...% ...% ...%

Considering the experience and lessons from this research it is already possible to create this application. With further research it may even be possible to create a more accurate and usable application.

P a g e | 46

References

Bakkes, S. C. J., Spronck, P. H. M., and Van den Herik, H. J. (2009). Opponent Modelling for Case-based Adaptive Game AI. Entertainment Computing . (To appear). Carmel, D. & Markovitch, S. (1993). Learning Models of Opponent’s Strategy in Game Playing. In Proceedings of AAAI Fall Symposium on Games: Planning and Learning , pages 140-147, Raleigh, NC. Donkers, H. H. L. M., Uiterwijk, J. W. H. M., and Van den Herik, H. J. (2001). Probabilistic Opponent- model Search. Information Sciences , 135(3–4):123–149. Donkers, H. H. L. M. (2003). Nosce Hostem - Searching with Opponent Models . PhD thesis, SIKS Dissertation Series No. 2003-13, Faculty of Humanities and Sciences, Maastricht University, Maastricht, The Netherlands. Donkers, H. H. L. M. and Spronck, P. H. M. (2006). Preference-based Player Modeling. In Rabin, S. (Ed.), AI Game Programming Wisdom 3 , pages 647–659. Charles River Media, Inc., Hingham, Massachusetts, USA. Fürnkranz, J. (2007). Recent Advances in Machine Learning and Game Playing. ÖGAI-Journal , 26(2). Houlette, R. (2004). Player Modeling for Adaptive Games. In Rabin, S. (Ed.), AI Programming Wisdom 2 , chapter 10.1, pages 557–566. Charles River Media, Inc., Hingham, Massachusetts, USA. Keerthi, S. S., Shevade, S. K., Bhattacharyya, C., and Murthy, K. R. K. (2001). Improvements to Platt’s SMO Algorithm for SVM Classifier Design. Neural Computation , 13:637-649. Mak, G. M. F. (2000). The Implementation of Support Vector Machines using the Sequential Minimal Optimization Algorithm. Master thesis, School of Computer Science, McGill University, Montreal, Canada. Platt, J. C. (1998). Fast Training of Support Vector Machines using Sequential Minimal Optimization. In Schölkopf, B., Burges, C., and Smola, A. (Eds.), Advances in Kernel Methods Support Vector Machine , pages 185-208, MIT Press, Cambridge. Rohs, M. (2007). Preference-based Player Modelling for Civilization IV . Bachelor thesis, Faculty of Humanities and Sciences, Maastricht University, Maastricht, The Netherlands. Slagle, J. R. and Dixon, J. K. (1970). Experiments with the M & N Tree-searching Program. Communications of the ACM , 13(3):147–154. Spronck, P. H. M. & Den Teuling, F. (2010). Player Modelling in Civilization IV. (To appear) Van den Herik, H. J., Donkers, H. H. L. M., and Spronck, P. H. M. (2005). Opponent Modelling and Commercial Games. In Kendall, G. and Lucas, S. (Eds.), Proceedings of the IEEE 2005 Symposium on Computational Intelligence and Games (CIG’05) , pages 15–25. Van Lankveld, G., Spronck, P. H. M., and Van den Herik, H. J. (2009). Incongruity-based Adaptive Game Balancing. In Proceedings of the 12th Advances in Computer Games conference (ACG12) . 2K Games (2005). Civilization IV Manual .

P a g e | 47

A Appendix - Features

In this Appendix an overview of all features is provided. All these features are used by the SMO classifier to predict preferences of players. Table A.1 All 128 features # Attribute # Attribute 001. Turn 035. Land 002. War 036. LandDerivate 003. Cities 037. LandTrend 004. CitiesDerivate 038. LandTrendDerivate 005. CitiesTrend 039. LandDiff 006. CitiesTrendDerivate 040. LandDiffDerivate 007. CitiesDiff 041. LandDiffTrend 008. CitiesDiffDerivate 042. LandDiffTrendDerivate 009. CitiesDiffTrend 043. Plots 010. CitiesDiffTrendDerivate 044. PlotsDeriavte 011. Units 045. PlotsTrend 012. UnitsDerivate 046. PlotsTrendDerivate 013. UnitsTrend 047. PlotsDiff 014. UnitsTrendDerivate 048. PlotsDiffDerivate 015. UnitsDiff 049. PlotsDiffTrend 016. UnitsDiffDerivate 050. PlotsDiffTrendDerivate 017. UnitsDiffTrend 051. Techs 018. UnitsDiffTrendDerivate 052. TechsDerivate 019. Population 053. TechsTrend 020. PopulationDerivate 054. TechsTrendDerivate 021. PopulationTrend 055. TechsDiff 022. PopulationTrendDerivate 056. TechsDiffDerivate 023. PopulationDiff 057. TechsDiffTrend 024. PopulationDiffDerivate 058. TechsDiffTrendDerivate 025. PopulationDiffTrend 059. Score 026. PopulationDiffTrendDerivate 060. ScoreDerivate 027. Gold 061. ScoreTrend 028. GoldDerivate 062. ScoreTrendDerivate 029. GoldTrend 063. ScoreDiff 030. GoldTrendDerivate 064. ScoreDiffDerivate 031. GoldDiff 065. ScoreDiffTrend 032. GoldDiffDerivate 066. ScoreDiffTrendDerivate 033. GoldDiffTrend 067. Economy

P a g e | 48

# Attribute # Attribute 034. GoldDiffTrenDerivate 068. EconomyDerivate 069. EconomyTrend 099. Culture 070. EconomyTrendDerivate 100. CultureDerivate 071. EconomyDiff 101. CutureTrend 072. EconomyDiffDerivate 102. CultureTrendDerivate 073. EconomyTrend 103. CultureDiff 074. EconomyTrendDerivate 104. CultureDiffDerivate 075. Industry 105. CultureDiffTrend 076. IndustryDerivate 106. CultureDiffTrendDerivate 077. IndustryTrend 107. Maintenance 078. IndustryTrendDerivate 108. MaintenanceDerivate 079. IndustryDiff 109. MaintenanceTrend 080. IndustryDiffDerivate 110. MaintenanceTrendDerivate 081. IndustryDiffTrend 111. GoldRate 082. IndustryDiffTrendDerivate 112. GoldRateDerivate 083. Agriculture 113. GoldRateTrend 084. AgricultureDerivate 114. GoldRateTrendDerivate 085. AgricultureTrend 115. ResearchRate 086. AgricultureTrendDerivate 116. ResearchRateDerivate 087. AgricultureDiff 117. ResearchRateTrend 088. AgricultureDiffDerivate 118. ResearchRateTrendDerivate 089. AgricultureDiffTrend 119. CultureRate 090. AgricultureDiffTrendDerivate 120. CultureRateDerivate 091. Power 121. CultureRateTrend 092. PowerDerivate 122. CultureRateTrendDerivate 093. PowerTrend 123. StateReligion 094. PowerTrendDerivate 124. DeclaredWar 095. PowerDiff 125. CumulativeDeclaredWar 096. PowerDiffDerivate 126. AverageDeclaredWar 097. PowerDiffTrend 127. CumulativeWar 098. PowerDiffTrendDerivate 128. AverageWar

P a g e | 49

B Appendix - SMO

This appendix provides the description in WEKA about SMO. Furthermore, it includes a screenshot of the standard settings that are used in this research and a glossary provided by WEKA about the settings. The description of the SMO provided by WEKA is presented in Figure B.1

NAME weka.classifiers.functions.SMO

SYNOPSIS Implements John Platt's sequential minimal optimization algorithm for training a support vector classifier. This implementation globally replaces all missing values and transforms nominal attributes into binary ones. It also normalizes all attributes by default. (In that case the coefficients in the output are based on the normalized data, not the original data --- this is important for interpreting the classifier.) Multi-class problems are solved using pairwise classification (1-vs-1 and if logistic models are built pairwise coupling according to Hastie and Tibshirani, 1998). To obtain proper probability estimates, use the option that fits logistic regression models to the outputs of the

support vector machine. In the multi-class case the predicted probabilities are coupled using Hastie and Tibshirani's pairwise coupling method.

Note: for improved speed normalization should be turned off when operating on SparseInstances.

For more information on the SMO algorithm, see

J. Platt: Machines using Sequential Minimal Optimization. In B. Schoelkopf and C. Burges and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning, 1998.

S.S. Keerthi, S.K. Shevade, C. Bhattacharyya, K.R.K. Murthy (2001). Improvements to Platt's SMO Algorithm for SVM

Classifier Design. Neural Computation. 13(3):637-649.

Trevor Hastie, Robert Tibshirani: Classification by Pairwise Coupling. In: Advances in Neural Information Processing Systems, 1998.

Figure B.1 – Description of the SMO classifier, provided by WEKA

Presented in Figure B.2 are the standard settings for SMO in WEKA. These settings are used for each classification in this research. Following in Figure B.3 is a glossary corresponding with the options in Figure B.2, provided by WEKA.

P a g e | 50

Figure B.2 – Screenshot of the standard settings that WEKA provides for the SMO -classifier. These are the settings we used for each experiment

OPTIONS

buildLogisticModels -- Whether to fit logistic models to the outputs (for proper probability estimates). c -- The complexity parameter C. checksTurnedOff -- Turns time-consuming checks off - use with caution. debug -- If set to true, classifier may output additional info to the console.

epsilon -- The epsilon for round-off error (shouldn't be changed). filterType -- Determines how/if the data will be transformed. kernel -- The kernel to use. numFolds -- The number of folds for cross-validation used to generate training data for logistic models (-1 means use training data). randomSeed -- Random number seed for the cross-validation. toleranceParameter -- The tolerance parameter (shouldn't be changed).

Figure B.3 – Description of the possible options for the SMO -classifier, provided by WEKA

P a g e | 51

C Appendix – InfoGain & GainRatio

In this appendix we cite the description that WEKA provides about INFO GAIN ATTRIBUTE EVAL and GAIN RATIO ATTRIBUTE EVAL . In. Furthermore we include a screenshot of the standard settings we used, and a glossary given by WEKA about the settings.

NAME weka.attributeSelection.InfoGainAttributeEval

SYNOPSIS InfoGainAttributeEval :

Evaluates the worth of an attribute by measuring the information gain with respect to the class.

InfoGain(Class,Attribute) = H(Class) - H(Class | Attribute).

Figure C.1 – Description of INFO GAIN ATTRIBUTE EVAL , provided by WEKA

Figure C.2 – Screenshot of the standard settings that WEKA provides for INFO GAIN ATTRIBUTE EVAL . These are the settings we used for the experiments

OPTIONS binarizeNumericAttributes -- Just binarize numeric attributes instead of properly discretizing them. missingMerge -- Distribute counts for missing values. Counts are distributed across other values in proportion to their frequency. Otherwise, missing is treated as a separate value.

Figure C.3 - Description of the possible options for INFO GAIN ATTRIBUTE EVAL , provided by WEKA

P a g e | 52

NAME weka.attributeSelection.GainRatioAttributeEval

SYNOPSIS GainRatioAttributeEval :

Evaluates the worth of an attribute by measuring the gain ratio with respect to the class.

GainR(Class, Attribute) = (H(Class) - H(Class | Attribute)) / H(Attribute).

Figure C.4 - Description of GAIN RATIO ATTRIBUTE EVAL , provided by WEKA

Figure C.5 – Screenshot of the standard settings that WEKA provides for GAIN RATIO ATTRIBUTE EVAL . These are the settings we used for the experiments

OPTIONS missingMerge -- Distribute counts for missing values. Counts are distributed across other values in proportion to their frequency. Otherwise, missing is treated as a separate value.

Figure C.6 - Description of the possible options for GAIN RATIO ATTRIBUTE EVAL , provided by WEKA

P a g e | 53

D Appendix – InfoGain & GainRatio Features

Table D.1 Selected features for InfoGain and GainRatio # InfoGain Feature Set # GainRatio Feature Set 01. CitiesDiff 01. CitiesDiff 02. CitiesDiffTrend 02. CitiesDiffTrend 03. UnitsDiff 03. Land 04. UnitsDiffTrend 04. LandDerivate 05. PopulationDiff 05. LandTrend 06. Land 06. LandDiff 07. LandTrend 07. LandDiffDerivate 08. LandDiff 08. LandDiffTrend 09. LandDiffTrend 09. Plots 10. Plots 10. PlotsDerivate 11. PlotsTrend 11. PlotsTrend 12. PlotsDiff 12. PlotsDiff 13. PlotsDiffTrend 13. PlotsDiffDerivate 14. TechsDiff 14. PlotsDiffTrend 15. TechsDiffTrend 15. TechsDiff 16. ScoreDiff 16. TechsDiffTrend 17. ScoreDiffTrend 17. Economy 18. Economy 18. EconomtTrend 19. EconomyTrend 19. EconomyDiff 20. EconomyDiff 20. EconomyDiffTrend 21. EconomyDiffTrend 21. IndustryDerivate 22. PowerDiff 22. CultureDerivate 23. PowerDiffTrend 23. CultureTrendDerivate 24. CultureDerivate 24. CultureDiff 25. CulturetrendDerivate 25. CultureDiffDerivate 26. CultureDiff 26. CultureDiffTrend 27. CultureDiffDerivate 27. CultureDiffTrendDerivate 28. CultureDiffTrend 28. MaintenanceDerivate 29. CultureDiffTrendDerivate 29. CultureRate 30. CultureRate 30. CultureRateTend 31. CultureRatetrend 31. CumulativeDeclaredwar 32. Cumulativewar 32. AverageDeclareWar 33. Cumulativewar

P a g e | 54

E Appendix - Reports

This Appendix contains a total of seven game reports made by the two participants. Three reports are made by the casual player and four reports are made by the expert player. The reports of the casual player are in Dutch, the reports of the expert player are in English.

Three reports in Dutch from the casual CIV 4-player.

Ik ga spelen met Gandhi en proberen een cultural victory af te dwingen. Mijn grootste focus ligt dus op Culture en in mindere mate Growth.

Mijn tegenstander is Mansa Musa dus ik ga volledig voor de cultural victory en probeer koste wat kost de vrede te bewaren. Ik heb gedurende de hele game vrede gehad met Mansa Musa en een cultural victory weten te behalen.

Report E.1 - Gandhi vs. Mansa Musa by the casual player

Ik ga me met name focussen op Growth en in mindere mate op Military om de toegang tot goede resources en meerdere steden te hebben wat het snel opbouwen van een leger bevordert. Ik speel over het algemeen niet erg agressief maar richt mij meer op defensie. Wanneer ik een sterk leger verzamelt heb val ik aan. Ik kies voor dit spel Montezuma omdat ik een hekel heb aan anarchie, en om zijn speciale unit waar je vrij snel in het spel toegang tot hebt. Ook vind ik het prettig wanneer ik bij aanvang van het spel een scout ontvang i.p.v. een warrior. Als tegenstander heb ik deze keer Julius Caesar ik weet niet waarom maar ik heb een hekel aan Caesar rond 1000 AD kreeg ik twee great artists achter elkaar en kon ik in een hap een stuk gebied van Caesar inpikken hij vervolgde daarop met de bedreiging dat ik goud moest betalen. Omdat mijn rijk langgerekt was heb ik veel militaire troepen in de grensstad geplaatst hij heeft hier 3 steden in de buurt die ik ga proberen over te nemen (1030AD). Het ging minder makkelijk dan ik dacht ik heb er slechts 2 over genomen. En het duurde langer dan ik dacht. Caesar heeft vrede geaccepteerd in (1290AD). Mijn focus leg ik nu op Growth en Culture om zo domination af te dwingen. Caesar heeft oorlog verklaard in (1665AD). Ik antwoord hierop door domination te gaan afdwingen door het veroveren van zijn steden. De oorlog viel wat tegen ik heb de vrede af moeten kopen voordat hij mijn steden zou veroveren. Gedurende de oorlog heb ik een van zijn steden veroverd en hij een van mij. We zijn er beide niet veel mee op geschoten. Ik ga me nu focussen op het opbouwen van een leger om zijn steden te veroveren.(1755AD)

Ik denk dat ik sterk genoeg ben ik verklaar wederom oorlog (1810 AD) ik heb Caesar verslagen met een domination victory 1836(AD)

Report E.2 - Montezuma vs. Julius Caesar by the casual player

P a g e | 55

Ik ga me met name focussen op Military en in mindere mate op Growth om de toegang tot goede resources en meerdere steden te hebben wat het snel opbouwen van een leger bevordert. Ik speel over het algemeen niet erg agressief maar richt mij meer op defensie. Wanneer ik een sterk leger verzamelt heb val ik aan. Ik kies voor dit spel Montezuma omdat ik een hekel heb aan anarchie, en om zijn speciale unit waar je vrij snel in het spel toegang tot hebt. Ook vind ik het prettig wanneer ik bij aanvang van het spel een scout ontvang i.p.v. een warrior. Mansa Musa heeft rond 275 BC een stad bij mij in de buurt gebouwd. Hier was ik niet erg over te spreken ik heb deze stad dan ook aangevallenin 200 AD toen hij nog klein was en vrij snel over genomen. Ik had inmiddels 3 redelijk grote steden die snel een leger op de been konden brengen zodat ik nog een stad van mansa musa kon innemen. Na deze 2 steden veroverd te hebben heb ik hem vrede aangeboden en hij heeft dit geaccepteerd. Ik heb Mansa Musa inmiddels afgesloten van een groot deel van de world map en ga denk ik voor een overwinning dmv domination ik ga mijn focus dus leggen op Growth en in mindere mate Culture het is nu 475 AD. In 1560 AD won ik met een domination victory ik heb Mansa Musa niet meer aan hoeven vallen. Ik had hem voldoende afgesloten.

Report E.3 - Montezuma vs. Mansa Musa by the casual player

Four reports from the expert CIV 4-player.

In this game I played Alexander against Louis XIV. My goal for the game was to work towards a conquest victory. I only built two cities: Athens and Sparta. Athens had iron, Sparta horses. That is all I needed. I tried to locate horses and iron as quickly as possible, by researching Animal Husbandry and Iron Working. This was followed by Horseback Riding, so I could construct axemen, swordsmen, and horse archers. Until 1600 BC I just developed my two cities and did research. At that point I felt ready to delve into military, and started to built axemen and swordmen, which I sent towards the French. I also used a worker to built a long road in that direction, to speed up later assaults. When I had two axemen and a swordman near the French, they created their city right next to my units parked at their border. That was unacceptable, so I attacked. This was 1160 BC. This new city was quickly razed, and I marched on towards Paris. Since Paris was too well defended, I just created loads of units which I parked around the city. These took out any settlers that dared to leave the Paris. Louis XIV asked for peace twice, but I declined. In the meantime I researched Mathematics and Construction. As soon as that research was finished, I built several catapults which I sent towards the French, followed by some elephants. Two catapults bombarded Paris' defenses for several rounds, after which they attacked the city, causing collateral damage. Then the seven or eight units parked around the city entered, and managed to take it with minimal losses. I decided to keep Paris as a fallout base for further conquests. The city was a bit expensive to keep, but I did not care as the game would not take very long anymore. The French still had Orleans left, which was taken down about ten rounds later in a similar manner as Paris. A conquest victory was achieved in 350 AD.

How would I rate my characteristics in this game? I was definitely military-oriented for the whole game, and very clearly so after 1600 BC. I was highly aggressive after 1160 BC, not so much before that. I did not care at all about religion, growth, gold, culture, or science.

Report E.4 - Alexander vs. Louis XIV by the expert player

In this game I played Hatshepsut against Tokugawa. My goal for the game was to work towards a cultural victory. The first three cities (Thebes, Memphis, and Heliopolis) were built in such a way that they would generate lots of culture. I specifically used religion to get large amounts of culture going. The culture worked pretty well; I robbed Tokugawa of lots of land. I also gained two of his cities. I tried to keep the peace by taking the same religion that Tokugawa took, namely Confucianism. However, Tokugawa switched to Christianity at some point, and as I did not have any Christian cities I could not follow him. Tokugawa got pretty annoyed with me at that point, but I tried to keep the minimal defense necessary to avoid him attacking me. I the end, I won a domination victory. A cultural victory is probably unachievable on a Duel map, as the cultural spread will force a domination victory before long.

How would I rate my characteristics in this game? I was definitely religion-oriented, and also culturally oriented. I was definitely not aggressive, nor was I military oriented. I did not care about gold at all. I was not much interested in growth, and neither was I much interested in science.

Report E.5 - Hatshepsut vs. Tokugawa by the expert player

P a g e | 56

In this game I played Mansa Musa against Napoleon. My goal for the game was to create an economic powerhouse, and get to a victory from that basis. A space race victory or diplomatic victory seemed to be most in line of an economic society, but with Napoleon as neighbour that might be impossible. I started by researching some worker techs and some religion. Religion is not needed, but it helps in developing cities. I wanted a basic society consisting of about four cities, most of them surrounded by cottages to get lots of gold in later stages. I researched techs which allowed trading, and then I got Open Borders with Napoleon when this was possible, to get some trade going. After that I gave priority to techs which allowed extra gold from cities by building markets, grocers, banks, and harbors. Around 500 BC Napoleon was annoyed and closed the borders. That was the sign to spend a bit of time on defenses, so I built skirmishers, about two in every city. My civilization filled half the map, with about seven cities. One of Napoleon's cities joined my society. That annoyed him a lot, and around 500 AD he blackmailed me in giving him 80 gold. As I did not really care about gold, having lots of it, I gave him this. I kept developing my society, giving priority to techs, buildings, and wonders which produced gold. After 1500 AD, I invested more in research. I already got fast research going, but if I wanted to win the space race, I needed more. I thought about getting a diplomatic victory, so I steered research towards Mass Media and the United Nations. As I had by far the biggest population, I thought I could vote myself into office. Unfortunately, I found that I could not build the UN, as at least three civilizations are required in the game to do that. Rather irritating. As Napoleon was getting more and more annoyed with me, I knew I had to invest heavily in military, so after 1700 AD I constructed more units. In 1806 AD Napoleon declared war on me. He took one of my cities, but as I was ahead of him in tech, I retaliated by razing one of his cities. I marched towards Paris, taking out his units and destroying some of his developments. He was doing more or less the same to me. Unfortunately, neither of us was strong enough to really get an edge, so when I proposed a Peace Treaty in 1848 AD, Napoleon accepted. At this point I knew that a space race victory was unlikely. I already had about 70% of the world's population, and about 70% of the land mass. If I could get the land mass to 74%, I would get a domination victory. However, that meant I had to take some of Napoleon's cities. Therefore I needed tanks. I beelined to Combustion to get oil, and then beelined to Tanks. In the meantime, I constructed lots of riflemen, grenadiers, cannons, destroyers, and later infantry. When my first tank got out of the factory, I started bombarding Napoleon's cities, declaring war in 1886 AD. The war lasted until 1914 AD, when I had reduced Napoleon to one city, thereby getting the required land mass for a domination victory. This game was not much fun to play. Since I was so rich, I knew I could out-tech and out-build Napoleon, but still it took quite a long time to play.

My characteristic for the game were as follows: Until 1700 AD: Highly economical, some science, not at all aggressive. After 1700 AD: Higly militaristic, some econo my, some science, highly aggressive.

Report E.6 – Mansa Musa vs. Napoleon by the expert player

In this game I played Louis XIV against Mansa Musa. My goal was to win a cultural victory. As in the earlier game where I played Hatshepsut, I guessed that it was probable that it still would turn out to be a domination victory, but I now wanted to take a slightly different route. I decided to focus on culture, again through religion (as this is the easiest), but neglect the military almost completely in favor of science. Against Mansa, it should be possible to play a low-power game. I tried to postpone choosing a state religion as long as possible, as religions give more culture as long as you do not have a state religion. I could do that until the point that I gained Pacifism. I wanted Pacifism for the great-person bonus, but that only pays off if you have a state religion. I picked one that Mansa did not have, as I did not have any city with Buddhism, which was what my opponent had gone for. This soured our relationship, and Mansa closed off his borders. At this point I had to build a few military units, to make him hesitate to attack. I went to Liberalism as quickly as possible. As soon as I had that, I switched to Free Religion, which benefitted my relationship with my neighbor. He opened his borders again and was actually very generous in his trade offers. However, I did not take him up on these offers, as I thought it would be better for me to keep an advantage in the sciences that I had picked. Indeed, the game ended in a domination victory, but my three cultural cities were blooming and a cultural victory was around the corner. My characteristics for this game are: high culture, relatively high religion, relatively high science. The rest absent. No aggression at all. Report E.7 – Louis XIV vs. Mansa Musa by the expert player

P a g e | 57

F Appendix – SMO Output

=== Run information ===

Scheme: weka.classifiers.functions.SMO -D -C 1.0 -L 0.0010 -P 1.0E-12 -N 0 -V-1-W1 -K "weka.classifiers.functions.supportVector.PolyKernel -C 250007 -E 1.0" Relation: Training_Complete_Minus100- weka.filters.unsupervised.attribute.NumericToNominal-R2- weka.filters.unsupervised.attribute.Remove-R129- weka.filters.unsupervised.attribute.NumericToNominal-R129-136- weka.filters.unsupervised.attribute.Remove-R129-132,134-136 Instances: 54714 Attributes: 129 [list of attributes omitted]

Test mode: user supplied test set: size unknown (reading incrementally)

=== Classifier model (full training set) ===

SMO

Kernel used: Linear Kernel: K(x,y) =

Classifier for classes: 0, 2

BinarySMO

Machine linear: showing attribute weights, not support vectors.

0.7309 * (normalized) Turn + 0.1485 * (normalized) War

+ 0.8281 * (normalized) Cities + -0.1586 * (normalized) CitiesDerivate + 1.4166 * (normalized) CitiesTrend + -0.6093 * (normalized) CitiesTrendDerivate + 0.8438 * (normalized) CitiesDiff + 0.0496 * (normalized) CitiesDiffDerivate + 1.0676 * (normalized) CitiesDiffTrend + -0.5381 * (normalized) CitiesDiffTrendDerivate + -2.9234 * (normalized) Units + 0.0484 * (normalized) UnitsDerivate + -3.3088 * (normalized) UnitsTrend + 1.3264 * (normalized) UnitsTrendDerivate + 0.5734 * (normalized) UnitsDiff + 0.0162 * (normalized) UnitsDiffDerivate + 0.2863 * (normalized) UnitsDiffTrend + 0.0625 * (normalized) UnitsDiffTrendDerivate + -4.652 * (normalized) Population + 0.28 * (normalized) PopulationDerivate + -4.6523 * (normalized) PopulationTrend

... P a g e | 58

...

Time taken to build model: 4709.06 seconds

=== Predictions on test set ===

inst#, actual, predicted, error, probability distribution 1 1:0 1:0 *0.667 0.333 0 2 1:0 1:0 *0.667 0.333 0 3 1:0 1:0 *0.667 0.333 0 4 1:0 1:0 *0.667 0.333 0 5 1:0 1:0 *0.667 0.333 0 6 1:0 1:0 *0.667 0.333 0 7 1:0 1:0 *0.667 0.333 0 8 1:0 1:0 *0.667 0.333 0 9 1:0 1:0 *0.667 0.333 0 10 1:0 1:0 *0.667 0.333 0 11 1:0 1:0 *0.667 0.333 0 12 1:0 1:0 *0.667 0.333 0 13 1:0 1:0 *0.667 0.333 0 14 1:0 1:0 *0.667 0.333 0 15 1:0 1:0 *0.667 0.333 0 16 1:0 1:0 *0.667 0.333 0 17 1:0 1:0 *0.667 0.333 0 18 1:0 1:0 *0.667 0.333 0 19 1:0 1:0 *0.667 0.333 0 20 1:0 1:0 *0.667 0.333 0 21 1:0 1:0 *0.667 0.333 0 22 1:0 1:0 *0.667 0.333 0 23 1:0 1:0 *0.667 0.333 0

...

9434 3:5 1:0 + *0.667 0 0.333 9435 3:5 1:0 + *0.667 0 0.333 9436 3:5 1:0 + *0.667 0 0.333 9437 3:5 3:5 0.333 0 *0.667 9438 3:5 3:5 0.333 0 *0.667 9439 3:5 3:5 0.333 0 *0.667 9440 3:5 3:5 0.333 0 *0.667 9441 3:5 3:5 0.333 0 *0.667 9442 3:5 3:5 0.333 0 *0.667 9443 3:5 3:5 0.333 0 *0.667 9444 3:5 3:5 0.333 0 *0.667 9445 3:5 3:5 0.333 0 *0.667 9446 3:5 3:5 0.333 0 *0.667 9447 3:5 3:5 0.333 0 *0.667 9448 3:5 3:5 0.333 0 *0.667 9449 3:5 3:5 0.333 0 *0.667 9450 3:5 3:5 0.333 0 *0.667 9451 3:5 3:5 0.333 0 *0.667 9452 3:5 1:0 + *0.667 0 0.333 9453 3:5 1:0 + *0.667 0 0.333 9454 3:5 1:0 + *0.667 0 0.333 9455 3:5 1:0 + *0.667 0 0.333 9456 3:5 1:0 + *0.667 0 0.333 9457 3:5 1:0 + *0.667 0 0.333

...

P a g e | 59

...

12458 2:2 1:0 + *0.667 0.333 0 12459 2:2 1:0 + *0.667 0.333 0 12460 2:2 2:2 0.333 *0.667 0 12461 2:2 2:2 0.333 *0.667 0 12462 2:2 2:2 0.333 *0.667 0 12463 2:2 2:2 0.333 *0.667 0 12464 2:2 2:2 0.333 *0.667 0 12465 2:2 2:2 0.333 *0.667 0 12466 2:2 2:2 0.333 *0.667 0 12467 2:2 2:2 0.333 *0.667 0 12468 2:2 2:2 0.333 *0.667 0 12469 2:2 2:2 0.333 *0.667 0 12470 2:2 2:2 0.333 *0.667 0 12471 2:2 1:0 + *0.667 0.333 0 12472 2:2 2:2 0.333 *0.667 0 12473 2:2 1:0 + *0.667 0.333 0 12474 2:2 1:0 + *0.667 0.333 0 12475 2:2 1:0 + *0.667 0.333 0 12476 2:2 1:0 + *0.667 0.333 0 12477 2:2 1:0 + *0.667 0.333 0

...

=== Evaluation on test set ======Summary ===

Correctly Classified Instances 13377 74.5694 % Incorrectly Classified Instances 4562 25.4306 % Kappa statistic 0.4229 K&B Relative Info Score 490468.0566 % K&B Information Score 6133.2369 bits 0.3419 bits/instance Class complexity | order 0 22500.3812 bits 1.2543 bits/instance Class complexity | scheme 1127152.0362 bits 62.8325 bits/instance Complexity improvement (Sf) -1104651.655 bits -61.5782 bits/instance Mean absolute error 0.2916 Root mean squared error 0.3786 Relative absolute error 87.4177 % Root relative squared error 92.6243 % Total Number of Instances 17939

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Clas 0.904 0.555 0.764 0.904 0.828 0.678 0 0.284 0.065 0.469 0.284 0.354 0.687 2 0.581 0.017 0.871 0.581 0.697 0.825 5 Weighted Avg. 0.746 0.383 0.732 0.746 0.726 0.704

=== Confusion Matrix ===

a b c <-- classified as 10788 900 249 | a = 0 2157 859 7 | b = 2 1175 74 1730 | c = 5

Figure F.1 – Example of the output provided by WEKA after a classification with SMO