<<

Demonstrating the Feasibility of Automatic Balancing

Vanessa Volz Günter Rudolph Boris Naujoks

Abstract

Game balancing is an important part of the (computer) process, in which designers adapt a game prototype so that the resulting gameplay is as entertaining as possible. In industry, the evaluation of a game is often based on costly playtests with human players. It suggests itself to automate this process using surrogate models for the prediction of gameplay and outcome. In this paper, the feasibility of automatic balancing using simulation- and deck-based objectives is investigated for the card game top trumps. Additionally, the necessity of a multi-objective approach is asserted by a comparison with the only known (single-objective) method. We apply a multi-objective evolutionary to obtain decks that optimise objectives, e.g. win rate and average number of tricks, developed to express the fairness and the excitement of a game of top trumps. The results are compared with decks from published top trumps decks using simulation-based objectives. The possibility to generate decks better or at least as good as decks from published top trumps decks in terms of these objectives is demonstrated. Our results indicate that automatic balancing with the presented approach is feasible even for more complex such as real-time strategy games.

I. Introduction balancing process with tools that can automati- cally evaluate and suggest different game pa- The increasing complexity and popularity of rameter configurations, which fulfil a set of (computer) games result in numerous chal- predefined goals (cf. [14]). However, since the lenges for game designers. Especially fine- effects of certain goals tend to be obscure at tuning , which affects the feel the time of design, we suggest to use a multi- and required skill profile of a game signifi- objective approach which allows to postpone cantly, is a difficult task. For example, chang- the decision on a configuration until the trade- ing the time between shots for the sniper rifle offs can be observed. In this paper, we in- in 3 from 0.5 to 0.7 seconds impacted the troduce game balancing as a multi-objective gameplay significantly according to designer optimisation problem and demonstrate the fea- Jaime Griesemer1. sibility of automating the process in a case It is important to draw attention to the fact study. that the game designer’s vision of a game can For the purpose of this paper, we define rarely be condensed into just one intended game balancing as the modification of param- arXiv:1603.03795v1 [cs.HC] 11 Mar 2016 game characteristic. In competitive games, for eters of the constitutive and operational rules example, it is certainly important to consider of a game (i.e. the underlying physics and the fairness, meaning that the game outcome de- induced consequences / feedback) in order to pends on skill rather than luck (skill-based) and achieve optimal configurations in terms of a set that the win rate of two equally matched play- of goals. To this end, we analyse the card game ers is approx. 50% (unbiased). But additionally, top trumps and different approaches to balance the outcome should not be deterministic and it automatically, demonstrating the feasibility entail exciting gameplay, possibly favouring and advantages of a multi-objective approach tight outcomes. as well as possibilities to introduce surrogate It therefore suggests itself to support the models. 1http://www.gdcvault.com/play/1012211/Design-in-Detail-Changing-the

1 In the following section, we present related of rules [2, 16], and more interactive versions work on top trumps, balancing for multiplayer of game design, e.g. mixed-initiative [13]. competitive games and gameplay evaluations. The subsequent section highlights some impor- III. Basics tant concepts specific to the game top trumps and multi-objective optimisation, before the fol- In the following, the game top trumps is in- lowing section details our research approach troduced and theoretical background for the including the research questions posed. After- applied methods from multi-objective optimi- wards, the results of our analysis are presented sation and performance evaluation is sum- and discussed, before we finish with a conclu- marised. sion and outlook on future work. I. Top Trumps II. Related Work Top trumps is a themed card game originally published in the 1970s and relaunched in 1999. Cardona et al. use an evolutionary algorithm Popular themes include cars, motorcycles, and to select cards for top trumps games from open aircrafts. Each card in the deck corresponds to data [4]. The focus of their research, however, a specific member of the theme (such as a car is the potential to teach players about data and model for cars) and displays several of its char- learn about it using games. The authors de- acteristics, such as acceleration, cubic capacity, velop and use a single-objective dominance- performance, or top speed. An example can be related measure to evaluate the balance of a found in Fig. 1. given deck. This measure is used as a reference in this paper (cf. fD in Sec. IV). Jaffe introduces a technique called re- stricted play that is supposed to enable de- signer to express balancing goals in terms of the win rate of a suitably restricted agent [11]. However, this approach necessitates expert knowledge about the game as well as an AI and several potentially computationally expen- sive simulations. In contrast, we explore other possibilities to express design goals and utilise non-simulation based metrics. Chen et al. intend to solve “the balance problem of massively multiplayer online role- playing games using co-evolutionary program- ming” [5]. However, they focus on level pro- gression and ignore any balancing concerns apart from equalising the win-rates of different in-game characters. Figure 1: Card from a train-themed top trumps deck Yet, most work involving the evaluation of with 5 categories (winningmoves.co.uk) a game configuration is related to procedural content generation, specifically map or level At the start of a game, the deck is shuffled generation. Several papers focus on issuing and distributed evenly among players. The guarantees, e.g. with regards to playability [18], starting player chooses a characteristic whose solvability [17], or diversity [15, 10]. Other re- value is then compared to the corresponding search areas include dynamic difficulty adapta- values on the cards of the remaining players. tion for single-player games [9], the generation The player with the highest value receives all

2 cards in the trick and then continues the game of the resulting Pareto fronts. These are the by selecting a new attribute from their next additive ε indicator as well as the R2 indicator, card. The game usually ends when at least one all presented by Knowles et al. [12]. These indi- player has lost all their cards. However, for cators are also considered for the termination the purpose of this paper, we end the game of EMOA runs using online convergence detec- after all cards have been played once in order tion as introduced by Trautmann et al. [19]. to avoid possible issues of non-ending games. For variation in SMS-EMOA, the most widely used operators in the field are consid- II. Multi-objective optimisation ered, namely simulated binary crossover and polynomial mutation. These are parametrised Switching from single- to multi-objective opti- using pc = 1.0, pm = 1/n, ηc = 20.0, and misation has advantages and disadvantages [6]. ηm = 15.0, respectively, cf. Deb [6]. While it is often better to consider multiple ob- jectives for technical optimisation tasks, the complete order of individuals is lost in this III. Performance Measurement for case. Instead, one has to handle incomparable solutions, e.g. two solutions that are better than Stochastic Multi-objective Optimisers the other one in at least one objective. This ob- The unary performance indicators introduced jective or component wise approach goes back above only express the performance of a single to the definition of Pareto dominance. A solu- optimisation run. However, to evaluate and tion or individual x is said to strictly (Pareto) compare the relative performance of stochastic dominate another solution y (denoted x ≺ y) iff optimisers with potentially significantly dif- x is better than y in all objectives. Considering ferent outcomes (such as evolutionary multi- minimisation this reads objective ), measurements for the x ≺ y iff ∀i ∈ {1, . . . , m} : f (x ) < f (y ) i i statistical performance are needed. under fitness function For this purpose, (empirical) attainment n m  f : X ⊂ R → R , f (x) = f1(x),..., fm(x) . functions that describe the sets of goals Based on this, the set of (Pareto) non- achieved with different approaches were pro- dominated solutions (Pareto set) is defined as posed, expressing both the quality of the the set of incomparable solutions as defined achieved solutions as well as their spread over above and the Pareto front to be the image of multiple optimisation runs [7]. Based on these the Pareto set under fitness function f . functions, the set of goals that are reached in Nevertheless, even incomparable solutions 50% (or other quantiles) of the runs of the opti- need to be distinguished when it comes to se- misers can be computed (also known as 50%- lection in an evolutionary algorithm. In the evo- attainment surface). Comparing the attainment lutionary multi-objective optimiser considered, surfaces of different optimisers is already a SMS-EMOA [1], this is done based on the con- much better indicator for their performances tribution to the hypervolume (i.e the amount of than comparing the best solutions achieved, as objective space covered by a Pareto front w.r.t. they are subject to stochastic influences. a predefined reference point). The contribution Additionally, Fonseca et al. detail a statisti- of a single solution to the overall hypervolume cal testing procedure (akin to a two-sample of the front is used as the secondary ranking two-sided Kolmogorov-Smirnov hypothesis criterion for the (µ + 1)-approach to selection. test) based on the first- and second-order attain- The first one is the non-dominated sorting rank ment-functions of two optimisers [7]. If the null assigned to each solution. hypotheses of these tests are rejected, it can be For measuring the performance of different assumed that the differences in performance SMS-EMOA runs, we consider two other per- of the considered optimisers are statistically formance indicators next to the hypervolume significant.

3 IV. Approach account. p4, in contrast, is able to model the probability more precisely by accounting for For the remainder of this paper, we denote the the number of cards with a higher value in number of cards in a deck as K (even number) each category that are still in play. and the number of characteristics (categories) Let RG be the number of simulation runs. displayed on a card L. Two representations are The number of tricks that p4 received at the end KL used for a deck, (1) as a vector x ∈ R for of the r-th game (r ∈ {1, ... , RG}) with deck the evolutionary algorithm and (2) as a K × L (r,V) V will be called t4 henceforth, and thus iff matrix V for easier understanding. (r,V) K (r,V) K t4 > 2 , p4 won the game, iff t4 = 2 the Accordingly, the value of the k-th card in (r V) game was a draw, and else, p lost. t , is the l-th category is vk,l with k ∈ {1, ... , K}, l ∈ 4 c the number of times the player choosing the {1, ... , L}. The k-th card in a deck is vk,· = category did not win the trick in round r of the (vk,1, ... , vk,L). A partial order for the cards game with deck V, i.e. the number of times the can be expressed with vk ,· vk ,· meaning 1 2 player announcing the categories changed. that card vk2,· beats vk1,· in all categories In this paper, we only consider decks that Since optimisation tasks are generally as- fulfil two basic requirements we deem existen- sumed to be minimisation problems (without tial for entertaining gameplay: loss of generality), this convention is satisfied • all cards in the deck are unique: here as well for the sake of consistency. There- 2 fore, maximisation problems are transformed @(k1, k2) ∈ {1, ... , K} , k1 6= k2 with v = v into minimisation task by multiplication with k1,· k2,· − • there is no strictly dominant card in the 1. In the course of this paper, we compare deck: 8 card sets from purchased decks with decks ∈ { } ≺ ∀ ∈ generated using three different approaches and @k1 1, ... , K with vk2,· vk1,· k2 {1, . . . , K} corresponding fitness functions detailed in the (in this case, dominant cards have larger following: values, since higher values win according • Single-objective optimisation according to the game rules) to the dominance-related (D) measure proposed in [4] which describes the dis- We consider two agents p4, p0 with differ- tance of the cards in a deck V to the ent knowledge about the played deck: Pareto front: • p4 knows the exact values of all cards in K K the deck 1 1 fD(V) = − ∑ ∑ (1 − (vk ≺ vi)) • p0 only knows the valid value range for K k=1 i=1 all values vk,l Both agents are able to perfectly remember • Multi-objective optimisation with simula- which cards have been played already. Player tion-based measures developed with ex- pert knowledge that are supposed to ex- p4 is expected to perform better than p0 on average on a balanced deck. In order to reduce press the decks V’s fairness, excitement, the number of simulations needed to verify and resulting balance (B): this, only games of a player p4 against p0 will 1 RG  K  ( ) = − 1 (r,V) > be considered here. fB V ∑ t4 , RG 2 In our simulation, both of these agents es- r=1 RG RG ! timate the probabilities to win with each cat- 1 (r,V) 1 (r,V) K − t , 2t − . egory on a given card with consideration of ∑ c ∑ 4 RG r=1 RG r=1 2 their respective knowledge about the deck as well as the cards already played. p0 therefore • Multi-objective optimisation with simula- has to assume a uniform distribution and will tion-independent measures developed in only take the values of their current card into the pre-experimental planning phase (cf.

4 Sec. II) as surrogate (S) for (the simula- • The number of cards and categories are tion-based) fitness fB of different decks: set to K = 32, L = 4 in accordance with these values for the purchased decks. fS(V) =(−hv(V), • The valid range of all values vl,k is set [ ] ∈ R − sd({avg(v·,l)|l ∈ {1, . . . , L}})), to 1, 10 , which all decks can be transformed to. This results in an infinite with the dominated hypervolume hv of a number of possible cards, but other op- deck V, sd the empirical standard devia- tions entail the necessity to construct a tion and avg the average. genotype-phenotype mapping. In order to compare the aforementioned Due to the large number of possible card approaches, an SMS-EMOA is used to approxi- distributions among the players, the order of mate the Pareto front for the fitness functions the cards in a deck, and different starting play- fS and fB. Online convergence detection, varia- ers, a single deck could potentially result in tion operators, and parameters as described in a large number of different games (4K!). As Sec. II are used. For the single-objective fitness a consequence, all simulation-based metrics fD, the algorithm was modified as little as pos- to evaluate the deck have to be approximated. sible to enable comparisons.Thus, a (µ + 1)-EA The values of the metrics in fB, which all ex- was used with the same variation operators press an average, should be as close to the true and equivalent selection. The convergence was mean of the respective distributions as possible. tested based on the variations of some single- To ensure the quality of the approximation, a objective performance indicators, namely the statistical t-test is conducted to compute the min, mean and max fitness values of the active size of the confidence interval for each metric population. The experiments were conducted for RG between 100 and 10 000 at a confidence using R with the help of the emoa package2 level of 0.95. This test is repeated 500 times and a related SMS-EMOA implementation3,4. for each possible sample size and each met- ric and. Assuming a normal distribution, the .95-quantile is stored as the result. R = 2 000 I. Research questions G games are found to be a good tradeoff between The different approaches to finding a balanced computational time and fitness approximation top trumps deck are evaluated and compared accuracy for all metrics. in Sec. V. We focus on the following topics: An equivalent test is conducted to decide I Problems of manual balancing and the solu- on the number of optimisation runs necessary tions offered by automation to approximate the performance of the corre- II Feasibility of automatic balancing in terms sponding approach (to a suitable confidence of required quality interval). Here, the HV-, e- and R2-indicators III Performance of multi- and single-objective are considered with the Pareto front resulting approaches from all of these runs as a reference set. After IV Feasibility of automatic balancing in terms considering the results, the number of runs RO of computational costs was set to RO = 100. For the simulation-independent approach, II. Preexperimental planning metrics that do not require a simulation are de- veloped. The hypervolume is chosen as a mea- Before any experiments can be executed, the sure to achieve as many non-dominated cards test case has to be defined more accurately. The in each deck as possible. This is expected to following assumptions are made: improve the fairness of a deck (also cf. fD). The 2https://cran.r-project.org/web/packages/emoa/ 3https://github.com/olafmersmann/emoa 4Additional code written for the simulation and experiments will be made accessible after publication

5 standard deviation of category means is used front union for each approach. The legend to increase the significance of player p0’s disad- for all plots can be found in Tab. 1, where vantage, thereby resulting in a higher win-rate the same colour scheme is used to refer to the for p4. Pareto fronts and attainment surfaces of the Different population sizes are tested and respective approaches. the resulting Pareto fronts are compared for a quick estimate of their performance. Based Table 1: Legend providing the color assignment to the on the results received, approaches are evalu- results of different optimiser runs as depicted ated on runs with population sizes of 10 and in the Fig. 2 and Fig. 3. The same colour is 100 individuals. This accounts for both small assigned to all results from one approach, i.e. populations with a high selection pressure and B10,B1050 and B10p have the same colour. also bigger populations with a larger spread. B10 D10 S10 PD B100 D100 S100 V. Results

We evaluate all approaches according to the I. Automatic Balancing Advantages fitness function fB which is based on expert knowledge and therefore assumed to charac- To evaluate the advantages of automatic bal- terise a balanced deck best. This assumption ancing, the following hypotheses are proposed. is supported by the fact that most of the pur- I-C1 The number of tests needed to approxi- chased decks are located on the approximated mate some simulation-based metrics for a Pareto front according to these metrics. single deck to an appropriate accuracy is In the remainder of the paper, we use the very high and possibly exceeds the num- letter corresponding to the fitness function and ber of playtests that could reasonably be the population size to refer to the union of the done with human players. Pareto fronts of RO = 100 runs with the respec- I-C2 Many of the purchased decks are unfair tive fitness function and population size for the in the sense that the game’s outcome de- multi-objective approaches. The introduced pends strongly on luck and less on the acronym with an added index p refers to the players’ skill levels. Pareto front of the respective set with regards With these hypotheses, the effort needed for to fB. A numerical index stands for the attain- manual balancing is considered and the per- ment surfaces to the indicated level. For exam- formance of PD (i.e. likely manually balanced ple, S10 is the union of all Pareto fronts from decks) is compared to that of automatically optimiser runs with population size 10 and fit- balanced decks. ness function fS, S10p is the Pareto front of this The t-test described in Sec. II already de- set and S1050 is its 50% attainment surface. For termined that the best tradeoff between the the single-objective approach, the union of the accuracy of the approximation of simulation- best individuals achieved in RO = 100 runs are based metrics and the number of simulations considered instead, because the populations RG was ≈ 2 000. Considering the large effort converge to one deck. The set of purchased playtests with humans necessitate as well as decks will henceforth be denoted PD. the bias induced by having different players To facilitate the discussion of the experi- play, testing 2 000 rounds of a game with hu- ments, the results of the three different ap- mans is tedious and potentially not even pos- proaches are plotted in terms of their perfor- sible on smaller budgets. For example, the mance on the fitness function fB. Figure 2 standard deviation on fB for the decks in B10 depicts the sets listed in Tab. 1. Figure 3 vi- is ≈ (0.0427, 0.3191, 0.3576). If we assume sualises the 50%-attainment surfaces as well we had 100 players play 10 games each, the as the Pareto fronts resulting from the Pareto resulting confidence interval for α = 0.05 is

6 Figure 2: Union of all Pareto fronts of RO = 100 runs for all approaches considered (refer to legend in Fig. 1) from two different perspectives. Elements of the shared Pareto front are depicted with larger squares.

Figure 3: 50%-attainment surfaces (left) and Pareto fronts (rights) per approach of the union of all Pareto fronts from RO = 100 optimisation runs (depicted in Fig. 2). Refer to legend in Fig. 1 for the colour scheme. Elements of the shared Pareto front are depicted with larger squares.

≈ (0.0442, 0.4298, 0.15), which would not allow obvious in Fig. 3. The low win-rates for p4 the designers to distinguish between different are probably due to the fact that the number of solutions and is therefore not accurate enough. non-dominated cards in those decks in PD is The standard deviation as well as the number relatively low. The fD average is ≈ −24.12 com- of games needed would likely increase with pared to optimum −31 (only non-dominated the complexity of the game as well. Therefore, cards). This means that the resulting game- a definitive advantage of automatic balancing play depends heavily on luck because there are over manual playtests is the possibility of a card combinations with which a player sim- quantitative analysis of simulation-based met- ply cannot win regardless of their skill. The rics (cf. [11]). only exception is the motor cycle deck with a value for f of −30.4375 and a much better p Except for a single deck (motor cycles), the D 4 win-rate of approx. 0.8. purchased decks are all on the edge of the estimated Pareto front with the worst perfor- Thus, we demonstrated that the effort mances in terms of the win-rate of p4. This is needed to evaluate one deck is beyond a rea-

7 sonable number of playthroughs. Additionally, ume additionally favours cards with a larger the manually balanced decks are located on spread, which does not affect the dominance- the extreme edges of the Pareto front, imply- relationship (or the outcome of a playthrough ing that it is difficult to find less extreme so- or any simulation-based fitness values). The lutions manually. This also suggests that the second part of fS is the standard deviation of approximation of the Pareto front could help a the category means. The higher the deviation, designer by giving them a more sophisticated the more problematic is the strategy of player idea about the characteristics of their game and p0 to assume uniform distributions for the cate- potential alternatives. The findings by Nelson gories to make up for their lack of knowledge. and Mateas also connote that designers see po- III-C1 There is a significant difference between tential in automatic balancing tools to support the (empirical) attainment functions of the balancing process [14]. the considered multi-objective (S10, S100) and single-objective (D10, D100) optimi- II. Automatic Balancing Quality sation approaches. III-C2 The results of the single-objective ap- Next, we demonstrate the feasibility of auto- proach (D10, D100) perform signifi- matic balancing, i.e. that at least some of the cantly worse than the multi-objective automatically balanced decks perform on par ones (S10,S100) in terms of fB. with the purchased decks. In order to test hypothesis III-C1, the statistical II-C1 Automatically balanced decks are on the testing procedure for the comparison of empir- Pareto front. ical attainment surfaces described by Fonseca II-C2 The results for II-C1 are statistically sig- et al. [7] is conducted using software published nificant. by C. Fonseca5. With 10 000 random mutations As is obvious from the plots (especially the and α = 0.05, the null hypothesis (the attain- right plot in Fig. 3), automatically balanced ment function of two approaches are equally decks (S10 and B10) make up a large part of distributed) is rejected with a p-value of 0 (crit- the Pareto front and are thus not dominated ical value 0.23, test statistic 1) for all compar- by the purchased decks. Moreover, most pur- isons in {D10, D100} × {S10, S100}. This re- chased decks are concentrated at the extreme sult was expected as, judging from the visuali- edges of the front. sations (e.g. in Fig. 2), the individuals found The same is true for individuals in S10 50 by the analysed approaches are in very differ- and B10 , which ensures that, despite the 50 ent areas. Additionally, the results found by stochasticity of the approach, decks on the the single-objective approach have a very low Pareto front can be achieved in at least 50% spread, which is probably owed to the charac- of all optimisation runs, making the result sta- ter of the fitness measure f . tistically relevant. D The sets of solutions found for the single- objective approaches are both strictly domi- III. Single- and Multi-objective perfor- nated by both surrogate approaches according mance to the definition by Knowles et al. [12]. For- We analyse whether the multi-objectification mally, it holds that of the approach used by Cardona et al. [4] could result in better performing individuals. (D1050 ∪ D10050) ≺ (D10 ∪ D100) Therefore we extend f to f . The bigger D S ≺ S1050 ≺ S10 the dominated hypervolume (the first part of (D1050 ∪ D10050) ≺ (D10 ∪ D100) fS), the more non-dominated cards are in a k S100 ≺ S100. deck, which is expressed by fD. The hypervol- 50 5https://eden.dei.uc.pt/~cmfonsec/software.html

8 The test for hypothesis III-C1 has shown that As visualised in Fig. 2 and Fig. 3, there are the attainment functions of the approaches are individuals in S10 on the shared Pareto front not the same. This indicates that using fitness and which are therefore not dominated by any function fS instead of fD has improved the individual in B10 ∪ B100. In fact, B100 ≺ S10 results in terms of fB, thus confirming III-C2. and B10kS10. For S100, it can only be said that This is suggests that the multi-objectification S100kB100, making IV-C1 only true for S10. of fD can indeed improve the achieved results In order to compare the performances of in this case. the Pareto fronts of the approaches considered here, the performance indicators for HV, e and R2 are computed for S10p, S100p, B10p, B100p. IV. Computational Costs and Surro- To facilitate the interpretation of these values, gate Objectives the aforementioned sets are normalised (result- ing in values between 1 and 2, cf. [19]) with We now address the feasibility of automatic regards to all values achieved (cf. Fig. 2) be- balancing in terms of computational costs. In fore computing the indicators. The normalised Sec. II, we have already analysed and verified Pareto front of the union of all achieved fronts its feasibility on function fB. Therefore, the is used as a reference set (cf. Fig. 3 (right)). computational costs needed with RG = 2 000 The resulting indicator values can be found in simulations per game and RO = 100 optimi- the upper half of Tab. 2. The non-dominated sation runs are obviously manageable for the sorting ranks in Tab. 2 (top half) clearly show considered application. that hypothesis II-C2 is true for the computed However, a simulation-based approach to values and that the approaches with the same balancing might prove too costly for more com- population size perform equally well. plex games with computationally expensive simulations or a large game-state space. We approach this problem by investigating the pos- Table 2: Normalised indicator values for the Pareto fronts and 50%-attainment surfaces of the con- sibility of using simulation independent mea- sidered approaches (cf. Fig. 3) as well as their sures (e.g fS) instead of fB. Naturally, in prac- resulting ranks. The rank value in brackets are tice these measures would need to be devel- alternatives accounting for statistically insignif- oped in accordance with the intended balanc- icant distinctions. ing goals and observations of the optimisers’ behaviour, similar to what is described in Sec. set HV ffl R2 rank II. B10 0.241 0.329 0.103 1 The following hypotheses are put forward p B100 0.541 0.559 0.203 2 (3) in order to investigate the computational costs p S10 0.177 0.368 0.091 1 of automatic balancing and the feasibility of p S100 0.475 0.501 0.206 2 simulation-independent objectives: p B10 0.490 0.555 0.203 1 (1) IV-C1 Some results optimised based on fitness 50 B100 0.638 0.650 0.258 3 (2) function f (S10, S100) are not dominated 50 S S10 0.527 0.565 0.222 2 (1) by B10 and B100. 50 S100 0.676 0.692 0.288 4 (3) IV-C2 The best individuals in S10 and S100 50 perform at least equally well as the ones in B10 and B100 in terms of performance In order to test the statical significance of indicators. this statement, the width w of the confidence IV-C3 There is no significant difference interval for α = 0.05 for each set and each indi- between the attainment functions of cator is computed. This is done using a t-test S10, S100 and B10, B100. to estimate the true indicator means on the IV-C4 The results for IV-C1 and IV-C2 are sta- separately achieved performance indicators for tistically significant. RO = 100 runs for each approach, normalised

9 as before. When accounting for the uncertainty front counterpart and the differences are signif- expressed in the confidence intervals, all dif- icant. Interestingly, the differences per indica- ferences in performance indicators in Tab. 2 tor are smaller in the bottom half of Tab. 2. This (top half) are statistically significant except for reflects the fact that the distances of the 50%- the difference in R2 for B10 and S10, as well attainment surfaces of the different approaches as B100 and S100. This means that in the true in objective space are visibly smaller in Fig. 3 ranking, B100 could be ranked 3 instead of 2. (left) than the Pareto fronts in Fig. 3 (right). The tests used to compare empirical attain- There are also more individuals in S10050 and ment functions for hypothesis III-C1 described B10050 when compared to S100p and B100p, in Sec. III are applied again here to compare respectively, which explains their smaller loss the attainment functions for all combinations of in performance indicators. This is because both B10, B100 and S10, S100. Contrary to hypothe- approaches experience less spread in the direc- sis IV-C3, the tests all reject the null hypothesis tion of the optimum. of equal attainment functions with a p-value of 0, although the decisions are a bit tighter than in Sec. III. Thus, hypothesis IV-C3 can not be confirmed. The differences in attain- V. Additional Observations ment functions are likely due to the fact that the compared sets occupy different areas in Next to the results discussed previously, some the objective space (cf. Fig. 2). This can be interesting observations were made during the explained by failing to express the excitement experiments. of a playthrough in fS, which was constructed to better express a deck’s fairness starting from The single-objective optimisation approach fD (cf. Sec. III). Therefore, if the goal was to converges to one deck for both population approximate the solutions obtained from fB sizes tested, even though all decks with exclu- with simulation-independent fitness measures, sively non-dominated cards perform equally different ones should be selected, possibly us- well. The optimal fitness value for fD, 31, is ing the p-value of the aforementioned test as achieved in almost all runs. This suggests that an indicator for their quality. the algorithm used for single-objective opti- From Fig. 3 (left) it is obvious that both misation including the convergence detection worked for this application. Furthermore, we B1050 and S1050 contain individuals on the shared Pareto front, thus proving that IV-C1 can conclude that fD is not suited for deck gen- is true in at least 50% of optimisation runs. eration because it does not distinguish decks The performance indicators for the respective well. This might be entirely different for data 50%-attainment surfaces for the respective ap- selection as done by Cardona et al. [4]. proaches are listed in Tab. 2 (bottom half), The optimiser runs were stopped by the along with their ranks and the possible true convergence detection mechanism after very rank when uncertainty is accounted for. In this different numbers of function evaluations neval, case, S10050 performs significantly worse than even for the same approach. For example, the all the other approaches considered. However, first 30 runs for S10 executed between 3 727 S1050 definitely performs better than B10050 and 23 737 fitness function evaluations. There and there is no clear ranking of the perfor- is no apparent correlation between neval and mances of S1050 and B1050. This implies that the quality of the achieved solutions, with neval hypothesis IV-C4 is true as well. between 3 993 and 20 243 for runs with solu- Since the values in Tab. 2 are all based on tions on the Pareto front of this subset of S10. normalised outcomes, the absolute values can This point to a high complexity of the fitness be compared. As expected, the 50%-attainment function landscapes and validates the use of on- surfaces all perform worse than their Pareto line convergence detection in this experiment.

10 VI. Conclusion and Outlook and making more complex strategies profitable. A viable AI extension is inference based rea- In this paper, we present our approach to auto- soning about the opponent’s cards as demon- matic game balancing (as defined in Sec. I) and strated by Buro et al. in their work on improv- apply it to the card game top trumps. Our ap- ing state evaluation in trick-based card games. proach includes the formalisation and interpre- Monte Carlo Search is common used for card tation of the task as a multi-objective minimi- games as well, as they commonly feature im- sation problem which is solved using a state- perfect information (cf. [8, 20]). Another route of-the-art EMOA with online convergence de- would be the implementation of AIs that imi- tection. The performances of the resulting and tate human players. purchased decks next to a single-objective ap- Further work on the analysis of the pre- proach [4] are evaluated using statistical analy- sented measures and the discovery of new ones ses. is intended. As a first step in this direction, we We conclusively show the feasibility of auto- propose to use our approach for different appli- matic game balancing in terms of the quality of cations, possibly after developing application- the achieved solutions for the game top trumps specific methods. In that regard, we aim to under the assumptions detailed in Sec. IV. Be- test our approach on more complex computer ing aware that computational concerns could games. A first attempt will be made incorporat- render a simulation-based approach infeasible ing The Open Racing Car Simulator (TORCS)6, for complex applications, an approach to avoid but further tests on real-time-strategy games simulation was outlined in section IV. The pre- and platformers are intended as well. Based sented work, therefore, is a necessary step to on the analysis of different well-performing proving the feasibility of automatic balancing fitness measures, a next step could be the in- in general. Additionally, the apparent advan- vestigation of generalisable ones. tages of an automated balancing approach and More importantly, we plan to evaluate our multi-objective balancing are discussed as well vision of a balanced deck, our fitness measures (cf. Sec. V). These discussions and the addi- and the results of our methods with surveys tional observations in Sec. V strongly indicate for human players. In our opinion, incorporat- that the presented approach was suitable and ing human perception of balancing is the only successful. acceptable way to achieve the eventual goal, i.e. A possible way to proceed with this work accurately expressing and maximising human is to further optimise the different parts of the players’ enjoyment of a game. approach. For example, the considered opti- misers could be improved by better parame- References ters, e.g. determined by tuning methods like sequential parameter optimisation, thereby po- [1] N. Beume, B. Naujoks, and M. Em- tentially sharpening our results. In addition, merich. SMS-EMOA: Multiobjective Se- several other modules should be tested for pos- lection Based on Dominated Hypervol- sible (parameter) improvements like the online ume. European Journal of Operational Re- convergence detection mechanism. search, 181(3):1653–1669, 2007. With respect to the implemented player AI, [2] C. B. Browne. Automatic generation and it seems reasonable to extend our research by evaluation of recombination games. Phd the- testing different improvements of the proba- sis, Queensland University of Technology, bilistic AI used in our study. This could pro- 2008. vide interesting results if the restriction of al- K lowing exactly 2 rounds of play is removed. In [3] M. Buro, J. R. Long, T. Furtak, and N. R. this case, the agent is required to plan ahead Sturtevant. Improving state evaluation, 6http://torcs.sourceforge.net/

11 inference, and search in trick-based card [11] A. Jaffe. Understanding with games. In Conference on Artificial Intelli- Quantitative Methods. Phd thesis, Univer- gence (IJCAI), pages 1407–1413. Morgan sity of Washington, 2013. Kaufmann, San Francisco, CA, 2009. [12] J. Knowles, L. Thiele, and E. Zitzler. A Tu- [4] A. B. Cardona, A. W. Hansen, J. Togelius, torial on the Performance Assessment of and M. G. Friberger. Open Trumps, a Stochastic Multiobjective Optimizers. TIK Data Game. In Foundations of Digital Games Report 214, Computer Engineering and (FDG 2014). Society for the Advancement Networks Laboratory (TIK), ETH Zurich, of the Science of Digital Games, Santa 2006. Cruz, CA, 2014. [13] A. Liapis, G. N. Yannakakis, and J. To- [5] H. Chen, Y. Mori, and I. Matsuba. Solving gelius. Sentient sketchbook: Computer- the balance problem of massively multi- aided game level authoring. In Founda- player online role-playing games using tions of Digital Games (FDG 2013), pages coevolutionary programming. Applied Soft 213–220. Society for the Advancement of Computing, 18:1–11, 2014. the Science of Digital Games, Santa Cruz, CA, 2013. [6] K. Deb. Multi-Objective Optimization Using [14] M. J. Nelson and M. Mateas. A require- Evolutionary Algorithms. Wiley, Chichester, ments analysis for videogame design sup- UK, 2001. port tools. In Foundations of Digital Games [7] C. Fonseca, V. D. Fonseca, and L. Pa- (FDG 2009), pages 137–144. ACM Press, quete. Exploring the performance of New York, 2009. stochastic multiobjective optimisers with [15] M. Preuss, J. Togelius, and A. Liapis. the second-order attainment function. In Searching for Good and Diverse Game C. A. C. Coello et al., editors, Evolutionary Levels. In Computational Intelligence and Multi-Criterion Optimization (EMO 2005), Games (CIG 2014), pages 381–388. IEEE pages 250–264. Springer, Berlin, 2005. doi: Press, Piscataway, NJ, 2014. 10.1007/b106458. [16] A. M. Smith and M. Mateas. Variations [8] T. Furtak and M. Buro. Recursive Monte Forever: Flexibly generating rulesets from Carlo search for imperfect information a sculptable design space of mini-games. games. In Computational Intelligence and In Computational Intelligence and Games Games (CIG 2013), pages 225–232. IEEE (CIG 2010), pages 273–280. IEEE Press, Pis- Press, Piscataway, NJ, 2013. cataway, NJ, 2010. doi: 10.1109/ITW.2010. 5593343. [9] G. Hawkins, K. V. Nesbitt, and S. Brown. Dynamic Difficulty Balancing for Cau- [17] A. M. Smith, E. Butler, and Z. Popovi´c. tious Players and Risk Takers. International Quantifying over Play : Constraining Un- Journal of Computer Games Technology, 2012: desirable Solutions in Puzzle Design. In 1–10, 2012. Foundations of Digital Games (FDG 2013), pages 221–228. Society for the Advance- [10] A. Isaksen, D. Gopstein, J. Togelius, and ment of the Science of Digital Games, A. Nealen. Discovering Unique Game Santa Cruz, CA, 2013. Variants. In H. Toivonen et al., edi- tors, Computational Creativity (ICCC 2015). [18] J. Togelius, M. Preuss, N. Beume, S. Wess- Brigham Young University, Provo, Utah, ing, J. Hagelbäck, G. N. Yannakakis, and 2015. C. Grappiolo. Controllable procedural

12 map generation via multiobjective evolu- lutionary Computation, 17(4):493–509, 2009. tion. Genetic Programming and Evolvable doi: 10.1162/evco.2009.17.4.17403. Machines, 14(2):245–277, 2013. [20] C. D. Ward and P. I. Cowling. Monte Carlo [19] H. Trautmann, T. Wagner, B. Naujoks, search applied to card selection in : M. Preuss, and J. Mehnen. Statistical Meth- The Gathering. In Computational Intelli- ods for Convergence Detection of Multi- gence and Games (CIG 2009), pages 9–16. Objective Evolutionary Algorithms. Evo- IEEE Press, Piscataway, NJ, 2009.

13