ELIMINATION from Design to Analysis

Ahmed Khalifa, Dan Gopstein, Julian Togelius Game Innovation Lab, New York University, NY, USA [email protected], [email protected], [email protected]

Abstract—Elimination is a word puzzle game for browsers and iOS, and Android devices. The game supports three different mobile devices, where all levels are generated by a constrained gameplay modes: Infinite, Daily, and Levels. In this work, evolutionary algorithm with no human intervention. This paper we will only focus on the most popular mode, Levels, which describes the design of the game and its level generation methods, and analysis of playtraces from almost a thousand players. The presents players with a fixed sequence of challenges. The analysis corroborates that the level generator creates a sawtooth- Levels mode includes 30 levels with 10 challenges each, all shaped difficulty curve, as intended. The analysis also offers of which were procedurally generated initially, and reused in insights into player behavior in this game. each playthrough. Index Terms—word games, game design, procedural content generation, puzzles, difficulty measurement, postmortem A. Challenges An Elimination challenge is a sequence of letters which I.INTRODUCTION doesn’t form a word by itself but through the removal of Using procedural content generation (PCG) in games, es- letters will result in a recognized English word. A challenge pecially for core game elements such as levels, is a bit of an should have more than one word that can be discovered art. When it is executed well, it enhances the player experience after removing various groups of letters. Figure 1 shows an and keeps the player wanting to play again [1] but any mistake example challenge in the game. The player is shown 6 letters in the system and the game can feel boring and repetitive, and “HATDEL” (see figure 1a). The player can eliminate letters may lead to disappointment [2]. Just like game design, PCG by clicking them. A solution for that challenge can be seen systems are designed using designers’ intuition and a trial and in figure 1b and 1c respectively where the player removes error process. Designers usually test the system on a group of the letter “D” then “L” to find “HATE”. That is not the only players and adjust the generator based on the feedback until solution, as the player can remove “H”, “D”, and “L” to find it works as intended [3]. The designers of successful games “ATE”. One of the incentives to lead the player to find certain often write a postmortem about what went right and what words is that the score depends on the length of the found went wrong during the development process to summarize word. Looking back on the example in figure 1, “HATE” will the lessons learned and help themselves and others better have score of 4 while “ATE” will have score of 3. understand the underlying systems [3]–[5]. Few postmortems, We also use bonus letters which are called the 2X letter however, are supplemented by player data to validate the to influence the player choices. The 2X letter multiplies the design decisions and understand player behavior; conversely, score of the found word by 2. For example: if the player found game analytics papers are rarely written by the designers of the word “TRUE” where the “U” is a 2X letter the resulting the analyzed game [6], [7]. In this paper, we will discuss the score would be 8 instead of 4. The 2X letter is a good way to design decisions for the word puzzle game Elimination and influence the player choices toward certain words which could its level generator. Our postmortem is complemented by an be used in creating easier levels for the player. analysis of player data to validate our generation system and B. Levels better understand the parameters influencing player decisions. An Elimination level consists of 10 consecutive challenges II.THE DESIGNOF ELIMINATION where each challenge has to be solved before a timer runs out. A level ends either when the player find the 10th word Elimination is a word puzzle game that was designed during or if the timer runs out. The timer is added to make sure 1 GameZanga 8 , a game jam, with a theme of “Minimalism”. that the players don’t solve the challenges using brute force The game follows the theme literally by showing the player a by listing all the different possible words and then choosing bunch of letters and the player has to remove some of these the highest score. The timer also adds an element of pressure letters so that the remaining letters form a recognized English to a generally casual and relaxed game. Equation 1 shows the word. In this paper, we are analyzing the final version of the time in seconds allocated for each challenge in the level where game which was released on January 23, 2019 for browsers, challengeNumber ranges between 1 and 10.

1https://itch.io/jam/gamezanga8 30 challengeTime = (1) 978-1-7281-1884-0/19/$31.00 ©2019 IEEE 1+(challengeNumber − 1)/5 (a) Current Challenge. (b) Eliminate “D” letter. (c) Eliminate “L” letter. Fig. 1: Example of a challenge where the player eliminated “D” and “L” letters to find the word “HATE” for 4 points of score.

When the level ends (regardless of whether all 10 words are from. After the mixing is done (before testing for constraints found), the game automatically unlocks the next level for the or fitness), the challenge word is shortened by using a greedy player to be played. We decided to allow players progress to algorithm that removes a single letter at a time, each time higher levels without completing previous levels to make sure verifying that the source words can be still extracted from the game is more relaxing and casual by not getting players the new challenge word. The reduction algorithm continues stuck on particular level. This will make sure that players who removing letters until either no more letters can be removed play the same level are doing so willingly and not because the or the new final word satisfies the length constraints. Since game forces them to. the problem of letter ordering solved by the representation, the only constraint being satisfied is the length of the final word. III.PUZZLE GENERATION Target word length is calculated by the formula in equation 2 Elimination levels are generated by running an evolution- where w is the challenge word and tl is the target length. ary algorithm 10 times with same parameters to generate a 1 |w|≤ tl single challenge per run. No human curation or editing was c = (2) performed when generating the included levels. (1 − log(|w|− tl + 1) |w| > tl A. Generative System As soon as the chromosome satisfies the constraint function shown in equation 2, it gets tested for its fitness which For the generation we used Feasible Infeasible 2-Population measures the degree to which recognized English words in (FI-2Pop) Algorithm [8] to generate the game challenges. Each the challenge word are obvious to spot or not. For example: challenge is generated by mixing a group of English words “CADOGT” and “CDAOGT” are two different challenge called the source words (usually two or three words) together words for “CAT” and “DOG” where the word “DOG” is to form the challenge word. The challenge here is that the easier to be noticed in “CADOGT” than in “CDAOGT”. mixture has to be efficient and this is tested by satisfying two The fitness function first generates all the possible recognized basic constraints: the final word has to be less than a target English words in challenge word. For example: “HATDEL” length and the order of the letters in the challenge has to obey is a mixture of “HATE” and “DEL” but these are not the the same order in the source words. The first constraint is only words in that challenge word as you can get “HADE”, added to make sure the system does an intelligent mixing of “ATE”, “TEL”, etc. The possible words set is split based on the source words so there is no repeated letters. For example: each word length using a certain value called the maximum “CAR” and “COOL” could be mixed as “CARCOOL” or sequence parameter to form two sets of words: long words it can be mixed as “CAROOL” where the second one is a and short words. The fitness function tries to minimize the better mixing than the first as it has fewer repeated letters. number of words from the long words set that appears as a The second constraint is added to make sure that the player full words in the challenge word and maximize the number of can solve the level and find any of the source words. For words from the short words set that appear as a full words in example: if “CDATOG” is a mixture of the source words the challenge word. Equation 3 describes the fitness function “CAT” and “DOG”, the ‘G’ can’t come before the ‘O’ as f, where maxSeq is a the maximum sequence parameter, it will be impossible to construct “DOG” from that mixture chWord is the challenge word, and words is a set of all by just eliminating letters. possible words in the chWord. Based on that, we decided that the chromosome consists of a list of integer values. The first three values point to the Long = {w|w ∈ words ∧|w| > maxSeq} source words in the corpus (We don’t use the full corpus during Short = {w|w ∈ words ∧|w|≤ maxSeq} generation. We define a fraction of the corpus to work with for v = {w|w ∈ Long ∧ w ∈ chWord} each level so earlier levels uses more frequent source words (3) than later levels.), while the remaining values are used for e = {w|w ∈ Short ∧ w ∈ chWord} the mixing procedure. The mixing procedure uses the integer |e| |v| f = (1.1 − ) ∗ (1.1 − )/1.21 values to determine which mixing word to grab the next letter |Short| |Long| Histogram of challenges solved by players How Many Players Reached Each Level Average Inferred Difficulty per Level 1000 300 0.8

750 0.7 200 0.6 500 0.5 # Players

Frequency 100 250 Inferred Difficulty Inferred

[1 − mean(Score)] 0.4

0.3 0 0 0 100 200 300 0 10 20 30 0 10 20 30 # of challenges solved Level Level (a) Histogram of challenges (b) The number of players that (a) (b) solved. reached each level. Fig. 3: Comparison between the ideal difficulty curve on the Fig. 2: General gameplay statistics. left, and the observed difficulty curve on the right.

It is important to note that the fitness function works as a public launch sufficient player data was gathered to measure soft constraint compared to the target length in equation 2. the relationship between the intended design of the game and We consider length as a hard constraint since a too-long word the actual reception by players. might not fit on the screen. Because of this we chose FI-2Pop over a standard genetic algorithm, because it makes sure that A. General Gameplay Statistics the hard constraint is satisfied before checking the fitness. In total 975 users played Elimination over the course of After all 10 challenges are generated for each level, the 98 days. The dedication of these players varied from casual, system picks a fixed number of these challenges to which to people who stumbled across the game and quickly realized apply the 2X letter bonus. The 2X letter is picked by finding it wasn’t for them, to the hardcore, playing levels over and the least frequent letter in each source word then picking a over again trying to achieve perfect results. The median player random letter out of that list. solved 20 challenges, equivalent to solving all the challenge of the first 2 levels, and the most diligent player solved exactly B. Parameter Choices 2000 challenges, repeating several levels over dozens of times All 30 levels in Elimination were generated automatically perfecting them. Figure 2 illustrates the distribution of player by the system with minimum interference from the designers. dedication through the number of completed challenges and We only controlled the difficulty of the generated level by the number of players that played each level. controlling the system parameters for each level. We used the following five parameters to control the difficulty of the B. Validation of Difficulty Measure generated level, corpusFreq: the subset of the corpus used The shape of the difficulty curve was intended to be a saw- during generation (this subset is defined by a minimum word tooth graph (as discussed in section III-B) similar to the one frequency and a maximum word frequency), sourceWords: showed in Figure 3a. In practice it’s impossible to measure the a list of the length of the source words being selected (usually absolute difficulty of a level (as players get better playing the two or three source words), targetLength: the length of the game), but we can estimate the relative difficulty by comparing challenge word after mixture that is being used to calculate players’ score from one level to another. We can assume that the constraints, maxSeq: the split parameter used in the fitness levels in which players score higher are easier, and levels in function to calculate the fitness, and num2X: the number of which players score lower are harder. Under this interpretation challenges in the level that has 2X. we had hoped to see that player scores would follow the The level generator of Elimination was designed to guide the inverse saw-toothed pattern the game’s difficulty was designed user through ups and downs in difficulty of play. The idea was to produce. The observed outcome is displayed in Figure 3b. to alternate between challenging the user and rewarding them To validate the efficacy of our level-generation parameters with successes. We selected the parameter values for each level on the resultant game, we employ a linear regression to such that the difficulty of the level increases every 5 levels measure the predictive power of the generation parameters then drops at the 6th level to form a saw-toothed difficulty on the observed player scores. Our linear regression took the curve [9]–[11] which is discussed in details in section IV-B. form of normalizedScore ∼ min(corpusFreq)+maxSeq+ The reason for the difficulty drop is to allow players to have targetLength+num2X+min(sourceWord), predicting the some breathing room after a hard sequence of levels, also mean normalized score of each level based on the parameters the sequence of gradual increase followed by sharp decline is described in section III-B where min(corpusFreq) is the designed to make sure that the generated levels will help the popularity of constituent words in the English language and user to be in state of flow [12]. min(sourceWord) is the size of the smallest constituent word. Together these features explain almost 20% of the IV. PLAYER ANALYSIS 2 variance (R = 19.43) of player scores. What this model Without a large sample of play-testers, the design goals of indicates is that, of the generation parameters used, the ones the game, and the accuracy of the level generation parameters most correlated with user score were the ones controlling word was not fully tested before the game was released. After the popularity, and the length of the challenge word. C. Understanding Word Choices predict the players’ scores based on the generation parameters. Since a level in Elimination consists of 10 independent The model was able to explain 20% of the data with the challenges, we can get a more detailed understanding of highest correlation for the frequency of the source words and the players’ responses by focusing on the outcomes of each the challenge word length. We further analyzed the data to individual challenge. Similar to our analysis of level generation understand more about the player word choices. We found parameters, we can use a linear regression model to explain that the length of the selected word and number of consecutive a player’s selections on a word-by-word basis. Rather than letters from the selected word in the challenge word have the evaluating the efficacy of our AI algorithm, this analysis is most effect on the player’s choice. more suited to teaching us about the way players think while There is much room for improvement to the game, for searching for words. We analyzed the following regression example we can use automated tuning to adjust the parameters selectionRate ∼ wordLength + maxSequence + has2X + to fit the saw-toothed difficulty curve similar to what was splitDistance+firstOccurrence+dirtyWord which mea- done by Isaksen et al [13]. Another direction can be achieved sures the correlation between the frequency a particular word by using Constrained Map-Elites [14] for offline generation is selected in a challenge (selectionRate) and the following instead of FI-2Pop to generate challenges that explores the features: the length of the selected word (wordLength), the full space of the game behaviors. largest number of letters from the selected word appear- ACKNOWLEDGMENT ing consecutively in the original challenge (maxSequence), Ahmed Khalifa acknowledges the financial support from whether the selected word has a 2x bonus on one of its letters NSF grant (Award number 1717324 - “RI: Small: General (has2X), the total number of letters from other words that are Intelligence through Algorithm Invention and Selection.”). interspersed with letters of the selected word (splitDistance), the position in the challenge word of the first letter of the REFERENCES selected word ( ), and whether or not the firstOccurrence [1] P. Glagowski, “Life after death: Edmund mcmillen on the success of selected word is profane (dirtyWord). Together these features the binding of isaac and his future,” https://www.destructoid.com/life- account for 21% of the variance in selection rates (R2 = 21.0). after-death-edmund-mcmillen-on-the-success-of-the-binding-of-isaac- The biggest contributing features are and and-his-future-523209.phtml, 2018, last Accessed: May 14, 2019. wordLength [2] E. Maiberg, “’No Mans Sky’ is like 18 quintillion bowls of oat- maxSequence. It is intuitive that words with more sequential meal,” https://www.vice.com/en us/article/nz7d8q/no-mans-sky-review, letters directly in the challenge word are more likely to be 2016, last Accessed: May 14, 2019. selected, as they can be read without any visual interpretation. [3] D. Yu, : Fight Books# 11. Boss Fight Books, 2016, vol. 11. However, it is less obvious why longer words are more likely [4] E. McMillen, “Postmortem: Mcmillen and himsl’s the binding to be selected. There are several reasons why players might of isaac,” https://www.gamasutra.com/view/feature/182380/postmortem tend to select longer words. Firstly, longer words are more mcmillen and himsls .php?print=1, 2012, last Accessed: May 14, 2019. [5] Lee, L. Kenny, and Teddy, “Rogue legacy design postmortem: Budget highly rewarded by the scoring system, selecting longer words development,” https://www.gdcvault.com/play/1020541/Rogue-Legacy- rewards the player with a bigger score. Counter-intuitively, Design-Postmortem-Budget, 2014, last Accessed: May 14, 2019. though, longer words may also be easier to select than shorter [6] B. G. Weber, M. John, M. Mateas, and A. Jhala, “Modeling player retention in madden nfl 11,” in Twenty-Third IAAI Conference, 2011. words. For example, longer words necessarily require fewer [7] A. Drachen, M. S. El-Nasr, and A. Canossa, Game Analytics: Maximiz- eliminations (clicks) to be selected than shorter words. Some ing the Value of Player Data. Springer, 2013. elimination paths (series of clicks required to select a short [8] S. O. Kimbrough, G. J. Koehler, M. Lu, and D. H. Wood, “On a feasible– infeasible two-population (fi-2pop) genetic algorithm for constrained word) will accidentally select larger words in the process, optimization: Distance tracing and no free lunch,” European Journal immediately solving the challenge. In fact, in the (somewhat of Operational Research, vol. 190, no. 2, 2008. rare) context of specific challenge words, there are some de- [9] P. Holleman, Reverse Design: Super Mario World. CRC Press, 2018. [10] J. Brown, “Difficulty curves start at their peak,” https: sired words that exist, but cannot be selected. For example: the //www.gamasutra.com/blogs/JonBrown/20100922/88111/Difficulty challenge word “AFART”, if the player tries to select the word Curves Start At Their Peak.php?print=1, 2010, last Accessed: May “FAR” they will quickly find out it is impossible. To transform 14, 2019. [11] D. Strachan, “The idea of a difficulty curve is all wrong,” http://www. “AFART” into “FAR” requires removing the first letter ‘A’ and davetech.co.uk/difficultycurves, 2018, last Accessed: May 14, 2019. the last letter ‘T’, however neither can be removed without [12] J. Nakamura and M. Csikszentmihalyi, “The concept of flow,” in Flow inadvertently creating another word (“FART” or “AFAR”, and the foundations of positive psychology. Springer, 2014. [13] A. Isaksen, D. Gopstein, and A. Nealen, “Exploring game space using respectively). This property was completely unknown to us survival analysis.” in FDG, 2015. before finding the prevalence of longer word selections. [14] A. Khalifa, S. Lee, A. Nealen, and J. Togelius, “Talakat: Bullet hell generation through constrained map-elites,” in Proceedings of The V. CONCLUSION Genetic and Evolutionary Computation Conference. ACM, 2018. This paper described the design of Elimination, a word puzzle game where all the levels were generated using the FI-2Pop algorithm. The generator was controlled using five different parameters that were adjusted to reflect a saw-toothed difficulty curve. We analyzed the collected user data to validate our difficulty curve by using a linear regression model to