<<

A Framework for Exploring and Evaluating Mechanics in Human Computation Games

Kristin Siu Alexander Zook Mark O. Riedl School of Interactive Computing School of Interactive Computing School of Interactive Computing Georgia Institute of Technology Georgia Institute of Technology Georgia Institute of Technology [email protected] [email protected] [email protected]

ABSTRACT exists beyond a small number of simple paŠerns from examples or Human computation games (HCGs) are a approach takeaways from successful games (e.g., [7, 45]). Œis lack of knowl- to solving computationally-intractable tasks using games. In this edge is intimidating to task providers and HCG developers, who paper, we describe the need for generalizable HCG design knowl- might €nd it dicult to justify the risk of an expensive development edge that accommodates the needs of both players and tasks. We process when there are no guarantees that a game will be successful propose a formal representation of the mechanics in HCGs, pro- at a desired computational task. As a result, most HCGs to date viding a structural breakdown to visualize, compare, and explore are built around speci€c kinds of templates, leaving the space of the space of HCG mechanics. We present a methodology based possible HCG designs limited and relatively unexplored. on small-scale design experiments using €xed tasks while varying To facilitate broader adoption and ease of game development, hu- game elements to observe e‚ects on both the player experience and man computation needs the tools and frameworks to the human computation task completion. Finally we discuss appli- study and communicate about these games in a consistent manner. cations of our framework using comparisons of prior HCGs and We need to understand precisely what game elements make certain recent design experiments. Ultimately, we wish to enable easier ex- HCGs successful, that is both e‚ective at engaging players and ploration and development of HCGs, helping these games provide solving tasks. Building up comprehensive, reusable design knowl- meaningful player experiences while solving dicult problems. edge for HCGs that addresses both players and tasks enables us to construct formal representations for tasks, audiences, and game CCS CONCEPTS elements. A common language and structure for HCGs would allow us to talk about and explore the space of possible HCG designs, •Human-centered computing →Computer supported coop- thus broadening the diversity of HCGs as a whole. erative work; •Applied computing →Computer games; Additionally, this knowledge could also be used to build formal KEYWORDS models that can predict the e‚ect of tasks, audiences, and game elements on both aspects of the player experience—engagement and human computation games; games with a purpose; scienti€c dis- retention—and the completed task metrics—data quality, volume, covery games; game design; game mechanics and diversity. All of these a‚ordances would help to make HCGs more successful interfaces for solving new tasks while still able to 1 INTRODUCTION engage new player audiences with changing player preferences. In this paper, we contend that reusable design knowledge for Games are everywhere. Games accompany people out on their mo- human computation games is necessary and should take the needs bile phones and await them back at home on their entertainment of both players and tasks into account. We introduce a formal rep- systems. Games are being integrated into educational curricula, resentation of HCG mechanics that provides us with a common embedded in wearable devices, and used in professional training vocabulary and structure to visualize, compare, and explore the simulations. For human computation, which harnesses the compu- space of game mechanics in HCGs. We advocate a methodology tational potential of the human crowd, this increasingly diverse for building up HCG design knowledge, which uses small-scale, audience of players represents new opportunities for computation.

arXiv:1706.03311v1 [cs.HC] 11 Jun 2017 controlled design experiments on tasks with known solutions to Human computation games (HCGs) are games that ask players to understand how variations of game elements a‚ect the player expe- help solve complex, computationally-intractable tasks or provide rience and the completion of human computation tasks. Finally, we data through . Œese games—also known as Games With illustrate how our representation, combined with this methodology, a Purpose (GWAPs), scienti€c discovery games, and enables the comparative study of existing HCGs and the exploration games—have been used to solve a wide variety of problems such as of HCG mechanics through novel designs. image labeling, , and data collection. Compared with mainstream games for entertainment, one hur- dle compounding HCG development is that these games su‚er the 2 BACKGROUND design problem of serving two di‚erent goals. On the one hand, Human computation is the process of using people to solve compu- an HCG must provide a suciently-engaging experience for its tational problems as an alternative for current algorithms [23]. Such players. On the other hand, an HCG must enable players to success- problems are o‰en broken into smaller tasks, which are completed fully complete the underlying human computation task. Balancing by crowdsourced workers, and then aggregated into larger solutions. these two goals is dicult and o‰en results in conƒicting design Œis paradigm has been used for tasks including classifying objects, decisions. To compound this dilemma, very liŠle design knowledge ranking items, summarizing texts, and iterative document editing A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl and improvement [23]. Some human computation or crowdsourc- ing tasks rely on volunteer e‚orts, where workers are motivated by participation in scienti€c processes or altruistic goodwill. More player(s) common are online interfaces such as Amazon Mechanical Turk and CrowdFlower, which give task providers a platform to recruit workers, distribute tasks, and provide monetary compensation. Human computation games have been developed as an alterna- tive to these monetary incentive systems, utilizing game mechan- ics to enable task completion and providing players an engaging action verification feedback gameplay experience, in addition or as an alternative to €nancial compensation. Œe original Game With a Purpose, the ESP Game, addressed the problem of labeling images [44], but the breadth of Figure 1: Breakdown of HCG mechanics. Players provide human-computed tasks has greatly diversi€ed since. HCGs have inputs to take actions (shown in blue), which are veri€ed been used to annotate or classify other kinds of information, from (shown in orange), and receive feedback (shown in gray) music [3, 22] to galaxies [26], to relational information (e.g., on- from the game. Solid lines represent transitions through the tology construction) [19, 37], to protein function recognition [34]. gameplay loop. Other HCGs have leveraged human players as alternatives to opti- mization functions in order to manipulate and re€ne existing input tasks [23]. Controlled studies in this domain focus on evaluating information. Common tasks of this type are o‰en “scienti€c dis- how design variations a‚ect worker eciency and task completion, covery” problems such as protein [6] and RNA folding [25], DNA examples of which include the kinds of tasks [9] and the means multiple sequence alignment [17], so‰ware veri€cation [8, 28], and of motivating workers [24]. How these might apply to human mapping dataƒow diagrams onto hardware architectures [32]. Ad- computation games in a way that addresses the necessity to engage ditionally, HCGs have used crowdsourced audiences to collect or players remains unexplored. generate new information, such as creative content or datasets for It is only recently that researchers have conducted similar con- machine-learning algorithms. Œese tasks include photo collec- trolled studies on speci€c game elements of HCGs that jointly tion [42, 43], location tagging [4, 10], and commonsense knowledge address aspects of the player experience and the completion of the acquisition [20]. Comprehensive taxonomies [18, 33] detail a wide human computation task [12], and advocated for their use [40]. We breadth of HCGs and the tasks they have tackled. discuss these, along with other relevant examples in subsequent Human computation game design has been primarily guided by sections. Combined with formal crowdsourcing research, we posit examples of successful games. Œese include von Ahn and Dab- that these approaches can enable a formal study of HCG design, bish’s templates for classi€cation and labeling tasks [45] and the allowing HCGs to more e‚ectively address a broader range of tasks design anecdotes of [7] rather than systematic study of HCG in ways more satisfying to players. elements. As a result, we do not understand what speci€c elements of these particular design choices work and how to appropriately 3 FORMALIZING HCG MECHANICS generalize them or consider alternatives. Confounding this issue is the fact that HCG research remains divided on how game ele- We propose a formal representation of the mechanics of human ments, in particular game mechanics, can ensure both successful computation games. Œis representation serves three func- completion of tasks and engaging player experiences. Some argue tions: that HCG game mechanics should be isomorphic or non-orthogonal (1) Provides a common vocabulary and visual organization of to the underlying task, that is game mechanics should map to the HCG elements process of solving the underlying task [16, 41]. Others argue that (2) Enables formal comparison of existing HCGs to understand incorporation or adaptation of game mechanics from successful the space of HCG designs and their consequences digital games designed for entertainment can leverage player famil- (3) Facilitates the formulation of controlled design experi- iarity with existing games and keep them more engaged [19]. Œis ments of HCG elements to build further, generalizable ongoing debate highlights the challenge of designing HCGs that knowledge of HCG design optimize for both a positive player experience and a successfully- We speci€cally formalize human computation game mechanics—the completed human computation task. rules that de€ne how a player can interact with the game systems— Prior research has demonstrated that qualitative and quantita- leaving other elements of HCG designs to future work. We divide tive research methods can be used to study game design including HCG game mechanics into three types: action mechanics, veri€- methods from game usability [15], game analytics [35], and vi- cation mechanics, and feedback mechanics. As shown in Figure 1, sual analysis [46]. Controlled studies have proven successful for this breakdown reƒects the core gameplay loop of most HCGs. understanding the inƒuence of general game design elements, par- HCGs begin with players taking in-game actions, then compare ticularly in dual-purpose domains such as educational games. Such task-relevant input from these actions through veri€cation mech- studies have investigated game aspects including diculty [29, 47], anisms, and €nally use veri€cation output to provide feedback or controls [21, 31], and tutorials [1, 2, 27, 36]. reward for players. General crowdsourcing and human computation research has We now de€ne and describe these three sets of mechanics in worked to formalize the design and presentation of crowdsourcing detail, illustrated using three successful HCGs spanning di‚erent A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl tasks: the original ESP Game [44], Foldit [6], and PhotoCity [43]. this issue. Given that action mechanics are o‰en the €rst mechanics Figure 2 shows the mechanical breakdown of these games into that players encounter in a game, an important question is: can action, veri€cation, and feedback mechanics. In addition to these actions be used to engage players and aŠract certain audiences? three examples, we also highlight notable instances of other HCGs Players accustomed to certain kinds of (action) mechanics may €nd that have explored these subsets of mechanics and discuss aspects some games easier to interact with, easing entry into the game of these mechanics that merit further exploration. and facilitating early successes. Conversely, familiar mechanics may raise expectations about the game that may or may not be met given the constraints of solving the task alongside expected 3.1 Action Mechanics gameplay. For example, an HCG adopting action mechanics from a Action mechanics are the interface for players to complete a human successful game may negatively impact players’ experiences if the computation task through in-game actions or gameplay. Œese design quality of the HCG did not match that of the popular game mechanics align with the process of solving the human computation or series. task, o‰en asking players to utilize skills necessary for solving the task during play. Such mechanics may be as simple as entering text input or as complicated as piloting a space ship in a virtual environment, and tend to vary based on the nature of the task. 3.2 Veri€cation Mechanics Veri€cation mechanics combine the output of player actions to com- Examples. In the ESP Game, players provide labels through text pute task-relevant outcomes. Œese mechanics can support task entry to solve the task of labeling given images. In Foldit, players completion outcomes including the quality, volume, diversity, and are given a variety of spatial actions, such as handling or rotating the rate at which the data are acquired. components of a protein structure, to solve the task of “folding” a given protein into a minimal energy con€guration. In PhotoCity, Examples. For many human computation tasks, consensus on players navigate to a desired location and take pictures using their player input o‰en serves as veri€cation. Œe ESP Game (and many camera phones, which are later uploaded to a database and used to of the games inspired by its structure) verify using an online agree- construct a 3D representation of the buildings in that location. ment check that €lters correct answers from incorrect answers using agreement between players (Figure 2). Œe ESP Game later Discussion. In general, the action mechanics in human computa- added “taboo word” mechanics to promote data diversity through tion games have been closely aligned with the process of solving banning words once consensus on existing data was reached. the human computation task. Œe three examples in Figure 2 adhere By contrast, both Foldit and PhotoCity, which emphasize data to this alignment of mechanics and task completion, suggesting manipulation and collection, handle veri€cation through a task- that the mechanics of these games were designed €rst and foremost based evaluation function. Foldit’s protein con€guration energy with the task in mind, as opposed to adapting the mechanics of function determines the quality of player solutions. In PhotoCity, existing games for entertainment. the game does not explicitly evaluate the provided photos; photos Actual explorations of adapting mechanics from mainstream or are instead processed on an o„ine server and then player feedback popular digital games do exist for HCGs, but are few. For example, is based on the resulting alterations to a constructed 3D mesh of the game OnToGalaxy [19] converts an analogous classi€cation the world. task of that previously seen in games such as Ontogame [37] to a Foldit also makes use of social mechanics, such as allowing play- space shooter game akin to the classic arcade game Asteroids. Un- ers to share solution procedures (called “recipes”) through its com- fortunately no comparisons were conducted between these games; munity interfaces, as an additional (but optional) instance of veri€- while OnToGalaxy was compared against the ESP Game, di‚erences cation [5]. Players can utilize existing recipes uploaded by other in tasks—not to mention game mechanics—make direct comparison players as a starting point for solving tasks, thus validating and and generalization dicult. Meanwhile, the game Gwario [38] was iterating on pre-existing, partial solution strategies, and can also used in an experiment to explore singleplayer and multiplayer me- rate their utility. chanics in HCGs, turning a classi€cation task from prior work [40] into a platformer game resembling the classic action game Super Discussion. Veri€cation requirements for a task impact design Mario Bros.. To complement this study of game mechanic varia- decisions related to the number of players required to play the tions, the authors also solicited opinions from HCG developers and game and how they might interact with each other. Whether or not researchers, asking directly if mechanics from mainstream digital veri€cation is handled within the human computation game or as games should be adapted into HCGs. Diverging from prior HCG an o„ine process may impact if a game uses singleplayer mechan- design theories [16, 41], these experts were positive towards such ics, multiplayer mechanics, or both. Many human computation adaptation, though not without cautioning that such mechanics tasks rely on consensus, and multiple players permits synchronous ought to be selected carefully to avoid compromising the task re- consensus, as shown in the ESP Game. Other tasks, which can sults. be evaluated using an objective function or comparison against One of the major concerns surrounding human computation existing data, may permit asynchronous (singleplayer) play, such games is that when compared against other games designed purely as Foldit and PhotoCity. Synchronous veri€cation o‰en requires for entertainment, HCGs might be perceived as shallow [41] and multiple players to be playing at the same time, necessitating a therefore risk failing to aŠract players. Adaptation of familiar or fast veri€cation step for real-time feedback. Note that simulated successful mechanics from digital games might serve to address players can be used to re-verify old solutions (e.g., utilized in the A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

ESP Game Foldit PhotoCity (vonAhn & Dabbish 2004) (Cooper et al. 2010) (Tuite et al. 2011)

player1 player2 player player

enter text enter text selection from community- shared recipe shared solution take photo (optional) “recipes”

label1 label2 location photo manipulate 3D location in model label1 == label2 database

3D model shared solution true true (“recipe”) (optional) 3D mesh player1.score++ energy reconstruction player2.score++ function number of 3D points energy added (n3D)

player.score == energy player.score += n3D

Figure 2: Examples of HCGs [6, 43, 44] subdivided into action, veri€cation, and feedback mechanics. Arrows from feedback to players have been omitted for clarity.

ESP Game), especially where multiple players are required, but are predictor of high task accuracy. Œis suggests that direct communi- not necessarily available concurrently. cation may be permissible for certain tasks, audiences, and kinds As with action mechanics, few design experiments have tested of games. alternative veri€cation mechanics in HCGs. Œe game KissKiss- Overall, we believe that veri€cation mechanics are ripe with ques- Ban [14] modi€ed the original structure of the ESP Game to promote tions for future exploration. How do di‚erent ways of determining data diversity (i.e., a broader set of image labels), by adding a third agreement inƒuence task results? Would di‚erent mechanisms have player to “ban” common words. Compared with the ESP Game’s downstream consequences for the feedback players can receive, for eventual use of taboo words, KissKissBan demonstrates an alterna- example, highlighting ƒaws in player-provided inputs? tive veri€cation mechanic that players found engaging. In most human computation games (i.e., those which rely on 3.3 Feedback Mechanics synchronous consensus), players are commonly forbidden from Feedback mechanics provide players with information or digital direct communication with each other (o‰en implemented through artifacts based on the results of player actions in terms of partial anonymous pairing with no means of determining the identity of or full task completion. Œese mechanics commonly encompass the other player). Very few games (e.g., Foldit) allow players to gameplay elements such as rewards and scoring, and can also be communicate through channels of communication such as game fo- mapped to evaluation metrics for the underlying task, thus allowing rums or community interfaces. Œis is meant to minimize both researchers and designers to assess player performance at both between players, which may lead to players providing deliberately the completion of the task and progression through the in-game incorrect answers while still succeeding at the game. Œe game experience. Gwario [38] was used to compare singleplayer and co-located mul- tiplayer, which allowed players to communicate directly during Examples. For all of the games shown in Figure 2, players receive side-by-side play. Study results found no negative impact from di- feedback in the form of a score. However, the scale of the scoring rect communication; collusion was instead found to be the strongest mechanics themselves are unique to the tasks performed. Œe ESP Game rewards players with points for agreement on an image label. A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

By contrast, Foldit rewards players with points for minimizing an that players who are dedicated to the task itself might disengage energy function describing the protein structure. PhotoCity rewards with games that only provide one kind of reward. In their analysis players for the number of points their photo choices add to the of Foldit, Cooper et al. [7] identify a subset of players who are reconstructed 3D mesh. Œese examples are all similar in that the intrinsically motivated to participate in scienti€c discovery games, feedback “currency” is nominal—points contributing to a numerical which highlights that such intrinsically-motivated players do exist. score—but vary in what players are rewarded for. So what are the alternatives to points and leaderboards for play- ers who may not €nd typical feedback systems compelling? Re- Discussion. Aspects of feedback mechanics are some of the most searchers have explored the use of di‚erent reward systems. Goh well-explored design elements in human computation games. Œese et al. [13] compared di‚erent versions of a location-based content mechanics are synonymous with in-game rewards, which some sharing HCG—one of which rewarded players with points and an- consider comparable to monetary compensation in other crowd- other which rewarded players with collectible badges—against a sourcing systems. So what should rewards to the player represent: control version with no in-game rewards (but which displayed non- player actions in the game, player performance at the task (as eval- gami€ed progress and statistics). Overall, few di‚erences in player uated by veri€cation mechanics), or a combination of both? engagement and task results were found between the versions utiliz- Traditionally, positive feedback in human computation games is ing points and badges, though both were considered more e‚ective designed to encourage participation in the crowdsourcing process, and engaging than the control. Siu and Riedl [39] investigated as well as correctness of task completion. For example, players may multiple reward systems—leaderboards, customizable avatar items, receive rewards for €rst completing a task and, if such veri€cation unlockable narrative, and ungami€ed statistics—using the game is available, additional rewards for completing it correctly. Œis Cafe Flour Sack in a multivariate experiment with two di‚erent kind of positive feedback o‰en rewards collaborative player behav- player audiences: expert crowdsourced workers and a non-expert ior, which is reasonable given that crowdsourcing is an aggregate student population. Œey found that o‚ering players a choice of re- process. However, reward systems in games, such as points and ward from di‚erent systems (e.g., leaderboards, customizable avatar leaderboards, are o‰en designed to a‚ord and encourage competi- items, and unlockable narrative), as opposed to randomly assigning tive player behavior. A variety of research has examined in-game rewards from these systems, resulted in a beŠer player experience rewards for collaborative versus competitive behaviors, speci€cally with no di‚erences in task results. When player audiences were in the context of HCGs where competition may be discouraged taken into account, expert crowdsourced workers proved to be the as a distraction from the task (i.e., players will optimize their be- most e‚ective, but non-experts could perform nearly as well by be- haviors to outperform others, rather than to complete the task ing o‚ered the ability to choose which rewards they desired. Gaston suciently). Both the game KissKissBan [14] and a study by Goh et and Cooper [11] explored the use of three-star reward systems in al. [12] examine collaboration and competition in the context of the the context of Foldit. Œey found that the introduction of three-star ESP Game. KissKissBan explored the introduction of competitive reward systems—mechanics awarding players up to three “stars” mechanics by rewarding a third player for antagonistic behavior based on their performance of a game or challenge—resulted that would encourage the other players to provide more diverse in players completing levels in fewer, longer moves. Additionally image labels. Goh et al. compared collaborative and competitive players were more likely to replay levels, suggesting that such re- scoring using two versions of the ESP Game—one which rewarded ward systems could encourage players to interact with the game players for answer agreement and the other which only rewarded in desirable ways (e.g., replaying levels to reinforce mastery of the €rst player to an agreed answer. Œey found no signi€cant dif- concepts in games like Foldit, which must teach players strategies ferences when looking at player experience metrics and task results, necessary to complete the task). Œese explorations suggest that which suggests that rewarding players for competitive behavior di‚erent reward systems and their presentation may have e‚ects had no negative e‚ects on the completion of the task. Likewise, on the player experience and the completion of the task. Targeted Siu et al. [40] compared collaborative and competitive versions of player audiences may also be a‚ected by how and what kinds of the game Cabbage ‰est, €nding no signi€cant di‚erences in task rewards are available, and may behave di‚erently under di‚erent completion such as accuracy and rate of task completion, but that conditions. players found the competitive version more engaging. Finally, the Finally, we highlight Project Discovery [34] as a unique example previously-mentioned game Gwario [38] compared collaborative of an HCG that is embedded as a playable experience accessible in and competitive scoring in its multiplayer condition, €nding that a mainstream game: the online game EVE Online. Œematically, the players yielded more accurate (but slower) task results in the collab- HCG is set in the EVE Online universe and players are rewarded with orative version but found the competitive version more challenging. currency that can be spent within the larger game world. Direct In addition to examining what players are rewarded for, one integration of HCGs with existing games (moreover, those designed must also consider what players are rewarded with. Players have primarily for entertainment) remains otherwise unexplored, but di‚erent motivations for playing games [48] and thus may di‚er this example demonstrates a potential way to leverage existing in their preferred kinds of feedback. Crowdsourcing research has player bases of other games using feedback mechanics. found that players who are intrinsically-motivated to solve tasks are o‰en disengaged by monetary compensation [30] and that curiosity can be a strong incentive for crowdsourced work [24]. Œis suggests 4 A METHODOLOGY FOR HCG DESIGN that standard in-game rewards (i.e., points), which may take the Our mechanics representation provides a breakdown of the di‚erent place of monetary compensation, may not appeal to all players and kinds of mechanics in human computation games. Œis enables us A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl to identify where we can focus our explorations of the HCG design results while still maintaining a positive player experience. Likewise, space, but not how we should explore the space in order to build player audiences (not to mention their preferences) may change, up generalizable design knowledge. especially if an HCG wishes to broaden its reach to new or di‚erent We propose a methodology of controlled A/B design experiments target populations of players. that explore the space of human computation game designs, using We note that this methodology is not new, as similar experi- formal representations for game elements, tasks, and audiences. In mental testing approaches have been applied to several instances the context of HCG mechanics, this manifests as between-subjects of HCGs. Goh et al. [12] compared a non-gami€ed control appli- (alternatively, within-subjects) experiments comparing separate cation for image labeling against two versions of the ESP Game, versions of HCGs with di‚erent mechanical variations. one using collaborative scoring mechanisms and the other using Œese design experiments should (1) implement a task with a competitive scoring mechanisms. Œey evaluated both results of known solution, while (2) focusing on a single element of a HCG’s the image labeling task and aspects of the application or game expe- design. First, testing with a known solution allows us to evalu- rience across these three conditions. We will further examine this ate task-related metrics objectively without simultaneously solv- experiment in Section 4.1. Similarly, Siu et al. [40] conducted an ing a novel problem. Such known solutions may be the result of experiment using the game Cabbage ‰est to test collaborative and pre-solved human computation problems (e.g., image labeling, as competitive scoring mechanisms. Œe study utilized a task with a discussed in Section 4.1) or simpler tasks (o‰en those relying on known solution—categorizing everyday objects with purchasing commonsense human knowledge) that are analogous to existing locations—and the controlled experiment compared two in-game problems. Second, focusing on one particular element of an HCG’s scoring mechanisms: one collaborative and one competitive. Task design allows us to understand exactly what kind of impact an results were compared to a gold standard to evaluate task comple- element may have on both players and the task with minimal inter- tion metrics, while player actions and survey responses were logged action e‚ects. Our mechanics representation can be used to assist to evaluate player engagement and experience. Both Goh et al.’s us in understanding where and how the introduction of an ele- ESP Games and Cabbage ‰est follow our proposed methodology of ment (which may a‚ect any of the action, veri€cation, and feedback taking a problem with a known solution or gold-standard answer mechanics) will a‚ect the HCG game loop. set, testing design elements by treating a set of game mechanics as Œese experiments should simultaneously evaluate how design independent variables, and measuring aspects of both the player decisions meet the needs of players and tasks. Optimizing only experience and task completion. We note that these experiments ben- for the player may result in a game with mechanics that do not e€ted from having a gold-standard answer set, but in some cases, e‚ectively solve the human computation task even if the game is a preexisting solution may not be available for new problems. In considered engaging. Optimizing only for the task may result in such cases, it may be sucient to evaluate €rst for player experience, a game that players do not €nd engaging (and thus will not play) followed by later evaluation of task completion upon veri€cation of even if the game e‚ectively solves the human computation task. initial task results. We refer to these two axes of metrics as the player experience and We now discuss how a controlled methodology of systematically the task completion. testing game elements in human computation games, combined Player experience encompasses metrics such as: with our mechanics representation, can be used to study, compare, • Engagement: how players interact with the game or rate and explore HCG designs. We provide two case studies as illus- their experience with it trative examples: (1) a comparison of image labeling games and • Retention: how likely are players to continue playing studies of their design space and (2) a discussion of several studies • Other subjective measures related to how players interact that have utilized (and expanded) this methodology to test multi- and perceive the game (e.g., preferences, unstructured self- ple aspects of HCGs, including player audiences and novel game reported feedback) elements. Task completion refers to the task-related metrics such as: • †ality: correctness or accuracy of task results • Volume: amount of completed tasks 4.1 Case Study: Comparison and Evolution of • Diversity: the variation or breadth of task results Image Labeling Games • Rate of Acquisition: how quickly tasks are completed Image labeling, the task of annotating or classifying images with Œe exact metrics to test for o‰en depend on the nature of the subject labels or tags, is one of the most iconic and well-studied human computation task and the HCG’s target player audiences. tasks in human computation games. Figure 3 illustrates our me- For example, HCGs for tasks that have a high bar for teaching or chanics representation of three HCGs designed to solve the image training players to solve them e‚ectively may consider metrics such labeling problem, beginning with the original ESP Game [44] fol- as player retention much more important than HCGs for simpler lowed by KissKissBan [14] and Goh et al.’s ESP Games [12]. Our tasks where maintaining a skilled player base is not a priority. mechanics representation enables us to identify where changes in Moreover, these requirements may change over time. Task game mechanics were made in later games; Figure 3 shows the providers may €nd that their initial task completion results may not mechanical structure of the original ESP Game highlighted in gray be sucient or need additional re€nement. Should they wish to on top of later iterations. Furthermore, we can visualize how these change existing mechanics or introduce new ones, it is imperative later games di‚er, speci€cally in their explorations of competitive to understand how those design changes can ensure the necessary game mechanics. A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

ESP Game KissKissBan ESP Game (vonAhn & Dabbish 2004) (Ho et al. 2009) (Goh et al. 2011)

player1 player2 player1 player2 player3 player1 player2

enter text enter text enter text enter text enter text enter text enter text

label1 label2 l1 l2 ban label1 label2

label1 == label2 l1 == ban l2 == ban label1 == label2

true true true {collaborative} true {competitive} true

false, l1 false, l2 player1.score++ game.time-- player1.score++ player2.time > player2.score++ player2.score++ player1.time > player2.time player1.time l1 == l2

true player1.score++ player2.score++

player1.score++ player2.score++

Figure 3: Mechanics breakdown of three image labeling HCGs [12, 14, 44]. On the le is the original ESP Game, followed by subsequent variations that modi€ed elements of its original design. ‡e mechanics of the original ESP Game are colored in gray; novel mechanical variations are colored in white.

KissKissBan, previously discussed in Section 3, tackled the prob- Understanding how certain mechanics, or collections of mechan- lem of generating a wider, more diverse set of labels for given im- ics are modi€ed or added on top of existing games can help provide ages. While not shown in Figure 3, the original ESP Game eventually insight to apply them to other HCGs. Visually, Figure 3 gives us adopted the use of “taboo” words based on preexisting (repetitive) a beŠer understanding of how one might take the changes uti- solutions that players could not input, forcing players to gener- lized by KissKissBan or Goh et al.’s ESP Games and apply them ate a wider range of labels. In this context, KissKissBan provides to a structurally-similar ESP Game (which vonAhn and Dabbish’s an alternative mechanical solution to the problem, speci€cally by templates refer to as “input-agreement” [45]). introducing an additional player and with it, an additional veri€- Moreover, this helps to hypothesize just how these collections cation step that did not rely on preexisting labels. We can see just of mechanics are likely to a‚ect the player experience and task how these mechanics were injected into the original design of the completion. Unfortunately, while Goh et al. evaluated both player ESP Game in Figure 3. While the action mechanics of entering text experience and task completion metrics in their experiment, KissKiss- input are identical to those of the €rst two players, the new design Ban only evaluated the laŠer against original ESP Game. Œis makes incorporates additional veri€cation steps and feedback mechanics it dicult to compare how KissKissBan’s mechanics might compare that can penalize the €rst two players. to those of Goh et al.’s competitive ESP Game, particularly in the Goh et al., as previously discussed in Section 3, conducted a context of player experience. Ideally, as more comprehensive design study comparing collaborative and competitive versions of the knowledge is built up, HCG developers might be able to treat these ESP Game against a non-gami€ed control application. Œeir ESP collections of mechanics as modular when applying them to new Games are shown Figure 3 with the two experimental conditions tasks and integrating them into games, assisted by understanding of emphasized in boldfaced braces. Using our mechanics breakdown, their potential e‚ects on the player experience and task completion. we can identify that the only major mechanical di‚erence between the collaborative and competitive version of their ESP Games is in 4.2 Case Study: Exploring Player Audiences the particular scoring function used to provide feedback behaviors. Speci€cally, we can see just how their study isolated and varied and Novel Designs in HCGs one speci€c feedback mechanic: the scoring function. As previously described, experiments using HCGs such as the ESP Game variations and Cabbage ‰est have used a controlled method- ology as highlighted in Section 4 to systematically test aspects of A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

was a reward selection screen presented to the player: random— Cafe Flour Sack which showed players a randomly-assigned (highlighted) reward type—and choice—which allowed players to choose and click on one {experts} {students} of three reward types. Œis interface change was kept deliberately minimal, but when looking only at the reward condition, results show that the choice condition was perceived more positively by players. Otherwise, players showed no signi€cant di‚erences in player player experience or task completion. In addition to testing random versus choice reward distribution, Cafe Flour Sack also tested these conditions across two di‚erent classify object player audiences. One audience consisted of expert crowdsourcing classification workers recruited on Amazon Mechanical Turk, while the other audience consisted of university students. Predictably, experts per- database check formed overall beŠer than students at task completion metrics such as task accuracy/correctness and rate of task completion. However, {random} true {choice} true when looking at the interactions between the reward conditions,

wait for random students given a choice of rewards performed nearly as well as assignment experts given a choice of rewards. Œis suggests that under certain conditions such as the appropriate choice of mechanics, non-experts (i.e., students) can perform nearly as well as experts. We note that select reward player.leaderboardRank++ type OR Cafe Flour Sack uses a simple commonsense knowledge task, but player.numCustomItemsUnlocked++ OR such di‚erences could be more impactful for more complex tasks, player.narrativeProgressionCount++ player.selected = although further replication and experimentation is needed to un- “narrative” derstand how these results hold. player.selected = “points” While our proposed methodology focuses on using controlled player.narrativeProgressionCount++ studies on game mechanics, the Cafe Flour Sack experiment high- player.leaderboardRank++ lights the need to also consider other aspects of human computation player.selected = games. As shown in Figure 4, our mechanics representation can “customizables” also be extended to other aspects of HCGs, such as player audience. player.numCustomItemsUnlocked++ Œe results from Cafe Flour Sack suggest that testing conditions such as player audience is just as valuable as testing game mechan- ics, given the e‚ects on the player experience and task completion metrics. We believe that our framework is ƒexible to such exten- Figure 4: Mechanical breakdown of the Cafe Flour Sack ex- sions. For example, while we advocate for €xing the task using periment. Experimental conditions for reward distribution this experimental methodology, one could hypothetically explore (random versus choice) and player audience (experts versus the e‚ect that di‚erent tasks have on player experience and task students) are noted using boldfaced braces. completion by €xing mechanics of an HCG while using similar tasks. Gwario. Gwario, previously discussed in Section 3, was designed to test co-located multiplayer game mechanics in the context of games, in particular feedback mechanics. We now report on two an HCG adopting the mechanics of the popular platformer Super recent experiments that have taken this methodology further by Mario Bros. [38]. Œe game utilizes the task with a known solution exploring aspects of HCGs such as player audience, considering as that in Cabbage ‰est [40]. A visualization of the experiment multivariate conditions (i.e., two independent variables), and chal- using our mechanics representation can be seen in Figure 5. lenging longstanding design hypotheses with novel game designs. Œe game tested co-located multiplayer mechanics and consisted of two conditions: a singleplayer version and a co-located multi- Cafe Flour Sack. Cafe Flour Sack, previously discussed in Section player version, in which players are seated side-by-side sharing the 3.3, was designed to test aspects of feedback mechanics [39]. Œe same game screen. Figure 5 shows that the mechanics of the single- game uses a classi€cation task with a known answer—pairing cook- player version are a deliberately-strict subset of the mechanics in ing ingredients with recipes that can be made from them—with the multiplayer version, which adds verbal communication between drag-and-drop action mechanics for players to classify answers. A players as an optional veri€cation step (i.e., players may choose visualization of the experiment using our mechanics representation whether or not to talk during gameplay). Additionally, within its can be seen in Figure 4. multiplayer condition, Gwario also compared the experimental con- Œe primary feedback mechanic tested was the distribution of ditions described in Goh et al.’s previously-described ESP Games rewards: whether players were randomly assigned one of three and Cabbage ‰est: collaborative scoring and competitive scoring. possible reward types or were allowed to choose from those reward Like these prior experiments, the only variation between the two types. As shown in Figure 4, the variation occurred only in the feedback mechanics. Œe only di‚erence between the two versions A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

Gwario (Singleplayer) Gwario (Multiplayer) Additionally, the Gwario experiment also highlights the poten- tial for di‚erent experimental methods in the context of human

player1 player1 player2 computation game design. Gwario’s survey of HCG experts, while limited, provides valuable insights into the design philosophies and category category1 category2 considerations of HCG developers. While our proposed methodol- navigate to navigate to ogy is very amenable to quantitative methods for evaluating games, object object we believe that HCG design stands to bene€t from qualitative and object hypothesis = (object in category) mixed method approaches, such as grounded-theory approaches verbal collusion (optional) and/or semi-structured interviews with both players and HCG de- collect object velopers/task providers. hypothesis = (object in category1) hypothesis = (object in category2)

collect object collect object (player1) (player2) 5 CONCLUSIONS

database check player1.collectedObject = true player2.collectedObject = true (if available) In this paper, we describe a framework for designing and studying

database check human computation games. We believe that generalizable design hypothesis == answer (if available) knowledge for HCGs is necessary, and should consider both players {collaborative} {competitive} and tasks. We proposed a formal representation of HCG mechanics player.score++ hypothesis == answer hypothesis == answer into three types: action, veri€cation, and feedback, and illustrate player1.collectedObject = true player1.score++ it with examples of its potential use. Œis formalization allows player2.score++ player2.collectedObject = true us to visualize and study HCG elements, enable formal compar- isons of existing HCGs, and facilitate controlled experiments to player1.score++ player2.score++ advance HCG design knowledge. We highlighted a methodology of running design experiments on known tasks that measure both Figure 5: Mechanical breakdown of the Gwario experiment. player experience and task completion when varying game elements. Experimental conditions for singleplayer and multiplayer We illustrated how to utilize both representation and methodol- are split for clarity, while multiplayer scoring conditions ogy in two examples: a comparison of image labeling games and a (collaborative versus competitive) are noted using boldfaced discussion of recent HCG experiments. braces. Human computation games have demonstrated the potential to solve complex and dicult problems, but must be both engaging experiences for players and e‚ective at solving their tasks. With the volume and availability of games increasing each day, HCGs must compete for players’ time and aŠention, and thus must remain rel- scoring conditions was the scoring function used to evaluate play- evant and consistent with player expectations. Otherwise, we risk ers. Œe only visual di‚erence between the two was the reward tarnishing their reputation with players, which may ultimately lead screen at the end of the game levels, showing either a joint score to HCGs’ premature dismissal as ine‚ective interfaces for human or separate scores. computation work. In order to ensure this does not happen, we Adoption of mainstream game mechanics still remains a con- need to understand how HCGs work, to build beŠer and broader tentious and relatively-unexplored area of human computation generalizable design knowledge that can adapt to new games, tasks, game design. Gwario’s adaptation of collection and 2D platforming and audiences, especially when HCG developers do not typically mechanics from Super Mario Bros. represents a novel design in the have the training or resources of professional game studios. Our space of HCGs. Œis raises the question: how can we safely adopt mechanics representation and our proposed methodology are de- and test mechanics from successful mainstream games? When signed to explore and evaluate HCG mechanics so that it will be considering a particular genre or game, which mechanics do we easier to design and develop successful, e‚ective HCGs. In doing adopt? Œe Gwario experiment demonstrates possible approaches, so, we hope to work towards a future where HCGs are engaging, such adopting a secondary mechanic from Super Mario Bros.—coin e‚ective, ubiquitous, and empowering. collection—to the task of classifying objects as part of its action mechanics. Meanwhile, Super Mario Bros.’ primary navigation me- chanics were le‰ untouched to preserve the feel of the original ACKNOWLEDGMENTS game, but did not a‚ect the completion of the human computation Œis material is based upon work supported by the National Science task. Additionally, direct communication—a social element—was Foundation under Grant No. 1525967. Any opinions, €ndings, and treated as a veri€cation mechanic. By breaking HCGs down using conclusions or recommendations expressed in this material are our mechanics representation, we can identify and isolate where those of the author(s) and do not necessarily reƒect the views of adopted mainstream game elements could €t into the HCG game the National Science Foundation. loop. Œis would allow us to carefully and iteratively test variations while ensuring that design changes to mechanics are isolated to REFERENCES particular parts of the game loop. Comparison through our pro- [1] Erik Andersen, Yun-En Liu, Richard Snider, Roy Szeto, Seth Cooper, and Zoran posed methodology then could help to identify potential tradeo‚s Popovic.´ 2011. On the Harmfulness of Secondary Game Objectives. In Proceed- between the player experience and task completion. ings of the 6th International Conference on Foundations of Digital Games (FDG A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

’11). ACM, New York, NY, USA, 30–37. DOI:hŠp://dx.doi.org/10.1145/2159365. [21] Michael Lankes, Œomas Mirlacher, Stefan Wagner, and Wolfgang Hochleitner. 2159370 2014. Whom Are You Looking for?: Œe E‚ects of Di‚erent Player Representation [2] Erik Andersen, Eleanor O’Rourke, Yun-En Liu, Rich Snider, Je‚ Lowdermilk, Relations on the Presence in Gaze-based Games. In Proceedings of the First ACM David Truong, Seth Cooper, and Zoran Popovic.´ 2012. Œe Impact of Tutorials SIGCHI Annual Symposium on Computer-human Interaction in Play (CHI PLAY on Games of Varying Complexity. In Proceedings of the SIGCHI Conference on ’14). ACM, New York, NY, USA, 171–179. DOI:hŠp://dx.doi.org/10.1145/2658537. Human Factors in Computing Systems (CHI ’12). ACM, New York, NY, USA, 59–68. 2658698 DOI:hŠp://dx.doi.org/10.1145/2207676.2207687 [22] Edith Law and Luis von Ahn. 2009. Input-agreement: A New Mechanism for [3] Luke Barrington, Douglas Turnbull, and Gert Lanckriet. 2012. Game- Collecting Data Using Human Computation Games. In Proceedings of the SIGCHI powered machine learning. Proceedings of the National Academy of Sci- Conference on Human Factors in Computing Systems (CHI ’09). ACM, New York, ences 109, 17 (2012), 6411–6416. DOI:hŠp://dx.doi.org/10.1073/pnas.1014748109 NY, USA, 1197–1206. DOI:hŠp://dx.doi.org/10.1145/1518701.1518881 arXiv:hŠp://www.pnas.org/content/109/17/6411.full.pdf [23] Edith Law and Luis von Ahn. 2011. Human computation. Synthesis Lectures on [4] Marek Bell, Stuart Reeves, Barry Brown, ScoŠ Sherwood, Donny MacMillan, Arti€cial Intelligence and Machine Learning 5, 3 (2011), 1–121. John Ferguson, and MaŠhew Chalmers. 2009. EyeSpy: Supporting Navigation [24] Edith Law, Ming Yin, Joslin Goh, Kevin Chen, Michael A. Terry, and Krzysztof Z. Œrough Play. In Proceedings of the SIGCHI Conference on Human Factors in Gajos. 2016. Curiosity Killed the Cat, but Makes Crowdwork BeŠer. In Proceedings Computing Systems (CHI ’09). ACM, New York, NY, USA, 123–132. DOI:hŠp: of the 2016 CHI Conference on Human Factors in Computing Systems (CHI ’16). //dx.doi.org/10.1145/1518701.1518723 ACM, New York, NY, USA, 4098–4110. DOI:hŠp://dx.doi.org/10.1145/2858036. [5] Seth Cooper, Firas Khatib, Ilya Makedon, Hao Lu, Janos Barbero, David Baker, 2858144 James Fogarty, Zoran Popovic,´ and Foldit players. 2011. Analysis of Social [25] Jeehyung Lee, Wipapat Kladwang, Minjae Lee, Daniel Cantu, Martin Azizyan, Gameplay Macros in the Foldit Cookbook. In Proceedings of the 6th International Hanjoo Kim, Alex Limpaecher, Sungroh Yoon, Adrien Treuille, Rhiju Das, and Conference on Foundations of Digital Games (FDG ’11). ACM, New York, NY, USA, EteRNA Participants. 2014. RNA design rules from a massive open laboratory. 9–14. Proceedings of the National Academy of Sciences 111, 6 (2014), 2122–2127. [6] Seth Cooper, Firas Khatib, Adrien Treuille, Janos Barbero, Jeehyung Lee, Michael [26] Chris J LintoŠ, Kevin Schawinski, Anzeˇ Slosar, Kate Land, Steven Bamford, Beenen, Andrew Leaver-Fay, David Baker, Zoran Popovic,´ and others. 2010. Daniel Œomas, M Jordan Raddick, Robert C Nichol, Alex Szalay, Dan Andreescu, Predicting protein structures with a multiplayer online game. Nature 466, 7307 Phil Murray, and Jan Vandenberg. 2008. Galaxy Zoo: morphologies derived from (2010), 756–760. visual inspection of galaxies from the Sloan Digital Sky Survey. Monthly Notices [7] Seth Cooper, Adrien Treuille, Janos Barbero, Andrew Leaver-Fay, Kathleen Tuite, of the Royal Astronomical Society 389, 3 (2008), 1179–1189. Firas Khatib, Alex Cho Snyder, Michael Beenen, David Salesin, David Baker, [27] Yun-En Liu, Travis Mandel, Emma Brunskill, and Zoran Popovic.´ 2014. Trad- Zoran Popovic,´ and Foldit Players. 2010. Œe Challenge of Designing Scienti€c ing O‚ Scienti€c Knowledge and User Learning with Multi-Armed Bandits. In Discovery Games. In 5th International Conference on the Foundations of Digital Educational Data Mining. Games. [28] Heather Logas, Jim Whitehead, Michael Mateas, Richard Vallejos, Lauren ScoŠ, [8] Werner Dietl, Stephanie Dietzel, Michael D. Ernst, Nathaniel Mote, Brian Walker, Dan Shapiro, John Murray, Kate Compton, Joseph Osborn, Orlando Salvatore, Seth Cooper, Timothy Pavlik, and Zoran Popovic.´ 2012. Veri€cation Games: Zhongpeng Lin, Huascar Sanchez, Michael Shavlovsky, Daniel Cetina, Shayne Making Veri€cation Fun. In 14th Workshop on Formal Techniques for Java-like Clementi, and Chris Lewis. 2014. So‰ware Veri€cation Games: Designing Xylem, Programs. Œe Code of Plants. In Proceedings of the 9th International Conference on the [9] Shayan Doroudi, Ece Kamar, Emma Brunskill, and Eric Horvitz. 2016. Toward a Foundations of Digital Games. Learning Science for Complex Crowdsourcing Tasks. In Proceedings of the 2016 [29] Derek Lomas, Kishan Patel, Jodi L. Forlizzi, and Kenneth R. Koedinger. 2013. CHI Conference on Human Factors in Computing Systems (CHI ’16). ACM, New Optimizing Challenge in an Educational Game Using Large-scale Design Experi- York, NY, USA, 2623–2634. DOI:hŠp://dx.doi.org/10.1145/2858036.2858268 ments. In Proceedings of the SIGCHI Conference on Human Factors in Computing [10] Spencer Frazier and Mark Riedl. 2014. Persistent and Pervasive Real-World Systems (CHI ’13). ACM, New York, NY, USA, 89–98. DOI:hŠp://dx.doi.org/10. Sensing Using Games. In Second AAAI Conference on Human Computation and 1145/2470654.2470668 Crowdsourcing. [30] Winter Mason and Duncan J. WaŠs. 2010. Financial Incentives and the ”Per- [11] Jacqueline Gaston and Seth Cooper. 2017. To Œree or Not to Œree: Improving formance of Crowds”. SIGKDD Explor. Newsl. 11, 2 (May 2010), 100–108. DOI: Human Computation Game Onboarding with a Œree-Star System. In Proceedings hŠp://dx.doi.org/10.1145/1809400.1809422 of the 2017 CHI Conference on Human Factors in Computing Systems (CHI ’17). [31] Mitchell W. McEwan, Alethea L. Blackler, Daniel M. Johnson, and Peta A. Wyeth. ACM, New York, NY, USA, 5034–5039. DOI:hŠp://dx.doi.org/10.1145/3025453. 2014. Natural Mapping and Intuitive Interaction in Videogames. In Proceedings 3025997 of the First ACM SIGCHI Annual Symposium on Computer-human Interaction in [12] Dion Hoe-Lian Goh, Rebecca P. Ang, Chei Sian Lee, and Alton Y. K. Chua. 2011. Play (CHI PLAY ’14). ACM, New York, NY, USA, 191–200. DOI:hŠp://dx.doi.org/ Fight or Unite: Investigating Game Genres for Image Tagging. J. Am. Soc. Inf. 10.1145/2658537.2658541 Sci. Technol. 62, 7 (July 2011), 1311–1324. [32] Gayatri Mehta, Carson Crawford, Xiaozhong Luo, Natalie Parde, Krunalkumar [13] Dion Hoe-Lian Goh, Ei Pa Pa Pe-Œan, and Chei Sian Lee. 2015. An Investigation of Patel, Brandon Rodgers, Anil Kumar Sistla, Anil Yadav, and Marc Reisner. 2013. Reward Systems in Human Computation Games. Springer International Publishing, UNTANGLED: A Game Environment for Discovery of Creative Mapping Strate- 596–607. gies. ACM Trans. Recon€gurable Technol. Syst. 6, 3, Article 13 (Oct. 2013), 26 pages. [14] Chien-Ju Ho, Tao-Hsuan Chang, Jong-Chuan Lee, Jane Yung-jen Hsu, and Kuan- DOI:hŠp://dx.doi.org/10.1145/2517325 Ta Chen. 2010. KissKissBan: A Competitive Human Computation Game for [33] Ei Pa Pa Pe-Œan, Dion Hoe-Lian Goh, and Chei Sian Lee. 2013. A typology of Image Annotation. SIGKDD Explor. Newsl. 12, 1 (Nov. 2010), 21–24. DOI:hŠp: human computation games: an analysis and a review of current games. Behaviour //dx.doi.org/10.1145/1882471.1882475 & Information Technology (2013). [15] Katherine Isbister and Noah Scha‚er. 2008. Game Usability: Advancing the Player [34] Mark Peplow. 2016. Citizen science lures gamers into Sweden’s Human Protein Experience. Morgan Kaufmann. Atlas. Nature Biotechnology 34, 5 (2016), 452–452. hŠp://dx.doi.org/10.1038/ [16] Peter Jamieson, Lindsay Grace, and Jack Hall. 2012. Research Directions for Push- nbt0516-452c ing Harnessing Human Computation to Mainstream Video Games. In Meaningful [35] Magy Seif El-Nasr, Anders Drachen, and Alessandro Canossa (Eds.). 2013. Game Play. Analytics. Springer London. [17] Alexander Kawrykow, Gary Roumanis, Alfred Kam, Daniel Kwak, Clarence [36] Amy Shannon, Acey Boyce, Chitra Gadwal, and Ti‚any Barnes. 2013. E‚ec- Leung, Chu Wu, Eleyine Zarour, Luis Sarmenta, Mathieu BlancheŠe, Jer´ omeˆ tive Practices in Game Tutorial Systems. In Proceedings of the 8th International Waldispuhl,¨ and Phylo Players. 2012. Phylo: a citizen science approach for Conference on the Foundations of Digital Games. improving multiple sequence alignment. PLoS One 7, 3 (2012), e31362. [37] Katharina Siorpaes and Martin Hepp. 2008. Games with a Purpose for the [18] Markus Krause and Jan Smeddinck. 2011. Human computation games: A survey. Semantic Web. IEEE Intelligent Systems 23, 3 (2008), 50–60. In 2011 19th European Signal Processing Conference. IEEE, 754–758. [38] Kristin Siu, MaŠhew Guzdial, and Mark O. Riedl. 2017. Evaluating Singleplayer [19] Markus Krause, Aneta Takhtamysheva, Marion WiŠstock, and Rainer Malaka. and Multiplayer in Human Computation Games. In Proceedings of the 12th Inter- 2010. Frontiers of a Paradigm: Exploring Human Computation with Digital national Conference on the Foundations of Digital Games. to appear. Games. In Proceedings of the ACM SIGKDD Workshop on Human Computation [39] Kristin Siu and Mark O. Riedl. 2016. Reward Systems in Human Computation (HCOMP ’10). ACM, New York, NY, USA, 22–25. DOI:hŠp://dx.doi.org/10.1145/ Games. In Proceedings of the 2016 Annual Symposium on Computer-Human In- 1837885.1837893 teraction in Play (CHI PLAY ’16). ACM, New York, NY, USA, 266–275. DOI: [20] Yen-ling Kuo, Jong-Chuan Lee, Kai-yang Chiang, Rex Wang, Edward Shen, hŠp://dx.doi.org/10.1145/2967934.2968083 Cheng-wei Chan, and Jane Yung-jen Hsu. 2009. Community-based Game Design: [40] Kristin Siu, Alexander Zook, and Mark O. Riedl. 2014. Collaboration versus Experiments on Social Games for Commonsense Data Collection. In Proceedings Competition: Design and Evaluation of Mechanics for Games with a Purpose. of the ACM SIGKDD Workshop on Human Computation (HCOMP ’09). ACM, New In Proceedings of the 9th International Conference on the Foundations of Digital York, NY, USA, 15–22. DOI:hŠp://dx.doi.org/10.1145/1600150.1600154 Games. A Framework for Exploring and Evaluating Mechanics in Human Computation Games K. Siu, A. Zook, and M. Riedl

[41] Kathleen Tuite. 2014. GWAPs: Games with a Problem. In Proceedings of the 9th International Conference on the Foundations of Digital Games. [42] Kathleen Tuite, Rahul Banerjee, Noah Snavely, Jovan Popovic,´ and Zoran Popovic.´ 2015. PointCra‰: Harnessing Players’ FPS Skills to Interactively Trace Point Clouds in 3D. In Proceedings of the 10th International Conference on the Foundations of Digital Games. [43] Kathleen Tuite, Noah Snavely, Dun-yu Hsiao, Nadine Tabing, and Zoran Popovic.´ 2011. PhotoCity: Training Experts at Large-scale Image Acquisition Œrough a Competitive Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’11). ACM, New York, NY, USA, 1383–1392. DOI: hŠp://dx.doi.org/10.1145/1978942.1979146 [44] Luis von Ahn and Laura Dabbish. 2004. Labeling Images with a Computer Game. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI ’04). ACM, New York, NY, USA, 319–326. DOI:hŠp://dx.doi.org/10.1145/ 985692.985733 [45] Luis von Ahn and Laura Dabbish. 2008. Designing Games with a Purpose. Commun. ACM 51, 8 (2008), 58–67. [46] G. Wallner and S. Kriglstein. 2013. Visualization-based analysis of gameplay data — A review of literature. Entertainment Computing 4, 3 (2013), 143 – 155. [47] Laura A. Whitlock, Anne Collins McLaughlin, William Leidheiser, Maribeth Gandy, and Jason C. Allaire. 2014. Know Before you Go: Feelings of Flow for Older Players Depends on Game and Player Characteristics. In ACM SIGCHI Annual Symposium on Computer-Human Interaction in Play. [48] Nick Yee. 2006. Motivations for play in online games. CyberPsychology & behavior 9, 6 (2006), 772–775.