The Affect Game Annotation (AGAIN) Dataset

The Affect Game AnnotatIoN (AGAIN) Dataset

David Melhart, Antonios Liapis and Georgios N. Yannakakis, IEEE Senior Member Institute of Digital Games, University of Malta Msida, Malta [email protected], [email protected], [email protected]

Abstract—How can we model affect in a general fashion, across dissimilar tasks, and to which degree are such general representations of affect even possible? To address such questions and enable research towards general affective computing this paper introduces the Affect Game AnnotatIoN (AGAIN) dataset. AGAIN is a large-scale affective corpus that features over 1, 100 in-game videos (with corresponding gameplay data) from nine different games, which are annotated for arousal from 124 participants in a ﬁrst-person continuous fashion. Even though AGAIN is created for the purpose of investigating the generality of affective computing across dissimilar tasks, affect modeling can be studied within each of its 9 speciﬁc interactive games. To the best of our knowledge AGAIN is the largest—over 37 hours of annotated video and game logs—and most diverse publicly available affective dataset based on games as interactive affect elicitors.

I.INTRODUCTION

A core challenge of affective computing (AC) is the in- Figure 1. All games featured in the AGAIN dataset currently. The dataset vestigation of generality in the ways emotions are elicited includes 3 racing games (top row), 3 shooter games (middle row), and three platformers (bottom row). and manifested, in the annotation protocols designed, and ultimately in the affect models created. To examine the degree to which general representations of affect are possible and Table I CORE PROPERTIES OF THE AGAINDATASET meaningful, AC research requires access to corpora containing affect responses and annotations across dissimilar tasks, par- Properties Raw dataset Clean dataset ticipants and annotators. Traditional large-scale AC datasets Number of Participants 124 122 feature affect annotation of static images, videos, sounds and Number of Gameplay Videos 1116 995 Number of Game-telemetry Logs 1116 995 speech files within a narrow context through which affect is Video database size 37+ hours 33+ hours elicited from a particular task. However, such datasets cannot Number of Elicitors 9 games (3 genres) advance research in general AC, as stimuli used to elicit affect Gameplay/Video duration 2 min Annotation Perspective First-person tend to be very similar. Even when the various tasks under Annotation Type Continuous unbounded annotation may vary, those are still limited to a very specific Affective Labels Arousal context—such as viewing a set of social interactions under a arXiv:2104.02643v1 [cs.HC] 6 Apr 2021 theme or playing sessions of the same game. Motivated by the lack of corpora for the study of general broadens the horizons for further research on general-purpose properties of affect across tasks and participants, in this AI representations [3], [4] and artificial general intelligence. paper we introduce the Affect Game AnnotatIoN (AGAIN) The design and creation of AGAIN was guided by the dataset. which contains data from over 120 participants who following factors: a) accessibility, which is achieved through played and annotated over 1, 000 gameplay sessions. AGAIN an online crowdsourcing framework; b) scalability: AGAIN is features data collected from nine games spanning across three utilising the PAGAN online annotation framework [5] and, dissimilar genres, which were developed specifically for the hence, one can easily populate the AGAIN database with purposes of the dataset (see Fig. 1). As shown in Table I, more participants and annotators; c) extensibility: more affect along with game telemetry and self-annotated arousal labels, dimensions and categories can be considered and integrated to the dataset also features a video database of unique gameplay the existing dataset through the customisable PAGAN anno- sessions with over 37 hours of in-game footage. The diverse tation tool, and; d) generality: any additional online game or nature of the AGAIN affect elicitors (games) provides a interactive session can be easily integrated to the experimental testbed for general affect detection in games [1], [2] and protocol of AGAIN. While at the time of writing the dataset hosts 9 games annotated for arousal, AGAIN is designed with finally dominance describes the agency or level of auton- all aforementioned factors in mind so that is able to host data omy during the emotional episode. One can place different from more games and user modalities, considering alternative emotions within this 3D continuous space without explicitly affective labels. categorising them, reducing the chance of misrepresenting how The AGAIN dataset is unique in a number of ways. First, a subject feels. This type of evaluation lends itself better for it is the largest and most diverse publicly available affective continuous and subjective annotation [7], [8]. Using the PAD dataset based on games as interactive elicitors. Given the model for self-appraisal removes much of the dependence on breadth of elicitors offered, the dataset can be used for testing culturally and personally biased evaluations when categorising specific affect models on one particular task (i.e. a particular emotions [14] which can increase the face validity [15] of the game) all the way to general models of affect across tasks measurement. (game genres and games in general). Second, the dataset is While the Circumplex model and the PAD model represent annotated with the core affective dimension of arousal, linking affect across two and three dimensions, respectively, in the dominant annotation practices in affective computing with AGAIN dataset we focus currently on soliciting annotations player modelling and game user research. Finally, it employs based on the dimension of arousal. Selecting and investi- a novel annotation framework [6] which captures subjective gating arousal first, instead of other affect dimensions, is annotations in a continuous and unbounded manner that can relevant for games, the core domain of AGAIN. Arousal is be further processed as labels for regression, classification or present and dominant as an emotional manifestation in game ordinal learning affect modelling tasks [7], [8]. affect interactions and has been associated with challenge The remainder of the paper is structured as follows. Section [16], cognitive and affective engagement [17], tension [18], II contextualises the dataset within the fields of affective fun [19], frustration [20] and flow [21], as well as positive computing and affect modelling in games while Section III post-game outcomes, such as increased creativity [22] and offers a systematic review of existing audiovisual datasets working memory [23] performance. Focusing on one affect placing AGAIN within the literature of affective corpora. The dimension reduces the cognitive load of the annotation task games used as the affect elicitors of AGAIN are described in [5], which in turn increases the reliability of our data; however, Section IV. Section V details the AGAIN dataset by describing it limits the expressive range of affect annotation in the dataset. the protocol followed, the characteristics of the participants, Moreover, the focus on arousal assists the research community the data types collected, and the annotation framework used. to build, extend upon and advance studies that already have Section VI offers a detailed yet preliminary data analysis of benchmarked the study of arousal in games [2], [3], [6]. the dataset, and the paper concludes with Section VII. B. Affect Modelling in Games II.BACKGROUND Player modelling is the study of videogame play both in terms of behavioural and affective patterns [24]. It relies heav- AGAIN is an accessible dataset offered for research in ily on artificial intelligence methods for building predictive affective computing at large and player modeling in particular. models of player behaviour [25], [26], playtime [27], churn This background section discusses the importance of arousal [28], [29], or player experience [3], [7], [30]. It is naturally within the field of affect representation (Section II-A) and characterised by dynamic representations and modelling of reviews studies for modelling the affect of game users (i.e. data, thereby providing even moment-to-moment predictions players) in Section II-B. of a game’s elicited experience [31]. A key limitation of player modelling, as with any other data-driven approach, is that A. Arousal as Affect Representation is data hungry. In particular, studies that focus on affective While there are different approaches to affect representation aspects of player experience require ground-truth affect labels including categorical [9], [10], dimensional [11], and mixed which are often costly to collect [32], [33]. [12] frameworks, the AGAIN dataset uses a dimensional rep- To address the above challenge, an increasing number of resentation based on the Pleasure-Arousal-Dominance (PAD) studies focus on approaches that could realise aspects of gen- model of affect [13] and the Circumplex Model of Emotions eral player modelling [1]. General player modelling features [11]. In contrast to categorical frameworks, which assume methods that are able to predict a player’s affective state on a clear division between emotional responses, these models unseen games. While early studies such as that of Martinez propose a more ambiguous and general representation. While et al. [34] investigated game-independent features of the the dimensional approach has its own limitations—e.g. the playing experience, such as heart rate and skin conductance, inability to capture complex self-reflexive emotions—it also later studies put an emphasis on finding general gameplay sidesteps the challenge of subjective emotional appraisal [14] features either manually [35] or through algorithmic feature that is usually existent in categorical models. mapping [36]. More recently, Camilleri et al. investigated Instead of complex emotions, the PAD model focuses on general gameplay features and generalised metrics of player basic affective states represented across three dimensions. experience across three dissimilar games [2]. Their study used Pleasure is associated with the valence of the emotion; psy- high-level features such as goal-oriented and goal-opposed chological arousal describes the intensity of the emotion; and gameplay events and relative metrics of arousal to moderate success, showing the difficulty of creating general player player experience datasets distinctive to affective computing models. Similarly, Bonometti et al. used high-level general primarily because any lessons learned on traditional affective features to characterise the gameplay context (such as activity databases are not directly applicable to player experience count and activity diversity) to model engagement across six datasets, and vice versa. games published by Square Enix Ltd. [37]. The affective datasets we survey appear to be rather split in III.AUDIOVISUAL AFFECTIVE DATASETS terms of annotation type used. While some (e.g. DEAP [39], The availability of large-scale corpora comprising affect MANHOB-HCI [38]) opt for self annotation (first-person), manifestations that are elicited through appropriate stimuli many databases (e.g. RELOCA [44], SEWA [43]) use only a is a necessity for affect modelling. Creating datasets that few expert annotators in a third-person manner. There is a clear are annotated with reliable affect information is, therefore, trade-off between these approaches. First-person annotations instrumental to the field of affective computing at large. In are ideal for capturing the subjective appraisal of emotional this section we review representative affective corpora that content, while third-person annotations are better at labelling rely on audiovisual elicitors and discuss the contribution of emotion manifestation through inter-rater agreement [48]. AGAIN to the current list of datasets that are enriched with The above systematic review of the literature highlights affect labels. Table II presents the outcome of our survey1. a lack of large-scale databases implementing an active elic- We follow a systematic approach for reviewing the state of itation mode, using multiple elicitor types and adopting a the art in affective corpora and examine the following factors first-person annotation scheme. While datasets using passive that distinguish the surveyed datasets: the mode, type of the elicitors are generally larger, the cost associated with using provided elicitors, the number of possible elicitor items, and active elicitation limits these datasets. As Table II shows, the overall size of the available video database (see second to size of databases featuring active elicitors generally cannot fifth column of Table II), the number of participants and their reach the standard of datasets featuring passive elicitors. The recorded modalities (see columns six and seven of Table II), passive elicitors of these datasets, however, are also less the annotation protocol in terms of the mode and type of diverse, generally limited to very similar annotation tasks. the annotation (see columns eight and nine of Table II), the This does not advance research on general affect modelling, as affective labels (see column ten of Table II), and finally the researchers have to examine dissimilar datasets [2] that often number of annotators and number of tasks each annotator had comprise mismatching data collection methods and annotation to complete (see the eleventh and twelfth column of Table II). tools or do not offer enough context variety (e.g. the FUNii It is apparent from Table II that affective datasets have Database [47] features two similar games from the same gradually—over the last decade or so—drifted away from franchise). AGAIN addresses the aforementioned limitations traditional induced elicitation and posed expressions, and in- by offering a large-scale corpus that is based on a set of stead turned towards soliciting spontaneous emotion manifes- dissimilar interactive affect elicitors that are annotated through tations. Most of these datasets have focused mainly on affect a first-person protocol. While the dataset at the time of elicitation through passive (i.e. non-interactive) audiovisual writing is limited to 9 games and their annotated arousal, the stimuli (see second row of Table II). Passive audiovisual dataset is planned to be augmented through more affective elicitors are a popular choice as they do not require any dimensions and enriched through more games. The resulting particular skill from the participants and are relatively easy dataset leverages the strength of active emotion elicitation to implement. In contrast, we meet datasets that make use while producing data in amounts comparable to databases of active elicitors involving tasks in dyads and videogames— featuring passive affect stimuli. Moreover AGAIN provides including RELOCA [44] and player experience datasets such a diverse database for general affect modelling research that as PED [46] or the FUNii Database [47]. Compared to passive is not possible within any of the existing corpora. elicitors, these interactive tasks provide a more complex and multifaceted affective stimulus, while organically structuring We position AGAIN at the intersection of traditional af- the participants’ experience. fective computing corpora and datasets with a focus on Most affective computing databases surveyed (see tenth player experience. By focusing on a core affect dimension row of Table II) capture affective dimensions such as arousal (i.e. arousal) instead of a game-related complex emotional and valence, with some datasets offering labels for additional outcome, we aim to make AC research even more relevant dimensions—such as dominance—and categorical labels. The for game user research, and vice versa. As games are highly surveyed datasets that have used games as affect elicitors— interactive media, the captured data and annotations encode Mazeball [45], PED [46], and FUNii [47]—tend to be less not merely player affect but also behaviour and game context. focused with regards to the labels used and instead aim We focus on first-person annotations to better capture the to capture more complex game-related user states such as subjective intricacies of gameplay. Finally, we choose to record engagement, fun or challenge. This core difference makes such continuous unbounded traces of arousal using RankTrace [6] via the PAGAN online annotation framework [5]. Such traces 1N/A indicates where the category is “not-applicable” (e.g. there are no participants when third-party videos are used) and UNK indicates if an can be processed and machine learned in a number of ways attribute is “unknown”. including regression, classification and relational learning [7]. Table II ASURVEY OF AFFECTIVE DATASETS OF AUDIOVISUAL CONTENT.A TABLE ENTRY IS INDICATED WITH ‘N/A’ AND ‘UNK’ IF IT IS NOT AVAILABLE AND UNKNOWN, RESPECTIVELY.

Elicitation Participants Annotation Database Mode Type Items Video Number Modalities Mode Type Labels Annotators Tasks EEG, ECG, Arousal, valence, MAHNOB- EDA, temp., dominance, 20 20 Discrete HCI Passive Video 30 resp., face and First-person emotional 30 20 videos hours (9-step) [38] body video, keywords, gaze, audio predictability EEG, BVP, Arousal, valence, DEAP 40 40 EDA, EMG, Discrete Passive Video 32 First-person dominance, 32 40 [39] videos mins temp., resp., face (5-step) liking, familiarity video LIRIS- 9, 800 27 1517(arousal) ACCEDE Passive Video N/A N/A First-person Pairwise Arousal, valence UNK videos hours 2442(valence) [40] Aff-Wild 298 30 Continuous Passive Video 200 N/A Third-person Arousal, valence 6-8 298 [41] videos hours bounded Continuous Arousal, valence, AffectNet 450, 000 Passive Image N/A N/A N/A Third-person bounded, 8 emotion 12 137, 500 [42] images categorical categories Sonancia 1280 Arousal, valence, Passive Audio N/A N/A N/A First-person Pairwise UNK 10 [18] sounds tension Arousal, valence 27 Facial 4 (dis)liking SEWA Passive, hours landmarks, FAU, Continuous Video videos 398 Third-person intensity, 5 90 DB [43] Active 17 hand and head bounded 1 task agreement, hours gestures mimicry RELOCA 4 ECG, EDA, face Continuous Active Video 1 task 46 Third-person Arousal, valence 6 23 [44] hours video, audio bounded Fun, challenge, frustration, BVP(HRV), MazeBall 1 anxiety, Active Game N/A 36 EDA, game First-person Pairwise 36 1 [45] game boredom, telemetry excitement, relaxation Gaze, head Discrete Engagement, 1 6 PED [46] Active Game 58 position, game First-person (5-step), frustration, 58 1 game hours telemetry pairwise challenge ECG, EDA, gaze Fun (cont.), fun, FUNii 2 and head Continuous, difﬁculty, Active Game N/A 190 First-person 190 2 [47] games position, discrete workload, controller input immersion, UX 9 37 Game video, Continuous AGAIN Active Game 124 First-person Arousal 124 9 games hours game telemetry unbounded (a) TinyCars (b) Solid (c) ApexSpeed

(d) Heist! (e) TopDown (f) Shootout

(g) Endless (h) Pirates! (i) Run’N’Gun! Figure 2. Start screens of the nine games included in the AGAIN dataset, showing the game’s rules and players’ controls.

IV. GAMES possible. Specific games were designed under each genre are representative of the genre. Nine games, across three different genres, were designed and developed as affect elicitors specifically for the AGAIN A. Racing dataset. We put careful consideration to create software which is aesthetically pleasing, representative of popular sub-genres Three games represent the racing game genre, which is of games, can be understood immediately with a basic level characterised by fast-paced driving along a given track. While of game literacy [49], and produces a coherent and consistent racing games feature less direct interaction with opponents, dataset without the need of heavy pre-processing. To achieve players can often try to push other cars off the track or into this, the games featured in AGAIN are all created using the a less favourable position. In all three games the races take Unity 3D engine2. The game genres were selected (racing, place in a closed loop. The player always starts from the last shooters, platformers) because they represent a good cross- position and has to fight their way up during the race. These section of the game genres [50] and are among the most games contain no combat mechanics but other cars and the popular among gamers [24], [51], but also because they have environment can still act as obstacles or challenges. If they simple enough controls and clear mechanics so that players feel stuck, players can press the ‘R’ key to be reset to the last can pick them up quickly. Opposed to other genres, like role checkpoint. The three racing games included in the AGAIN playing or strategy games, that require longer time investment dataset are as follows: and players to learn the specific mechanics, strategies and • TinyCars is a top-down arcade racing game (see Fig. 2a). synergies, the games in the dataset relied on fast-paced genres The player’s view is isometric and the camera is at a fixed and popular tropes to communicate the game rules as fast as rotation. The controls are relative to the player’s car. The racetrack features no large obstacles but there is a jump- 2https://unity.com/ ramp and an overpass. While off-track, cars are slowed down considerably. The game was inspired by the classic runs out of health—just as in Heist!—they respawn to arcade game Super Cars II (Magnetic Fields, 1991). the beginning of the level with full health. • Solid is a first-person rally game (see Fig. 2b) and • Shootout is a first-person arcade shooter (see Fig. 2f). plays similarly to games in the Colin McRae Rally series The camera is fixed and the player cannot move their (Codemasters, 1998-2019). In this game the camera is character, only aim with the mouse. The player has no fixed inside the car and the player’s vision is partially health and the game is played only for the highest score. blocked by the steering wheel, the dashboard, and the Enemies appear at ever increasing speeds until the clock hood of the car. To help with visibility, the UI includes a runs out. The player has a revolver, which is automatically rear-mirror. The racetrack includes a large loop, in which reloaded when bullets run out, preventing shooting for the player has to speed up to pass through. There are no 2 seconds. This game was inspired by classic shooting jump-ramps or other obstacles in the game. Going off- gallery games such as Hogan’s Alley (Nintendo, 1984). track slows the car down only a bit, making it a viable strategy to cut paths through curves. C. Platformers • ApexSpeed is a third-person view speed racer-type game (see Fig. 2c). like Wipeout (Psygnosis, 1995)—or more Finally, three games represent the platformer game genre; recently Redout (34BigThings, 2016). The camera fol- this genre’s gameplay focuses on traversal and often requires lows the player around in a 3D environment. The track precision and dexterity. While platformers often feature ene- is closed and the car moves forward automatically after mies, the core goal of most platformer games is to reach the the race starts. The racetrack features speed boosts, jump- end of the level (or in some cases to go for as long as possible). ramps, and dangerous obstacles, which set players back Platformer games in the dataset had the most diverse control to the last checkpoint. Because of the closed track, the schemes, with Endless game requiring two keys to navigate cars cannot go off-road and the impact of collisions are and one to attack, Pirates! requiring three keys to navigate, reduced as well. and Run’N’Gun requiring five keys to navigate and one to attack. The three platformers of AGAIN are the following: B. Shooters • Endless is a casual endless-runner game (see Fig. 2g). Rather than reaching the end of the level, the player’s Three games represent the shooter game genre, which goal is to stay alive for as long as possible on an is characterised by action and eliminating enemies. Shooter endlessly looping map with randomly generated enemy games—as the name implies—feature projectile weapons and placement. In Endless, the player can switch between two rely heavily on hand-eye coordination and fast reflexes. In all tracks or hit incoming enemies. The game also features of the shooter games the player has to aim and fire using the pickups which can make the game harder (i.e. speed mouse, while using four keys to navigate (except Shooutout, boost) or easier (i.e. slow down). Additionally, the player where the player remains stationary—see Fig. 2f). These are can collect coins to increase their score. On a collision the only games in the dataset that require a mouse. The three with an enemy or an obstacle, the player loses score and shooter games included in AGAIN are as follows: the speed of the game is reset. • Heist! is a first-person shooter game (see Fig. 2d). • Pirates! is a classical platformer (see Fig. 2h) with a The player controls the character’s movement with the gameplay which resembles Super Mario Bros. (Nintendo, keyboard, while directing their gaze with the mouse. 1985). The game focuses on traversal, especially jumping The player can also sprint and crouch behind objects to solve light platform puzzles and collect coins to in- for cover. The weapon used in this game is a semi- crease the player’s score. The game also features a health automatic pistol with limited ammunition, which has to pickup akin to the Super Mushroom in Super Mario Bros. be manually reloaded by using the ’R’ key. The player’s While the player has no weapons or special attacks, they health automatically regenerates when out of combat at a can defeat enemies by jumping on their heads. On direct steady rate. If the player runs out of health, they are reset collision with the enemies however, the player is reset to to the beginning of the level. The game imitates modern the last checkpoint. first-person shooters like Call of Duty: Modern Warfare • Run’N’Gun is a shoot ’em up platformer (see Fig. 2i), (Infinity Ward & Sledgehammer Games, 2019). imitating games like Metal Slug (SNK, 1996). While • TopDown is a third-person, top-down shooter (see the goal of the game is for the player to reach the end Fig. 2e), resembling games like Neon Chrome (10tons of the level, the gameplay includes combat. The player Ltd. 2016). This game has an isometric top-down camera, has a health bar and a weapon, and the game features which follows the player but its rotation is fixed. Instead health pickups to replenish health. Enemies come in a of directing the camera’s gaze, the player instead moves large variety, with melee and ranged enemies, and bosses the reticle around the screen with the mouse. The player with multiple weapons and health bars. Run’N’Gun is the has an automatic rifle with unlimited ammunition. The only platformer which awards the player for defeating player’s health does not regenerate but they can pick up enemies. If the player runs out of health, they are sent health-packs on the level to replenish it. When the player back to the last checkpoint. Table III NUMBEROFFEATURESEXTRACTEDPERGAME

Genre Sub-Genre Title # Features Racing Arcade-Racing TinyCars 33 Rally Solid 34 Speed-Racer ApexSpeed 34 Shooter First-Person Shooter Heist! 37 Top-down Shooter TopDown 38 Arcade-Shooter Shootout 23 Platform Endless Runner Endless 33 Mario-Clone Pirates! 39 Shoot’Em’Up Run’N’Gun 47

B. Participants Through the procedure presented in Section V-A, we collected data from 124 participants4 which include 1, 116 gameplay sessions (124 sessions per game) with detailed telemetry and over 37 hours of gameplay videos. Out of the Figure 3. Introduction screen of the experiment. 124 participants, one identified as non-binary, 43 as female, and 80 as male. Participants’ age varied between 19 and 55 years old (average of 33). Most participants were from V. AGAIN DATASET the USA (82%); the remaining 22 participants came from Brazil (10 participants), Italy (3), Canada (2), India (2), Czech Republic (1), Germany (1), and Romania (1). Most Games in the AGAIN dataset were built for the WebGL participants identified as casual gamers (57%) or hard-core platform and are played in a web-browser. The games were gamers (36%). Reflectively, the majority of participants (87%) integrated into the PAGAN annotation platform [5], which were playing daily or weekly. All participants had either a PC allowed the large-scale crowd-sourcing of both the game or a gaming console or both, with the most popular platform playing and annotation tasks. being PC. Participants played very diverse games in their free time across different genres: from casual games through A. Protocol platformers, sports simulators, shooters, to role-playing games. The anonymised demograchical data is included in the dataset. The collection procedure took anywhere between 45 to C. Game Footage Videos 55 minutes. Participants were invited through Amazon’s Me- For realising first person annotation the gameplay footage chanical Turk service3 and were compensated with 10 USD of players had to be recorded and annotated by the players for their time. The only criterion for participation was prior themselves. As a result the raw AGAIN dataset features 1, 116 purchase of videogames, in order to filter out potential subjects videos of around 2 mins each (i.e. over 37 hours of game who might not have the game literacy required to play the footage). The video database contains more than 3 × 106 games. Participants were greeted with an introduction screen frames of video, which are recorded at 24 FPS and have (see Fig. 3), which informed them about the overall task and a resolution of 960 × 600 pixels. The characteristics of the explained arousal as a feeling of tension, excitement, exhil- AGAIN corpus enable the use of data-hungry deep learning aration or readiness and the opposite of boredom, calmness methods for directly mapping affect to frame pixels [3]. or relaxation. During the experiment, participants played and annotated all 9 games in the dataset. Each game (and each D. Game Context Features annotation task) take approximately 2 minutes to complete. In addition to the raw video game footage, AGAIN features During their play, game telemetry was collected at a rate a number of hand-crafted attributes for each game. Inspired of 4Hz and the game canvas was recorded in video format. by advances in machine learning with privileged information The collection procedure was set up in an iterative-manner [52] we view telemetry data as privileged information and we with participants playing for 2 minutes, then annotating their include such ad-hoc features in the dataset. Fusing gameplay gameplay video for 2 minutes. The order of the games was features with other user modalities has also been a dominant randomised and this procedure was repeated until all games practice in game-based affective computing [53], [54]. The were played and annotated. After the experiment, participants game context features described in this section are considered filled in a simple exit-survey recording their biographical data in the preliminary data analysis of the dataset in Section V. and gaming habits. 4While 169 participants completed the data collection process, 45 participants were omitted as their experiments were incomplete (i.e. no video or 3https://requester.mturk.com/ annotation data) due to software or hardware error. Table IV THE GENERAL GAMEPLAY FEATURES OF AGAIN

feature description time_passed time counted from the start of the recording score player score input_intensity number of keypresses input_diversity number of unique keypresses idle_time percentage of time spent without input activity inverse of idle_time movement distance travelled + reticle moved (in shooters) bot_count number of bots visible bot_movement bot distance travelled bot_diversity number of unique bots visible object_intensity number of objects of interest object_diversity number of unique objects event_intensity number of events event_diversity number of unique events

All AGAIN games implement the same data-logging strategy and use a similar method for recording telemetry. Games Figure 4. The PAGAN RankTrace annotation interface. The gameplay video within the same genre share the same feature labels. Not all is played in the window above and the participant controls the annotation features, however, have a qualitative meaning for all games cursor (blue circle) below, drawing a visible annotation trace. within a genre—in Heist!, for instance, players move but they are immobile in Shootout. To ease the data collection and aggregation process, when features are absent from a additional features. Table IV lists these features alongside their game they are given values with zero-variance (zeroes or ones, explanation. depending on the feature). For example, a looping racetrack is E. Annotation only present in the Solid game (see Figure 2b), therefore the visible_loop_count feature is always zero in the other The annotation task was administered through the PAGAN racing games. platform [5], using the RankTrace annotation method [6]. Table III shows the number of features we have ex- PAGAN is an online annotation platform developed to be an tracted per game with the zero-variance features removed. easy-to-use software for crowdsourcing annotation tasks with The recorded game telemetry encodes control events ini- a focus on one-dimensional time-continuous annotation using tiated by the player (e.g. player_steering), player three different methods. RankTrace [6], an ordinal annotation status (e.g. player_health), gameplay events outside framework, GTrace [55], a bounded annotation scale which of the player’s control (e.g. bot_aim_at_player), bot gathers continuous data that can be converted to a Likert- status (e.g. bot_offroad), and the proximal and gen- like format, and BTrace, which is a binary annotation tool eral game context (e.g bot_player_distance and for both time-continuous and discrete annotation, inspired by pickups_visible). Gameplay is recorded at approxi- AffectRank [56]. We have chosen RankTrace as our annotation mately 4Hz (every 250ms). Due to limitations of the Unity framework for this dataset. engine and the WebGL format, the logging rate is not con- RankTrace allowed us to collect data in an unbounded sistent. To mitigate this issue, the logging script aggregates fashion (see Fig. 4). This type of data is best interpreted multiple ticks of the engine’s update loop and provides an as subjective, ordinal labels as it preserves the relative rela- average value. Due to this processing technique almost all tionships between datapoints [7]. The unbounded trace means events are represented by continuous values. For example, that users can always adjust their annotations higher or lower pickups_visible can take ﬂoat values under 1 when a than previous values, which alleviates much of the guesswork pickup just became visible at the end of the given 250ms compared to when users annotate on an absolute and objective window. The only features which are represented by integer scale [54]. The ordinal nature of the annotation follows the values are player_death and objects_destroyed be- cognitive process of human evaluation, as it provides a trace cause of their sparsity. which factors in habituation [57], anchoring bias [58], [59] In addition to the features enumerated in Table III, the and recency-effects [60]. dataset includes 14 general gameplay features. These general features are ad-hoc designed and derived from the game- F. Data Cleaning speciﬁc events and are based on contemporary studies of To ease any subsequent analysis and future studies based general player modelling [2], [37]. Events which require expert on the dataset, in this section we propose a preprocessing evaluation of the game such as the goal-oriented and goal- pipeline which removes 10.8% of the dataset as outliers. opposed events of Camillieri et al. [2] are omitted from AGAIN contains both the raw and the cleaned data that result these general features of AGAIN, but may be considered as from the process outlined here. TinyCars Solid ApexSpeed Table V 30 30 30 PRELIMINARY ANALYSIS OF THE CLEAN AGAIN DATASET.THE TABLE LISTSTHENUMBEROFGAMESESSIONSANDTHEIRCORRESPONDING 20 20 20 DATA POINTS ON A FRAME-BY-FRAMEBASIS (250 MS).THE TABLE ALSO LISTSTHENUMBEROF 3STIMEWINDOWSWITHINWHICHTHEAROUSAL Count 10 10 10 VALUE INCREASES (↑), DECREASES (↓) OR STAYS STABLE WITHIN A 10% THRESHOLDBOUND (—). 0 0 0 10000 20000 30000 10000 20000 30000 10000 20000 30000 Arousal (3 s interval) Heist! TopDown Shootout 3 30 30 30 Game Sessions Data (·10 ) ↑ ↓ — TinyCars 109 52.75 543 461 3386 20 20 20 Solid 109 53.42 613 492 3346 ApexSpeed 114 56.10 607 462 3581

Count Racing 332 162.27 1763 1415 10313 10 10 10 Heist! 110 53.91 580 424 3479 0 0 0 TopDown 115 56.90 650 463 3614 10000 20000 30000 10000 20000 30000 10000 20000 30000 Shootout 106 51.77 471 341 3496 Endless Pirates! Run'N'Gun Shooter 331 162.57 1701 1228 10589 30 30 30 Endless 112 55.11 559 438 3595 Pirates! 110 52.26 625 534 3186 20 20 20 Run’N’Gun 110 54.97 618 431 3521 Platformer 332 162.34 1802 1403 10302 Count 10 10 10 Total 995 487.18 5266 4046 31204

0 0 0 10000 20000 30000 10000 20000 30000 10000 20000 30000 Pairwise DTW distance Pairwise DTW distance Pairwise DTW distance DTW distance metric between each datapoint and sum up the resulting distances. This metric shows us the relative Figure 5. Distribution of summed cumulative DTW distance values of each session compared to every other session. The solid line shows the average similarity of a session to every other session. We remove all score, while the dotted lines show the first and second standard deviation. sessions which fall more than two standard deviations away Values in the grey field (right tail) are removed during data cleaning. from the average summed cumulative distance (see Fig. 5). This step removes an additional 69 sessions (6.2%). This last step removes annotations which are too dissimilar from Since PAGAN only records annotations when there is a the general trends of participants’ annotations; we presume change in the signal and the Unity engine loop is affected by that either the annotation was improper or that this session’s hardware performance, as a first step we resample the whole elicitor was somehow not in line with how other players dataset at 4Hz to get a consistent signal. We remove duplicate played the same game. At the end of the cleaning process, 121 values from the dataset, as well as sessions which are either too sessions—including all data from 2 participants—are removed short (less than 1 minute) or too long (more than 3 minutes) (10.8%). The cleaned dataset consists of 122 participants and due to software or technical errors during crowdsourcing. We 995 sessions; more details on the cleaned dataset are provided also prune sessions which have less than 10 annotation points, in Section VI. assuming that the participant was unresponsive. This initial cleanup phase removes 24 sessions (2.1% of the data). VI.AGAINANALYSIS To clean the dataset further, we apply Dynamic Time Warp- Following the cleanup process presented in Section V-F, this ing (DTW) to get an approximate similarity measure between Section performs a preliminary analysis of the clean version traces. DTW is used in time-series analysis to measure the of the AGAIN dataset focusing on patterns in the arousal similarity between temporal sequences that might be out of annotations (see Section VI-A) and the AGAIN game context sync or vary in speed [61]. DTW works by calculating a features (see Section VI-B). The section concludes with an warping path between two signals using a similarity matrix initial set of affect modelling experiments in the AGAIN and provides a useful metric which qualifies time-series in the dataset (Section VI-C) that can serve as baseline for future form of a cumulative DTW distance describing the similarity studies with this dataset. While some games receive more to a baseline trace or other signals [61]. We apply both of aggressive data cleaning than others (TinyCars, Solid, and these strategies when cleaning the dataset. As a first step, Shootout), overall there is an even distribution of data and we calculate the cumulative DTW distance to an artificial flat sessions across genres as shown in Table V. baseline (arousal annotations at 0 in all time windows). The resulting score provides us with a similarity measure to an ar- A. Trends in Annotations tificial session where the participant performed no annotation; Figure 6 shows the average annotation trace as calculated this allows us to remove unresponsive outliers. We remove all by averaging values in time windows of 250 ms of all sessions which fall more than two standard deviations closer sessions’ traces. It is evident that arousal annotation tends to zero from the average cumulative distance (the left tail to have an upwards tendency. This is not surpising, as most of the distribution). This step removes 28 additional sessions games considered are action-oriented with an ever-increasing from the dataset (2.5%). Finally, we apply the cumulative challenge; for instance, Endless keeps increasing the speed TinyCars Solid ApexSpeed terms of other features, the number of bots (opponents) visible 1 1 1 on the screen varied wildly between games, with Tiny Cars and Shootout having the highest number of visible enemies on 0.5 0.5 0.5 average. Perhaps due to the many enemies present, Shootout

Arousal had the highest number of events (event intensity in Table IV), while Solid had the fewest events per time window. 0 0 0 0 30 60 90 120 0 30 60 90 120 0 30 60 90 120 In terms of comparing the general gameplay features across Heist! TopDown Shootout 1 1 1 games, this requires some normalisation in order to account for both discrepancies in value ranges between games (e.g. in terms of score) but also between players in the same game. 0.5 0.5 0.5 Following the paradigm of treating both input and output Arousal as relative [2], the gameplay features of each time window 0 0 0 are normalised to the [0, 1] value range within each session. 0 30 60 90 120 0 30 60 90 120 0 30 60 90 120 As a result such normalised features consider the dynamics Endless Pirates! Run'N'Gun 1 1 1 of a single player in a given session (e.g. in which time window the player achieved the top score of their session),

0.5 0.5 0.5 disregarding for example whether other players reached higher

Arousal scores in the same game. Since arousal is similarly a deeply subjective notion, the player is expected to annotate arousal in 0 0 0 0 30 60 90 120 0 30 60 90 120 0 30 60 90 120 the context of their current session (e.g. whether their arousal Time (sec) Time (sec) Time (sec) might increase if they start performing better than they were performing previously in the same session). After all 4.9 · 105 Figure 6. Average annotation traces (normalised per session) showing an game context data was normalised in this fashion, we applied increasing tendency. The coloured area around the mean depicts the 95% confidence interval of the mean. t-distributed Stochastic Neighbor Embedding (t-SNE) [62] to map this data on a two-dimensional space. Figure 7 shows the resulting data distributions. The visualised distributions offer of the game which evidently makes it both harder and more some important insights on the differences between games. arousing as time passes. Racing games (top row of Figure In particular, every game’s general features tend to exhibit 6), on the other hand, tend to have arousal converging to different patterns compared to the other 8 games. Moreover, a maximum mean value after the first 30 seconds. This is the compressed (game context) feature distributions across the likely because the player is initially rushing to overtake the three shooter games (see middle row of the figure) appear opponents’ cars (players always start last); after this initial quite distinct from one another. In some cases, however, there excitement the race becomes repetitive, with players trying to appears to be an overlap, either between games of the same either maintain the lead or slowly catch up to the leader. genre (e.g. all racing games) and games of different genres (e.g. see Pirates! and Solid). Even though this type of data B. Trends in Game Context Features visualisation cannot shed light on all possible differences Observing the twelve general gameplay features shared between games, it indicates that the games impact the patterns across all nine games, one can detect some notable differences of the data solicited from players (i.e. the context influences between games. In terms of the player’s input (control), games the user behaviour). The t-SNE analysis also indicates that the with more complex interaction schemes appear to have higher problem of mapping between game content and arousal seems input diversity and input intensity (see Table IV for details to be easier for some games (and game genres) than others. on these features). Even accounting for the games’ different The next section hosts an initial study that investigates the control schemes (i.e. the number of controls the player has potential of machine learning for deriving such a mapping. available), ApexSpeed, Shootout, and Endless have the lowest intensity (number of keypresses) and diversity (number of C. Preliminary Arousal Models unique keypresses) while Pirates! and TinyCars have the high- In this section we provide an initial analysis of the AGAIN est diversity. This discrepancy could point to an easier control dataset which aims to serve as a baseline study for future scheme for the former games, but it could also point to a affect modelling attempts with this dataset. In particular, we more frantic and engaging interaction in the latter games. The process the clean AGAIN dataset to predict arousal. To that idle time and activity features corroborate this observation, as end, we split the annotation traces in 3-second time windows, racing games have less idle time without keypresses (since and compute the mean arousal value from all data points in in two of the games the player needs to constantly press a that time window. Following common practice in affective button to move forward). In contrast, games where participants computing [2], [6], [63], we introduce a time offset of 1- mainly reacted to stimuli (e.g. in Shootout players react to second to the annotation traces As discussed in Section VI-B, opponents popping up and in Endless players jump only when all features (including the arousal values) are normalised on a a gap or obstacle is near) featured much higher idle times. In per-session basis. TinyCars Solid ApexSpeed 100%

50 50 50

0 0 0 80%

Component 2 -50 -50 -50 60% -50 0 50 -50 0 50 -50 0 50 Heist! TopDown Shootout 50%

50 50 50 40% Test Accuracy

0 0 0 Racing Shooter Platformer 20% specific specific specific

Component 2 -50 -50 -50 general general general all all all -50 0 50 -50 0 50 -50 0 50 0% Endless Pirates! Run'N'Gun TinyCars ApexSpeed TopDown Endless Run'N'Gun Solid Heist! Shootout Pirates! 50 50 50

0 0 0 Figure 8. Performance of random forest models of arousal for each game with game-speciﬁc, general, and all available features. The dotted line depict the Component 2 -50 -50 -50 performance baseline and the error bars represent 95% conﬁdence intervals.

-50 0 50 -50 0 50 -50 0 50 Component 1 Component 1 Component 1 the Scikit-learn Python library [67]. We initialise RFs with Figure 7. Projections of general game features in a 2D space with t-distributed their default parameters. For controlling overfitting we set the Stochastic Neighbor Embedding. number of estimators in the RF to 100 and the maximum depth of each tree to 10. This experimental setup is meant to provide a simple baseline prediction performance for the dataset, and We treat arousal modeling in AGAIN as a preference thus, we are not tuning the hyperparameters of the algorithm learning task [7], [8], [64] and focus on predicting arousal in this paper any further. change from a 3-second time window to the next. To reduce To examine the validity of the general features discussed in experimental noise from trivial changes within the arousal Section V-D, models are constructed for each game based on trace we omit all consecutive time windows between which three different feature sets: 1) game specific features excluding the arousal change is less than 10% of the total amplitude of additional general features 2) general features that include the session’s arousal value. While this 10% threshold is based only the features shown on Table IV and 3) all features on prior experiments in similar problems [18], [65], a more combined. Due to the pairwise transformation discussed above, extensive analysis could explore the impact of the threshold the baseline accuracy of all experiments is 50%. Because RFs value on prediction accuracy and the volume of data lost. Pairs are stochastic algorithms, we run each experiment 5 times and of consecutive time windows where the mean arousal in the we report the 10-fold cross validation accuracy. Note that each second time window is higher than the first (over the threshold) fold contains the data of 10 to 12 participants and no two is labelled with a value of 1 (↑; see Table V) and -1 if otherwise folds contain data from the same participant. The reported (↓; see Table V). As noted, pairs where the absolute difference statistical significance is measured with two-tailed Student’s in terms of arousal is below the 10% threshold are labelled as t-tests with α = 0.05, adjusted with the Bonferroni correction “stable” (—; see Table V) and omitted from the dataset. Table where applicable. V shows the distribution of ascending, descending and stable Figure 8 shows the performance of the RF models. Pre- pairs of time windows per game. diction accuracy varies between 58.06% and 82.50% across By applying this pairwise transformation to consecutive games. The results reveal that arousal appears to be easier to time-windows the preference learning paradigm is reformu- predict in some games (e.g. ApexSpeed, TopDown, and End- lated as binary classification. To construct accessible and less) than others (e.g. TinyCars, Shootout, and Run’N’Gun). simple models of arousal, in this initial study we employ a In the racing and platoformer genres, games with fewer input Random Forest Classifier. A Random Forest (RF) is an ensem- options and an automatic progression system (ApexSpeed and ble learning method, which operates by constructing a number Endless respectively) are tied to higher model performance. of randomly initialised decision trees and uses the mode of An explanation could be that games that have more internal their independent predictions as its output. Decision trees are structure (due to the sparsity of actions the player can take and simple learning algorithms, which operate through an acyclical automatic progression through the game with minimal input) network of nodes that split the decision process along smaller present a simpler problem. An exception to this observation feature sets and model the prediction as a tree of decisions is Shootout, in which the controls are limited (only looking [66]. In this paper we are using the RF implementation in around and shooting) and enemies appearing in an ever- increasing speed, but despite these similarities with ApexSpeed annotations are captured in an unbounded fashion which and Endless, Shootout models are struggling to reach 60% eliminates high degrees of reporting bias [7], [8]. accuracy (the lowest performance across all games). The current dataset only encodes one affective dimension, Looking at individual games across different feature sets, arousal, across videos from 9 games; AGAIN, however is we observe that the general features manage to perform easily scalable to more affective dimensions (e.g. valence or comparably to the specific features independently of the game dominance) and more games-based affect stimuli. Future work tested. Game-specific features yield significantly higher per- will focus on expanding the labels with expert annotations of formances than general features only in 4 games (TinyCars, valence and dominance to match the format of other affective Solid, Endless, and Pirates!). Moreover, the combination of computing databases [39], [41], [43], [44]. Its accessibility and both specific and general features yields significantly more ac- its unobtrusive data collection nature through crowdsourcing curate arousal models than either the game-specific or general make AGAIN easily extendable to more affect labels, affect features (or both) in 5 games: Solid, Heist!, TopDown, Endless, elicitors and participants. and Pirates!. These results demonstrate the robustness of the Inspired by recent work on the importance of game context general features presented in Section V-D and show that there as a predictor of affect [3] the user modalities of AGAIN are is little to no trade-off in representing the presented games in currently limited to in-game video footage and behavioural a more abstract and general manner. telemetry data. In addition, the protocol of AGAIN limits the The arousal model performances presented in this section user modalities available so that first person crowdsourcing of highlight a number of challenges for future research. Firstly, affect annotations is both feasible and efficient. While AGAIN the differences in performances between games show that puts an emphasis on accessibility—soliciting game context and the complexity of the affect modelling task is dependent behavioural data from users as its modalities—the AGAIN on the characteristics of the elicitor and the game context. games can be used for small-scale, lab-based affect studies that Finding new processing methods, data treatment, algorithms, incorporate more user modalities including visual and auditory and model architectures which perform equally well across player cues (e.g. [46]). different games is an open problem. Secondly, the robustness Given the characteristics of a unique set of diverse elicitors, demonstrated by the general features proposed in this paper a large participant count, first-person annotations and a large- point towards the possibility of general affect modelling across scale video and game telemetry database, AGAIN couples games. While research has already been investigating general important aspects of affective computing with core aspects of affect modelling in videogames [2], early results showed only game user modelling thereby enabling research in the area of moderate success. The dataset and baselines presented in this general affect modelling, in games and beyond. paper provide a large open source database of games with robust enough general features to continue the exploration of ACKNOWLEDGEMENTS general affect modelling. This project has received funding from the European VII.DISCUSSION &CONCLUSION Union’s Horizon 2020 programme under grant agreement No 952002. This paper introduced a new database for affective modelling, the AGAIN dataset. AGAIN is the largest and most REFERENCES diverse publicly available dataset coupling gameplay context, video footage of games, and annotated affect to date. It [1] J. Togelius and G. N. Yannakakis, “General general game ai,” in 2016 includes a variety of interactive elicitors, in the form of nine IEEE Conference on Computational Intelligence and Games (CIG). IEEE, 2016, pp. 1–8. games from three popular yet dissimilar game genres. In [2] E. Camilleri, G. N. Yannakakis, and A. Liapis, “Towards general models particular, the dataset consists of 37 hours of video footage of player affect,” in Proceedings of the International Conference on accompanied by telemetry and self-annotated arousal labels Affective Computing and Intelligent Interaction (ACII), 2017, pp. 333– 339. from 1, 116 gameplay sessions played by 124 participants. [3] K. Makantasis, A. Liapis, and G. N. Yannakakis, “From pixels to affect: The motivation behind the construction of this dataset is a study on games and player experience,” in 2019 8th International to facilitate and further advance research on general affect Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 2019, pp. 1–7. modelling through a clean, large-scale, diverse (elicitor-wise) [4] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. and accessible database. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski While each game elicits similar playstyles across different et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015. participants, the database features unique videos with self- [5] D. Melhart, A. Liapis, and G. N. Yannakakis, “Pagan: Video affect an- annotated arousal traces. AGAIN puts an emphasis on first- notation made easy,” in Proceedings of the 8th International Conference person annotation as—compared to a third-person annotation on Affective Computing & Intelligent Interaction (ACII), 2019. [6] P. Lopes, G. N. Yannakakis, and A. Liapis, “Ranktrace: Relative and scheme—is expected to yield ground truths of affect that are unbounded affect annotation,” in Proceedings of the Intl. Conference closer to the affect experienced [7], [63], [68]. The existing on Affective Computing and Intelligent Interaction. IEEE, 2017, pp. in-game footage of AGAIN, however, can be used directly 158–163. [7] G. N. Yannakakis, R. Cowie, and C. Busso, “The ordinal nature of for third-person annotation in future studies. Regardless of emotions: An emerging approach,” IEEE Transactions on Affective the annotation scheme used (first vs. third person) AGAIN Computing, 2018. [8] ——, “The ordinal nature of emotions,” in Proceedings of the Inter- [30] D. Melhart, A. Azadvar, A. Canossa, A. Liapis, and G. N. Yannakakis, national Conference on Affective Computing and Intelligent Interaction “Your gameplay says it all: Modelling motivation in tom clancy’s the (ACII). IEEE, 2017, pp. 248–255. division,” in Proceedings of the IEEE Conference on Games, 2019. [9] J. Diehl-Schmid, C. Pohl, C. Ruprecht, S. Wagenpfeil, H. Foerstl, [31] D. Melhart, D. Gravina, and G. N. Yannakakis, “Moment-to-moment and A. Kurz, “The Ekman 60 faces test as a diagnostic instrument engagement prediction through the eyes of the observer: Pubg streaming in frontotemporal dementia,” Archives of Clinical Neuropsychology, on twitch,” in International Conference on the Foundations of Digital vol. 22, no. 4, pp. 459–464, 2007. Games, 2020, pp. 1–10. [10] C. Westbury, J. Keith, B. B. Briesemeister, M. J. Hofmann, and A. M. [32] C.-H. Wu, Y.-M. Huang, and J.-P. Hwang, “Review of affective com- Jacobs, “Avoid violence, rioting, and outrage; approach celebration, puting in education/learning: Trends and challenges,” British Journal of delight, and strength: Using large text corpora to compute valence, Educational Technology, vol. 47, no. 6, pp. 1304–1323, 2016. arousal, and the basic emotions,” The Quarterly Journal of Experimental [33] S. D’Mello, A. Kappas, and J. Gratch, “The affective computing Psychology, vol. 68, no. 8, pp. 1599–1622, 2015. approach to affect measurement,” Emotion Review, vol. 10, no. 2, pp. [11] J. A. Russell, “A circumplex model of affect.” Journal of personality 174–183, 2018. and social psychology, vol. 39, no. 6, p. 1161, 1980. [34] H. P. Mart´ınez, M. Garbarino, and G. N. Yannakakis, “Generic phys- [12] E. Cambria, A. Livingstone, and A. Hussain, “The hourglass of emo- iological features as predictors of player experience,” in Proceedings tions,” in Cognitive behavioural systems. Springer, 2012, pp. 144–157. of the International Conference on Affective Computing and Intelligent Interaction (ACII). Springer, 2011, pp. 267–276. [13] A. Mehrabian, Basic dimensions for a general psychological theory [35] N. Shaker, M. Shaker, and M. Abou-Zleikha, “Towards generic models implications for personality, social, environmental, and developmental of player experience,” in Proceedings of the Conference on Artificial studies. Cambridge, 1980. Intelligence and Interactive Digital Entertainment (AIIDE), 2015. [14] H. Aviezer, R. R. Hassin, J. Ryan, C. Grady, J. Susskind, A. Anderson, [36] N. Shaker and M. Abou-Zleikha, “Transfer learning for cross-game M. Moscovitch, and S. Bentin, “Angry, disgusted, or afraid? studies on prediction of player experience,” in 2016 IEEE Conference on Com- Psychological science the malleability of emotion perception,” , vol. 19, putational Intelligence and Games (CIG). IEEE, 2016, pp. 1–8. no. 7, pp. 724–732, 2008. [37] V. Bonometti, C. Ringer, M. Ruiz, A. Wade, and A. Drachen, “From [15] B. Nevo, “Face validity revisited,” Journal of Educational Measurement, theory to behaviour: Towards a general model of engagement,” arXiv vol. 22, no. 4, pp. 287–293, 1985. preprint arXiv:2004.12644, 2020. [16] M. Klarkowski, D. Johnson, P. Wyeth, C. Phillips, and S. Smith, [38] J. Lichtenauer and M. Soleymani, “Mahnob-hci-tagging database,” 2011. “Psychophysiology of challenge in play: Eda and self-reported arousal,” [39] S. Koelstra, C. Muhl, M. Soleymani, J.-S. Lee, A. Yazdani, T. Ebrahimi, in Proceedings of the 2016 CHI Conference Extended Abstracts on T. Pun, A. Nijholt, and I. Patras, “DEAP: A database for emotion Human Factors in Computing Systems, 2016, pp. 1930–1936. analysis using physiological signals,” IEEE Transactions on Affective [17] A. Z. Abbasi, D. H. Ting, H. Hlavacs, L. V. Costa, and A. I. Veloso, “An Computing, vol. 3, no. 1, pp. 18–31, 2012. empirical validation of consumer video game engagement: A playful- [40] Y. Baveye, E. Dellandrea, C. Chamaret, and L. Chen, “Liris-accede: consumption experience approach,” Entertainment Computing, vol. 29, A video database for affective content analysis,” IEEE Transactions on pp. 43–55, 2019. Affective Computing, vol. 6, no. 1, pp. 43–55, 2015. [18] P. Lopes, A. Liapis, and G. N. Yannakakis, “Modelling affect for horror [41] S. Zafeiriou, D. Kollias, M. A. Nicolaou, A. Papaioannou, G. Zhao, soundscapes,” IEEE Transactions on Affective Computing, vol. 10, no. 2, and I. Kotsia, “Aff-wild: Valence and arousal’in-the-wild’challenge,” in pp. 209–222, 2017. Proceedings of the IEEE Conference on Computer Vision and Pattern [19] A. Clerico, C. Chamberland, M. Parent, P.-E. Michon, S. Tremblay, T. H. Recognition Workshops, 2017, pp. 34–41. Falk, J.-C. Gagnon, and P. Jackson, “Biometrics and classifier fusion to [42] A. Mollahosseini, B. Hasani, and M. H. Mahoor, “Affectnet: A database predict the fun-factor in video gaming,” in 2016 IEEE Conference on for facial expression, valence, and arousal computing in the wild,” IEEE Computational Intelligence and Games (CIG). IEEE, 2016, pp. 1–8. Transactions on Affective Computing, vol. 10, no. 1, pp. 18–31, 2017. [20] D. Melhart, “Towards a comprehensive model of mediating frustration [43] J. Kossaifi, R. Walecki, Y. Panagakis, J. Shen, M. Schmitt, F. Ringeval, in videogames,” Game Studies, vol. 18, no. 1, 2018. J. Han, V. Pandit, A. Toisoul, B. W. Schuller et al., “Sewa db: A rich [21] J. Seger and R. Potts, “Personality correlates of psychological flow states database for audio-visual emotion and sentiment research in the wild,” in videogame play,” Current Psychology, vol. 31, no. 2, pp. 103–121, IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019. 2012. [44] F. Ringeval, A. Sonderegger, J. Sauer, and D. Lalanne, “Introducing [22] C. S.-H. Yeh, “Exploring the effects of videogame play on creativity the recola multimodal corpus of remote collaborative and affective performance and emotional responses,” Computers in Human Behavior, interactions,” in 2013 10th IEEE international conference and workshops vol. 53, pp. 396–407, 2015. on automatic face and gesture recognition (FG). IEEE, 2013, pp. 1–8. [23] D. Gabana, L. Tokarchuk, E. Hannon, and H. Gunes, “Effects of valence [45] G. N. Yannakakis, H. P. Mart´ınez, and A. Jhala, “Towards affective User Modeling and User-Adapted Interaction and arousal on working memory performance in virtual reality gaming,” camera control in games,” , in 2017 Seventh International Conference on Affective Computing and vol. 20, no. 4, pp. 313–340, 2010. Intelligent Interaction (ACII). IEEE, 2017, pp. 36–41. [46] K. Karpouzis, G. N. Yannakakis, N. Shaker, and S. Asteriadis, “The platformer experience dataset,” in 2015 International Conference on [24] G. N. Yannakakis and J. Togelius, Artificial Intelligence and Games. Affective Computing and Intelligent Interaction (ACII). IEEE, 2015, Springer, 2018. pp. 712–718. [25] S. C. Bakkes, P. H. Spronck, and G. van Lankveld, “Player behavioural [47] N. Beaudoin-Gagnon, A. Fortin-Cotˆ e,´ C. Chamberland, L. Lefebvre, modelling for video games,” Entertainment Computing, vol. 3, no. 3, J. Bergeron-Boucher, A. Campeau-Lecours, S. Tremblay, and P. L. pp. 71–79, 2012. Jackson, “The funii database: A physiological, behavioral, demographic [26] J. Pfau, J. D. Smeddinck, and R. Malaka, “Deep player behavior models: and subjective video game database for affective gaming and player Evaluating a novel take on dynamic difficulty adjustment,” in Extended experience research,” in 2019 8th International Conference on Affective Abstracts of the 2019 CHI Conference on Human Factors in Computing Computing and Intelligent Interaction (ACII). IEEE, 2019, pp. 1–7. Systems, 2019. [48] S. Afzal and P. Robinson, “Natural affect data: Collection and annota- [27] T. Mahlmann, A. Drachen, J. Togelius, A. Canossa, and G. N. Yan- tion,” in New perspectives on affect and learning technologies. Springer, nakakis, “Predicting player behavior in Tomb Raider: Underworld,” 2011, pp. 55–70. in Proceedings of the Symposium on Computational Intelligence and [49] D. Buckingham and A. Burn, “Game literacy in theory and practice,” Games (CIG). IEEE, 2010, pp. 178–185. Journal of Educational Multimedia and Hypermedia, vol. 16, no. 3, pp. [28] A.´ Periańez,˜ A. Saas, A. Guitart, and C. Magne, “Churn prediction 323–349, 2007. in mobile social games: towards a complete assessment using survival [50] J. J. Vargas-Iglesias, “Making sense of genre: The logic of video game ensembles,” in Proceedings of the International Conference on Data genre organization,” Games and Culture, vol. 15, no. 2, pp. 158–178, Science and Advanced Analytics (DSAA), 2016, pp. 564–573. 2020. [29] M. Viljanen, A. Airola, J. Heikkonen, and T. Pahikkala, “Playtime [51] R. Sevin and W. DeCamp, “Video game genres and advancing quantita- measurement with survival analysis,” IEEE Transactions on Games, tive video game research with the genre diversity score,” The Computer vol. 10, no. 2, pp. 128–138, 2018. Games Journal, pp. 1–20, 2020. [52] V. Vapnik and A. Vashist, “A new learning paradigm: Learning using Antonios Liapis is a Senior Lecturer at the Institute privileged information,” Neural networks, vol. 22, no. 5-6, pp. 544–557, of Digital Games, University of Malta, where he 2009. bridges the gap between game technology and game [53] H. P. Mart´ınez and G. N. Yannakakis, “Deep multimodal fusion: design in courses focusing on human-computer cre- Combining discrete events and continuous signals,” in Proceedings of ativity, digital prototyping and game development. the International Conference on Multimodal Interaction. ACM, 2014, He received the Ph.D. degree in Information Tech- pp. 34–41. nology from the IT University of Copenhagen in [54] H. P. Martinez, G. N. Yannakakis, and J. Hallam, “Don’t classify ratings 2014. His research focuses on Artificial Intelligence of affect; rank them!” IEEE Transactions on Affective Computing, vol. 5, in Games, Human-Computer Interaction, Computa- no. 3, pp. 314–326, 2014. tional Creativity, and User Modelling. He has pub- [55] R. Cowie, M. Sawey, C. Doherty, J. Jaimovich, C. Fyans, and P. Sta- lished over 100 papers in the aforementioned fields, pleton, “Gtrace: General trace program compatible with EmotionML,” and has received several awards for his research contributions and reviewing in Proceedings of the Intl. Conference on Affective Computing and effort. He has served as general chair in four international conferences, as Intelligent Interaction. IEEE, 2013, pp. 709–710. guest editor in four special issues in international journals, and has co- [56] G. N. Yannakakis and H. P. Martinez, “Grounding truth via ordinal organised 11 workshops. annotation,” in Proceedings of the International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 574–580. [57] R. L. Solomon and J. D. Corbit, “An opponent-process theory of motivation: I. temporal dynamics of affect,” Psychological Review, vol. 81, no. 2, p. 119, 1974. [58] A. R. Damasio, Descartes’ error: Emotion, rationality and the human brain. New York: Putnam, 1994. [59] B. Seymour and S. M. McClure, “Anchors, scales and the relative coding of value in the brain,” Current Opinion in Neurobiology, vol. 18, no. 2, pp. 173–178, 2008. [60] S. Erk, M. Kiefer, J. Grothe, A. P. Wunderlich, M. Spitzer, and H. Walter, “Emotional context modulates subsequent memory effect,” Neuroimage, vol. 18, no. 2, pp. 439–447, 2003. [61] D. J. Berndt and J. Clifford, “Using dynamic time warping to find patterns in time series.” in KDD workshop, vol. 10, no. 16. Seattle, WA, USA:, 1994, pp. 359–370. [62] L. van der Maaten and G. Hinton, “Viualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 11 2008. [63] A. Metallinou and S. Narayanan, “Annotation and processing of continuous emotional attributes: Challenges and opportunities,” in 2013 10th Georgios N. Yannakakis is a Professor and Director IEEE international conference and workshops on automatic face and of the Institute of Digital Games, University of gesture recognition (FG). IEEE, 2013, pp. 1–8. Malta. He received the Ph.D. degree in Informatics [64]J.F urnkranz¨ and E. Hullermeier,¨ “Preference learning,” in Encyclopedia from the University of Edinburgh in 2006. Prior of Machine Learning. Springer, 2011, pp. 789–795. to joining the Institute of Digital Games, UoM, in [65] D. Melhart, K. Sfikas, G. Giannakakis, G. N. Yannakakis, and A. Liapis, 2012 he was an Associate Professor at the Center “A motivational model of video game engagement.” Proc. of Machine for Computer Games Research at the IT University Learning Research, 2018 IJCAI workshop on AI and Affective Comput- of Copenhagen. He does research at the crossroads ing, vol. 86, pp. 26–33, in print. of artificial intelligence, computational creativity, [66] R. J. Lewis, “An introduction to classification and regression tree affective computing, advanced game technology, and (cart) analysis,” in Proceedings of the society for Academic Emergency human-computer interaction. He has published more Medicine (SAEM) annual meeting, 2000. than 260 papers in the aforementioned fields and his work has been cited [67] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, broadly. His research has been supported by numerous national and European O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vander- grants (including a Marie Skłodowska-Curie Fellowship) and has appeared plas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duch- in Science Magazine and New Scientist among other venues. He is currently esnay, “Scikit-learn: Machine learning in Python,” Journal of Machine an Associate Editor of the IEEE Transactions on Evolutionary Computation Learning Research, vol. 12, pp. 2825–2830, 2011. and the IEEE Transactions on Games, and used to be Associate Editor of [68] S. Afzal and P. Robinson, “Emotion data collection and its implications the IEEE Transactions on Affective Computing and the IEEE Transactions for affective computing,” The oxford handbook of affective computing, on Computational Intelligence and AI in Games journals. He has been the pp. 359–369, 2014. General Chair of key conferences in the area of game artificial intelligence (IEEE CIG 2010) and games research (FDG 2013, 2020). Among the several rewards he has received for journal and conference publications he is the recipient of the IEEE Transactions on Affective Computing Most Influential Paper Award and the IEEE Transactions on Games Outstanding Paper Award. He is a senior member of the IEEE.

David Melhart is a Ph.D. student at the Institute of Digital Games, University of Malta. He received a master’s degree in Cognition and Communication from the University of Copenhagen in 2016 and since 2017 he is studying towards a Ph.D. degree in Game Research. His research focuses on Machine Learning, Affective Computing, and Games User Modelling. He was the Communication Chair of FDG 2020 and has been a recurring organiser and Publicity Chair of the Summer School series on Artiﬁcial Intelligence and Games (2018-2020).