Towards Recognizing Emotion in the Latent Space

Towards Recognizing Emotion in the Latent Space Marios Fanourakis Guillaume Chanel Rayan Elalamy Phil Lopes SIMS group SIMS group SIMS group Immersive Interaction Group University of Geneva University of Geneva University of Geneva EPFL Geneva, Switzerland Geneva, Switzerland Geneva, Switzerland Lausanne, Switzerland [email protected] [email protected] [email protected] phil.lopes@epfl.ch Abstract—Emotion recognition is usually achieved by collecting general, there are a few common methods to acquire ground features (physiological signals, events, facial expressions, etc.) to truth data: expert annotations, crowd-sourced annotations, self- predict an emotional ground truth. This ground truth is arguably reports, induced emotions. These different methods to acquire unreliable due to its subjective nature. In this paper, we introduce a new approach to measure the magnitude of an emotion in the the ground truth make the models difficult to meaningfully latent space of a Neural Network without the need for a subjective compare. Most are unavoidably subjective in nature since the ground truth. Our data consists of physiological measurements verbalized/communicated emotion does not necessarily reflect during video gameplay, game events, and subjective rankings of the true underlying emotion of the subject [13]. They also game events for the validation of our model. Our model encodes depend on the capacity of an individual to assess their own physiological features into a latent variable which is then decoded into video game events. We show that the events are ranked in or an other’s emotional state [9], [12]. Consequently, these the latent space similarly to the participants’ subjective ranks. methods only provide a crude approximation of the ground For instance, our model’s ranking is correlated (Kendall τ of truth. 0.91) with the predictability rankings. To create models that better capture the emotion of an Index Terms—affective computing, video games, emotion individual we must take a step back to look at where emo- recognition, neural networks, appraisal theory, arousal, valence, emotional dimensions tions come from. Moors [11], does an invaluable comparison of the different theories and concludes that there is much I. INTRODUCTION AND RELATED WORK agreement that emotions stem from a combination of com- One of the main motives for automatic emotion recognition ponent processes. We will focus on a specific component- has been the improvement of human computer interaction process representation called appraisal theory that is now well (HCI) by affording machines human-like abilities to better established [14]. The general appraisal process starts with an anticipate and adapt to their operators behaviours and needs. event which is subsequently appraised and weighed against Both Cowie et al. [3] and Fragopanagos et al. [5] describe the various criteria which together regulate the specific emotion. challenges and opportunities in this endeavour covering not In a recent analysis, Scherer and Moors [15], bring to light only the need for machines to recognize human emotions but problems in emotion research like the use of discrete emotions also how machines can influence human emotions. despite that emotions are often a combination of different Since then, there has been an abundance of research lit- components with varying amplitudes in a continuous space. erature on the topic of automatic emotion recognition using Although correlated, internal emotional states, experienced physiological signals, a topic which is still very active [6], feelings, and expressed feelings are not one and the same. [16]. Affective gaming is an exciting sub-field of HCI where The emotional states are mainly dependent on the various the emotions of video game players are detected and ana- appraisal criteria, the experienced feelings are the conscious lyzed in the context of gaming. Video games offer a high representation of an amalgamation of the internal emotional level of immersion and can elicit a wide range of emotions, states, the expressed feelings are a further modulation of making it a popular tool in emotion research [7], [10]. In the experienced feelings based on sociocultural norms and the literature, emotion recognition in affective gaming (and interpersonal relationships. in general) is achieved, almost invariably, by utilizing various Recent work of Yannakakis et al. [17] make strong argu- supervised learning techniques which require inputs of features ments for an ordinal approach to measuring and analyzing like physiological signals, events, facial expressions, etc. The emotions. Emotions and the subsequent subjective feelings targeted ground truth can be discrete (happy, sad, angry, etc.) towards an event are not absolute, they are experienced in or continuous (arousal, valence, etc.) [7], [8], [10], [16]. More relation to the emotions of previous events.Yannakakis et al. recently, deep learning techniques have also made their impact show the validity, reliability, and robustness across domains of in affective gaming [1]. an ordinal approach to measuring emotions. Hence, we also The quality and reproducibility of the resulting models is adopt this approach in our work. closely tied to the quality of the ground truth labels [2]. In In our work we attempt to recover the internal emotional state by creating a machine learning model that does not rely This work was funded by Innosuisse. on subjective feelings. We do not yet make any recommenda- tions for the specific layers and components best suited for this (ECG), electrodermal activity (EDA), and respiration of the task but rather propose a general architecture that resembles players using a Bitalino device. Immediately after the partici- parts of the appraisal process. pants finished their gameplay session, they were asked to rank We show that it is feasible to indirectly measure the ampli- a list of game events in terms of four emotional dimensions tude of an emotion in the latent space of a Neural Network [4]: arousal, valence, control, and predictability (four different without the need for a ground truth of emotion. ranking tasks of the game events per participant). Our experimental data consists of physiological signals In total, we collected physiological data from 19 dyads (38 collected from video game players while they are engaged participants). We visually inspected the signal quality for each in gameplay. We create a model with physiological features participant and discarded a participant’s physiological data if as input and specific events in the game as targets using a very the quality of either the EDA or ECG was not good. This left basic neural network with an encoder/decoder architecture. We us with physiological data for 19 participants. All 38 rankings then look at how the model’s latent/learned variables relate to for each ranking task remained valid. emotion, therefore avoiding the use of subjective data during A. Participant ranking analysis model training. To our knowledge this is a new approach to emotion recognition and we hope that our results encourage To show the disparity between the participants’ subjective further development in this direction. rankings of the game events, we compared them by calculating the Kendall τ rank correlation between all pairs of ranking in II. MODEL ARCHITECTURE each ranking task. The results of this comparison are shown in We designed our overall architecture in such a way that Figure 2, where we calculate a histogram of Kendall τ values it resembles a generic appraisal model where an event goes (in the range of −1 to +1). As expected, the participants did through an appraisal process (represented in the encoder) not perfectly agree. The median/mean Kendall τ for arousal, which produces an internal representation of emotional states valence, control, and predictability were 0:42=0:44, 0:56=0:52, (latent space) that are translated into a measurable physio- 0:42=0:44, and 0:38=0:34 respectively. logical reaction (decoder). This is illustrated in the upper -0.5 0.5 -1 0 1 part of Figure 1. The encoder and decoder components can 0.16 include any of the modern neural network layers and their Arousal combinations. Our main assumption is that if the inputs and 0.14 outputs of the model are strongly linked via internal emotional 0.12 states, then the latent space will capture these states. Valence density Probability 0.10 Control Events 0.08 Physiological signals/features 0.06 Physiological Predict. signals/features Events 0.04 Latent space Model(1) Encoder Decoder 0.02 Fig. 1: The proposed general neural network architecture. 0.00 Since our goal is to detect the emotional state from phys- Fig. 2: Kendall τ rank correlation histogram. iological signals we invert the model and use physiological signals on the input side and the events on the output like the Seeing how the participant rankings are correlated, we lower part of Figure 1. wanted to capture a global representative ranking for each ranking task. To achieve this, we used the Schulze Condorcet III. DATA AND IMPLEMENTATION method to combine all rankings into one for each ranking task. In our experiment, pairs of players were asked to play a The resulting rankings are shown in Table I. round of 1 vs 1 deathmatch using the Xonotic computer video Next, we wanted to see which events had the highest agree- game. Xonotic is an open source fast paced first person shooter ment among participants in each ranking task. For two events similar to Quake 4. The goal of the deathmatch gamemode i; j, we can count how many pairs of rankings have them is to be the first to get 10 frags (kills), the player respawns in the same relative order (number of concordances, conci;j) within a few seconds after each death. There are several items or disagree on their relative order (number of discordances, scattered in the environment that the player can pick up and disci;j). Then we can calculate a concordance score, csi, for which are replenished after a short time.

Load more