RAMAS: Russian Multimodal Corpus of Dyadic Interaction for Studying Recognition

Olga Perepelkina Eva Kazimirova Maria Konstantinova Neurodata Lab LLC Neurodata Lab LLC Neurodata Lab LLC Lomonosov MSU Moscow, Russia Lomonosov MSU Moscow, Russia [email protected] Moscow, Russia [email protected] [email protected]

Abstract—Emotion expression encompasses various types of The investigations of facial expression recognition can be information, including face and eye movement, voice and body classified into two parts, the detection of facial affect (human motion. Most of the studies in automated affective recognition ), and the detection of facial muscle action (action use faces as stimuli, less often they include speech and even more rarely gestures. Emotions collected from real conversations are units) [4]. Currently a good classification accuracy (more difficult to classify using one channel. That is why multimodal than 90% [5, 6]) of basic emotions has been achieved by techniques have recently become more popular in automatic using face images. Speech is another essential channel for emotion recognition. Multimodal databases that include audio, emotion recognition. Some emotions, such as and video, 3D motion capture and physiology data are quite rare. , could be distinguished from an audio stream even better We collected The Russian Acted Multimodal Affective Set (RA- MAS) the first multimodal corpus in Russian language. Our than from video [7]. The average recognition level across database contains approximately 7 hours of high-quality close- different studies varies from 45% to 90% and depends on up video recordings of subjects faces, speech, motion-capture the number of categories, the classifier type and the dataset data and such physiological signals as electro-dermal activity and type [8]. Physiological signals like cardiovascular, respiratory photoplethysmogram. The subjects were 10 actors who played and electrodermal measures are also successfully applied in out interactive dyadic scenarios. Each scenario involved one of the basic emotions: , Sadness, Disgust, , Fear emotion recognition. In several studies of biosignal-based or Surprise, and some characteristics of social interaction like affect classification recognition rate is more than 80% [9, 10]. Domination and Submission. In order to note emotions that The analysis of body motion data for emotion recognition has subjects really felt during the process we asked them to fill in become more common only recently. Movement data gives short questionnaires (self-reports) after each played scenario. The recognition rates comparable to facial expressions or speech records were marked by 21 annotators (at least five annotators marked each scenario). We present our multimodal data col- in multimodal scenarios, and improves overall accuracy in lection, annotation process, inter-rater agreement analysis and multimodal systems when combined with other modalities [2]. the comparison between self-reports and received annotations. Emotions collected from real conversations are difficult to RAMAS is an open database that provides research community classify by means of one channel. That is why multimodal with multimodal data of faces, speech, gestures and physiology techniques have recently become more popular in automatic interrelation. Such material is useful for various investigations and automatic affective systems development. emotion recognition. Multimodal databases containing audio- Index Terms—affective computing, multimodal affect recogni- visual information and 3D motion capture are quite rare. The tion, multimodal dataset USC CreativeIT database [11] may serve as an example here. It includes full-body motion capture, video and audio data. This I.INTRODUCTION database provides annotations for each fragment concerning There are several approaches to classification and cate- valence, activation and dominance categories. The interactive gorization of affective conditions. Most studies use discrete emotional dyadic motion capture database (IEMOCAP) [12] categories (e.g. basic emotions such as happiness and sadness). contains audio-visual and motion data for faces and hands However, a few studies focus on dimensional models (such as only, but not for the whole body. The MPI Emotional Body valence and arousal) [1]. Dimensional approach in automated Expressions Database [13] was also collected by means of emotion recognition faces challenges such as unbalanced data, several channels, yet only motion capture data are available overlapping categories and differences in the inter-observer for the research community. Finally, there are well-designed agreement on the dimensions [1, 2]. On the other hand, multimodal databases (e.g. RECOLA dataset [14], GEMEP categorical labels describe emotional states based on their corpus [15]) that provide multichannel information, but leave linguistic use in our daily life. Most studies use a basic set out motion data. Thus, we decided to collect multimodal of emotions that includes anger, happiness, sadness, surprise, database with multiple channels (video, audio, motion and disgust and fear [3]. At present time there are many studies physiology data) in Russian, since existing datasets have of automatic emotion recognition using one modality. The some limitations. RAMAS expands and complements existing most popular source for affect recognition is the face data. datasets for the needs of affective computing.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018 II.RAMAS DATASET skeleton data, RGB and depth video, EDA, and PPG. We used A. Dataset collection SSI software to synchronize the streams [21]. The Russian Acted Multimodal Affective Set (RAMAS) III.POST-PROCESSING AND ANNOTATION consists of multimodal recordings of improvised affective Statistical analysis was performed using R 3.4.2 (R Studio dyadic interactions in Russian. It was obtained in 2016-2017 version 1.1.383). by Neurodata Lab LLC [16]. Ten semi-professional actors (5 A. Annotation men and 5 women, 18-28 years old, native Russian speakers) participated in the data collection. Semi-professional actors We asked 21 annotators (18-33 years old, 71% women) to are more preferable than professional actors for analyzing evaluate emotions in the received video-audio fragments. Each movements in emotional states, as professional theatre actors annotator worked with 150 video fragments (except for two may use stereotypical motion patterns [13, 17]. annotators who had 150 fragments in sum) and at least five The actors were given scenarios with descriptions of dif- annotators marked each fragment. We asked all the applicants ferent situations, but no exact lines. They were encouraged to take the emotional intelligence test [22, 23], and only those, to gesticulate and move, but they had to stay in a certain, who got average or above the average results, were picked to specially marked part of the room to achieve stable close- annotate the material. The Elan [24] tool from Max Planck up footage of their faces. In order to perform refined motion Institute for Psycholinguistics (the Netherlands) was used for tracking all the actors were dressed in tight black clothes. First emotion annotation. Oral and written instructions as well as of all the participants contributed a sample of their neutral templates of all the emotional states were provided. The task emotional state. Then they improvised from 30 to 60 seconds was to mark the beginning and the end of each emotion that on each of the 13 topics. Interactions were conducted in mixed- seemed natural. The work of the annotators was monitored dyads where participants were assigned to be either and paid for. friends or colleagues. Scenarios implied the presence of one B. Inter-Annotator Agreement of the six basic emotions (Angry, Sad, Disgusted, Happy, Scared, and Surprised) and the neutral state in each dialogue. We used Krippendorff’s alpha statistic [25] to estimate the There were dominated and submissive roles in each scenario amount of inter-rater agreement in the RAMAS database. that were balanced across all scenarios. English translations Alpha defines the two reliability scale points as 1.0 for perfect of sceanarios could be found under the link [18]. The actors reliability, 0.0 for the absence of reliability, and alpha < 0 were coordinated by a professional teacher from Russian State when disagreements are systematic and exceed what can be University of Cinematography named after S. Gerasimov. Each expected by chance. We choose Krippendorffs alpha because scenario was played out from two to five times to achieve it is applicable to any number of coders, allows for reliability the best quality and the highest variety in emotional and data with missing categories or scale points, and measures behavior expressions. The roles were assigned to actors in agreements for nominal, ordinal, interval, and ratio data [26]. such a way that all states were evenly distributed between The expression for calculation of Alpha (α) is as follows: men and women. We also asked the subjects to fill in short 1 − D α = 0 (1) questionnaires (self-reports) after each played scenario in order De to note emotions they really felt during the process. The where D0 is the disagreement observed and De is the disagree- actors received fixed payments for the production days and ment expected by chance. consented to the usage of all of the recordings for scientific Results of annotators who labelled the same video fragments and commercial purposes. (150 in most cases) were grouped, and Alpha was counted for each group for each of 9 emotional, social and neutral scales. B. Apparatus and recording setup Elan provides an opportunity to set variative length for frag- Audio was recorded with Sennheiser EW 112-p G3 portable ment annotation (i.e. the subject determines the starting and wireless microphone system (wav format, 32-bit, 48000 Hz). ending point of emotions). Subsequently, to count realiability Microphones were placed on the participants necklines. Gen- with Krippendorff’s alpha statistics we split all annotations eral acoustic scene was obtained by stereo Zoom H5 recorder. into second intervals. The average Alpha statistics for RAMAS Microsoft Kinect RGB-D sensor v. 2 [19] was used to gather dataset is 0.44. The mean and median statistics for each scale 3D skeleton data, RGB and depth videos (of both actors is presented in Table I. simultaneously). A green background was used to ensure the motion capture and video quality enhancement. Close-up C. Self-report analysis videos of each participant were recorded by means of two Real emotions of the actors were collected by means of self- cameras (Canon HF G40 and Panasonic HC-V760). Photo- reports they gave after the last take of each played scenario. plethysmogram (PPG) and electrodermal activity (EDA) were The actors were to fill in short questionnaires and evaluate registered by Consensys GSR Development Kit [20]. As a his/her state during the scenario (Angry, Sad, Disgusted, result, the RAMAS database comprise approximately 7 hours Happy, Scared, and Surprised) on a 5-point Likert scale (1 of synchronized multimodal information including audio, 3D = did not experience the emotion, 7 = experienced it a lot).

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018 TABLE I 5 MEANANDMEDIAN KRIPPENDORFFS ALPHA FOR EACH SCALE

Scale Mean alpha Median alpha Disgusted 0.54 0.66 4 Happy 0.58 0.6 Angry 0.5 0.56 Scared 0.47 0.48 Domination 0.45 0.46 3 Submission 0.46 0.44 Score Surprised 0.41 0.38 Sad 0.35 0.31 2 Neutral 0.22 0.07

1 TABLE II Angry Disgusted Happy Sad Scared Surprised PAIRWISE COMPARISONS:WILCOXONRANKSUMTEST.P VALUE Emotion ADJUSTMENTMETHOD:BH

Angry Disgusted Happy Sad Scared Fig. 1. Bar and violin plots for self-reports of the experienced emotions in Disgusted 0.63 all the scenarios. Happy 0.05 0.09 Sad 0.65 0.93 0.07 Scared 0.65 0.93 0.07 0.95 Domination Submission Surprised 0.07 0.19 0.65 0.18 0.17 2.4

They also answered the question about the complexity of the 2.1 scenario played (1 = very easy, 7 = very difficult). We analyzed self-reports trying to answer the following questions: 1) Are 1.8 there any differences between the emotions that actors experi- Score enced during all the scenarios? 2) Are there any differences in the complexity of the scenarios? If yes, which kind of emotions 1.5 (types of scenarios) were more difficult for playing? 3) What are relations between played and experienced emotions? Did actors really feel the same emotions they played? Angry Disgusted Happy Sad Scared Surprised Question #1: The differences between experienced emo- Emotion tions First, we tested the diversity of emotions each actor Fig. 2. The influence of Domination and Submission conditions on the expe- experienced during the experiment. We wanted to find out rienced emotions. Plot with mean values and bootstrap estimated confidence which emotions they experienced more often regardless of intervals (CI). the type of the emotion in the scenario. The answers from self-reports were compared with pairwise Wilcoxon rank sum test (see Table II and “Fig. 1”). There were no significant pendent variables and the types of scenario as predictors, and differences between experienced emotions. That means the tested the comparisons of interests by using contrast/lsmeans. actors experienced balanced overall amount of all the emotions Scenarios for disgust were more difficult than the scenarios for during all sessions. Since the number of the emotions in the angriness (p<0.01) and happiness (p<0.05), and the scenarios scenarios was balanced, it also corresponds to the experienced for fear were more difficult than the scenarios for angriness emotions. (p<0.05), see Table III and “Fig. 3”. Then we analyzed how dominating and submissive positions Then we studied the effect domination-submission type - assigned to each actor by the scenario - influenced their had on the evaluated complexity of the scenario. Logistic emotional state. The logistic regression model was constructed, regression model and contrast/lsmeans revealed that there were with self-reports evaluations as dependent variables and the no differences between these types of scenarios. types of experienced emotions and the Domination/Submission Question #3: Relations between played and experienced types of scenario as predictors, and tested the comparisons emotions First, we studied how the intensity of the experienced of interests by using contrast/lsmeans. As a result we found emotions depended on the individual differences between the no significant differences between dominative and submissive actors. We counted sum of all emotions scores in each answer scenarios for each experienced emotion (see “Fig. 2”). (intensity of emotions, varies from 6 to 30 in each answer, Question #2: Complexity of the scenarios We analyzed what M 10.5 SD 2.6 in all answers). Then generalized linear kinds of scenarios were more difficult for playing according to model was created with the intensity as response and the the actors self-reports. For this purpose the logistic regression actors as predictors, and tested with Anova type two. The model was constructed, with complexity evaluations as de- model was significant (p<0.001) which meant the intensity

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018 TABLE III FALSE TRUE EVALUATEDCOMPLEXITYOFTHESCENARIOSACCORDINGTOTHETYPE Matched OFEMOTION.P VALUE ADJUSTMENT:TUKEYMETHODFORCOMPARINGA FAMILY OF 6 ESTIMATES 0.4 Contrast: Emotion in scenario Estimate SE z.ratio p.value Angry - Disgusteda -0.994 0.263 -3.773 0.002 Angry - Happy -0.148 0.246 -0.601 0.991 0.3 Angry - Sad -0.497 0.25 -1.984 0.351 Angry - Scared -0.814 0.249 -3.264 0.014 Angry - Surprised -0.341 0.249 -1.365 0.748 0.2 Disgusted - Happy 0.846 0.265 3.191 0.018 Normalized score Disgusted - Sad 0.497 0.269 1.85 0.434 Disgusted - Scared 0.179 0.268 0.67 0.985 Disgusted - Surprised 0.653 0.268 2.436 0.144 Happy - Sad -0.349 0.252 -1.384 0.737 Angry Disgusted Happy Sad Scared Surprised Happy - Scared -0.666 0.251 -2.652 0.085 Emotion Happy - Surprised -0.193 0.251 -0.767 0.973 Sad - Scared -0.317 0.255 -1.243 0.815 Fig. 4. Played and experienced emotions. False the emotion in the scenario Sad - Surprised 0.156 0.256 0.611 0.99 did not match the emotion in the self-report, true they coincided. Plot with Scared - Surprised 0.474 0.255 1.86 0.427 mean values and bootstrap estimated confidence intervals (CI). aBold for significant p-values (p<0.05).

six basic emotions and two social categories in the Russian 5 language.

4 IV. DISCUSSION We collected the first emotional multimodal database in

3 Russian language. Semiprofessional actors played out prepared

Score scenarios and expressed basic emotions (Anger, Sadness, Disgust, Happiness, Fear and Surprise), as well as two so- 2 cial interaction characteristics Domination and Submission. Audio, close-up and whole scene videos, motion capture and physiology data were collected. Twenty one annotators marked 1 emotions in the received videos, at least five annotators marked Angry Disgusted Happy Sad Scared Surprised each video. Complexity The further analysis of the annotations revealed that the Fig. 3. Bar and violin plots for the evaluated complexity of the scenarios. RAMAS database has moderate inter-rater agreement (Krip- pendoroffs alpha = 0.44). Among all the scales except for the neutral condition, the smallest inter-rater agreement was of the experienced emotions varied from actor to actor. Since observed for sadness (0.35), while the largest agreement was emotion intensity was depended on actors, we normalized their observed for happiness (0.58). After playing out each scenario evaluations for further analysis: in order to do so we devided all the actors answered several short questions about the score of the emotion by the sum of all emotions in each emotions they experienced. Analysis of these answers revealed answer. Normalized scores of the experienced emotions were that the actors experienced balanced overall amount of all compared to the emotions that actors had to play according emotions during all sessions. At the same time, no significant to the scenarios. We analyzed the relation between played differences were found between dominating and submissive and experienced emotions using proportional odds logistic scenarios in terms of their impact on the experienced emotions. regression with the normalized score of the questionnaire as That is to say, social positions the actors were assigned in each the response and the logical variable (that reflected whether the scenario affected neither type nor the intensity of the emotions played emotion matched the felt emotion) - as the predictor. they felt. The dominance is considered to be one of the basic Anova type two revealed that this predictor was significant, scales in social relation [27].We assumed that emotions of that is: the actors tend to evaluate the emotion they had just the actor who had control of the situation and the other one played as their most intense feeling. In other words, the actors with submissive position may be affected by this social scale. experienced the same emotions they had played out in the However we have not found significant effects. This could be scenario (see “Fig. 4”). because of actors were close in age and some of them had The properties of the database are summarized in Table IV. known each other. The other assumption was that this scale The RAMAS database is a novel contribution to the affective was not perfectly considered in the scenarios and in some cases computing area since it contains multimodal recordings of the could be ambiguous. As we did not use exact scripts, the effect

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018 TABLE IV energy [33]. 3) Eye-based features were extracted from close- BASIC PROPERTIES OF RAMAS DATABASE up videos and include X and Y-axis offsets of left and right Number of videos 581 eyes, and boolean flags for open eyes and fixation. 4) Face General length of database, minutes/hours 395 / 6.6 mimic features were extracted from each video frame using Number of actors 10 (5 men, 5 women) the VGG16 neural network, and their amount was reduced Age of actors 18-28 years Language of videos Russian with the principal component analysis. The described data Video length, seconds (min/max/average) 9/96/41 are used for the Multimodal Emotion Recognition Challenge Number of annotators 21 (6 men, 15 women) [34], the machine learning competition on multimodal emotion Inter-rater agreement (Krippendorffs alpha) 0.44 recognition with missing data. The main goal of this challenge is to find approaches for a reliable recognition of emotional Written Number Length, behavior when some data is unavailable. Expression scripts of videos minutes Emotions V. CONCLUSION Disgusted 4 80 56.6 Happy 4 64 42.2 RAMAS is the unique play-acted multimodal corpus in the Angry 5 62 45.5 Scared 5 94 65.8 Russian language The database in open and provides research Surprised 5 70 44.6 community with multimodal data of face, speech, gestures Sad 5 84 63.8 and physiology modalities. Such material is useful for various Neutral (speaker + listener) 6 + 6 63 + 64 37.8 + 38.6 investigations and automatic affective systems development. It Social scales can also be applicable for psychological, psychophysiological Domination and linguistic studies. Access to the database is available under (emotional + neutral) 14 + 6 227 + 63 158.8 + 37.8 the link [35]. Submission (emotional + neutral) 14 + 6 227 + 64 158.8 + 38.6 VI.ACKNOWLEDGMENT The authors would like to thank Elena Arkova for finding of dominance-submission may be masked by improvisation of the actors and helping with the scenarios and experimental the actors. procedure, and Irina Vetrova for evaluating the emotional The analysis of the question about the complexity of the intelligent of the annotators with MSCEIT v 2.0 test. played emotion revealed that the actors had more difficulties REFERENCES with the scenarios for disgust compared to the scenarios for angriness and happiness, and the scenarios for fear were more [1] Hatice Gunes. Automatic, dimensional and continuous difficult for them than the scenarios for angriness. There were emotion recognition. 2010. no differences between the complexity of playing dominative [2] Michelle Karg, Ali-Akbar Samadani, Rob Gorbet, Kolja vs. submissive types of the scenarios. Presumably, experience Kuhnlenz,¨ Jesse Hoey, and Dana Kulic.´ Body movements and natural expression of fear and disgust are more rooted for affective expression: A survey of automatic recog- in the real context of the situation in comparison with other nition and generation. IEEE Transactions on Affective emotions, because the function of those affects is avoiding Computing, 4(4):341–359, 2013. threats here and now [28, 29, 30]. For example, its easier to [3] Paul Ekman. Are there basic emotions? 1992. trigger natural anger than fear by memories. The comparison [4] Thim Vai Chaw, Siak Wang Khor, and Phooi Yee Lau. of played and experienced emotions revealed that, according Facial Expression Recognition using Correlation of Eyes to their self-reports, the actors experienced the same emotions Regions. In The FICT Colloquium 2016 (December that they played during the scenario. Due to this fact RAMAS 2016), page 34, 2016. could be considered quite a naturalistic database. In our future [5] Uur Ayvaz, Hseyin Gur¨ uler,¨ and Mehmet Osman Devrim. studies we are going to analyze and compare different methods USE OF FACIAL EMOTION RECOGNITION IN E- and classification algorithms for the RAMAS database. The LEARNING SYSTEMS. Information Technologies and preliminary research has already been conducted; we extracted Learning Tools, 60(4):95–104, 2017. several features from each channel: 1) For audio data two types [6] Pawe Tarnowski, Marcin Kołodziej, Andrzej Majkowski, of features were estimated the Geneva Minimalistic Acoustic and Remigiusz J Rak. Emotion recognition using facial Parameter Set (GeMAPS) for Voice Research and Affective expressions. Procedia Computer Science, 108:1175– Computing [31], and 13 MFCC-coefficients including 0-th. 1184, 2017. 2) Motion feature extraction was performed using R 3.4.0 [7] Liyanage C De Silva, Tsutomu Miyasato, and Ryohei and EyesWeb XMI 5.7.0.0 [32]. The features were computed Nakatsu. Facial emotion recognition using multi-modal from the 3D coordinates of the bodys joints captured with information. In Information, Communications and Signal Kinect v. 2. They included distances between the joints; joints Processing, 1997. ICICS., Proceedings of 1997 Interna- velocities, accelerations, and jerks; smoothness, curvature, and tional Conference on, volume 1, pages 397–401. IEEE, density indexes; symmetry of the hands joints and kinetic 1997. ISBN 0780336763.

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018 [8] Moataz El Ayadi, Mohamed S Kamel, and Fakhri Karray. Proceedings of the 21st ACM international conference Survey on speech emotion recognition: Features, classi- on Multimedia, pages 831–834. ACM, 2013. ISBN fication schemes, and databases. Pattern Recognition, 44 1450324045. (3):572–587, 2011. [22] John D Mayer, Peter Salovey, David R Caruso, and Gill [9] Adam Anderson, Thomas Hsiao, and Vangelis Metsis. Sitarenios. Measuring emotional intelligence with the Classification of Emotional Arousal During Multimedia MSCEIT V2. 0. Emotion, 3(1):97, 2003. Exposure. In Proceedings of the 10th International [23] E G Sergienko, I I Vetrova, A A Volochkov, and A Y Conference on PErvasive Technologies Related to Assis- Popov. ADAPTATION OF J. MAYER P. SALOVEY tive Environments, pages 181–184. ACM, 2017. ISBN AND D. CARUSO EMOTIONAL INTELLIGENCE 1450352278. TEST ON RUSSIAN-SPEAKING SAMPLE. , 31(1): [10] Khadidja Gouizi, F Bereksi Reguig, and C Maaoui. 55–73, 2010. Emotion recognition from physiological signals. Journal [24] Han Sloetjes and Peter Wittenburg. Annotation by of medical engineering & technology, 35(6-7):300–307, Category: ELAN and ISO DCR. In LREC, 2008. 2011. [25] Klaus Krippendorff. Estimating the reliability, systematic [11] Angeliki Metallinou, Chi-Chun Lee, Carlos Busso, error and random error of interval data. Educational and Sharon Carnicke, and Shrikanth Narayanan. The USC Psychological Measurement, 30(1):61–70, 1970. CreativeIT database: A multimodal database of theatrical [26] Andrew F Hayes and Klaus Krippendorff. Answering improvisation. Multimodal Corpora: Advances in Cap- the call for a standard reliability measure for coding turing, Coding and Analyzing Multimodality, page 55, data. Communication methods and measures, 1(1):77– 2010. 89, 2007. [12] Carlos Busso, Murtaza Bulut, Chi-Chun Lee, Abe [27] Judee K Burgoon and Norah E Dunbar. Nonverbal Kazemzadeh, Emily Mower, Samuel Kim, Jeannette N expressions of dominance and power in human relation- Chang, Sungbok Lee, and Shrikanth S Narayanan. ships. The Sage handbook of nonverbal communication. IEMOCAP: Interactive emotional dyadic motion capture Sage, 2, 2006. database. Language resources and evaluation, 42(4):335, [28] M Douglas. Purity and Danger: An Analysis of Pollution 2008. and Taboo London, 1966. [13] Ekaterina Volkova, Stephan De La Rosa, Heinrich H [29] Stanley Rachman. Anxiety. East Sussex. UK: Psychology Bulthoff,¨ and Betty Mohler. The MPI emotional body Press Ltd., Publishers, 1998. expressions database for narrative scenarios. PloS one, 9 [30] Silvan Tomkins. Affect imagery consciousness: volume (12):e113647, 2014. II: the negative affects. Springer Publishing Company, [14] Fabien Ringeval, Andreas Sonderegger, Juergen Sauer, 1963. ISBN 0826104436. and Denis Lalanne. Introducing the RECOLA mul- [31] Florian Eyben, Klaus R Scherer, Bjrn W Schuller, Johan timodal corpus of remote collaborative and affective Sundberg, Elisabeth Andre,´ Carlos Busso, Laurence Y interactions. In Automatic Face and Gesture Recog- Devillers, Julien Epps, Petri Laukka, and Shrikanth S nition (FG), 2013 10th IEEE International Conference Narayanan. The Geneva minimalistic acoustic parameter and Workshops on, pages 1–8. IEEE, 2013. ISBN set (GeMAPS) for voice research and affective comput- 1467355461. ing. IEEE Transactions on Affective Computing, 7(2): [15] Tanja Banziger,¨ Hannes Pirker, and K Scherer. GEMEP- 190–202, 2016. GEneva Multimodal Emotion Portrayals: A corpus for [32] EyesWeb XMI 5.7.0.0. URL http://www.infomus.org/ the study of multimodal emotional expressions. In eyesweb ita.php. Proceedings of LREC, volume 6, pages 15–19, 2006. [33] Antonio Camurri, Paolo Coletta, Barbara Mazzarino, and [16] Neurodata Lab LLC. URL http://www.neurodatalab.com. Stefano Piana. Implementation of EyesWeb low-level. [17] James A Russell and Jos Miguel Fernandez-Dols.´ The 2012. psychology of facial expression. Cambridge university [34] Multimodal Emotion Recognition Challenge. URL http: press, 1997. ISBN 0521587964. //www.datacombats.com/. [18] RAMAS scripts. URL http://neurodatalab.com/upload/ [35] RAMAS database, 2016. URL http://neurodatalab.com/ technologies files/scenarios/Scripts RAMAS.pdf. en/projects/RAMAS/. [19] Kinect v.2. URL http://www.microsoft.com/en-us/ kinectforwindows/. [20] Consensys GSR Development Kit. URL http://www.shimmersensing.com/products/ gsr-optical-pulse-development-kit. [21] Johannes Wagner, Florian Lingenfelser, Tobias Baur, Ionut Damian, Felix Kistler, and Elisabeth Andre.´ The social signal interpretation (SSI) framework: multimodal signal processing and recognition in real-time. In

PeerJ Preprints | https://doi.org/10.7287/peerj.preprints.26688v1 | CC BY 4.0 Open Access | rec: 13 Mar 2018, publ: 13 Mar 2018