<<

IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE 1 Context Based Recognition using EMOTIC Dataset

Ronak Kosti, Jose M. Alvarez, Adria Recasens, Agata Lapedriza

Abstract—In our everyday lives and social interactions we often try to perceive the emotional states of people. There has been a lot of research in providing machines with a similar capacity of recognizing . From a perspective, most of the previous efforts have been focusing in analyzing the facial expressions and, in some cases, also the body pose. Some of these methods work remarkably well in specific settings. However, their performance is limited in natural, unconstrained environments. Psychological studies show that the scene context, in addition to and body pose, provides important information to our of people’s emotions. However, the processing of the context for automatic has not been explored in depth, partly due to the lack of proper data. In this paper we present EMOTIC, a dataset of images of people in a diverse set of natural situations, annotated with their apparent emotion. The EMOTIC dataset combines two different types of emotion representation: (1) a set of 26 discrete categories, and (2) the continuous dimensions , , and Dominance. We also present a detailed statistical and algorithmic analysis of the dataset along with annotators’ agreement analysis. Using the EMOTIC dataset we train different CNN models for emotion recognition, combining the information of the bounding box containing the person with the contextual information extracted from the scene. Our results show how scene context provides important information to automatically recognize emotional states and motivate further research in this direction. Dataset and Code is open-sourced and available on https://github.com/rkosti/emotic. Link for the published article https://ieeexplore.ieee.org/document/8713881.

Index Terms—Emotion recognition, , Pattern recognition !

1 INTRODUCTION

Ver the past years, the in developing automatic O systems for recognizing emotional states has grown rapidly. We can find several recent works showing how emotions can be inferred from cues like text [1], voice [2], or visual information [3], [4]. The automatic recognition of emotions has a lot of applications in environments where machines need to interact or monitor humans. For instance, automatic tutors in an online learning platform would pro- vide better feedback to a student according to her level of motivation or . Also, a car with the capacity of assisting a driver can intervene or give an alarm if it detects the driver is tired or nervous. Fig. 1: How is this kid ? Try to recognize his emotional In this paper we focus on the problem of emotion states from the person bounding box, without scene context. recognition from visual information. Concretely, we want to recognize the apparent emotional state of a person in a given image. This problem has been broadly studied in computer surroundings of the person, like the place category, the place

arXiv:2003.13401v1 [cs.CV] 30 Mar 2020 vision mainly from two perspectives: (1) facial expression attributes, the objects, or the actions occurring around the analysis, and (2) body posture and gesture analysis. Section person. Fig. 1 illustrates the importance of scene context for 2 gives an overview of related work on these perspectives emotion recognition. When we just see the kid it is difficult and also on some of the common public datasets for emotion to recognize his emotion (from his facial expression it seems recognition. he is feeling ). However, when we see the context Although face and body pose give lot of information on (Fig. 2.a) we see the kid is celebrating his birthday, blowing the affective state of a person, our claim in this work is the candles, probably with his family or friends at home. that scene context information is also a key component for With this additional information we can interpret much understanding emotional states. Scene context includes the better his face and posture and recognize that he probably feels engaged, happy and excited. • Ronak Kosti & Agata Lapedriza are with Universitat Oberta de Catalunya, The importance of context in is well Spain. Email: [email protected], [email protected]. supported by different studies in [5], [6]. In • Adria Recasens is with the Computer Science and Artificial Intelli- general situations, facial expression is not sufficient to de- gence Laboratory, Massachusetts Institute of Technology, USA. Email: [email protected]. termine the emotional state of a person, since the perception • Jose M. Alvarez is with NVIDIA, USA. Email: jal- of the emotion is heavily influenced by different types of [email protected] context, including the scene context [2], [3], [4]. • Project Page: http://sunai.uoc.edu/emotic/ In this work, we present two main contributions. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2 a) Engagement functions. We also present comparative analysis of two Excitement different scene context features, showing how the context is contributing to recognize emotions in the wild. V: 0,75 A: 0,60 D: 0,70 2 RELATED WORK Emotion recognition has been broadly studied by the Com- puter Vision community. Most of the existing work has focused on the analysis of facial expression to predict emo- tions [9], [10]. The base of these methods is the Facial Action Coding System [11], which encodes the facial expression b) c) using a set of specific localized movements of the face, called Action Units. These facial-based approaches [9], [10] usually use facial-geometry based features or appearance features to describe the face. Afterwards, the extracted features are used to recognize Action Units and the basic emotions proposed by Ekman and Friesen [12]: , , , happiness, , and surprise. Currently, state-of-the-art systems for Yearning V: 0,63 V: 0,60 Disquietment A: 0,37 Excitement A: 0,90 emotion recognition from facial expression analysis use D: 0,73 Engagement D: 1 CNNs to recognize emotions or Action Units [13]. In terms of emotion representation, some recent works Fig. 2: Sample images in the EMOTIC dataset along with based on facial expression [14] use the continuous dimen- their annotations. sions of the V AD Emotional State Model [7]. The VAD model describes emotions using 3 numerical dimensions: Valence Our first contribution is the creation and publication of (V), that measures how positive or pleasant an emotion is, the EMOTIC (from EMOTions In Context) Dataset. The ranging from negative to positive; Arousal (A), that measures EMOTIC database is a collection of images of people anno- the agitation level of the person, ranging from non-active / tated according to their apparent emotional states. Images in calm to agitated / ready to act; and Dominance (D) that are spontaneous and unconstrained, showing people doing measures the level of control a person feels of the situation, different things in different environments. Fig. 2 shows ranging from submissive / non-control to dominant / in-control. some examples of images in the EMOTIC database along On the other hand, Du et al. [15] proposed a set of 21 with their corresponding annotations. As shown, anno- facial emotion categories, defined as different combinations tations combine 2 different types of emotion representa- of the basic emotions, like ‘happily surprised’ or ‘happily tion: Discrete Emotion Categories and 3 Continuous Emo- disgusted’. With this categorization the authors can give a tion Dimensions Valence, Arousal, and Dominance [7]. The fine-grained detail about the . EMOTIC dataset is now publicly available for download at Although the research in emotion recognition from a the EMOTIC website1. Details of the dataset construction computer vision perspective is mainly focused in the anal- process and dataset statistics can be found in section 3. ysis of the face, there are some works that also consider Our second contribution is the creation of a baseline other additional visual cues or multimodal approaches. system for the task of emotion recognition in context. In For instance, in [16] the location of shoulders is used as particular, we present and test a Convolutional Neural additional information to the face features to recognize basic Network (CNN) model that jointly processes the window emotions. More generally, Schindler et al. [17] used the body of the person and the whole image to predict the appar- pose to recognize 6 basic emotions, performing experiments ent emotional state of the person. Section 4 describes the on a small dataset of non-spontaneous poses acquired under CNN model and the implementation details while section 5 controlled conditions. Mou et al. [18] presented a system of presents our experiments and discussion on the results. All analysis in still images of groups of people, recogniz- the trained models resulting from this work are also publicly ing group-level arousal and valence from combining face, available at the EMOTIC website1. body and contextual information. This paper is an extension of the conference paper ”Emo- Emotion Recognition in Scene Context and Image Sen- tion Recognition in Context”, presented at the IEEE Interna- timent Analysis are different problems that share some tional Conference on Computer Vision and Pattern Recogni- characteristics. Emotion Recognition aims to identify the tion (CVPR) 2017 [8]. We present here an extended version emotions of a person depicted in an image. Image Sentiment of the EMOTIC dataset, with further statistical dataset anal- Analysis consists of predicting what a person will feel ysis, an analysis of scene-centric algorithms on the data, and when observing a picture. This picture does not necessarily a study on the annotation consistency among different an- contain a person. When an image contains a person, there notators. This new release of the EMOTIC database contains can be a difference between the emotions experienced by 44.4% more annotated people as compared to its previous the person in the image and the emotions felt by observers smaller version. With the new extended dataset we retrained of the image. For example, in the image of Figure 2.b, we all the proposed baseline CNN models with additional loss see a kid who seems to be annoyed for having an apple instead of chocolate and another who seems happy to have 1. http://sunai.uoc.edu/emotic/ chocolate. However, as observers, we might not have any IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 3 of those sentiments when looking at the photo. Instead, we 3 EMOTIC DATASET might think the situation is not fair and feel disapproval. The EMOTIC dataset is a collection of images of people in Also, if we see an image of an athlete that has lost a match, unconstrained environments annotated according to their we can recognize the athlete feels sad. However, an observer apparent emotional states. The dataset contains 23, 571 im- of the image may feel happy if the observer is a fan of the ages and 34, 320 annotated people. Some of the images team that won the match. were manually collected from the Internet by Google search engine. For that we used a combination of queries contain- 2.1 Emotion Recognition Datasets ing various places, social environments, different activities and a variety of keywords on emotional states. The rest Most of the existing datasets for emotion recognition using of images belong to 2 public benchmark datasets: COCO computer vision are centered in facial expression analysis. [30] and Ade20k [31]. Overall, the images show a wide For example, the GENKI database [19] contains frontal diversity of contexts, containing people in different places, face images of a single person with different illumination, social settings, and doing different activities. geographic, personal and ethnic settings. Images in this Fig. 2 shows three examples of annotated images in the smiling non-smiling dataset are labelled as or . Another com- EMOTIC dataset. Images were annotated using Amazon mon facial expression analysis dataset is the ICML Face- Mechanical Turk (AMT). Annotators were asked to label 28, 000 Expression Recognition dataset [20], that contains each image according to what they think people in the 6 images annotated with basic emotions and a neutral images are feeling. Notice that we have the capacity of mak- category. On the other hand, the UCDSEE dataset [21] has ing reasonable guesses about other people’s emotional state 9 4 a set of emotion expressions acted by persons. The lab due to our capacity of being empathetic, putting ourselves setting is strictly kept the same in order to focus mainly on into another’s situation, and also because of our common the facial expression of the person. sense knowledge and our ability for reasoning about vi- The dynamic body movement is also an essential source sual information. For example, in Fig. 2.b, the person is for estimating emotion. Studies such as [22], [23] establish performing an activity that requires Anticipation to adapt to the relationship between affect and body posture using as the trajectory. Since he is doing a thrilling activity, he seems ground truth the base-rate of human observers. The data excited about it and he is engaged or focused in this activity. consist of a spontaneous set of images acquired under a In Fig. 2.c, the kid feels a strong (yearning) for eating restrictive setting (people playing Wii games). The GEMEP the chocolate instead of the apple. Because of his situation 10 database [24] is multi-modal (audio and video) and has we can interpret his facial expression as disquietness and 18 actors playing affective states. The dataset has videos annoyance. Notice that images are also annotated accord- of actors showing emotions through acting. Body pose and ing to the continuous dimensions V alence, Arousal, and facial expression are combined. Dominance. We describe the emotion annotation modalities The Looking at People (LAP) challenges and competi- of EMOTIC dataset and the annotation process in sections tions [25] involve specialized datasets containing images, 3.1 and 3.2, respectively. sequences of images and multi-modal data. The main focus After the first round of annotations (1 annotator per of these datasets is the complexity and variability of human image), we divided the images into three sets: Training body configuration which include data related to personal- (70%), Validation (10%), and Testing (20%) maintaining a ity traits (spontaneous), gesture recognition (acted), appar- similar affective category distribution across the different ent age recognition (spontaneous), cultural event recogni- sets. After that, Validation and Testing were annotated by tion (spontaneous), action/interaction recognition and hu- 4 and 2 extra annotators respectively. As a consequence, man pose recognition (spontaneous). images in the Validation set are annotated by a total of 5 The Emotion Recognition in the Wild (EmotiW) chal- annotators, while images in the Testing set are annotated lenges [26] host 3 databases: (1) The AFEW database [27] by 3 annotators (these numbers can slightly vary for some focuses on emotion recognition from video frames taken images since we removed noisy annotations). from movies and TV shows, where the actions are annotated We used the annotations from the Validation to study the with attributes like name, age of actor, age of character, pose, consistency of the annotations across different annotators. gender, expression of person, the overall clip expression and This study is shown in section 3.3. The data statistics and the basic 6 emotions and a neutral category; (2) The SFEW, algorithmic analysis on the EMOTIC dataset are detailed in which is a subset of AFEW database containing images of sections 3.4 and 3.5 respectively. face-frames annotated specifically with the 6 basic emotions and a neutral category; and (3) the HAPPEI database [28], which addresses the problem of group level emotion esti- 3.1 Emotion representation mation. Thus, [28] offers a first attempt to use context for The EMOTIC dataset combines two different types of emo- the problem of predicting happiness in groups of people. tion representation: Finally, the COCO dataset has been recently annotated Continuous Dimensions: images are annotated accord- with object attributes [29], including some emotion cate- ing to the V AD model [7], which represents emotions by a gories for people, such as happy and curious. These attributes combination of 3 continuous dimensions: Valence, Arousal show some overlap with the categories that we define in this and Dominance. In our representation each dimension takes paper. However, COCO attributes are not intended to be an integer value that lies in the range [1 − 10]. Fig. 4 shows exhaustive for emotion recognition, and not all the people examples of people annotated by different values of the in the dataset are annotated with affect attributes. given dimension. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 4

Fig. 3: Examples of annotated people in EMOTIC dataset for each of the 26 emotion categories (Table 1). The person in the red bounding box is annotated by the corresponding category.

1 3 5 8 10 1. : fond ; ; tenderness 2. Anger: intense displeasure or ; furious; resentful 3. Annoyance: bothered by something or someone; irritated; impa- 1 tient; frustrated Valence 4. Anticipation: state of looking forward; hoping on or getting 1 3 5 8 10 prepared for possible future events 5. Aversion: feeling disgust, dislike, repulsion; feeling hate 6. Confidence: feeling of being certain; conviction that an outcome

Arousal will be favorable; encouraged; proud 1 3 5 8 10 7. Disapproval: feeling that something is wrong or reprehensible; ; hostile

ance

n 8. Disconnection: feeling not interested in the main event of the surrounding; indifferent; bored; distracted

Domi 9. Disquietment: nervous; worried; upset; anxious; tense; pres- sured; alarmed Fig. 4: Examples of annotated images in EMOTIC dataset 10. /: difficulty to understand or decide; thinking for each of the 3 continuous dimensions Valence, Arousal about different options & Dominance. The person in the red bounding box has the 11. : feeling ashamed or guilty corresponding value of the given dimension. 12. Engagement: paying to something; absorbed into something; curious; interested 13. Esteem: feelings of favourable opinion or judgement; respect; ; gratefulness 14. Excitement: feeling ; stimulated; energetic Emotion Categories: in addition to VAD we also estab- 15. Fatigue: weariness; tiredness; sleepy lished a list of 26 emotion categories that represent various 16. Fear: feeling suspicious or afraid of danger, threat, evil or ; state of emotions. The list of the 26 emotional categories horror 17. Happiness: feeling delighted; feeling enjoyment or and their corresponding definitions can be found in Table 18. Pain: physical 1. Also, Fig. 3 shows (per category) examples of people 19. Peace: well being and relaxed; no ; having positive showing different emotional categories. thoughts or sensations; satisfied 20. : feeling of delight in the senses The list of emotion categories has been created as fol- 21. Sadness: feeling unhappy, , disappointed, or discouraged lows. We manually collected an affective vocabulary from 22. Sensitivity: feeling of being physically or emotionally and books on psychology [32], [33], [34], [35]. wounded; feeling delicate or vulnerable This vocabulary consists of a list of approximately 400 23. Suffering: psychological or emotional pain; distressed; an- guished words representing a wide variety of emotional states. Af- 24. Surprise: sudden discovery of something unexpected ter a careful study of the definitions and the similarities 25. : state of sharing others emotions, goals or troubles; amongst these definitions, we formed cluster of words with supportive; compassionate similar meanings. The clusters were formalized into 26 26. Yearning: strong desire to have something; jealous; envious; categories such that they were distinguishable in a single TABLE 1: Proposed emotion categories with definitions. image of a person with her context. We created the final list of 26 emotion categories taking into account the Visual Separability criterion: words that have a close meaning could the words rage, furious and resentful. These affective states not be visually separable. For instance, Anger is defined by are different, but it is not always possible to separate them IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 5 Emotion Category Continuous Dimension a) Peace “Consider each emotion category “Consider each emotion dimension (fond feelings/tenderness/love/) Expectation (state of anticipating/hoping on something or someone) separately and, if it is applicable separately, observe what level is ap- Esteem (favorable opinion or judgment/gratefulness/admiration/respect) to the person in the given context, plicable to the person in the given (feeling of being certain/proud/encouraged/optimistic) Engagement (occupied/absorbed/interested/paying attention to something) select that emotion category” context, and select that level” Pleasure (feeling of delight in the senses) Happiness (feeling delighted/enjoyment/amusement) Excitement (pleasant and excited state/stimulated/energetic/enthusiastic) Surprise (sudden discovery of something unexpected) TABLE 2: Instruction summary for each HIT (distressed/perturbed/anguished) Disapproval (think that something is wrong or reprehensible/contempt/hostile) Yearning (strong desire to have something/jealous/envious) Fatigue (weariness/tiredness/sleepy) Pain Doubt/Confusion Fear (feeling afraid of danger/evil/pain/horror) visually in a single image. Thus, our list of affective cate- Vulnerability (feeling of being physically or emotionally wounded) Disquitement (unpleasant restlessness/tense/worried/upset/stressed) Annoyance (bothered/iritated/impatient/troubled/frustrated) gories can be seen as a first level of a hierarchy, where each Anger (intense displeasure or rage/furious/resentful) Disgust (feeling dislike or repulsion/feeling hateful) category has associated subcategories. Sadness (feeling unhappy//disappointed/discouraged) Disconnection Notice that the final list of affective categories also Back (Image 1 of 20) Go to Next Image Embarrassment (feeling ashamed or guilty) includes the 6 basic emotions (categories 2, 5, 16, 17, 21, 24), but we used the more general term Aversion for the b) category Disgust. Thus, the category Aversion also includes Valence: Negative vs. Positive

Negative Positive the subcategories dislike, repulsion, and hate apart from (unpleasant) (pleasant) disgust. Arousal (awakeness): Calm vs. Ready to act

Calm Ready to act (active)

3.2 Collecting Annotations Dominance: Dominated vs. In control Dominated In (no control We used Amazon Mechanical Turk (AMT) crowd-sourcing control) Gender and age of the person in the yellow box platform to collect the annotations of the EMOTIC dataset. Male Female We designed two Human Intelligence Tasks (HITs), one for Kid (0-12) Teenager (13-20) Adult (more than 20) each of the 2 formats of emotion representation. The two Back (Image 1 of 20) Go to Next Image annotation interfaces are shown in Fig. 5. Each annotator is shown a person-in-context enclosed in a red bounding-box Fig. 5: AMT interface designs (a) For Discrete Categories’ along with the annotation format next to it. Fig. 5.a shows annotations & (b) For Continuous Dimensions’ annotations the interface for discrete category annotation while Fig. 5.b displays the interface for continuous dimension annotation. Notice that, in the last box of the continuous dimension 3.3 Agreement Level Among Different Annotators interface, we also ask AMT workers to annotate the gender Since emotion perception is a subjective task, different peo- and estimate the age (range) of the person enclosed in red ple can perceive different emotions after seeing the same bounding-box. The designing of the annotation interface has image. For example in both Fig. 6.a and 6.b, the person in the two main focuses: i) the task is easy to understand and ii) the red box seems to feel Affection, Happiness and Pleasure and interface fits the HIT in one screen which avoids scrolling. the annotators have annotated with these categories with To make sure annotators understand the task, we consistency. However, not everyone has selected all these showed them how to annotate the images step-wise, by emotions. Also, we see that annotators do not agree in the explaining two examples in detail. Also, instructions and emotions Excitement and Engagement. We consider, however, examples were attached at the bottom on each page as a that these categories are reasonable in this situation. An- quick reference to the annotator. Finally, a summary of the other example is that of Roger Federer hitting a tennis ball detailed instructions was shown at the top of each page in Fig. 6.c. He is seen predicting the ball (or Anticipating) and (Table 2). clearly looks Engaged in the activity. He also seems Confident in getting the ball. We adopted two strategies to avoid noisy annotations After these observations we conducted different quanti- in the EMOTIC dataset. First, we conduct a qualification tative analysis on the annotation agreement. We focused first task to annotator candidates. This qualification task has two on analyzing the agreement level in the category annotation. parts: (i) an Emotional Quotient HIT (based on standard EQ Given a category assigned to a person in an image, we task [36]) and (ii) 2 sample image annotation tasks - one for consider as an agreement measure the number of annota- each of our 2 emotion representations (discrete categories tors agreeing for that particular category. Accordingly, we and continuous dimensions). For the sample annotations, calculated, for each category and for each annotation in we had a set of acceptable labels. The responses of the an- the validation set, the agreement amongst the annotators notator candidates to this qualification task were evaluated and sorted those values across categories. Fig. 7 shows the and those who responded satisfactorily were allowed to distribution on the percentage of annotators agreeing for an annotate the images from the EMOTIC dataset. The second annotated category across the validation set. strategy to avoid noisy annotations was to insert, randomly, We also computed the agreement between all the an- 2 control images in every annotation batch of 20 images; notators for a given person using Fleiss’ Kappa (κ). Fleiss’ the correct assortment of labels for the control images was Kappa is a common measure to evaluate the agreement know beforehand. Annotators selecting incorrect labels on level among a fixed number of annotators when assigning these control images were not allowed to annotate further categories to data. In our case, given a person to annotate, and their annotations were discarded. there is a subset of 26 categories. If we have N annotators IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 6

a) Distribution of Kappa Values b) Std across Validation Set (each dimension) a) Number of annotors per category 1 4 4 Valence 3 Arousal 1. A ection, Happiness, Pleasure 2 0.8 Dominance 3 2. A ection, Pleasure 1 0

# annotators 3. A ection 0.6 17.Fear 15.Pain 4. Excitement, Happiness, Pleasure 1.Peace 2 mean value:1.84 4.Esteem 20.Anger 7.Pleasure 14.Fatigue 2.A ection 22.Sadness 10.Surprise mean value:1.57 21.Aversion 13.Yearning 11.Su ering 5. A ection, Happiness, Pleasure 8.Happiness 0.4 Kappa Values 9.Excitement 26.Sympathy 5.Condence 25.Sensitivity 3.Anticipation 19.Annoyance 6.Engagement 12.Disapproval

16.Doubt/Conf. mean Value:0.30 24.Embarrassm. 18.Disquietment 23.Disconnection 1 mean value:1.04 0.2 Standard deviation (std) b) Number of annotors per category 4 0 500 1000 1500 2000 2500 0 500 1000 1500 2000 2500 3000 1. A ection 3 2 Sorted Images in Validation Set Sorted Images in Validation Set 2. Engagement, Happiness, Pleasure 1

3. A ection, Pleasure # annotators 0 4. A ection 17.Fear 15.Pain 1.Peace Fig. 8: (a) Kappa values (sorted) and (b) Standard deviation 4.Esteem 20.Anger 7.Pleasure 14.Fatigue 2.A ection 22.Sadness 10.Surprise

5. Engagement, Excitement, Happiness 21.Aversion 13.Yearning 11.Su ering 8.Happiness 9.Excitement 26.Sympathy 5.Condence 25.Sensitivity 3.Anticipation 19.Annoyance 6.Engagement 12.Disapproval 16.Doubt/Conf. 24.Embarrassm. 18.Disquietment

23.Disconnection (sorted), for each annotated person in validation set c) Number of annotors per category 4 3 1. Anticipation 2 2. Anticipation, Engagement 1 a) 55% 0 3. Anticipation # annotators a) 4. Condence, Engagement, Esteem Total People:34320 17.Fear 15.Pain 3 1.Peace 4

4.Esteem 27% 20.Anger 7.Pleasure 14.Fatigue 10 24% 2.A ection 22.Sadness 10.Surprise 10 21.Aversion

5. Engagement 13.Yearning Total Images: 23571 11.Su ering 8.Happiness 21% 23571 9.Excitement 26.Sympathy 5.Condence 25.Sensitivity 3.Anticipation

19.Annoyance 19% 6.Engagement 12.Disapproval 16.Doubt/Conf. 24.Embarrassm. 18.Disquietment 23.Disconnection

11% 8% 6% Fig. 6: Annotations of five different annotators for 3 images 6% 5% 4% 3% in EMOTIC. 32 3% 1010 3% 2% 2% 2% 2% 2% 2% 1% Number of People Number 1% Agreement amongst multiple annotators for Emotion Categories 1% 1% 1% 1 1%

0.9

s e n n s y 0.8 ion m se ion sio ning igue nce val sur tee ment tivit ering ement ectio sment gement 2.Anger16.Fear 18.Pain 0.7 appro as 19.Peace1.Aff 15.Fatquiet 5.Avers connect 13.Es /Confu 21.Sadne24.Surprs i 20.Plea 25.Sympathy26.Year 3.Annoya 22.Sens23.Sufi f 17.Happine4.Anticipation14.Excis 6.Confidencet 7.Dis 0.6 12.Enga 8.Dis 9.Dis 20% annotators 11.Embarr 40% annotators 10.Doubt 0.5 60% annotators 80% annotators b) Valence c) Arousal d) Dominance 100% annotators 104 4 4

0.4 e 34% 4 29% 4 10 23% 10 10 4

10l 10 22% 17% 20% p 18% 15%

0.3 o 12% 12% 13% 9% 9% 11% e 8% 7% 8% P Number of people (normalized) 6% 5% 5% 0.2 f 3 4% 4% 103 of Persons 3 3

o 10 3 10 2% 2% 103 10

r 10 0.1 2% e 1% b 1%

0 m 1% Number of Persons Number of Persons Numbe r 0% 1% u

2 2 2 10 2 10 N 10 10 1 1 1 2.Anger 18.Pain 16.Fear 19.Peace 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10 1.Affection 15.Fatigue 5.Aversion 13.Esteem 1 2 3 4 5 21.Sadness 20.Pleasure 24.Surprise 26.Yearning 6.Confidence23.Suffering 3.Annoyance25.Sympathy 22.Sensitivity 17.Happiness4.Anticipation14.Excitement 7.Disapproval 12.Engagement 8.Disconnection 9.Disquietment 10.Doubt/Confusion 11.Embarrassment Fig. 9: Dataset Statistics. (a) Number of people annotated for Fig. 7: Representation of agreement between multiple an- each emotion category; (b), (c) & (d) Number of people an- notators. Categories sorted in decreasing order according notated for every value of the three continuous dimensions to the average number of annotators who agreed for that viz.Valence, Arousal & Dominance category.

validation set for all the 3 dimensions, sorted in decreasing per image, that means that each of the 26 categories can be order. selected by n annotators, where 0 ≤ n ≤ N. Given an image we compute the Fleiss’ Kappa per each emotion category 3.4 Dataset Statistics first, and then the general agreement level on this image is computed as the average of these Fleiss’ Kappa values EMOTIC dataset contains 34, 320 annotated people, where across the different emotion categories. We obtained that 66% of them are males and 34% of them are females. There more than 50% of the images have κ > 0.30. Fig. 8.a shows are 10% children, 7% teenagers and 83% adults amongst the distribution of kappa values across the validation set them. for all the annotated people in the validation set, sorted in Fig. 9.a shows the number of annotated people for each decreasing order. Random annotations or total disagreement of the 26 emotion categories, sorted by decreasing order. produces κ ∼ 0, however for our case, κ ∼ 0.3 (on average) Notice that the data is unbalanced, which makes the dataset suggesting significant agreement level even though the task particularly challenging. An interesting observation is that of emotion recognition is subjective. there are more examples for categories associated to positive For continuous dimensions, the agreement is measured emotions, like Happiness or Pleasure, than for categories as- by the standard deviation (SD) of the different annotations. sociated with negative emotions, like Pain or Embarrassment. The average SD across the Validation set is 1.04, 1.57 and The category with most examples is Engagement. This is 1.84 for Valence, Arousal and Dominance respectively - in- because in most of the images people are doing something dicating that Dominance has higher (±1.84) dispersion than or are involved in some activity, showing some degree the other dimensions. It reflects that annotators disagree of engagement. Figs. 9.b, 9.c and 9.d show the number more often for Dominance than for the other dimensions of annotated people for each value of the 3 continuous which is understandable since Dominance is more difficult dimensions. In this case we also observe unbalanced data to interpret than Valence or Arousal [7]. As a summary, Fig. but fairly distributed across the 3 dimensions which is good 8.b shows the standard deviations of all the images in the for modelling. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 7

a) Distribution of Valence values across Emotion Categories Cooccurence of Categories 100 1 100.00 0.05 0.00 12.35 0.26 17.11 0.16 1.00 0.42 0.90 0.58 33.74 17.48 27.19 0.37 0.32 65.42 0.21 14.52 42.61 0.95 3.01 0.48 2.69 22.97 1.69 2 0.28 100.00 46.05 5.65 20.34 5.65 31.36 5.65 21.47 12.99 4.52 15.82 0.56 3.11 3.95 6.50 0.56 6.78 0.00 0.28 9.60 4.24 11.58 3.95 0.85 3.39 3 4 1.Affection 0.00 24.18 100.00 9.20 16.32 5.64 37.54 13.35 24.48 18.69 4.45 19.14 0.30 3.26 12.61 5.64 0.89 5.93 0.59 0.30 10.24 5.64 12.17 4.30 0.89 4.45 2.Anger 5 2.85 0.24 0.75 100.00 0.34 23.04 0.77 1.84 1.49 3.96 0.45 62.34 5.40 25.16 1.11 0.82 19.29 0.30 3.81 6.82 0.29 0.79 0.26 1.50 2.88 1.80 80 6 1.57 22.64 34.59 8.81 100.00 4.09 35.22 15.41 22.01 17.30 5.66 13.21 1.26 3.46 7.55 7.55 0.63 4.40 1.26 6.29 13.84 6.29 14.15 5.66 4.40 3.77 3.Annoyance 7 4.93 0.30 0.58 28.80 0.20 100.00 0.53 1.52 0.91 0.70 0.26 57.58 9.36 43.13 0.87 0.41 28.97 0.18 8.40 14.39 0.06 0.35 0.08 1.28 3.19 0.85 8 4.Anticipation 5.Aversion 0.48 17.70 40.35 10.05 17.86 5.58 100.00 12.60 21.05 30.46 6.06 23.29 0.32 2.23 15.31 4.94 1.59 11.80 1.28 1.28 14.04 5.58 17.86 13.56 2.23 9.73 9 10 0.85 0.90 4.04 6.77 2.20 4.48 3.54 100.00 6.77 7.35 1.17 17.04 1.75 2.15 9.28 0.81 4.75 0.81 8.43 1.93 2.96 1.30 1.66 0.99 0.76 1.43 6. 0.95 9.06 19.67 14.54 8.34 7.15 15.73 18.00 100.00 19.55 5.84 21.33 1.19 6.44 16.33 7.27 1.19 4.77 1.19 0.72 18.95 7.39 15.02 4.41 4.17 3.58 60 7.Disapproval 1.51 4.09 11.20 28.89 4.89 4.09 16.98 14.58 14.58 100.00 3.91 32.89 1.51 5.69 12.09 7.91 3.20 6.84 3.29 3.02 8.18 4.62 9.51 12.09 2.13 8.00

8.Disconnection 4.45 6.48 12.15 14.98 7.29 6.88 15.38 10.53 19.84 17.81 100.00 11.74 3.24 7.69 9.72 9.72 7.29 8.91 1.62 4.05 14.98 6.88 14.98 7.29 2.83 1.21

9.Disquietment 3.36 0.29 0.68 26.96 0.22 19.93 0.77 2.00 0.94 1.95 0.15 100.00 3.61 20.37 1.09 0.31 18.66 0.21 6.13 8.02 0.55 0.45 0.22 1.17 1.72 0.84

21.07 0.13 0.13 28.26 0.25 39.15 0.13 2.48 0.64 1.08 0.51 43.67 100.00 29.66 0.38 0.57 45.77 0.19 9.61 22.28 0.70 1.21 0.25 1.65 20.88 2.04 10.Doubt/Confusion 40 7.10 0.15 0.30 28.51 0.15 39.11 0.19 0.66 0.74 0.88 0.26 53.37 6.43 100.00 0.39 0.65 45.33 0.06 4.66 18.43 0.03 0.36 0.06 2.66 2.94 0.79 11.Embarrassment 12.Engagement13.Esteem 0.80 1.60 9.70 10.39 2.74 6.51 10.96 23.63 15.64 15.53 2.74 23.63 0.68 3.20 100.00 3.08 3.77 11.07 8.90 2.85 8.33 2.74 14.04 6.96 1.48 7.88 1.82 6.99 11.55 20.36 7.29 8.21 9.42 5.47 18.54 27.05 7.29 17.63 2.74 14.29 8.21 100.00 1.22 11.85 3.65 2.13 21.88 9.73 20.36 12.77 2.74 4.86

14.Excitement (in %) Distribution of Valence 15.Fatigue 13.59 0.02 0.07 17.38 0.02 20.88 0.11 1.16 0.11 0.39 0.20 38.87 7.88 36.03 0.36 0.04 100.00 0.08 10.75 24.86 0.01 0.67 0.10 2.19 6.47 0.84 16.Fear 1.26 7.55 12.58 7.86 4.40 3.77 23.27 5.66 12.58 24.21 6.92 12.58 0.94 1.26 30.50 12.26 2.20 100.00 1.57 2.20 31.45 10.69 53.14 16.67 6.29 15.09 20 10.21 0.00 0.15 11.62 0.15 20.50 0.30 6.98 0.37 1.37 0.15 43.26 5.61 12.55 2.90 0.45 36.39 0.19 100.00 20.16 0.82 5.05 0.33 1.49 4.94 4.23 17.Happiness18.Pain 22.12 0.03 0.05 15.35 0.55 25.92 0.22 1.18 0.16 0.93 0.27 41.76 9.59 36.61 0.69 0.19 62.13 0.19 14.88 100.00 0.00 1.07 0.14 3.51 6.69 1.64

19.Peace 2.47 4.67 9.48 3.30 6.04 0.55 12.09 9.07 21.84 12.64 5.08 14.42 1.51 0.27 10.03 9.89 0.14 13.74 3.02 0.00 100.00 14.84 34.89 1.92 12.36 4.53 20.Pleasure 9.47 2.49 6.31 10.80 3.32 3.82 5.81 4.82 10.30 8.64 2.82 14.12 3.16 4.32 3.99 5.32 10.13 5.65 22.59 6.48 17.94 100.00 13.46 2.16 8.64 14.29 0 21.Sadness 1.80 8.22 16.43 4.21 9.02 1.00 22.44 7.41 25.25 21.44 7.41 8.42 0.80 0.80 24.65 13.43 1.80 33.87 1.80 1.00 50.90 16.23 100.00 10.22 7.82 10.62 7.30 2.00 4.15 17.60 2.58 12.02 12.16 3.15 5.29 19.46 2.58 31.76 3.72 27.61 8.73 6.01 28.61 7.58 5.72 18.31 2.00 1.86 7.30 100.00 3.29 7.87 22.Sensitivity 32.85 0.23 0.45 17.90 1.06 15.86 1.06 1.28 2.64 1.81 0.53 24.62 24.77 16.09 0.98 0.68 44.56 1.51 10.05 18.43 6.80 3.93 2.95 1.74 100.00 1.74 18.Pain 2.Anger 16.Fear 23.Suffering 19.Peace 5.Aversion 15.Fatigue 13.Esteem 1.Affection 24.Surprise 3.26 1.22 3.05 15.06 1.22 5.70 6.21 3.26 3.05 9.16 0.31 16.17 3.26 5.80 7.02 1.63 7.83 4.88 11.60 6.10 3.36 8.75 5.39 5.60 2.34 100.00 23.Suffering 21.Sadness 24.Surprise26.Yearning 20.Pleasure 3.Annoyance 7.Disapproval 22.Sensitivity 4.Anticipation 6.Confidence 25.Sympathy14.Excitement17.Happiness 9.Disquietment 8.Disconnection 12.Engagement 25.Sympathy 10.Doubt/Confusion11.Embarrassment 26.Yearning 2.Anger 16.Fear 18.Pain b) Distribution of Arousal values across Emotion Categories 19.Peace 1.Affection 5.Aversion 13.Esteem 15.Fatigue 100 20.Pleasure21.Sadness 23.Suffering24.Surprise 26.Yearning 3.Annoyance4.Anticipation 6.Confidence7.Disapproval 17.Happiness 22.Sensitivity 25.Sympathy 1 9.Disquietment 14.Excitement 8.Disconnection 12.Engagement 2 10.Doubt/Confusion11.Embarrassment 3 4 5 80 6 7 Fig. 10: Co-variance between 26 emotion categories. Each 8 9 10 row represents the occurrence probability of every other 60 category given the category of that particular row.

40 Distribution of Arousal (in %)

20 Fig. 10 shows the co-occurrence rates of any two cate- gories. Every value in the matrix (r, c) (r represents the row 0 18.Pain 2.Anger16.Fear 19.Peace 15.Fatigue 1.Affection 13.Esteem5.Aversion 21.Sadness23.Suffering 26.Yearning 24.Surprise 20.Pleasure category and c column category) is a co-occurrence prob- 7.Disapproval 3.Annoyance 22.Sensitivity25.Sympathy17.Happiness 4.Anticipation6.Confidence14.Excitement 8.Disconnection 9.Disquietment 12.Engagement 10.Doubt/Confusion 11.Embarrassment ability (in %) of category r if the annotation also contains c) Distribution of Dominance values across Emotion Categories 100 1 the category c, that is, P (r|c). We observe, for instance, that 2 3 4 5 when a person is labelled with the category Annoyance, then 80 6 7 8 there is 46.05% probability that this person is also annotated 9 10 by the category Anger. This means that when a person seems 60 to be feeling Annoyance it is likely (by 46.05%) that this person might also be feeling Anger. We also used a K-Means 40 clustering on the category annotations to find groups of categories that occur frequently. We found, for example, that Distribution of Dominance (in %) 20 these category groups are common in the EMOTIC annota- 0 tions: {Anticipation, Engagement, Confidence}, {Affection, Hap- 18.Pain 2.Anger16.Fear 19.Peace 5.Aversion 15.Fatigue 1.Affection 13.Esteem 23.Suffering 21.Sadness 24.Surprise 26.Yearning 20.Pleasure 7.Disapproval 3.Annoyance 22.Sensitivity 17.Happiness 25.Sympathy4.Anticipation14.Excitement6.Confidence piness, Pleasure}, {Doubt/Confusion, Disapproval, Annoyance}, 9.Disquietment 8.Disconnection 12.Engagement 10.Doubt/Confusion11.Embarrassment {Yearning, Annoyance, Disquietment}. Fig. 11: Distribution of continuous dimension values across Fig. 11 shows the distribution of each continuous di- emotion categories. Average value of a dimension is calcu- mension across the different emotion categories. For every lated for every category and then plotted in increasing order plot, categories are arranged in increasing order of their for every distribution. average values of the given dimension (calculated for all the instances containing that particular category). Thus, we observe from Fig. 11.a that emotion categories like Suffering, 3.5 Algorithmic Scene Context Analysis Annoyance, Pain correlate with low Valence values (feeling less positive) in average whereas emotion categories like This section illustrates how current scene-centric systems Pleasure, Happiness, Affection correlate with higher Valence can be used to extract contextual information that can be values (feeling more positive). Also interesting is to note that potentially useful for emotion recognition. In particular, we a category like Disconnection lies in the mid-range of Valence illustrate this idea with a CNN trained on Places dataset value which makes sense. When we observe Fig. 11.b, it is [37] and with the Sentibanks Adjective-Noun Pair (ANP) easy to understand that emotional categories like Discon- detectors [38], [39], a Visual Sentiment Ontology for Image nection, Fatigue, Sadness show low Arousal values and we . As a reference, Fig. 12 shows Places and see high activeness for emotion categories like Anticipation, ANP outputs for sample images of the EMOTIC dataset. Confidence, Excitement. Finally, Fig. 11.c shows that people We used AlexNet Places CNN [37] to predict the scene are not in control when they show emotion categories like category and scene attributes for the images in EMOTIC. Suffering, Pain, Sadness whereas when the Dominance is This information helps to divide the analysis into place high, emotion categories like Esteem, Excitement, Confidence category and place attribute. We observed that the dis- occur more often. tribution of emotions varies significantly among different An important remark about the EMOTIC dataset is that place categories. For example, we found that people in the there are people whose faces are not visible. More than 25% ’ski slope’ frequently experience Anticipation or Excitement, of the people in EMOTIC have their faces partially occluded which are associated to the activities that usually happen in or with very low resolution, so we can not rely on facial this place category. Comparing sport-related and working- expression analysis for recognizing their emotional state. environment related images, we find that people in sport- IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 8

Places CNN output Sentibanks NAP (score). Top 8. Place Category: early_childhood (0.203) kindergarden_classroom, early_education (0.087) classroom early_learning (0.047) Attributes: elementary_schools (0.045) no_horizon,enclosed _area, elementary_education (0.041) man-made, working, cloth, creative_kids (0.040) wood, socializing, plastic, nal_exam (0.024) congregating. young_child (0.019) Place Category: long_range (0,024) land ll outdoor_adventure (0.022) outdoor_education (0.020) Attributes: hard_work (0.016) natural_light, open_area, healthy_lifestile (0.007) dirt, sunny, no_horizon, environmental_portrait (0.007) rugged scene, dry, foliage, active_volcano (0.007) trees. big_bear (0.007) Fig. 13: Proposed end-to-end model for emotion recognition Fig. 12: Illustration of 2 current scene-centric methods in context. The model consists of two feature extraction for extracting contextual features from the scene: AlexNet modules and a fusion network for jointly estimating the Places CNN outputs (place categories and attributes) and discrete categories and the continuous dimensions. Sentibanks ANP outputs for three example images of the EMOTIC dataset. The body feature extraction module takes the visible part of the body of the target person as input and generates body-related features. These features include important cues Excitement, Anticipation related images usually show and like face and head aspects and pose or body appearance. In Confidence Sadness Annoyance , however they show or less order to capture these aspects, this module is pre-trained Sadness Annoyance frequently. Interestingly, and appear with ImageNet [40], which is an object centric dataset that with higher frequency in working environments. We also includes the category person. observe interesting patterns when correlating continuous The image feature extraction module takes the whole dimensions with place attributes and categories. For in- image as input and generates scene-context features. These stance, places where people usually show high Dominance contextual features can be interpreted as an encoding of the are sport-related places and sport-related attributes. On the scene category, its attributes and objects present in the scene, contrary, low Dominance is shown in ’jail cell’ or attributes or the dynamics between other people present in the scene. like ’enclosed area’ or ’working’, where the freedom of To capture these aspects, we pre-train this module with the movement is restricted. In Fig. 12, the predictions by Places scene-centric Places dataset [37]. CNN describe the scene in general, like in the top image The fusion module combines features of the two feature there is a girl sitting in a ’kindergarten classroom’ (places extraction modules and estimates the discrete emotion cate- category) which usually is situated in enclosed areas with gories and the continuous emotion dimensions. ’no horizon’ (attributes). Both feature extraction modules are based on the one- We also find interesting patterns when we compute the dimensional filter CNN proposed in [41]. These CNN net- correlation between detected ANPs and emotions labelled works provide competitive performance while the num- in the image. For example, in images with people labelled ber of parameters is low. Each network consists of 16 with Affection, the most frequent ANP is ’young couple’, convolutional layers with 1-dimensional kernels alternat- while in images with people labelled with Excitement we ing between horizontal and vertical orientations, effectively found frequently the ANPs ’last game’ and ’playing field’. modeling 8 layers using 2-dimensional kernels. Then, to Also, we observe a high correlation between images with maintain the location of different parts of the image, we use Peace and ANP like ’old couple’ and ’domestic scenes’, a global average pooling layer to reduce the features of the and between Happiness and the ANPs ’outdoor wedding’, last convolutional layer. To avoid internal-covariant-shift we ’outdoor activities’, ’happy family’ or ’happy couple’. add a batch normalizing layer [42] after each convolutional Overall, these observations suggest that some common layer and rectifier linear units to speed up the training. sense knowledge patterns related with emotions and context The fusion network module consists of two fully con- could be potentially extracted, automatically, from the data. nected (FC) layers. The first FC layer is used to reduce the dimensionality of the features to 256 and then, a second 4 CNN MODELFOR EMOTION RECOGNITIONIN fully connected layer is used to learn independent repre- sentations for each task [43]. The output of this second SCENE CONTEXT FC layer branches off into 2 separate representations, one We propose a baseline CNN model for the problem of with 26 units representing the discrete emotion categories, emotion recognition in context. The pipeline of the model and second with 3 units representing the 3 continuous is shown in Fig. 13 and it is divided in three modules: body dimensions (section 3.1). feature extraction, image (context) feature extraction and fusion network. The first module takes the whole image as input and generates scene-related features. The second module 4.1 Loss Function and Training Setup takes the visible body of the person and generates body- We define the loss function as a weighted combination of related features. Finally, the third module combines these two separate losses. A prediction yˆ is composed by the features to do a fine-grained regression of the two types of prediction of each of the 26 discrete categories and the emotion representations (section 3.1). 3 continuous dimensions, yˆ = (ˆydisc, yˆcont). In particular, IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 9

disc disc disc cont cont cont cont cont cont yˆ = (ˆy1 , ..., yˆ26 ) and yˆ = (ˆy1 , yˆ2 , yˆ3 ). where xk = (ˆyk − yk ), and vk is a weight assigned to Given a prediction yˆ, the loss in this prediction is de- each of the continuous dimensions and it is set to 1 in our fined by L = λdiscLdisc + λcontLcont, where Ldisc and Lcont experiments. represent the loss corresponding to learning the discrete We train our recognition system end-to-end, learning the categories and the continuous dimensions respectively. The parameters jointly using stochastic gradient descent with parameters λ(disc,cont) weight the contribution of each loss momentum. The first two modules are initialized using pre- and are set empirically using the validation set. trained models from Places [37] and Imagenet [45] while Criterion for Discrete categories (Ldisc): The discrete the fusion network is trained from scratch. The batch size is category estimation is a multilabel problem with an inherent set to 52 - twice the size of the discrete emotion categories. class imbalance issue, as the number of training examples is We found empirically after testing multiple batch sizes not the same for each class (see Fig 9.a). (including multiples of 26 like 26, 52, 78, 108) that batch- In our experiments, we use a weighted Euclidean loss for size of 52 gives the best performance (on the validation set). the discrete categories. Empirically, we found the Euclidean loss to be more effective than using Kullback−Leibler diver- gence or a multi-class multi-classification hinge loss. More 5 EXPERIMENTS precisely, given a prediction yˆdisc, the weighted Euclidean loss is defined as follows We trained four different instances of our CNN model, 26 which are the combination of two different input types disc X disc disc 2 and the two different continuous loss functions described L2disc(ˆy ) = wi(ˆyi − yi ) (1) i=1 in section 4.1. The input types are body (i.e., upper branch in Fig. 13), denoted by B, and body plus image (i.e., both yˆdisc ydisc where i is the prediction for the i-th category and i branches shown in Fig. 13), denoted by B+I. The contin- is the ground-truth label. The parameter wi is the weight uous loss types are denoted in the experiments by L2 for assigned to each category. Weight values are defined as 1 Euclidean loss (equation 2) and SL1 for the Smooth L1 wi = , where pi is the probability of the i-th category ln(c+pi) (equation 3). c and is a parameter to control the range of valid values Results for discrete categories in the form of Average w w for i. Using this weighting scheme the values of i are Precision per category (the higher, the better) are summa- bounded as the number of instances of a category approach rized in Table 3. Notice that the B+I model outperforms the 0 to . This is particularly relevant in our case as we set the B model in all categories except 1. The combination of body weights based on the occurrence of each category for each and image features (B+I(SL1) model) is better than the B batch. Experimentally, we obtained better results using this model. approach compared to setting the global weights based on Results for continuous dimensions in the form of Av- the entire dataset. erage Absolute Error per dimension, AAE (the lower, the Criterion for Continuous dimensions (L ): We model cont better) are summarized in Table 4. In this case, all the models the estimation of the continuous dimensions as a regression provide similar results where differences are not significant. problem. Due to multiple annotators annotating the data Fig. 14 shows the summary of the results obtained per based on subjective evaluation, we compare the perfor- each instance in the testing set. Specifically, Fig. 14.a shows mance when using two different robust losses: (1) a margin Jaccard coefficient (JC) for all the samples in the test set. The Euclidean loss L , and (2) the Smooth L SL . The 2cont 1 1cont JC coefficient is computed as follows: per each category former defines a margin of error (v ) when computing k we use as threshold for the detection of the category the the loss for which the error is not considered. The margin value where Precision = Recall. Then, the JC coefficient is Euclidean loss for continuous dimension is defined as: computed as the number of categories detected that are also 3 X present in the ground truth (number of categories in the L (ˆycont) = v (ˆycont − ycont)2, 2cont k k k (2) intersection of detections and ground truth) divided by the k=1 total number of categories that are in the ground truth or where yˆcont and ycont are the prediction and the ground- k k detected (union over detected categories and categories in truth for the k-th dimension, respectively, and vk ∈ {0, 1} the ground truth). The higher this JC is the better, with is a binary weight to represent the error margin. vk = 0 if cont cont a maximum value of 1, where the detected categories and |yˆ − y | < θ. Otherwise, vk = 1. If the predictions k k the ground truth categories are exactly the same. In the are within the error margin, i.e. error is smaller than θ, then graphic, examples are sorted in decreasing order of the JC these predictions do not contribute to update the weights of coefficient. Notice that these results also support that the the network. B+I model outperforms the B model. The Smooth L1 loss refers to the absolute error using the squared error if the error is less than a threshold (set to 1 in For the case of continuous dimensions, Fig. 14.b shows AAE our experiments). This loss has been widely used for object the Average Absolute Error ( ) obtained per each sam- detection [44] and, in our experiments, has been shown to ple in the testing set. Samples are sorted by increasing order (best performances on the left). Consistent with the results be less sensitive to outliers. Precisely, the Smooth L1 loss is defined as follows shown in Table 4, we do not observe a significant difference among the different models. Finally, Fig. 15 shows qualitative predictions for the best 3  2 cont X 0.5x , if|xk| < 1 B and B+I models. These examples were randomly selected SL1cont(ˆy ) = vk (3) |xk| − 0.5, otherwise JC B+I k=1 among samples with high in (a-b) and samples with IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 10

Emotion CNN Inputs and Lcont type Categories B(L2) B(SL1) B+I (L2) B+I (SL1) a) Categories' Performance b) Dimensions' Performance 1. Affection 21.80 16.55 21.16 27.85 1.0 Model Name ~ JC (mean) Model Name ~ AE (mean) 0.30 2. Anger 06.45 04.67 06.45 09.49 B(L ) ~ 0.35 B(L ) ~ 0.0569 0.8 B(SL ) ~ 0.33 B(SL ) ~ 0.0581 3. Annoyance 07.82 05.54 11.18 14.06 0.25 B+I(L ) ~ 0.37 B+I(L ) ~ 0.0589 4. Anticipation 58.61 56.61 58.61 58.64 0.6 B+I(SL ) ~ 0.4 0.20 B+I(SL ) ~ 0.0573 5. Aversion 05.08 03.64 06.45 07.48 0.15 6. Confidence 73.79 72.57 77.97 78.35 0.4 7. Disapproval 07.63 05.50 11.00 14.97 0.10

8. Disconnection 20.78 16.12 20.37 21.32 0.2 0.05

9. Disquietment 14.32 13.99 15.54 16.89 Absolute Errors (AE) 10. Doubt/Confusion 29.19 28.35 28.15 29.63 Jaccard coefficient (JC) 0.0 0.00 0 1000 2000 3000 4000 5000 6000 7000 0 1000 2000 3000 4000 5000 6000 7000 11. Embarrassment 02.38 02.15 02.44 03.18 Sorted Test-Set examples Sorted Test-Set examples 12. Engagement 84.00 84.59 86.24 87.53 13. Esteem 18.36 19.48 17.35 17.73 14. Excitement 73.73 71.80 76.96 77.16 Fig. 14: Results per each sample (Test Set, sorted): (a) Jaccard 15. Fatigue 07.85 06.55 08.87 09.70 Coefficient (JC) of the recognized discrete categories (b) 16. Fear 12.85 12.94 12.34 14.14 Average Absolute Error (AAE) in the estimation of the three 17. Happiness 58.71 51.56 60.69 58.26 continuous dimensions. 18. Pain 03.65 02.71 04.42 08.94 19. Peace 17.85 17.09 19.43 21.56 20. Pleasure 42.58 40.98 42.12 45.46 21. Sadness 08.13 06.19 10.36 19.66 he/she observes an image) also provide a system for scene 22. Sensitivity 04.23 03.60 04.82 09.28 feature extraction that can be used for encoding the relevant 23. Suffering 04.90 04.38 07.65 18.84 contextual information for emotion recognition. 24. Surprise 17.20 17.03 16.42 18.81 To compute body features, denoted by Bf , we fine tune 25. Sympathy 10.66 09.35 11.44 14.71 26. Yearning 07.82 07.40 08.34 08.34 an AlexNet ImageNet CNN with EMOTIC database, and Mean 23.86 22.36 24.88 27.38 use the average pooling of the last convolutional layer as features. For the context (image), we compare two different TABLE 3: Average Precision (AP) obtained on test set per feature types, which are denoted by If and IS. If are category. Results for models where the input is just the obtained by fine tunning an AlexNet Places CNN with body B, and models where the input are both the body and EMOTIC database, and taking the average pooling of the the whole image B+I. The type of Lcont used is indicated last convolutional layer as features (similar to Bf ), while IS in parenthesis (L2 refers to equation 2 and SL1 refers to is a feature vector composed of the sentiment scores for the equation 3). ANP detectors from the implementation of [39]. To fairly compare the contribution of the different con- Continuous CNN Inputs and Lcont type text features, we train Logistic Regressors for the following Dimensions B(L2) B(SL1) B+I (L2) B+I (SL1) Valence 0.0537 0.0545 0.0546 0.0528 features and combination of features: (1) Bf , (2) Bf +If , and Arousal 0.0600 0.0630 0.0648 0.0611 (3) Bf +IS. For the discrete categories we obtain mean APs Dominance 0.0570 0.0567 0.0573 0.0579 AP = 23.00, AP = 27.70, and AP = 29.45, respectively. Mean 0.0569 0.0581 0.0589 0.0573 For the continuous dimensions, we obtain AAE 0.0704, 0.0643, and 0.0713 respectively. We observe that, for the TABLE 4: Average Absolute Error (AAE) obtained on test set discrete categories, both I and I contribute relevant infor- per each continuous dimension. Results for models where f S mation to the emotion recognition in context. Interestingly, the input is just the body B, and models where the input are IS performs better than If , even though these features have both the body and the whole image B+I. The type of Lcont not been trained using EMOTIC. However, these features used is indicated in parenthesis (L2 refers to equation 2 and are smartly designed for sentiment analysis, which is a SL1 refers to equation 3). problem closely related to extracting relevant contextual information for emotion recognition, and are trained with low JC in B+I (g-h). Incorrect category recognition is indi- a large dataset of images. cated in red. As shown, in general, B+I model outperforms B, although there are some exceptions, like Fig. 15.c. 6 CONCLUSIONS In this paper we pointed out the importance of consider- 5.1 Context Features Comparison ing the person scene context in the problem of automatic The goal of this section is to compare different context emotion recognition in the wild. We presented the EMOTIC features for the problem of emotion recognition in context. database, a dataset of 23, 571 natural unconstrained images A key aspect for incorporating the context in an emotion with 34, 320 people labeled according to their apparent recognition model is to be able to obtain information from emotions. The images in the dataset are annotated using two the context that is actually relevant for emotion recognition. different emotion representations: 26 discrete categories, Since the context information extraction is a scene-centric and the 3 continuous dimensions V alence, Arousal and task, the information extracted from the context should be Dominance. We described in depth the annotation process based in a scene-centric feature extraction system. That is and analyzed the annotation consistency of different anno- why our baseline model uses a Places CNN for the con- tators. We also provided different statistics and algorithmic text feature extraction module. However, recent works in analysis on the data, showing the characteristics of the sentiment analysis (detecting the emotion of a person when EMOTIC database. In addition, we proposed a baseline IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 11

Ground Truth B (L2) B+I (SL1) Ground Truth B (L2) B+I (SL1) a) Anticipation V: 0,57 Anticipation V: 0,61 Anticipation V: 0,62 e) Aection V: 0,67 Anticipation V: 0,59 Anticipation V: 0,62 Con dence A: 0,83 Con dence A: 0,61 Con dence A: 0,70 Anticipation A: 0,43 Con dence A: 0,52 Con dence A: 0,52 Engagement D: 0,67 Engagement D: 0,67 Engagement D: 0,66 Engagement D: 0,83 Doubt/Confusion D: 0,61 Disconnection D: 0,62 Excitement Excitement Excitement Esteem Engagement Engagement Happiness Happiness Pain Excitement Surprise Peace Pleasure Happiness Sympathy Pleasure Pleasure JC: 0,30 JC: 0,40 JC: 0,57 JC: 1.00 Aection V: 0,53 Anticipation V: 0,60 Aection Anticipation V: 0,50 Anticipation V: 0,50 Anticipation V: 0,62 V: 0,62 f) Anticipation A: 0,70 Engagement A: 0,56 Anticipation b) Con dence A: 0,63 Con dence A: 0,54 Con dence A: 0,56 A: 0,58 Disquietment D: 0,63 Happiness D: 0,63 Con dence Engagement D: 0,67 Engagement D: 0,64 Engagement D: 0,61 D: 0,63 Excitement Esteem Excitement Engagement Peace Engagement Happiness Excitement Happiness Fear Pleasure Excitement, Happiness Happiness Sympathy Pleasure, Sympathy Peace JC: 0,22 JC: 0.40 JC: 0,71 JC: 1.00 V: 0,59 Anticipation Annoyance V: 0,40 Aection, Anticipa- Aection Anticipation V: 0,60 Anticipation V: 0,63 V: 0,59 V: 0,62 A: 0,52 Con dence g) Engagement A: 0,33 tion, Aversion Anticipation c) Con dence A: 0,33 Con dence A: 0,56 A: 0,52 A: 0,50 D: 0,61 Engagement Excitement D: 0,63 Con dence, Disconnection Engagement D: 0,63 Disquietment D: 0,63 D: 0,61 D: 0,59 Esteem Fatigue Disapproval, Doubt/Confusion Doubt/Confusion Doubt/Confusion Excitement Doubt/Confusion Doubt/Confusion, Engagement, Happiness Excitement Engagement Happiness Fear Embarrassment, Engagement Peace, Pleasure Happiness Excitement Happiness, Surprise Surprise Esteem, Fatigue, Happiness, Surprise Peace, Pleasure, Sympathy JC: 0,75 JC: 0,71 JC: 0,17 JC: 0,23

Anger V: 0,50 Anticipation V: 0,60 Aection V: 0,64 Anticipation V: 0,59 Anticipation V: 0,64 h) d) Anticipation V: 0,60 Annoyance A: 0,33 Con dence A: 0,50 Anticipation A: 0,54 Aversion A: 0,52 Con dence A: 0,54 Con dence A: 0,53 Aversion D: 0,67 Disconnection D: 0,63 Disquietment D: 0,62 Engagement Engagement D: 0,62 Excitement D: 0,73 D: 0,58 Doubt/Confusion Engagement Doubt/Confusion Happiness Excitement Happiness Sadness Happiness Engagement Peace Happiness Peace Surprise Pain Happiness Pleasure Pleasure Pleasure JC: 0,38 JC: 0,71 JC: 0,00 JC: 0,08

Fig. 15: Ground truth and results on images randomly selected with different JC scores.

CNN model for emotion recognition in scene context that [8] R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion combines the information of the person (body bounding recognition in context,” in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. box) with the scene context information (whole image). We [9] M. Pantic and L. J. Rothkrantz, “Expert system for automatic also compare two different feature types for encoding the analysis of facial expressions,” Image and Vision Computing, vol. 18, contextual information. Our results show the relevance of no. 11, pp. 881–905, 2000. using contextual information to recognize emotions and, [10] Z. Li, J.-i. Imai, and M. Kaneko, “Facial-component-based bag of words and phog descriptor for facial expression recognition.” in in conjunction with the EMOTIC dataset, motivate further SMC, 2009, pp. 1353–1358. research in this direction. All the data and trained models [11] E. Friesen and P. Ekman, “Facial action coding system: a technique are publicly available for the research community in the for the measurement of facial movement,” Palo Alto, 1978. website of the project. [12] P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.” Journal of personality and , vol. 17, no. 2, p. 124, 1971. [13] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez, “Emo- ACKNOWLEDGMENT tionet: An accurate, real-time algorithm for the automatic annota- This work has been partially supported by the Ministerio de tion of a million facial expressions in the wild,” in Proceedings of IEEE International Conference on Computer Vision & Pattern Recogni- Economia, Industria y Competitividad (Spain), under the Grants tion (CVPR16), Las Vegas, NV, USA, 2016. Ref. TIN2015-66951-C2-2-R and RTI2018-095232-B-C22, and [14] M. Soleymani, S. Asghari-Esfeden, Y. Fu, and M. Pantic, “Analysis by Innovation and Universities (FEDER funds). The authors of eeg signals and facial expressions for continuous emotion also thank NVIDIA for their generous hardware donations. detection,” IEEE Transactions on Affective Computing, vol. 7, no. 1, pp. 17–28, 2016. [15] S. Du, Y. Tao, and A. M. Martinez, “Compound facial expressions EFERENCES of emotion,” Proceedings of the National Academy of Sciences, vol. R 111, no. 15, pp. E1454–E1462, 2014. [1] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale [16] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous predic- visual sentiment ontology and detectors using adjective noun tion of spontaneous affect from multiple cues and modalities in pairs,” in Proceedings of the 21st ACM international conference on valence-arousal space,” IEEE Transactions on Affective Computing, Multimedia. ACM, 2013, pp. 223–232. vol. 2, no. 2, pp. 92–105, 2011. [2] H. Aviezer, R. R. Hassin, J. Ryan, C. Grady, J. Susskind, A. Ander- [17] K. Schindler, L. Van Gool, and B. de Gelder, “Recognizing emo- son, M. Moscovitch, and S. Bentin, “Angry, disgusted, or afraid? tions expressed by body pose: A biologically inspired neural studies on the malleability of emotion perception,” Psychological model,” Neural networks, vol. 21, no. 9, pp. 1238–1246, 2008. science, vol. 19, no. 7, pp. 724–732, 2008. [18] W. Mou, O. Celiktutan, and H. Gunes, “Group-level arousal and [3] R. Righart and B. De Gelder, “Rapid influence of emotional scenes valence recognition in static images: Face, body and context,” on encoding of facial expressions: an erp study,” Social cognitive in Automatic Face and Gesture Recognition (FG), 2015 11th IEEE and , vol. 3, no. 3, pp. 270–278, 2008. International Conference and Workshops on, vol. 5. IEEE, 2015, pp. [4] T. Masuda, P. C. Ellsworth, B. Mesquita, J. Leu, S. Tanida, and 1–6. E. Van de Veerdonk, “Placing the face in context: cultural differ- [19] “GENKI database,” http://mplab.ucsd.edu/wordpress/?page - ences in the perception of facial emotion.” Journal of personality and id=398, accessed: 2017-04-12. social psychology, vol. 94, no. 3, p. 365, 2008. [20] “ICML face expression recognition dataset,” [5] L. F. Barrett, B. Mesquita, and M. Gendron, “Context in emotion https://goo.gl/nn9w4R, accessed: 2017-04-12. perception,” Current Directions in Psychological Science, vol. 20, [21] J. L. Tracy, R. W. Robins, and R. A. Schriber, “Development of a no. 5, pp. 286–290, 2011. facs-verified set of basic and self-conscious emotion expressions.” [6] L. F. Barrett, How emotions are made: The secret life of the brain. Emotion, vol. 9, no. 4, p. 554, 2009. Houghton Mifflin Harcourt, 2017. [22] A. Kleinsmith and N. Bianchi-Berthouze, “Recognizing affective [7] A. Mehrabian, “Framework for a comprehensive description and dimensions from body posture,” in Proceedings of the 2Nd Inter- measurement of emotional states.” Genetic, social, and general psy- national Conference on Affective Computing and Intelligent Interaction, chology monographs, 1995. ser. ACII ’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 12

48–58. [Online]. Available: http://dx.doi.org/10.1007/978-3-540- Ronak Kosti is pursuing his PhD at Universitat 74889-2 5 Oberta de Catalunya, Spain advised by Prof. [23] A. Kleinsmith, N. Bianchi-Berthouze, and A. Steed, “Automatic Agata Lapedriza. He is working with Scene Un- recognition of non-acted affective postures,” IEEE Transactions on derstanding and Artificial Intelligence (SUNAI) Systems, Man, and Cybernetics, Part B (Cybernetics), vol. 41, no. 4, group in computer vision, specifically in the area pp. 1027–1038, Aug 2011. of affective computing. He obtained his Masters [24] T. Banziger,¨ H. Pirker, and K. Scherer, “Gemep-geneva multimodal in Machine Intelligence from DA-IICT (Dhirubhai emotion portrayals: A corpus for the study of multimodal emo- Ambani Institute of Information and Communica- tional expressions,” in Proceedings of LREC, vol. 6, 2006, pp. 15–019. tion Technology) in 2014. His master’s research [25] S. Escalera, X. Baro,´ H. J. Escalante, and I. Guyon, was based on depth estimation from single im- “Chalearn looking at people: Events and resources,” age using Artificial Neural Networks. CoRR, vol. abs/1701.02664, 2017. [Online]. Available: http://arxiv.org/abs/1701.02664 [26] A. Dhall, R. Goecke, J. Joshi, J. Hoey, and T. Gedeon, “Emotiw 2016: Video and group-level emotion recognition challenges,” in Proceedings of the 18th ACM International Conference Jose M. Alvarez is a senior research scien- on , ser. ICMI 2016. New York, tist at NVIDIA. Previously, he was senior deep NY, USA: ACM, 2016, pp. 427–432. [Online]. Available: learning researcher at Toyota Research Institute, http://doi.acm.org/10.1145/2993148.2997638 US; prior to that he was with Data61, CSIRO, [27] A. Dhall et al., “Collecting large, richly annotated facial-expression Australia (formerly NICTA) as a researcher. He databases from movies,” 2012. obtained his Ph.D. in 2010 from Autonomous [28] A. Dhall, J. Joshi, I. Radwan, and R. Goecke, “Finding happiest University of Barcelona under the supervision moments in a social context,” in Asian Conference on Computer of Prof. Antonio Lopez and Prof. Theo Gevers. Vision. Springer, 2012, pp. 613–626. Previous to CSIRO, he worked as a postdoctoral [29] G. Patterson and J. Hays, “Coco attributes: Attributes for people, researcher at the Courant Institute of Mathemat- animals, and objects,” in European Conference on Computer Vision. ical Science at New York University under the Springer, 2016, pp. 85–100. supervision of Prof. Yann LeCun. [30] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollar,´ and C. L. Zitnick, “Microsoft COCO: common objects in context,” CoRR, vol. abs/1405.0312, 2014. [Online]. Available: http://arxiv.org/abs/1405.0312 Adria Recasens is pursuing his PhD in com- [31] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, puter vision at the Computer Science and Artifi- “Semantic understanding of scenes through ade20k dataset,” cial Intelligence Laboratory (CSAIL) of the Mas- 2016. [Online]. Available: https://arxiv.org/pdf/1608.05442 sachusetts Institute of Technology advised by [32] “Oxford english ,” http://http://www.oed.com, ac- Professor Antonio Torralba. His research inter- cessed: 2017-06-09. ests range on various topics in computer vision [33] “Merriam-webster online english dictionary,” and . He is focusing most of https://www.merriam-webster.com, accessed: 2017-06-09. his research on automatic gaze-following. He [34] E. G. Fernandez-Abascal,´ B. Garc´ıa, M. Jimenez,´ M. Mart´ın, and received a Telecommunications Engineer’s De- F. Dom´ınguez, Psicolog´ıa de la emoci´on. Editorial Universitaria gree and a Mathematics Licentiate Degree from Ramon´ Areces, 2010. the Universitat Politcnica de Catalunya. [35] R. W. Picard and R. Picard, Affective computing. MIT press Cambridge, 1997, vol. 252. [36] Y. Groen, A. B. M. Fuermaier, A. E. Den Heijer, O. Tucha, and M. Althaus, “The and systemizing quotient: The psychometric properties of the dutch version and a review of Agata Lapedriza is an Associate Professor at the cross-cultural stability,” Journal of Autism and Developmental the Universitat Oberta de Catalunya. She re- Disorders, vol. 45, no. 9, pp. 2848–2864, 2015. [Online]. Available: ceived her MS deegree in Mathematics at the http://dx.doi.org/10.1007/s10803-015-2448-z Universitat de Barcelona and her Ph.D. degree [37] B. Zhou, A. Khosla, A. Lapedriza, A. Torralba, and A. Oliva, in Computer Science at the Computer Vision “Places: An image database for deep scene understanding,” CoRR, Center, at the Universitat Autonoma Barcelona. vol. abs/1610.02055, 2015. She was working as a visiting researcher in the [38] D. Borth, R. Ji, T. Chen, T. Breuel, and S.-F. Chang, “Large-scale Computer Science and Artificial Intelligence Lab, visual sentiment ontology and detectors using adjective noun at the Massachusetts Institute of Technology pairs,” in Proceedings of the 21st ACM international conference on (MIT), from 2012 until 2015. Currently she is a Multimedia. ACM, 2013, pp. 223–232. visiting researcher at the MIT Medialab, Affective [39] T. Chen, D. Borth, T. Darrell, and S.-F. Chang, “Deepsentibank: Computing group from September 2017. Her research interests are Visual sentiment concept classification with deep convolutional related to image understanding, scene recognition and characterization, neural networks,” arXiv preprint arXiv:1410.8586, 2014. and affective computing. [40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi- cation with deep convolutional neural networks,” in Advances in neural information processing systems, 2012, pp. 1097–1105. [41] J. Alvarez and L. Petersson, “Decomposeme: Simplifying convnets for end-to-end learning,” CoRR, vol. abs/1606.05426, 2016. [42] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Interna- tional Conference on Machine Learning, 2015, pp. 448–456. [43] R. Caruana, A Dozen Tricks with Multitask Learning, 2012, pp. 163– 189. [44] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448. [45] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database.” in CVPR, 2009.