Analysis of Contextual Using Multimodal Data

by

Saurabh Hinduja

A submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy Department of Computer Science and Engineering College of Engineering University of South Florida

Major Professor: Shaun Canavan, Ph.D. Rangachar Kasturi, Ph.D. Jeffrey F. Cohn, Ph.D. Marvin J. Andujar, Ph.D. Pei-Sung Lin, Ph.D. Elizabeth Schotter, Ph.D.

Date of Approval: March 31, 2021

Keywords: Expression, Action Units, Physiological Signals

Copyright © 2021, Saurabh Hinduja Dedication

I would like to dedicate this dissertation to my family and my niece, Ria. They have been with me every step of the way through good times and bad. Their unconditional love, guidance, and support have helped me succeed in efficiently completing my Doctoral degree. They are and will always remain my source of inspiration. Acknowledgments

I would like to express my deepest gratitude to Dr. Shaun Canavan and Dr. Jeffrey F Cohn for giving me the opportunity to explore my potential. They have been excellent mentors and constant sources of encouragement and support. Thank you for spending a lot of your valuable time in indulging in insightful conversations that led to the design and development of my research. I would also like to thank Dr. Rangachar Kasturi for motivating me to pursue my doctoral degree. I would like to thank Dr. Rangachar Kasturi, Dr. Jeffrey F Cohn, Dr. Pei-Sung Lin, Dr. Marvin J Andujar and Dr. Elizabeth Schotter for agreeing to be a part of my defense committee. I appreciate their valuable feedback and review comments on this work. I would also like to thank Dr. Lijun Yin for his valuable time and feedback on my publications. I would like to thank my fellow researchers, Saandeep Aathreya and Shivam Srivastava for being a constant source of inspiration. Finally, I would like to thank my close friends Gurmeet Kaur and Aleksandrina D. Davidson for their constant support during my doctoral degree. Table of Contents

List of Tables ...... iv

List of Figures ...... vii

Abstract ...... ix

Chapter 1: Introduction ...... 1 1.1 Motivation ...... 1 1.2 Problems and Open Questions in Affective Computing ...... 2 1.3 Related Works ...... 3 1.4 Contribution ...... 6 1.4.1 Self-Reported Emotions ...... 7 1.4.2 Impact of Action Unit Patterns ...... 7 1.4.3 Multimodal Pain Detection ...... 8 1.4.4 Context recognition from Facial Movements ...... 8 1.5 Organization ...... 8

Chapter 2: Recognizing Self Reported Emotions from Facial Expressions . . . . . 10 2.1 Introduction ...... 10 2.2 Analysis of Self-Report Emotions ...... 12 2.2.1 Dataset ...... 12 2.2.2 Subject self-reporting ...... 17 2.3 Recognizing Self-Report Emotions ...... 19 2.3.1 Experimental Design ...... 19 2.3.1.1 Data Preprocessing ...... 19 2.3.1.2 Proposed 3D CNN ...... 19 2.3.1.3 Data Validation ...... 21 2.3.2 Results ...... 22 2.4 Conclusion ...... 22

Chapter 3: Impact of Action Unit Occurrence Patterns on Detection ...... 24 3.1 Introduction ...... 24 3.2 Action Unit Occurrence Patterns ...... 30 3.2.1 Datasets ...... 30 3.2.2 Class Imbalance and AU Detection ...... 31

i 3.2.3 Action Unit Occurrence Patterns ...... 37 3.3 Impact of AU Patterns on Detection ...... 40 3.3.1 Multi-AU detection (Experiment 1) ...... 42 3.3.2 Single-AU Detection (Experiment 2) ...... 46 3.4 Training on AU Occurrence Patterns ...... 47 3.4.1 Neural Network Architectures ...... 47 3.4.2 Top 66 AU Occurrence Patterns (Experiment 3) ...... 48 3.4.3 All AU Occurrence Patterns (Experiment 4) ...... 51 3.5 Conclusion ...... 51

Chapter 4: Multimodal Fusion of Physiological Signals and Facial Action Units for Pain Recognition ...... 53 4.1 Introduction ...... 53 4.2 Fusion and Experimental Design ...... 56 4.2.1 Multimodal Dataset (BP4D+) ...... 56 4.2.2 Syncing Physiological Signals with AUs ...... 56 4.2.3 Fusion of Physiological Signals and AUs ...... 57 4.2.4 Pain Recognition Experimental Design ...... 58 4.3 Results ...... 58 4.3.1 Pain Recognition ...... 58 4.3.2 Comparison to State of the Art ...... 60 4.4 Discussion ...... 61 4.5 Conclusion ...... 64

Chapter 5: Recognizing Context Using Facial Expression Dynamics from Ac- tion Unit Patterns ...... 65 5.1 Introduction ...... 65 5.2 Related Works ...... 67 5.2.1 Action Unit Detection ...... 68 5.2.2 Expression Recognition and Action Units ...... 69 5.3 Modeling Temporal Dynamics with Action Unit Patterns ...... 70 5.3.1 Motivation and Background ...... 70 5.3.2 Modeling Temporal Dynamics ...... 72 5.3.2.1 AU Occurrence Pattern Modeling ...... 73 5.3.2.2 AU Intensity Pattern Modeling ...... 73 5.3.3 Justification for AU Pattern Images ...... 74 5.4 Datasets ...... 76 5.4.1 DISFA ...... 76 5.4.2 BP4D ...... 78 5.4.3 BP4D+ ...... 78 5.5 Experimental Design ...... 79 5.5.1 Neural Network Architectures ...... 80 5.5.1.1 Convolutional Neural Network ...... 80 5.5.1.2 Feed-forward Fully-connected Network ...... 81 5.5.2 Experiments and Metrics ...... 81

ii 5.5.2.1 DISFA ...... 81 5.5.2.2 BP4D and BP4D+ ...... 82 5.5.3 Metrics ...... 83 5.6 Results ...... 84 5.6.1 DISFA ...... 85 5.6.1.1 Context ...... 85 5.6.1.2 Target from Context ...... 87 5.6.1.3 Target Emotion ...... 88 5.6.2 BP4D and BP4D+ ...... 90 5.6.2.1 BP4D Context Recognition ...... 90 5.6.2.2 BP4D+ Context Recognition ...... 91 5.6.2.3 BP4D vs. BP4D+ Context Recognition ...... 93 5.6.3 Network Attention Maps ...... 96 5.6.4 Automatic AU Detection ...... 99 5.7 Discussion ...... 101

Chapter 6: Conclusion ...... 103 6.1 Findings ...... 103 6.1.1 Self-reported emotions ...... 103 6.1.2 Impact of Action Unit Occurrence Distribution on Detection 103 6.1.3 Fusion of Physiological Signals and Facial Action Units for Pain Recognition ...... 103 6.1.4 Recognizing Context Using Facial Expression Dynamics from Action Unit Patterns ...... 104 6.2 Applications ...... 104 6.3 Limitations ...... 105 6.4 Future Work ...... 107

References ...... 108

Appendix A: Dictionary ...... 128

Appendix B: Attention Maps ...... 129

Appendix C: Copyright Permissions ...... 139

About the Author ...... End Page

iii List of Tables

Table 2.1 Task Descriptions of BP4D+ ...... 13

Table 2.2 Percent of subjects which felt emotion in each task. Rows are subject self-reporting, columns are tasks ...... 13

Table 2.3 Percentage of male subjects that felt emotion in each task ...... 14

Table 2.4 Percentage of female subjects that felt emotion in each task . . . . . 14

Table 2.5 Significance of differences between male and female intensity of self-reported emotion ...... 15

Table 2.6 Significance of differences between male and female occurrence of self-reported emotion ...... 16

Table 2.7 Average intensity of self-report emotions across all tasks (BP4D+) for male and female ...... 20

Table 2.8 Perceived emotion evaluation metrics...... 21

Table 2.9 Variance and standard deviation of accuracy for recognizing per- ceived emotion...... 22

Table 3.1 Number of AUs and Occurrence Criteria of datasets ...... 31

Table 3.2 Correlation between F1-binary scores and AU class imbalance. . . . . 32

Table 3.3 Standard deviation of F1-binary scores, for each individual AU, between all methods ...... 34

Table 3.4 BP4D AU occurrence patterns and total number of frames for each pattern ...... 34

Table 3.5 DISFA AU occurrence patterns and total number of frames for each pattern ...... 35

Table 3.6 BP4D+ AU occurrence patterns and total number of frames for each pattern ...... 35

Table 3.7 Top AU occurrence patterns, for each sequence, from DISFA . . . . . 38

iv Table 3.8 Top AU occurrence patterns, for each task, in BP4D ...... 39

Table 3.9 Top AU occurrence patterns, for each task, in BP4D+ ...... 39

Table 3.10 Correlation between metrics and class imbalance for experiment 1. . . 40

Table 3.11 Correlation between metrics and class imbalance for experiment 2. . . 44

Table 3.12 Evaluation on tested networks for top 66 AU occurrence patterns. . . 45

Table 3.13 Evaluation on tested networks for all AU occurrence patterns. . . . . 49

Table 3.14 Evaluation on N3 for unseen patterns...... 50

Table 3.15 Variance of AU F1-binary scores...... 51

Table 4.1 Pain recognition results using physiological signals, action units, and the fusion of both ...... 55

Table 5.1 Video Descriptions of DISFA...... 77

Table 5.2 Task Descriptions of BP4D...... 77

Table 5.3 Evaluation metrics for the feed-forward fully-connected neural network on DISFA, BP4D, and BP4D+...... 85

Table 5.4 Evaluation metrics for context from AU patterns using intensity, occurrence, and no normalization on the DISFA dataset...... 86

Table 5.5 Confusion matrix for AU intensity-based context on the DISFA dataset 86

Table 5.6 Confusion matrix for AU occurrence-based context on the DISFA dataset ...... 87

Table 5.8 Confusion matrix for AU intensity-based target emotion from context on the DISFA dataset ...... 88

Table 5.7 Evaluation metrics for target emotion from context from AU pat- terns using intensity, occurrence, and no normalization on the DISFA dataset...... 88

Table 5.9 Confusion matrix for AU intensity-based target emotion on the DISFA dataset ...... 89

Table 5.10 Confusion matrix for AU occurrence-based target emotion from context on the DISFA dataset ...... 89

Table 5.11 Evaluation metrics for target emotion from AU patterns using intensity, occurrence, and no normalization on DISFA...... 89

v Table 5.12 Confusion matrix for AU occurrence-based target emotion on the DISFA dataset ...... 90

Table 5.13 Evaluation metrics for intensity, occurrence, combination on BP4D . 91

Table 5.14 Confusion matrix for AU intensity-based (5 AUs) context on the BP4D dataset ...... 91

Table 5.15 Confusion matrix for AU occurrence-based (5 AUs) context on the BP4D dataset ...... 92

Table 5.16 Confusion matrix for AU combination-based context on the BP4D dataset ...... 92

Table 5.17 Confusion matrix for AU occurrence-based context on the BP4D dataset ...... 92

Table 5.18 Evaluation metrics for intensity, occurrence, combination on BP4D . 93

Table 5.19 Confusion matrix for AU intensity-based (5 AUs) context on the BP4D+ dataset ...... 93

Table 5.20 Confusion matrix for AU occurrence-based (5 AUs) context on the BP4D+ dataset ...... 94

Table 5.21 Confusion matrix for AU combination-based context on the BP4D+ dataset ...... 94

Table 5.22 Confusion matrix for AU occurrence-based (12 AUs) context on the BP4D+ dataset ...... 94

Table 5.23 Evaluation metrics for experiments on BP4D vs. BP4D+ ...... 95

Table 5.24 AU Intensity based Context - Cross Dataset ...... 95

Table 5.25 AU Occurrence based Context - Cross Dataset ...... 96

Table 5.26 Manually vs. automatically annotated AUs...... 100

vi List of Figures

Figure 2.1 Different subjects from same task meant to elicit happiness/amusement 11

Figure 2.2 Proposed 3D convolutional neural network for recognizing multi- ple self-report emotions for one task...... 18

Figure 3.1 AU detection accuracy vs occurrence in BP4D ...... 25

Figure 3.2 AU detection accuracy vs occurrence in DISFA ...... 26

Figure 3.3 AU detection accuracy vs occurrence in BP4D+ ...... 29

Figure 3.4 Patterns and their frame counts, compared to total percentage of patterns for BP4D, DISFA and BP4D+...... 36

Figure 3.5 Comparisons of different accuracy metrics on multiple networks . . . 41

Figure 3.6 Comparisons of different accuracy metrics on different networks . . . 43

Figure 4.1 Visual comparison between reparation and blood pressure, and facial expressions (i.e. action units) occurring at same time . . . . . 59

Figure 4.2 Correlation matrices of physiological signals and action units . . . . . 62

Figure 5.1 Overview of proposed approach to model facial dynamics from AU patterns ...... 66

Figure 5.2 Comparison of high/low AU intensity, with same occurrence pat- terns, from different sequences (DISFA [73])...... 71

Figure 5.3 Comparison between the AU intensity distributions ...... 73

Figure 5.4 Comparison between the AU occurrence and AU intensity patterns . 74

Figure 5.5 Interpolated Intensity Action Unit Signals ...... 75

Figure 5.6 Convolutional neural network architecture for AU pattern training . 80

Figure 5.7 Non-normalized AU Pattern image ...... 82

Figure 5.8 AU occurrence and intensity combination image ...... 83

vii Figure 5.9 Average network attention maps (intensity) for all subjects in DISFA. 97

Figure 5.10 Average network attention maps (occurrence) for all subjects in DISFA. 98

Figure B.1 Examples of Happy Video Attention Maps Using Intensity ...... 129

Figure B.2 Examples of Surprise Video Attention Maps Using Intensity . . . . . 130

Figure B.3 Examples of Sad Video Attention Maps Using Intensity ...... 131

Figure B.4 Examples of Disgust Video Attention Maps Using Intensity . . . . . 132

Figure B.5 Examples of Fear Video Attention Maps Using Intensity ...... 133

Figure B.6 Examples of Happy Video Attention Maps Using Occurrence . . . . 134

Figure B.7 Examples of Surprise Video Attention Maps Using Occurrence . . . . 135

Figure B.8 Examples of Sad Video Attention Maps Using Occurrence ...... 136

Figure B.9 Examples of Disgust Video Attention Maps Using Occurrence . . . . 137

Figure B.10 Examples of Fear Video Attention Maps Using Occurrence ...... 138

viii Abstract

Affective computing builds and evaluates systems that can recognize, interpret, and sim- ulate human emotion. It is an interdisciplinary field, which includes computer science, psychology, and many others. For years, human emotion has been studied in psychology, but recently has become a prominent field in computer science. Largely, the field of affec- tive computing has been focused on analyzing static facial expressions to recognize human emotions, without taking bias (e.g. gender, data bias), context, or temporal information into account. Psychology has shown the difficulty of analyzing emotions without incorpo- rating this type of information. In this dissertation, we have proposed new approaches to recognizing emotions by incorporating both contextual and temporal information, as well as approaches to mitigate data bias. More specifically, this dissertation has the following the- oretical and application-based contributions: (1) The first work to recognize context using temporal dynamics from facial action units; (2) We recognize multiple self-reported emotions using facial expression-based videos; (3) A new approach to mitigate data bias in facial action units is proposed; and (4) Multimodal, temporal fusion of physiological signals and action units are used for emotion recognition. This dissertation has a wide range of applications in the fields including, but not limited to, medicine, security, and entertainment.

ix Chapter 1: Introduction

1.1 Motivation

Human emotions are important for conveying feelings and the expression of emotions can vary significantly from person to person. In this dissertation, we investigate expressed and self-reported emotions. Expressed emotions are facially expressed emotions and self-reported emotions are those experienced by the subject. The self-reported emotions are manually obtained from the subject reporting the emotion(s) that they felt. Both expressed and self- reported emotions are dependent on context and the stimulus to which the subject is exposed to. The current state-of-the-art methods in emotion recognition mainly detect expressed facial emotions, they do not detect the self-reported emotions or the context. In recent publications, it has been shown that context plays an important role in analysis of emotions [71] [100] and that context is very important for understanding emotion. Much of the current research into emotion recognition also focuses on facial expressions that occurred in a single instant of time [1], however, emotional response to stimuli is temporal [125][72][114] and not instantaneous. This dissertation proposes new approaches to using temporal information for self-reported emotion and context recognition. Along with expression detection, research has also been conducted on detecting facial action units [34], which correspond to muscle movement of the face. Most research on action units is on the detection of actions, while limited research has been conducted on the analysis of the action units or the use of action units after detection. As the dynamics of facial emotions are temporal [89], the corresponding muscle movements (action units) are also temporal, that is, action units have an on-set, peak and off-set. This temporal nature of facial action units along with their co-occurrence with

1 each other [44][109] may give action unit research a new avenue for investigation, as well as new challenges and problems to solve. The overall objective of my dissertation is to explore the temporal and contextual nature of emotions with current state-of-the-art datasets. A main objective was to develop the first method that shows the relationship between facial action units and the experience of a person (context). Along with this, a new approach was developed to detect self-report emotions from facial expression. This is the first work to propose approaches to both of these challenging and important problems.

1.2 Problems and Open Questions in Affective Computing

In this dissertation, I conduct an investigation into some of the main challenges in emotion recognition:

1. For emotion recognition many state-of-the-art works either detect facially expressed emotions or the triggered emotion. But people feel different emotions under different stimuli and there has been very little research to detect self-reported emotions. Self- reported emotions are the closest measure we have to the emotions experienced by a person. Considering this, my dissertation investigates the following open question: Can multiple self-reported emotions be recognized using temporal, facial information (e.g. expression)?

2. In machine learning data, distribution plays an important role in prediction metrics. Often, the distribution of facial action unit occurrence is imbalanced, that negatively impacts evaluation metrics (e.g. F1 score) [50]. This causes the majority of the state-of- the-art methods to have a low standard deviation (e.g. they all have similar accuracies) and are highly correlated to the action unit occurrence distribution. This dissertation proposed a new approach to the challenging, open question of how to handle data imbalance in action units (i.e. data bias).

2 3. Pain recognition is an important question for the medical community. While there are encouraging works in this area, most of them use a single modality (e.g. face) for pain recognition. As it has been shown that multiple modalities can increase the accuracy of recognition systems [134] in infants, this naturally leads to how multimodal systems can increase the accuracy of pain recognition systems in all age ranges. This dissertation investigates the open question of: Can multiple modalities better recognize pain, as well as if there are any correlations between the modalities? Learning the correlations between the modalities may give new insights into more accurate pain recognition systems.

4. A great deal of encouraging work has been done on detecting facial action units [131, 128, 116], however, less work has been done on how to use those detected action units. This is a challenging and open question in affective computing. Considering this, this dissertation proposes a first and novel approach to modeling temporal action units for automatically recognizing context, that is, the experience of the person.

1.3 Related Works

Facial expressions are an important form of conveying emotions in humans, however, human traits such as ethnicity, gender, and age can impact the recognition of emotions [41]. This is compounded by the fact that the ethnicity distribution of humans on the planet is not uniform, which makes it difficult to collect a balanced dataset for emotion analysis across ethnicity. The limitation to collect a balanced dataset can result in challenges to create a single emotion detection method which can work across all ethnicities. One such method to help recognize expression is the Facial Action Coding System (FACS) [34], which represents facial muscle movement in terms of Action Units (AU). While representing an emotion, one or more AUs can be triggered which allows a wide range of facial expression to be represented by FACS. To be able to successfully investigate this problem, large and varied datasets are required. Fortunately, many state-of-the-art datasets for emotion detection have

3 been collected [139][137] [73] [74] [24] and many promising results have been obtained using these state-of-the art datasets. Investigating AUs, Srinivasan et al.[111] analyzed cross-cultural and culture specific per- ceptions and expressions of facial emotion. They found that although for some of the emo- tions expression and perception are similar, there is a statistical difference in other emotions across cultures. Their findings suggest that there might be different AU activations for same emotions across cultures. Chu et al. [25] used a convolutional neural networks (CNN) and long short-term memory network (LSTM) to do three-class classification on AUs by learning temporal and spatial cues. Zeng et al. [135] developed a confidence preserving machine (CPM) for the task. In their proposed method, the CPM learns two classifiers. First the positive classifier separates all positive classes, and the negative classifier does the same for the negative classes. The CPM then learns a person-specific classifier to detect the AUs. Li et al. [59] used temporal fusion for AU detection. They developed a deep learning frame- work where regions of interest are learned independently so each sub-region has a local CNN; multi-label learning is then used. Although these results are encouraging, the imbalance in data can influence the training of the methods, making it difficult to detect emotion across age, gender, and ethnicity. Ertugrul et al. [35], also found that it is difficult to detect AUs across domains (i.e. across different datasets). Yang et al. [132] proposed the use of FACS3D-Net to detect action units. Their method integrates 2D and 3D convolutional neu- ral networks for this task. They showed that combining spatial and temporal information yields an increase in AU detection results. In recent years, there has been a great deal of research into inferring emotion through the use of facial expressions, using both 2D and 3D information. Fabiano et al. [38] proposed a Deformable Synthesis Model for creating synthetic 3D facial data used to train deep neural networks. They showed the use of this synthetic training data allowed for generalizing facial expressions across multiple state-of-the-art datasets. Yang et al. [131] proposed de- expression residue learning which recognizes facial expressions by extracting the expressive

4 component through a generative model. This component is then used for recognition of facial expression. While the previously mentioned works have shown encouraging results, they did not take context into account, which psychology has shown to be important in emotion recognition [107]. The context principle of Frege’s philosophy [97] says to ”never ask for the meaning of a word in isolation, but only in context of a sentence”. This principle motivates the work in this dissertation, which in effect is that emotions are inseparable to the context. Martinez [71] and Barrett et al.[10] have shown how context impacts . They have also shown how visual scenes, voices, culture and presence of other people affect the emotional perception about a person. Mesquita et al.[75] have shown that emotional factors are important to mental processes. They explored how a person’s psychology is a result of the individuals’ interaction with their physical, social and cultural environments. Ledgerwood et al.[56] showed that psychological traits are neither positive nor negative and their implications are dependent on the context. Aldao et al.[2] have shown that there are four components that characterize an emotion, that is, the person who feels the emotion, the stimuli which elicits the emotion, select and implementation of strategies, and finally, the types of outcomes. Mesquita et al.[76] also proposed a sociodynamic model of emotions in which they have emphasized the role of social context on emotions. Motivated by these works in psychology, Mittal et al.[79] proposed EmotiCon, which is context-aware emotion prediction. They used the context of the subject and implementation strategy (background and other people) to predict the emotion. Martinez [71] showed that by only observing the facial expression of a person, it is difficult for other people to recognise the emotion of that person. For example, if we just observe the facial expression of a soccer player after a goal has been scored, then the person may appear to be angry. When you the complete picture and show the context that the person has scored a goal, then you will be able to say confidently that the person is happy. This shows that it is important for machine learning systems to use context and the entire scene to predict the emotions. To the best of our knowledge,

5 however, there is no current work that considers the context to predict emotions. These works motivate our investigation into context, which is further detailed in this dissertation. Along with context, bias and the impact of gender are also important when analyzing emotions. Recently, Zou et al [145], call for AI to become more fair. They highlighted that the problem is largely because most of the AI training data is collected in the United States. AI not only learns identification features from the data given to it, but it also learns the bias and distribution of data. This effect has been also highlighted by Boulamwini et al. [15], where they analyzed the accuracy of gender classifiers for people with different demographic traits. They found that gender classifiers performed best on light skinned males and worst on dark skinned females.

1.4 Contribution

The contributions made in this dissertation are in the field of Affective Computing. The main contributions are:

1. We are the first to ask how action units directly relate to people’s experience (context).

2. We have developed a novel approach to detect context from the pattern of Action Units.

Additionally, this dissertation focuses on novel solutions to the following questions.

1. Does gender and data imbalance impact recognition accuracies?

2. How can self-reported emotion be used to facilitate an increase in recognition accura- cies, and give us insight into the correlation between self-report and expected elicitation of emotion?

3. How can we more effectively use multimodal data in emotion recognition systems (e.g. pain)?

6 4. What is the relationship between the occurrence of action units and their displayed patterns?

5. Given a set of detected action units, what applications can they positively impact? Can they be used for emotion or context recognition?

To answer these questions, this dissertation has 4 contributions which are detailed below.

1.4.1 Self-Reported Emotions

We investigated the recognition of multiple self-reported emotions by proposing a novel 3D convolutional neural network architecture that takes temporal information into account. Along with this, cross-gender analysis was performed to gain insight into how gender impacts self-reported emotion. To the best of our knowledge, this is the first work to recognize multiple self-reported emotions laying the ground-work for future investigations into how self-report and the expected elicited emotions are correlated.

1.4.2 Impact of Action Unit Patterns

We have analyzed the impact of facial action unit occurrence imbalance on evaluation metrics. First, we conducted an in-depth literature survey of state-of-the-art facial action unit detection methods. Through this survey, we show that the F1 scores are highly corre- lated with the data distribution of the facial action unit distribution and that the methods have a low standard deviation (i.e. their accuracies are similar across the action units). We have also shown that the F1 score of a random guess gives comparable F1 score to the state of the art methods. This naturally leads to the question of what the machine learning algorithms are learning, which we address in this dissertation. Next, we further analyzed the problem by using two networks, one is a current state-of-the-art network and the other is neural network that we developed. We have shown that both networks perform equally well on the F1 metrics but we can see the difference in other metrics like Area under the curve

7 and F1 micro, macro. This result suggests that more than one evaluation metric should be used in reporting results. Finally, we have shown that action units occur in patterns and these patterns can be used to increase the performance of action unit detection methods. The proposed approach is a useful step towards mitigating bias in data.

1.4.3 Multimodal Pain Detection

We propose a multimodal approach to pain recognition, by fusing Physiological signals and action units. We show that the fused signals showed an increase in accuracy for pain recognition. We have also shown the correlation between facial action units and physiological signals is strong during the most expressive segments of the data, however, when the data from the entire video is analyzed, the correlation decreases. Finally, we have shown that there is a delay between the peak of the physiological signals and facial movements from the time the person experiences pain, which can give us insight into future multimodal pain recognition systems. Our results suggest that syncing the modalities is important, as a spike in physiological data may not directly correlate with a spike in expression at a given time step.

1.4.4 Context recognition from Facial Movements

In this dissertation, we have proposed a first and novel approach to model both the inten- sity and occurrence of action units, using temporal patterns. The action units (both intensity and occurrence) are transformed to the image-space and used to train a convolutional neural network to recognize context. We show that intensity, and the number of action units (more the better) can positively impact recognition accuracies for contextual emotions.

1.5 Organization

The rest of the dissertation is organized as follows:

8 Chapter 2: The proposed approach for detecting contextual self-reported emotions is presented.

Chapter 3: A discussion about action unit occurrence imbalance, the correlation be- tween AU occurrence and F1 score, and between the state of the art methods is given.

Chapter 4: Multimodal pain detection and the correlation between physiological signals and action units are discussed.

Chapter 5: The relationship between context and temporal action unit patterns is detailed, including a novel approach to detecting said context.

9 Chapter 2: Recognizing Self Reported Emotions from Facial Expressions1

2.1 Introduction

In recent years, there has been a great deal of research has been done in inferring emotion through the use of facial expressions, using both 2D and 3D information. Yang et al. [132] proposed the use of FACS3D-Net to detect action units (AU). Their method integrates 2D and 3D convolutional neural networks for this task. They showed that combining spatial and temporal information yields an increase in AU detection results. Fabiano et al. [38] created synthetic 3D facial data used to train deep neural networks. They showed that the use of this synthetic training data resulted in generalizing the detection of facial expressions across multiple state-of-the-art datasets. Yang et al. [131] proposed de-expression residue learning which recognizes facial expressions by extracting the expressive component through a generative model. Although there has been encouraging results on recognizing emotion from facial expres- sions, it has been shown that expressions vary across cultures and people [111]. Barrett et al. [9] found that while there is evidence to suggest people follow a common view [9] of emotion vs expression (e.g. smile when happy and frown when sad), how this is communi- cated varies widely across people even within the same situation. It has also been shown that multiple emotions can occur within facial expressions [46, 129]. Considering this, it is important to investigate what emotions (possibly multiple) the subjects themselves felt (i.e. self-report emotion) to further learn how expressions are related to the emotion of the subject. Many of the works on recognition of emotion focus on recognizing the emotion that

1This chapter was published in ©2021 IEEE Face and Gesture 2020 [48]. Reprinted with permission included in Appendix C.

10 Figure 2.1: Different subjects from same task meant to elicit happiness/amusement. Self- report emotion (top) - relaxed:5; amused:4; sympathetic:2; startled:0; surprise:1. (bottom) - relaxed:1; amused:1; sympathetic:0; startled:2; surprise:1. All other self-reported emotions are 0 for both subjects. was meant to be elicited. Interestingly though, Girard et al. [44] have begun to investigate subject-self-reported emotions as related to facial expressions, with a specific focus on the emotion felt when a smile occurs. They found that while a smile occurs in emotions such as amusement, embarrassed, fear, and pain, the smiles also looked different as evidenced by their measurement of action units. This work motivated our current investigation into the self-report emotion of subjects. Bias and the impact of gender are also important when analyzing self-report emotions. Recently, Zou et al [145], call for AI to become more fair. They highlight the problem is due to the fact that most of the AI training data is collected in the United States. AI not only learns identification features from the data given to it, but it also learns the bias and distribution of data. This effect has been also highlighted by Boulamwini et al. [15], where they analyzed the accuracy of gender classifiers for people with different demographic traits and found that gender classifiers performed best on light skinned males and worst on dark skinned females.

11 As it has been shown that expressions vary among people, multiple emotions can occur with facial expressions, and is difficult to infer emotion directly from them (See Fig. 2.1). This motivated us to investigate the self-reported emotion of subjects as they relate to facial expressions. The contribution of this work is three-fold:

1. We analyze subject self-reporting from the BP4D+ multimodal spontaneous emotion corpus [139], across gender in relation to the emotion that was meant to be elicited from the task.

2. We propose a 3D CNN architecture for recognizing multiple self-report emotions from facial expressions.

3. To the best of our knowledge, this is the first work to report results for recognizing multiple self-report emotions, across all tasks in BP4D+.

2.2 Analysis of Self-Report Emotions

2.2.1 Dataset

BP4D+ [139] consists of 140 subjects (58 male/82 female) with an age range of 18-66. It contains multiple ethnicities including Caucasian, African American, Asian, and Hispanic. It contains 2D and thermal images, 3D models, physiological data, action units, and facial landmarks (2D and 3D). Ten tasks were performed to elicit particular target emotions (Refer to Table 2.1). For 138 subjects, self-report on the emotions that they felt during each task was collected. The subjects were allowed to choose multiple emotions for each task, such as relaxed, surprised, sad, happy, etc. The intensity of the self-reported emotions was also collected using a 5-point-Likert scale. Our analysis (Section 2.2.2) is done on all 138 subjects.

12 Table 2.1: Task Descriptions of BP4D+

Task Activity Target Emotion T1 Interview: Listen to a funny joke Happy T2 Graphic show: Watch 3D avatar of Surprise participant T3 Video clip: 911 emergency phone call Sad T4 Experience a sudden burst of sound. Startle or Surprise T5 Interview: True or false question Skeptical T6 Improvise a silly song Embarrassment T7 Experience physical threat in dart Fear or game nervous T8 Cold pressor: Submerge hand into ice Physical pain water T9 Interview: Complained for a poor Anger or upset performance T10 Experience smelly odor Disgust

Table 2.2: Percent of subjects which felt emotion in each task. Rows are subject self- reporting, columns are tasks. NOTE: Darker color corresponds to higher percent.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Relaxed 0.63 0.25 0.01 0.00 0.09 0.09 0.04 0.07 0.06 0.03 Amused 0.96 0.50 0.00 0.12 0.14 0.43 0.25 0.02 0.14 0.01 Disgusted 0.00 0.04 0.04 0.00 0.00 0.02 0.01 0.00 0.01 0.97 Afraid 0.01 0.00 0.53 0.35 0.01 0.02 0.71 0.12 0.15 0.13 Angry 0.00 0.00 0.42 0.12 0.04 0.01 0.04 0.09 0.47 0.09 Frustrated 0.01 0.00 0.06 0.04 0.06 0.12 0.02 0.08 0.41 0.04 Sad 0.00 0.01 0.65 0.00 0.00 0.01 0.00 0.00 0.09 0.00 Sympathetic 0.05 0.00 0.36 0.01 0.01 0.01 0.01 0.00 0.07 0.01 Nervous 0.22 0.05 0.27 0.10 0.09 0.40 0.55 0.12 0.33 0.16 Pained 0.00 0.01 0.02 0.04 0.00 0.00 0.02 0.99 0.00 0.04 Embarrassed 0.05 0.32 0.00 0.08 0.03 0.94 0.01 0.00 0.27 0.02 Startled 0.03 0.17 0.37 0.99 0.09 0.02 0.33 0.17 0.20 0.17 Surprised 0.23 0.81 0.09 0.64 0.46 0.17 0.31 0.12 0.19 0.12 Skeptical 0.04 0.03 0.01 0.00 0.98 0.04 0.13 0.03 0.28 0.04

13 Table 2.3: Percentage of male subjects that felt emotion in each task. NOTE: Rows/columns same as Table 2.2.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Relaxed 0.66 0.29 0.00 0.00 0.10 0.16 0.03 0.14 0.07 0.02 Amused 0.93 0.53 0.00 0.21 0.17 0.38 0.29 0.03 0.16 0.00 Disgusted 0.00 0.02 0.10 0.00 0.00 0.03 0.00 0.00 0.02 0.98 Afraid 0.00 0.00 0.50 0.28 0.00 0.03 0.66 0.10 0.12 0.05 Angry 0.00 0.00 0.33 0.12 0.07 0.03 0.03 0.10 0.52 0.07 Frustrated 0.02 0.00 0.09 0.03 0.03 0.14 0.02 0.07 0.36 0.09 Sad 0.00 0.02 0.55 0.00 0.00 0.02 0.00 0.00 0.12 0.00 Sympathetic 0.07 0.00 0.38 0.02 0.00 0.02 0.02 0.00 0.03 0.02 Nervous 0.21 0.03 0.34 0.09 0.07 0.33 0.57 0.14 0.34 0.17 Pained 0.00 0.02 0.02 0.03 0.00 0.00 0.02 0.98 0.00 0.05 Embarrassed 0.07 0.22 0.00 0.12 0.00 0.93 0.00 0.00 0.19 0.00 Startled 0.03 0.19 0.28 1.00 0.14 0.03 0.26 0.09 0.14 0.12 Surprised 0.28 0.78 0.07 0.69 0.45 0.17 0.29 0.07 0.21 0.14 Skeptical 0.02 0.05 0.00 0.00 0.98 0.02 0.21 0.03 0.29 0.05

Table 2.4: Percentage of female subjects that felt emotion in each task. NOTE: Rows/columns same as Table 2.2.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Relaxed 0.60 0.22 0.01 0.00 0.07 0.04 0.05 0.02 0.05 0.04 Amused 0.98 0.48 0.00 0.06 0.12 0.47 0.22 0.01 0.12 0.01 Disgusted 0.00 0.06 0.00 0.00 0.00 0.01 0.01 0.00 0.00 0.96 Afraid 0.01 0.00 0.54 0.40 0.01 0.01 0.74 0.12 0.17 0.19 Angry 0.00 0.00 0.49 0.12 0.02 0.00 0.04 0.09 0.43 0.10 Frustrated 0.01 0.00 0.05 0.04 0.07 0.11 0.02 0.09 0.44 0.01 Sad 0.00 0.01 0.72 0.00 0.00 0.00 0.00 0.00 0.07 0.00 Sympathetic 0.04 0.00 0.35 0.00 0.01 0.00 0.00 0.00 0.10 0.01 Nervous 0.22 0.06 0.22 0.11 0.10 0.46 0.54 0.11 0.32 0.15 Pained 0.00 0.00 0.02 0.05 0.00 0.00 0.02 1.00 0.00 0.02 Embarrassed 0.04 0.38 0.00 0.05 0.05 0.95 0.01 0.00 0.32 0.04 Startled 0.02 0.15 0.44 0.99 0.06 0.01 0.38 0.23 0.25 0.21 Surprised 0.20 0.84 0.11 0.60 0.47 0.16 0.32 0.15 0.17 0.11 Skeptical 0.05 0.01 0.01 0.00 0.98 0.05 0.07 0.02 0.27 0.02

14 Table 2.5: Significance of differences between male and female intensity of self-reported emotion. Note: ’n.s’ stands for not significant, * p< 0.05, ** p< 0.01, *** p<0.005

Task Relaxed Amused Disgusted Afraid Angry Frustrated Sad Sympathetic Nervous Pained Embarrassed Startled Surprised Skeptical T1 n.s n.s - n.s - n.s - n.s n.s - n.s n.s n.s n.s T2 n.s n.s n.s - - - n.s - n.s n.s * n.s n.s n.s T3 n.s - ** n.s n.s n.s *** n.s n.s n.s - n.s n.s n.s T4 - *** - n.s n.s n.s - n.s n.s n.s n.s n.s n.s - T5 n.s n.s - n.s n.s n.s - n.s n.s - n.s n.s n.s n.s T6 n.s n.s n.s n.s n.s n.s n.s n.s n.s - n.s n.s n.s n.s T7 n.s n.s n.s n.s n.s n.s - n.s n.s n.s n.s n.s n.s * T8 *** n.s - n.s n.s n.s - - n.s * - * * n.s T9 n.s n.s n.s n.s n.s n.s n.s n.s n.s - n.s n.s n.s n.s T10 n.s n.s n.s * n.s n.s - n.s n.s n.s n.s * n.s n.s

15 Table 2.6: Significance of differences between male and female occurrence of self-reported emotion. Note: ’n.s’ stands for not significant, * p< 0.05, ** p< 0.01, *** p<0.005

Task Relaxed Amused Disgusted Afraid Angry Frustrated Sad Sympathetic Nervous Pained Embarrassed Startled Surprised Skeptical T1 n.s n.s ------n.s - n.s - n.s - T2 n.s * n.s - - - - - n.s - ** n.s n.s n.s T3 - - n.s n.s n.s n.s *** n.s n.s n.s - n.s * - T4 - * - n.s n.s n.s - - n.s n.s n.s n.s n.s - T5 n.s n.s - n.s n.s n.s - n.s n.s - n.s n.s n.s n.s T6 n.s n.s n.s n.s n.s n.s n.s - n.s - n.s n.s n.s n.s T7 n.s n.s n.s n.s n.s n.s - - n.s - n.s * n.s n.s T8 *** n.s - n.s n.s n.s - - n.s n.s - * n.s n.s T9 n.s n.s - n.s n.s n.s n.s n.s n.s - n.s n.s n.s n.s T10 n.s - n.s * n.s n.s - - n.s n.s n.s ** n.s n.s

16 2.2.2 Subject self-reporting

Motivated by work that has shown multiple emotions can be felt during facial expressions [46, 129], we have investigated which emotions were self-reported by subjects across all of the tasks in BP4D+. As can be seen in Table 2.2, many of the subjects reported multiple emotions for each task. For example, in task 1, where happy was meant to be elicited, a large percentage of the subjects felt amused, relaxed, surprised, and nervous (96%, 63%, 23% and 22%, respectively). Along with these emotions a small percentage (<5%) also felt afraid, frustrated, sympathetic, embarrassed, startled, and surprised. For each task, the emotion that was meant to be elicited (or similar emotion) is felt by the largest percentage of all subjects (e.g. for happy task 96% of the subjects felt amused, for pain task, 99% of the subjects felt pain). What is interesting to note is that there are many instances of complementary emotions being felt for the tasks. For example, in task 3 (sad) 65% of the subjects felt sad, however, fear, anger, sympathy, nervous, and startled were all felt with relatively high percentages (53%, 42%, 36%, 27% and 37%, respectively). This analysis agrees with the literature [9] that people react differently even within the same situation (i.e. task). Another interesting task is 9 (anger), where frustration is felt with similar frequency as anger (41% and 47%, respectively). Multiple emotions being felt within the same task, can partially explain the difficulty in inferring emotion from facial expressions alone. It is possible that even though similar emotions are facially expressed, the subject may feel a different emotion. Along with analyzing all subjects with self-report, we also independently analyzed male and female subjects. As it can be seen in Tables 2.3 and 2.4, many of the self-reported emotions are similar across the tasks for both male and female subjects. For example with task T5 (skeptical), 98% of the male and female subjects reported that they felt skeptical, with 45% of the male and 47% of the female subjects reported they felt surprised. Although, the majority of female and male subjects reported similar emotions across the majority of the tasks, there are some instances where they differ. For example, in task 3 (sad), 72%

17 Figure 2.2: Proposed 3D convolutional neural network for recognizing multiple self-report emotions for one task. of the female subjects reported feeling sad, while only 55% of the male subjects reported this. To further investigate this, we evaluated the statistical significance between Tables 2.3 and 2.4, by conducting paired t-tests (Table 2.6). As mentioned in Section 3.2.1, the self-reports from BP4D+ also contains the intensity of the self-reported emotions, therefore, we also calculated the statistical significance between males and females with respect to the intensity of self-reported emotions (Table 2.5). It can be seen in Table 2.6 that for most emotions there is not enough significant difference between males and females with respect occurrences of self-reported emotion. There are some exceptions to this, especially tasks 3 (Sad), 4 (Startled), and 8 (Pain). These differences in response across gender may be explained by chance, although more statistical analysis is needed to determine this. It is also interesting to note, that the differences change when intensity is accounted for (Table 2.5). For example, in task 2 (surprise), there is no significant difference between males and females in how often they reported feeling amused. However, in the subjects that reported amused emotion, there is significant difference in the intensity of the emotion that they reported. To further investigate this, we calculated the average intensity of each reported emotion, for each task, for male and female subjects. We calculated this value for two cases. First, the range of [0, 5], which also includes when an emotion was

18 not felt. Secondly, we calculated it for the range [1,5], to compare the intensities only when an emotion was felt. For the first case, the average intensity across all tasks and emotions is 0.49 and 0.42 for females and males, respectively. For the second case, the average intensity across all tasks and emotions is 3.67 and 3.43 for females and males, respectively. Table 2.7, shows the average intensities for all self-reported emotions across all tasks (range [0,5]), which shows that for the majority of tasks female subjects had stronger intensities of emotion.

2.3 Recognizing Self-Report Emotions

2.3.1 Experimental Design

2.3.1.1 Data Preprocessing

First, we tracked and normalized faces in terms of rotation, scaling and centered using Dlib [53]. This resulted in face images of size 256 x 256 which we scaled down to 175 x 175 pixels. The 3D CNN architecture requires each input sequence to have the same number of frames; to satisfy this requirement one could take the N number of consecutive frames, however, this would not be representative of the whole sequence [132]. Therefore, we sampled the entire sequence to obtain an equal number of frames from each sequence. As adjacent frames are highly correlated [132] we sampled 200 equidistant frames from each sequence preserving maximum temporal information.

2.3.1.2 Proposed 3D CNN

Inspired by FACS3D-Net [132], we extend this work by proposing a multi-tail architecture for multi-emotion recognition that utilizes shared 3D Convolution layers. The 3D convolution captures the temporal information of emotions from the onset to offset and the multi-tail performs regression for each independent emotion. The multiple tails of the network take the deep features of the shared 3D CNN [127] to predict the presence of the self-reported emotion, treating each emotion independently while preserving the co-occurrence of emotions.

19 Table 2.7: Average intensity (range [0,5]) of self-report emotions across all tasks (BP4D+) for male (M) and female (F). White cells intensity [0, < 1); yellow cells intensity [1, 2); green cells intensity [2, 3); red cells intensity [3, 5].

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 Emotion M F M F M F M F M F M F M F M F M F M F Relaxed 1.97 2.00 0.90 0.60 0.00 0.01 0.00 0.00 0.28 0.24 0.38 0.12 0.09 0.15 0.41 0.04 0.16 0.11 0.05 0.12 Amused 2.91 3.11 1.71 1.22 0.00 0.00 0.60 0.11 0.41 0.28 0.95 1.23 0.74 0.59 0.10 0.01 0.36 0.30 0.00 0.01 Disgusted 0.00 0.00 0.02 0.10 0.22 0.00 0.00 0.00 0.00 0.00 0.05 0.04 0.00 0.05 0.00 0.00 0.02 0.00 3.83 4.22 Afraid 0.00 0.01 0.00 0.00 1.48 1.93 0.97 1.54 0.00 0.04 0.07 0.02 2.07 2.01 0.30 0.31 0.28 0.47 0.12 0.59 Angry 0.00 0.00 0.00 0.00 1.26 1.82 0.28 0.46 0.21 0.06 0.09 0.00 0.09 0.12 0.20 0.26 1.55 1.30 0.16 0.25 Frustrated 0.03 0.01 0.00 0.00 0.22 0.16 0.14 0.11 0.07 0.14 0.41 0.21 0.07 0.09 0.16 0.19 1.72 1.40 0.24 0.06 Sad 0.00 0.00 0.02 0.02 1.81 2.85 0.00 0.00 0.00 0.00 0.07 0.00 0.00 0.00 0.00 0.00 0.35 0.21 0.00 0.00 Sympathetic 0.12 0.05 0.00 0.00 1.33 1.30 0.04 0.00 0.00 0.04 0.02 0.00 0.04 0.00 0.00 0.00 0.05 0.25 0.02 0.01 Nervous 0.24 0.33 0.05 0.14 1.07 0.71 0.24 0.38 0.16 0.31 0.86 1.35 1.70 1.50 0.50 0.27 0.90 1.00 0.43 0.37 Pained 0.00 0.00 0.02 0.00 0.02 0.10 0.16 0.17 0.00 0.00 0.00 0.00 0.02 0.05 3.60 4.10 0.00 0.00 0.17 0.05 Embarrassed 0.10 0.09 0.40 0.96 0.00 0.00 0.26 0.14 0.00 0.19 3.20 3.31 0.00 0.04 0.00 0.00 0.60 0.90 0.00 0.15 Startled 0.05 0.04 0.40 0.40 1.00 1.54 4.50 4.61 0.36 0.12 0.05 0.05 0.75 1.15 0.20 0.70 0.36 0.70 0.28 0.70 Surprised 0.62 0.42 2.10 2.20 0.10 0.30 2.74 2.60 1.20 1.40 0.50 0.40 0.90 0.83 0.10 0.41 0.50 0.47 0.47 0.26 Skeptical 0.03 0.08 0.14 0.03 0.00 0.01 0.00 0.00 3.86 4.14 0.09 0.12 0.52 0.17 0.03 0.06 0.90 0.90 0.12 0.09

20 Table 2.8: Perceived emotion evaluation metrics.

All Female Males Acc F1 Bin AUC Acc F1 Bin AUC Acc F1 Bin AUC Relaxed 0.89 0.45 0.70 0.89 0.48 0.72 0.89 0.42 0.68 Amused 0.80 0.42 0.65 0.81 0.45 0.68 0.78 0.38 0.62 Nervous 0.81 0.42 0.68 0.79 0.44 0.68 0.84 0.38 0.66 Pained 0.86 0.35 0.66 0.84 0.31 0.64 0.89 0.40 0.68 Embarrassed 0.86 0.44 0.70 0.84 0.39 0.66 0.87 0.49 0.75 Surprised 0.75 0.39 0.62 0.75 0.39 0.61 0.74 0.39 0.63 Average 0.83 0.41 0.67 0.82 0.41 0.67 0.84 0.41 0.67

The proposed multi-tail 3D CNN (Fig. 2.2) has 3 3D CNN layers, each followed by batch normalization and max pooling. The first layer has 16 filters with a kernel size of (3,3,3), the second and third layers were identical with 32 filters and a kernel size of (3,3,3). All three 3D CNN were followed by Max pooling layers with kernel size of (3,3,3) and ReLu activation. For the multi-tailed part of the network; each emotion is predicted by a tail of the network; the deep features from the shared 3D CNN are connected to a fully connected layer with 250 neurons, followed by a single neuron output layer for regression. We used adam [54] optimizer, and as we had only 9 training sequences (i.e. 9 tasks) per subject we used a batch size of 2.

2.3.1.3 Data Validation

Since it has been shown that emotions can vary widely across people even within the same situation [9], for our experiments, we conduct subject specific task out validation (e.g. train on 9 tasks from same subject, test on 1). This leads to 10 experiments per subject on 70 subjects giving a total of 700 experiments. For each subject there were 1,800 images for training and 200 images for testing. The average performance of all subjects is reported. For training, the self-reported emotions (multiple) are used as the ground-truth class labels for each task sequence. In BP4D+ not all emotions were reported equally across all tasks. To recognize multiple emotions across tasks only the 6 emotions which were reported more than 100 times were used (Table 2.8).

21 Table 2.9: Variance and standard deviation of accuracy for recognizing perceived emotion.

Standard Deviation Variance Tasks 0.101 0.01 Subjects 0.16 0.026

2.3.2 Results

In Table 2.8 we report the average accuracy, F1-binary, and AUC scores for male,female, and all subjects. Our network achieved an average accuracy of 83%, F1-binary of 0.41, and AUC of 0.67. As it can be seen in Table 2.9, the standard deviation and variance for recognizing self-reported emotion is low for both tasks and subjects, showing that the approach is consistent across subjects. As shown in Section 2.2, there are many similarities between the occurrence of self-reported emotions across male and female subjects. This partially explains the results that the evaluation metrics are similar for the tested emotions across gender (Table 2.8). To the best of our knowledge, this is the first work to report these evaluation metrics on self-reported emotions in BP4D+, therefore we did not have any works to compare against.

2.4 Conclusion

We have analyzed self-reported emotions, showing that while male and female subjects generally respond to the tasks in a similar manner, there are instances where the two classes diverge, which may be explained by chance. We have also shown which emotions are sta- tistically significant across the different tasks for male and female subjects, for occurrence and intensity of emotion. Results suggest that on average female subjects have a higher intensity when the emotion is elicited, in BP4D+. Our analysis agrees with literature [9], that emotions can vary across subjects within the same situation. We have also proposed a 3D CNN architecture for recognizing self-reported emotions from task sequences. We report multiple evaluation metrics, and to the best of our knowledge, this is the first work to recognize self-reported emotions. While these results are encouraging,

22 there are some limitations of this work. First, only a subset of subjects are used from BP4D+. All subjects need to be analyzed, as well as different datasets containing subject self-report. Secondly, while self-reported emotions vary between subjects, to check the generalizability of the method, it will be interesting to see results on leave-one-subject-out validation. Finally, the proposed approach needs compared with other approaches (e.g. random forest).

23 Chapter 3: Impact of Action Unit Occurrence Patterns on Detection

3.1 Introduction

Facial expression recognition is a growing field that has attracted the attention of many research groups due in part to early work by Picard [93]. A range of modalities have been found useful for this problem including 2D [131, 23], thermal [124, 82] and 3D/4D data [20, 24]. Promising multimodal approaches have also been proposed [62, 126]. Another interesting approach is based on the detection of action units (AU). These works are based on the Facial Action Coding System [34] (FACS), which decomposes muscle movements (i.e. facial expressions) into a set of action units. There have been promising approaches to AU detection that have made use of both hand- crafted features, as well as deep learning. Liu et al. [65] proposed TEMT-NET that uses the correlations between AU detection and facial landmarks. Their proposed network performs action unit detection, facial landmark detection, and thermal image reconstruction simulta- neously. The thermal reconstruction and facial landmark detection provide regularization on the learned features providing a boost to AU detection performance. Werner et al. [128] investigated AU intensity estimation using modified random regression forests. They also developed a visualization technique for the relevance of the features. Their results suggest that precomputed features are not enough to detect certain AUs. Li et al. [61] proposed the EAC-Net, which is a deep network that enhances and crops regions of interest for AU detection. They found the proposed approach allows for robustness to face position and pose. They also integrated facial attention maps that correspond to areas of the face with active AUs. Their results suggest that use of these attention maps can enhance the learning of the network.

24 Figure 3.1: AU detection accuracy vs occurrence in BP4D. Bars are the average number of AU occurrences, per frame, across all subjects. Line graphs are different F1 scores, of methods in the literature, for each AU. ”Ones” is what happens when we manually label all 1’s (i.e. AU is active) for each of the AUs. Best viewed in color.

25 Figure 3.2: AU detection accuracy vs occurrence in DISFA. Bars are the average number of AU occurrences, per frame, across all subjects. Line graphs are different F1 scores, of methods in the literature, for each AU. The scales on the left and right are different due to the low number of active AUs compared to some of the F1 scores. Best viewed in color.

26 Romero et al. [98] proposed a Convolutional Neural Network to address multi-view AU detection. They explicitly model temporal information using optical flow, and predicting the overall view of a sequence before detecting the AUs. Their cascaded approach evaluates temporal AU networks trained for specific views. The Facial Expression Recognition and Analysis challenge (FERA) [119, 118, 120] also focused on the detection of AUs. This challenge was designed to address the challenge of a common evaluation protocol for AU detection. Girard et al. [43] looked at how much training data is needed for AU detection. They investigated 80 subjects, and more than 350,000 frames of data using SIFT features. Their results suggest that variation in the total number of subjects is an important factor in increasing AU detection accuracies, as they achieved their maximum accuracy from a large number of subjects. Ertugrul et al. [35] investigated the efficacy of cross-domain AU detection. Using shallow and deep approaches, their results suggest that more varied domains, as well as deep learning can increase generalizability. Jeni et al. [50] evaluated the effect of imbalanced data on AU detection. They evaluated accuracy, F1-score, Cohen’s kappa, Krippendorf’s alpha, receiver operator characteristic (ROC) curve, and precision- recall curve. Their findings indicated that the skewed data attenuated all metrics except for ROC, however, it may mask poor performance and they recommend reporting skew- normalized scores. While this work from Jeni et al. is similar to our work, there are some significant differences. They compare data skew with performance metrics, only using their own method across all the datasets. There are no comparisons across multiple methods. We have shown how data distribution (skew) affects F1 score across different state-of-the-art methods including a random guess. Additionally, we also show the correlation between F1 scores of different methods and data distribution. Finally, we show that just by looking at F1 scores it is not possible to comment which is a better method. Validating an approach based on F1 score, alone, is no better than chance. Even though the attenuation of F1 and label imbalance is well-known, most of state-of-the-art methods still use F1 scores to validate

27 their approach. We analyzed a large portion of state-of-the-art methods and showed that all methods are closely correlated. Although these works have shown success in detecting AUs, to the best of our knowledge, none have investigated the impact of AU occurrence patterns on detection. Motivated by this, we focus our experiments on AU occurrence patterns, showing they have significant impact on accuracy. The contributions of our work can be summarized as:

1. In-depth analysis on the impact of AU occurrence patterns on detection results, show- ing the patterns directly impact evaluation metrics.

2. Multi and single AU detection experiments are conducted validating that AU occur- rence patterns directly impact evaluation metrics. We have shown that a network that performs poorly on single AU detection, can learn the patterns thus improving overall performance.

3. A new method for AU detection is proposed that directly trains deep neural networks on occurrence patterns to help mitigate their negative impact on AU detection accuracy.

28 Figure 3.3: AU detection accuracy vs occurrence in BP4D+. Bars are the average number of AU occurrences, per frame, across all subjects. Line graphs are different F1 scores, of methods in the literature, for each AU. The scales on the left and right are different due to the low number of active AUs compared to some of the F1 scores. Best viewed in color.

29 3.2 Action Unit Occurrence Patterns

3.2.1 Datasets

DISFA [73] is a spontaneous dataset for studying facial action unit intensity. It contains 27 subjects (12 women/15 men) with an age range of 18 to 50 years old. Ethnicities include 21 Euro-American, 3 Asian, 2 Hispanic, and 1 African American subject. Each of them watched a 4-minute video clip, with segments taken from YouTube, that were meant to elicit spontaneous expressions. As all frames (130,815 frames) are AU annotated, we used all of them for our analysis. The following AUs are used, which are the most commonly used in the literature (Fig 3.2): AU1, AU2, AU4, AU6, AU9, AU12, AU25, and AU26. BP4D [137] is a multimodal (e.g. 2D, 3D, AUs) facial expression dataset which was used in the Facial Expression Recognition and Analysis challenges in 2015 [118] and 2017 [120] where the focus was the detection of occurrence and intensity. It includes 41 subjects (23 women/18 men) with an age range of 18 to 29 years old. Ethnicities include 11 Asian, 6 African American, 4 Hispanic, and 20 Euro-American subjects. Each subject participated in the following 8 tasks (with the expected emotional response shown in parentheses): interview (happy), watch video clip (sad), startled probe (surprise), improvise a song (embarrassed), threat (fear), cold pressor (pain), insult (anger), smell (disgust). For our analysis, all AU annotated frames are used from this dataset (146,847 frames). The following AUs are used, which are the most commonly used in the literature (Fig. 3.1): AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU23, and AU24. BP4D+ [139] is similar to BP4D, however, it includes more modalities such as thermal and physiological data. It includes 140 subjects (82 female/58 male) with an age range of 18 to 66 years old. Ethnicities include 15 Hispanic, 64 Caucasian, 15 African American, 46 Asian, and one subject identified as other. Similar to BP4D, each subject participated in tasks, however, there were 10 tasks in total compared to 8 in BP4D. The following tasks were performed (with the expected emotional response shown in parentheses): interview

30 (happy), watch 3D video of self (surprise), watch 911 video clip (sad), experience sudden burst of sound (startle), answer true or false questions (skeptical), improvise singing a song (embarrassed), threat (fear), cold pressor (pain), complaint about poor performance (anger), and smell (disgust). In total 197875 frames are used from this dataset, which are from the 4 annotated tasks (T1, T6, T7, and T8), and the same AUs as BP4D are used (Fig. 3.3): AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU23, and AU24.

Table 3.1: Number of AUs and Occurrence Criteria of datasets

Dataset Number of AUs Occurrence Criterion DISFA 12 AUs Level 1 (A) BP4D 12 AUs Level 1 (A) BP4D+ 12 AUs Level 2 (B)

3.2.2 Class Imbalance and AU Detection

It has been shown that AU class imbalance has a negative impact on evaluation metrics [50]. To further extend this notion, we will show that class imbalance and AU occurrence patterns are directly correlated. In order to analyze the impact of occurrence patterns on AU detection, we must first analyze the imbalance of AU classes. To facilitate this, we have analyzed state-of-the-art literature regarding their F1-binary scores from AU detection experiments on BP4D, BP4D+, and DISFA. F1-Binary is defined as

2 · precision · recall TruePositives F 1 = = 1 (3.1) precision + recall TruePositives + 2 (FalsePositive + FalseNegatives)

As can be seen in Figs. 3.1, 3.2, and 3.3 the occurrence of AUs in each dataset are imbalanced. For example, in BP4D, AU 10 has an average occurrence of 0.62, while AU 2 has an average occurrence of 0.18. There is a direct correlation between these occurrences and their F1- binary scores. AU 10 is one of the highest occurring AUs and also one of the highest F1-binary scores across the literature, with an average F1-binary score of 0.75. Similarly, AU 2 is one of the lowest occurring AU and one of the lowest F1-binary scores across the literature,

31 Table 3.2: Correlation between F1-binary scores and AU class imbalance.

Correlation Method BP4D DISFA BP4D+ LSTM[25] 0.680 - - LSVM[59] 0.957 0.347 - DAM[33] 0.922 - - MDA[113] 0.948 - - GFK[45] 0.951 - - iCPM[135] 0.967 - - JPML[141] 0.869 - - DRML[142] 0.949 0.844 - FVGG[59] 0.890 0.785 - E-Net[60] 0.944 0.919 - EAC[60] 0.953 0.472 0.84 ROI[59] 0.966 0.773 - R-T1[59] 0.931 0.816 - R-T2[59] 0.970 - - APL[142] - 0.5098 - D-PattNett[87] 0.946 - - DSIN[28] 0.931 0.792 - JAA-Net[103] 0.847 0.918 0.833 JAˆA-Net [104] 0.863 0.774 0.87 ARL[105] 0.808 0.954 0.827 Average 0.865 0.744 0.828

with an average F1-binary score of 0.36. Considering this, our analysis shows that current state-of-the-art results, in AU detection, follow an explicit trend which is correlated to the imbalance of the AUs. To further illustrate this trend, we also calculated the F1-binary score if we were to manually predict all AUs as active, for all frames. As can be seen in Fig. 3.1, the general trend that the F1-binary scores follow, for all methods in BP4D, is the same as labeling all AUs as active. The general trend that F1-binary scores follow can visually be seen in Figs. 3.1, 3.2, and 3.3. To statistically analyze this trend, we calculated the correlation between the class imbalance and F1-binary scores of the methods shown in Figs. 3.1, 3.2, and 3.3. We define P (x − x)(y − y) correlation as corr = , where x and y are the averages of the classes and pP (x − x)2(y − y)2 the F1-binary scores, respectively. For BP4D, BP4D+, and DISFA the average correlations

32 are 0.865, 0.828, and 0.744, respectively. These results suggest that there is high correlation between the class imbalance and the reported F1-binary scores for AU detection (Table 3.2). This further suggests that AUs with high F1-binary scores have a lot of training data, and AUs with low F1-binary scores have a small amount. Although there is a general trend of AU occurrence versus F1-binary accuracy, it is important to note that there are some anomalies in the F1-binary scores for some AUs and methods. For example, on BP4D, Chu et al [25] use high intensity AUs, equal to or higher than A-level for positive samples, and the rest for negative. Due to the imbalance of AUs in DISFA, some of the experiments train on BP4D and test on DISFA, which is a common approach for testing on this dataset. This can explain, in part, some of the lower correlations in Table 3.2 (e.g. [25, 60, 59]). Along with the correlation between the class imbalance and F1-binary scores, we also P (x − x)2 √ calculated the variance, var = , and standard deviation, std = var, of the (n − 1) F1-binary scores between each of the methods detailed in Figs. 3.1, 3.2, and 3.3. There is a small amount of variance between each of the methods across all studied AUs. In BP4D, the average variance is 0.0089 (std of 0.0923), 0.002 (std of 0.0452) in BP4D+, and in DISFA the average variance is 0.0383 (std of 0.18). This suggests that the investigated F1-binary scores are all similar within a small accuracy range. While the general variance and standard deviations are low, there are some outliers, especially in DISFA. For example, AU 9 has a variance of 0.064 and standard deviation of 0.253. This can also be visually seen in Fig. 3.2, with the F1-binary score of Li et al. [60]. As previously detailed, this can partially be explained from training on BP4D and testing on DISFA. See Table 3.3 for the standard deviation between all methods. This analysis naturally leads to the question of why is this specific trend occurring in AU detection? We hypothesized that this is due to the AU occurrence patterns that occur across all of the sequences in the data.

33 Table 3.3: Standard deviation of F1-binary scores, for each individual AU, between all methods (Figs. 3.1, and 3.2).

Std Dev AU BP4D DISFA BP4D+ AU 1 0.0671 0.1418 0.0759 AU 2 0.0853 0.1301 0.0613 AU 4 0.1197 0.1702 0.0825 AU 6 0.0935 0.1559 0.0203 AU 7 0.0660 - 0.0418 AU 9 - 0.2541 - AU 10 0.1006 - 0.0214 AU 12 0.0917 0.1455 0.0216 AU 14 0.0885 - 0.0979 AU 15 0.0965 - 0.0524 AU 17 0.0847 - 0.0191 AU 23 0.0663 - 0.0202 AU 24 0.1076 - 0.0282 AU 25 - 0.3172 - AU 26 - 0.1573 - Average 0.0890 0.1840 0.0452

Table 3.4: BP4D AU occurrence patterns and total number of frames for each pattern. NOTE: top 5 and bottom 2 rows are patterns with highest and lowest counts, respectively.

Pattern Frame Count 1 2 4 6 7 10 12 14 15 17 23 24 10630 0 0 0 0 0 0 0 0 0 0 0 0 8402 0 0 0 1 1 1 1 1 0 0 0 0 6883 0 0 0 1 1 1 1 0 0 0 0 0 3571 0 0 1 0 0 0 0 0 0 0 0 0 1814 0 0 0 0 0 1 1 0 0 0 0 0 1 0 0 1 0 1 1 0 0 1 1 1 0 1 1 1 0 1 1 1 1 0 1 1 1 1

34 Table 3.5: DISFA AU occurrence patterns and total number of frames for each pattern. NOTE: top 5 and bottom 2 rows are patterns with highest and lowest counts, respectively.

Pattern Frame Count 1 2 4 5 6 9 12 15 17 20 25 26 48616 0 0 0 0 0 0 0 0 0 0 0 0 8305 0 0 0 0 0 0 0 0 0 0 1 0 6035 0 0 1 0 0 0 0 0 0 0 0 0 5167 0 0 0 0 1 0 1 0 0 0 1 0 4994 0 0 0 0 0 0 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 1 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1

Table 3.6: BP4D+ AU occurrence patterns and total number of frames for each pattern. NOTE: top 5 and bottom 2 rows are patterns with highest and lowest counts, respectively.

Pattern Frame Count 1 2 4 6 7 10 12 14 15 17 23 24 34992 0 0 0 0 0 0 0 0 0 0 0 0 34678 0 0 0 1 1 1 1 0 1 0 0 0 16483 0 0 0 1 1 1 1 0 1 0 0 1 9234 0 0 0 1 1 1 1 0 0 0 0 0 6811 0 0 0 1 1 1 1 0 1 1 0 1 1 1 1 1 0 0 1 1 1 1 1 0 0 1 1 0 1 1 0 1 1 0 1 0 0 1

35 40 BP4D DISFA BP4D+ 35

30

25

20

15

Percentage of patterns 10

5

0 5000-11000 2000-5000 1000-2000 500-1000 200-500 100-200 50-100 10-50 5-10 0-5 Frame Count Range

Figure 3.4: Patterns and their frame counts, compared to total percentage of patterns for BP4D, DISFA and BP4D+.

36 3.2.3 Action Unit Occurrence Patterns

In Section 3.2.2, we analyzed the common AUs found in the literature. Here, the AUs investigated are the same for BP4D and BP4D+ (Table 3.3), however, for DISFA we are in- vestigating all available AUs (1, 2, 4, 5, 6, 9, 12, 15, 17, 20, 25, 26). We are interested in which

AU patterns occur most often, where a pattern is defined as the vector v = [AU1, ... , AUk ], where each element can have the values {0, 1}, and k is the number of AUs of interest (e.g. k = 12 for DISFA). To investigate this, we looked at how many individual patterns exist, as well as the total frame count for each pattern (i.e. how many frames of data have that pattern). There are 1,692 different patterns in BP4D, 564 in BP4D+, and 457 in DISFA, that contain the investigated AUs for each dataset respectively. Tables 3.4, 3.5, and 3.6 show the 5 patterns with the highest, and the 2 with the lowest frame counts in BP4D, DISFA, and BP4D+, respectively. Tables 3.4 - 3.6 reveal a significant imbalance in not only the number of individual AUs, but also the total number of AU occurrence patterns. As can be seen in each of the tables, the top pattern is all 0’s (no active AUs). In BP4D, there are 10,630 frames (≈ 7.2% of the AU annotated frames) where there are no active AUs. In BP4D+, there are 34,992 frame (≈ 17.7% of the AU annotated frames), and in DISFA, there are 48,616 frames that have no active AUs (≈ 37% of the data). As can be seen in Table 3.4, the bottom 2 patterns, of BP4D, both occur only once, while the top pattern, that has active AUs, occurs 8,402 times. Similarly, in DISFA (Table 3.5) and BP4D+ (Table 3.6) the bottom 2 patterns also only occur once, while the top pattern with active AUs occurs 8,305 times in DISFA and 34,678 in BP4D+. While there are less overall patterns in DISFA and BP4D+, there is still a large imbalance in the active and inactive AUs per frame. This difference in the number of patterns contributes, at least partially, to the imbalance of AUs in these datasets. It is also important to note that for each dataset, there is a similar trend in the percentage of patterns that exist in a range of frame counts (Fig. 3.4). For example, >72% of the patterns in BP4D, >67% in BP4D+, and >65% of the patterns in DISFA occur <50 times in all the

37 Table 3.7: Top AU occurrence patterns, for each sequence, from DISFA. NOTE: Last column is target emotion of task.

Pattern Seq Emo 1 2 4 5 6 9 12 15 17 20 25 26 1 0 0 0 0 1 0 1 0 0 0 1 0 Happy 2 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 0 0 Surp 4 0 0 0 0 0 0 0 0 0 0 1 1 Fear 5 0 0 0 0 0 0 0 0 0 0 1 0 Disg 6 0 0 0 0 0 0 0 0 0 0 1 0 7 0 0 0 0 0 0 0 0 0 0 1 0 Sad 8 0 0 1 0 0 0 0 0 0 0 0 0 9 0 0 0 0 0 0 0 0 0 0 1 0 Surp three datasets. On the other hand, <0.2% of the patterns in BP4D, <0.03% in BP4D+, and <2% of the patterns in DISFA occur >5000 times. Our analysis suggests, this imbalance of patterns has a direct impact on the evaluation metrics as there are many patterns that take up a large percentage of the total patterns. However, each of these patterns has a small number of instances that can effectively be used for training machine learning classifiers. We also analyzed the AU occurrence patterns as they relate to the sequences and tasks in DISFA, BP4D, and BP4D+. Each of the sequences/tasks corresponds to a specific emotion that was meant to be elicited. For each sequence or task, we hypothesized that different, specific patterns would exist for each of them based on FACS [34]. Our analysis, of these patterns does not support this hypothesis, as can be seen in Tables 3.7, 3.8, and 3.9. In BP4D, it can be seen that the patterns with the highest frame count for tasks T2 and T3 (surprise and sad), are the same with the only active AU being 4. Tasks 1, 4, 5 (happy, startle, skeptical) also have the same highest frame count pattern with AUs 6, 7, 10, 12, and 14 being active. In DISFA, as can be seen in Table 3.7, the patterns with the highest frame count for each sequence are largely inactive AUs. Sequences 2, 3, and 5-9 only have one active AU, sequence 1 has three active AUs, and sequence 4 has 2 active AUs. The active AUs make up approximately 11% of the AUs across the top patterns for each sequence. In

38 Table 3.8: Top AU occurrence patterns, for each task, in BP4D. NOTE: Last column is target emotion of task.

Pattern Task ID Emo 1 2 4 6 7 10 12 14 15 17 23 24 T1 0 0 0 1 1 1 1 1 0 0 0 0 Happy T2 0 0 1 0 0 0 0 0 0 0 0 0 Surp T3 0 0 1 0 0 0 0 0 0 0 0 0 Sad T4 0 0 0 1 1 1 1 1 0 0 0 0 Startle T5 0 0 0 1 1 1 1 1 0 0 0 0 Skept T6 0 0 0 1 1 1 1 0 0 0 0 0 Emb T7 0 0 0 0 0 1 1 1 0 0 0 0 Fear T8 0 0 0 1 1 1 1 0 0 0 0 0 Pain

Table 3.9: Top AU occurrence patterns, for each task, in BP4D+. NOTE: Last column is target emotion of task.

Pattern Task ID Emo 1 2 4 6 7 10 12 14 15 17 23 24 T1 0 0 0 1 1 1 1 0 1 0 0 0 Happy T6 0 0 0 1 1 1 1 0 1 0 0 1 Emb T7 0 0 0 1 1 1 1 0 1 0 0 0 Fear T8 0 0 0 0 0 0 0 0 1 0 0 0 Pain

BP4D+ (Table 3.9), the top occurrence patterns for T1 and T7 are exactly the same. Task 6 (T6) is also similar, with the only difference being it has one extra active AU (AU24). Using action units, Lucey et al. [68] showed that they can recognize pain from the UNBC-McMaster shoulder pain expressive archive database [69]. Although, these results are encouraging, analysis of AU patterns across multiple emotions suggests that using AUs to detect expression across a range (e.g. happy, sad) is a difficult problem due to similar AU patterns between them. This could explain, in part, the difficulty in detecting AUs, as machine learning classifiers are getting trained on similar patterns. However, different expressions are used as training data. Considering this, we hypothesized that AU patterns have a direct impact on AU detection. We evaluate this hypothesis in the next section by training on patterns vs. training on single AUs.

39 Table 3.10: Correlation between metrics and class imbalance for experiment 1.

Correlation Metric Network 1 Network 2 Ones F1-binary 0.9758 0.9489 0.9912 F1-macro 0.7570 0.6349 0.9961 AUC 0.5795 0.6406 N/A F1-micro -0.2656 -0.3118 0.9913

3.3 Impact of AU Patterns on Detection

To further investigate the impact of AU occurrence patterns on detection, we conducted in-depth experiments on BP4D. We chose BP4D for our experiments as BP4D+ and DISFA contain a larger imbalance of active versus inactive AUs, and not enough samples for training [61]. We evaluated the impact of two different convolutional neural networks (CNN) for this investigation to validate that the results are not classifier (i.e. network) specific. First, we implemented the CNN as detailed by Ertugral et al.[35], which we refer to as Network 1. For our second CNN, we used a network which had two convolutional layers with filter size of 8 and 16 followed by max pool layers, followed by two more convolutional layers, with filter sizes of 16 and 20; another max pool layer and batch normalization. All CNN layers had a kernel of (3,3). Before the output layer, there were three dense layers, with 4096, 4096 and 512 neurons respectively. ReLu activation function and dropout of 0.4 were also used. We refer to this as Network 2. In addition to these two networks, we also had a control group called ’Ones’, in which we labeled all AUs as active in all the frames, and is used as a baseline for comparisons between the two networks. To investigate the impact of the AU patterns on detection accuracy, we calculated the F1- binary, F1-micro, F1-macro, and Area Under the Curve (AUC) scores. F1 score is defined as

2×TruePositives F 1 = 2×TruePositives+FalsePositives+FalseNegatives [22] and AUC is defined as the area under the graph between True positive rate V/S False positive rate. F1-binary is F1 score for the positive class and does not consider the negative class where as F1-macro is the simple average of F1 scores of all classes. F1-micro is calculated globally by counting the total true positives, false

40 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (a) Comparison of F1-binary scores. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (b) Comparison of F1-micro scores. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (c) Comparison of F1-macro scores. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (d) Comparison of AUC scores.

Network 1 [35] Network 2 Control Group (’Ones’)

Figure 3.5: Comparisons of different accuracy metrics on multiple networks (experiment 1 ).

41 positives and false negatives. The score is computed using all estimated labels and manual annotations without averaging over the folds. To facilitate our investigation, we conducted two experiments. First, using the entire dataset, we detected multiple AUs (i.e. the entire sequence/pattern). For this experiment, we used all AU-labeled frames from BP4D, and we refer to this as experiment 1. Secondly, we detected individual AUs, which we refer to as experiment 2 in the rest of the work, and was done to test what impact removing the patterns has on AU detection accuracy. Both experiments were subject-independent (i.e. same subject did not appear in training and testing), and the subjects in each fold were fixed so that both experiments trained and tested the same subjects and images. Three-fold cross-validation was used for all experiments.

3.3.1 Multi-AU detection (Experiment 1)

When using the ’Ones’ control group as a baseline, it can be seen that there is a high correlation between the class imbalance and F1-binary, macro, and micro scores (Table 3.10). There is an average correlation, with the class imbalance, of 0.9928 across the three metrics. While the accuracies vary between the different metrics, it can be seen that the trend is similar (Fig. 3.5. For the control group, the AUC correlation is N/A as a score of 0.5 was obtained for each AU (Fig. 3.5d). As can be seen in Fig. 3.5a, where the F1-binary scores of the tested networks is similar to the control group. It can be seen that all of them follow a similar trend, which is the class imbalance. This suggests that the F1-binary score may not be an accurate metric to distinguish between correct detection and guessing (i.e. ”guessing” all AUs as ones/active). This can be explained, in part, since the F1-binary score only looks at the positive classes [22]. This can also be seen in Table 3.10 (first row), as there is a high correlation between the F1-binary score of both networks and the class imbalance. We also calculated the correlation across each AU of both networks to the control group (i.e. how correlated are the F1-binary

42 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (a) Comparisons of F1-binary score. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (b) Comparisons of F1-micro score. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (c) Comparisons of F1-macro score. 1 0.8 0.6

0.4 Score 0.2 0 AU 1AU 2AU 4AU 6AU 7AU 10AU 12AU 14AU 15AU 17AU 23AU 24 (d) Comparisons of AUC score.

Network 1 [35] Network 2 Control Group (’Ones’)

Figure 3.6: Comparisons of different accuracy metrics on different networks (experiment 2 ).

43 Table 3.11: Correlation between metrics and class imbalance for experiment 2.

Correlation Metric Network 1 Network 2 F1-binary 0.8783 0.1121 F1-macro 0.6869 -0.4281 AUC 0.5840 N/A F1-micro 0.6290 -0.4679 accuracies for each AU). This resulted in correlations of 0.98 and 0.94 for Networks 1 and 2, respectively, showing both give similar results to labeling all AUs as active. We also looked at using F1-micro and macro as the metrics for AU detection accuracy. F1-micro does not follow the control group trend. It has a correlation of -0.29 and -0.31 for Networks 1 and 2, respectively, with the control group (Fig. 3.5b). Both networks also had a low negative correlation with the class imbalance (Table 3.10). F1-macro also does not follow the control group trend (Fig. 3.5c). Although F1-macro is more correlated compared to F1-micro, with correlations of 0.74 and 0.62 for Networks 1 and 2, respectively, it is less correlated compared to F1-binary (Table 3.10). The final metric we looked at, for AU detection accuracy, was AUC. This metric has a lower correlation, with the class imbalance, compared to F1-macro. It can also be seen that it does not follow the class imbalance trend (Fig. 3.5d). Again, as AUC for the control group was 0.5, we were unable to calculate the correlation between the control group and two networks. It is also important to note that the correlations between the control group and F1-binary, macro, and micro closely resemble the correlation with the class imbalance. This is due to the high correlation of the control group with the data.

44 Table 3.12: Evaluation on tested networks for top 66 AU occurrence patterns.

AU 1 2 4 6 7 10 12 14 15 17 23 24 Avg N1 0.2085 0.2571 0.3238 0.844 0.7581 0.901 0.9095 0.5665 0.1668 0.4357 0.237 0.0301 0.4698 N3 0.1982 0.2428 0.2057 0.7931 0.7409 0.8641 0.8758 0.5134 0.1378 0.3574 0.1118 0.2147 0.438 F1 Bin N4 0.2154 0.2666 0.2331 0.8267 0.7493 0.8753 0.8724 0.5919 0.1633 0.4241 0.2286 0.261 0.4756 N5 0.1556 0.1569 0.0952 0.8104 0.7406 0.8732 0.8886 0.545 0.0939 0.2422 0.0422 0.0132 0.491 N6 0.2872 0.2633 0.3433 0.7919 0.7495 0.8684 0.8842 0.5269 0.1643 0.3224 0.1464 0.0958 0.5293 N1 0.6444 0.6917 0.7857 0.9095 0.8037 0.9442 0.9631 0.6724 0.7463 0.7619 0.6499 0.7712 0.7787 N3 0.563 0.5757 0.56 0.8 0.7249 0.8283 0.8543 0.6167 0.568 0.6097 0.5643 0.525 0.6492 AUC N4 0.599 0.704 0.6957 0.9027 0.8037 0.9504 0.9484 0.678 0.7469 0.7699 0.668 0.8117 0.7772 N5 0.6467 0.5972 0.6257 0.7899 0.7175 0.7868 0.8467 0.6749 0.5713 0.591 0.5677 0.528 0.658 N6 0.6469 0.6425 0.7115 0.858 0.7488 0.9008 0.9307 0.6381 0.6934 0.6205 0.6181 0.6441 0.7211 N1 0.6797 0.7432 0.7712 0.746 0.6792 0.7555 0.7956 0.6027 0.7397 0.686 0.7807 0.8402 0.735 N3 0.8025 0.8373 0.8337 0.8032 0.7289 0.846 0.8673 0.6202 0.8497 0.7703 0.8982 0.9287 0.8155 F1 Micro N4 0.8323 0.869 0.8613 0.838 0.755 0.8673 0.8867 0.6277 0.9103 0.8253 0.9483 0.9479 0.8474 N5 0.8705 0.886 0.8449 0.7933 0.7055 0.8406 0.8627 0.6033 0.9163 0.7995 0.9445 0.9396 0.8339 N6 0.8314 0.8396 0.8233 0.7883 0.7211 0.8408 0.8614 0.6103 0.8672 0.7564 0.9151 0.9285 0.8153 N1 0.59 0.607 0.6126 0.7459 0.6593 0.7219 0.7883 0.6003 0.5737 0.67 0.6184 0.6227 0.6508 N3 0.5587 0.569 0.563 0.7997 0.7236 0.834 0.8593 0.612 0.543 0.606 0.5406 0.5317 0.645 F1 Macro N4 0.576 0.604 0.609 0.7536 0.6613 0.7477 0.8137 0.5843 0.5757 0.6483 0.605 0.6143 0.6494 N5 0.5426 0.5477 0.5052 0.7904 0.6993 0.8278 0.8531 0.5967 0.525 0.5632 0.5068 0.491 0.6207 N6 0.5957 0.5864 0.6206 0.7879 0.7174 0.8315 0.855 0.5974 0.5459 0.5866 0.5508 0.5293 0.6504

45 3.3.2 Single-AU Detection (Experiment 2)

In experiment 1, we showed that class imbalance has a direct impact on AU detection, as some of the evaluation metrics are similar to manually labeling all AUs as one. To validate our hypothesis that action unit occurrence patterns have a direct impact on AU detection, we trained separate networks for each individual AU (i.e. single AU detection). This experimental design allows us to directly compare the correlations, and evaluation metrics obtained from experiment 1. Similar to experiment 1, we calculated the correlations between the F1-binary, macro, micro and AUC scores compared to the class imbalance for the two networks (Table 3.11). In experiment 1, the correlations with the class imbalance, for each metric, across the two network architectures was similar (e.g. F1-binary correlation of 0.9758 and 0.9489 for Networks 1 and 2, respectively). Conversely, for experiment 2, the correlations, between the two networks, are not similar. For example, the correlation between F1-micro for Network 1 is 0.6290, while Network 2 has a correlation of -0.4679. Similar to experiment 1, the AUC for our control group was 0.5 for all AUs, however, the AUC for Network 2 was also 0.5 for all AUs (Fig. 3.6d). This can be explained, in part, by the general performance of Network 2 as seen in Fig. 3.6. Each metric, for Network 2, generally performed poorly, compared to experiment 1 (i.e. multi-AU detection), which supports our hypothesis that AU occurrence patterns have a direct impact on AU detection. Our results suggest that the classifiers (i.e. Networks 1 and 2 ), are learning the patterns of AU occurrences and not one AU at a time. This is validated through experiment 2, that shows accuracies decrease when patterns are removed from training. Although the overall trend is that accuracies decrease when patterns are removed, it is interesting to note that AUs with lower base rates can be negatively impacted as seen in Figs. 3.5 and 3.6. It has been shown that AU co-occurrences are important [109], [141], however, co-occurrence matrices are generally shown between two AUs [74]. Here, we take that one step further and show that the entire patterns of active and inactive AU occurrences directly impacts AU detection. Our results suggest that higher

46 AU scores can be achieved with a multi-AU detection approach, which validates other work that has shown the same thing [59].

3.4 Training on AU Occurrence Patterns

We propose a method that uses AU occurrence patterns to train deep neural networks to increase AU detection accuracies. To facilitate this, we conducted experiments on BP4D, as BP4D+, and DISFA contain a larger imbalance of active versus inactive AUs [61] and they have more patterns with fewer active AUs. We conducted two experiments; in the first experiment we created a subset of the dataset using the top 66 patterns, which we refer to as experiment 3. In the second experiment, we used all patterns, which we refer to as experiment 4. Following the experimental design from Section 5.5, subject-independent, 3-fold cross validation was used where the subjects in each fold were fixed so that all the experiments were trained and tested on the same subjects.

3.4.1 Neural Network Architectures

For training neural networks on AU occurrence patterns, we trained six networks. First, we trained the network from Ertugrul et al. [35] (Network 1 ). Secondly, we modified the network by changing the output layer from 12 to 66, to detect individual patterns. In this network we used class labels [0, 65], with one class for each pattern. For example, given the pattern where all AUs are inactive, this pattern would be labeled with a class of 0. We also changed the loss function to categorical cross entropy, and the monitoring metric to accuracy. We refer to this network as Network 3. Next, we froze the CNN layers from Network 3 and then split the network after the CNN layers and added a 400 neuron fully connected layer followed by a 12 neuron output layer. To train this network we followed the same parameters as Network 1. We refer to this network as Network 4. To further validate the proposed approach, we also trained the network described by Fung et al. [40], which we

47 refer to as Network 5. We then made the same modifications as done to Network 1, to create Networks 3 and 4, on Network 5. We unfroze the CNN layers of Network 4 and retrained it at a lower learning rate of 0.0001 on just the top 66 patterns to create Network 6. When we retrain using all the AU occurrences we refer to it as Network 7.

3.4.2 Top 66 AU Occurrence Patterns (Experiment 3)

For training, we initially used the top 25 occurring patterns but found that AU 23 does not occur in the top 25 patterns and others like AU 2 and AU 15 occur very few number of times. For reliable training and testing of all AU occurrences, we used patterns which occur at least 397 times in the dataset, this gave us 66 top patterns. For this experiment, we trained Networks 1, 3, 4, 5, 6 and 7. As can be seen from Table 3.12, detecting AUs by first training on AU occurrence patterns (Network 4 ) has the highest average F1-binary and F1-micro scores compared to Networks 1 and 3. This shows that the proposed method has a positive impact on detection accuracy. An interesting result is that Network 1 has an F1-binary score of 0.03141 for AU 24, while the proposed approach has an F1-score of 0.261 for AU 24. This could be partially explained by the lack of training data (i.e. only top 66 patterns). Network 1 was unable to accurately detect this particular AU but the proposed approach was able to increase the accuracy with the limited amount of available training data. As Network 1 is from Ertugrul et al. [35], this validates that the proposed approach can be integrated into current state-of-the-art networks, to further improve upon results. This can also be seen in Table 3.15, where the proposed approach also has the lowest variance, across F1 scores of all AUs, for experiments 3 and 4.

48 Table 3.13: Evaluation on tested networks for all AU occurrence patterns.

AU 1 2 4 6 7 10 12 14 15 17 23 24 Avg N1 0.4068 0.3795 0.3649 0.745 0.7408 0.8174 0.827 0.6224 0.3087 0.5977 0.3704 0.3363 0.5431 N4 0.4339 0.3889 0.3775 0.7085 0.684 0.7978 0.7929 0.6023 0.3485 0.6203 0.4105 0.3763 0.5451 F1Bin N5 0.2137 0.2054 0.1749 0.4625 0.7026 0.7869 0.7622 0.3963 0.1214 0.3192 0.1161 0.1412 0.3669 N7 0.3579 0.2816 0.226 0.6899 0.7125 0.7772 0.7974 0.56 0.2143 0.4789 0.2233 0.1376 0.4547 N1 0.625 0.6553 0.6896 0.8293 0.7187 0.814 0.8845 0.63 0.646 0.7107 0.644 0.745 0.716 N4 0.6693 0.714 0.7086 0.8343 0.7467 0.8351 0.8872 0.6521 0.6671 0.7501 0.6904 0.8034 0.7465 AUC N5 0.5876 0.5813 0.5937 0.6777 0.6142 0.6937 0.7219 0.564 0.5581 0.6084 0.5681 0.6536 0.6185 N7 0.6684 0.6217 0.6377 0.7772 0.7087 0.7683 0.8395 0.585 0.6213 0.6494 0.6199 0.7283 0.6854 N1 0.6797 0.7432 0.7712 0.746 0.6792 0.7555 0.7956 0.6027 0.7397 0.686 0.7807 0.8402 0.735 N4 0.6813 0.7613 0.771 0.7543 0.6717 0.7777 0.8197 0.586 0.7573 0.673 0.7987 0.8413 0.7411 F1Micro N5 0.7632 0.8108 0.7617 0.6365 0.5988 0.6849 0.6713 0.5533 0.7979 0.6271 0.794 0.8271 0.7105 N7 0.7681 0.7752 0.7571 0.6952 0.6621 0.7027 0.7556 0.5648 0.7946 0.6386 0.793 0.8346 0.7285 N1 0.59 0.607 0.6126 0.7459 0.6593 0.7219 0.7883 0.6003 0.5737 0.67 0.6184 0.6227 0.6508 N4 0.576 0.604 0.609 0.7536 0.6613 0.7477 0.8137 0.5843 0.5757 0.6483 0.605 0.6143 0.6494 F1Macro N5 0.5359 0.5475 0.5166 0.5714 0.5191 0.5706 0.5978 0.5533 0.4867 0.5242 0.4993 0.5389 0.5329 N7 0.6081 0.5737 0.5409 0.694 0.6474 0.6584 0.7422 0.5648 0.548 0.5989 0.5515 0.5229 0.6041

49 Table 3.14: Evaluation on N3 for unseen patterns.

AU F1 Bin AUC F1 Micro F1 Macro 1 0.2763 0.5480 0.6570 0.5250 2 0.2443 0.5500 0.7171 0.5277 4 0.1950 0.5270 0.6829 0.4988 6 0.6433 0.6590 0.6468 0.6453 7 0.6453 0.5876 0.5950 0.5857 10 0.7367 0.6377 0.6673 0.6367 12 0.7280 0.6850 0.6913 0.6827 14 0.5663 0.5483 0.5457 0.5430 15 0.2693 0.5337 0.6577 0.5224 17 0.4855 0.5780 0.5620 0.5493 23 0.2483 0.5320 0.6557 0.5123 24 0.1947 0.5384 0.7277 0.5147 Average 0.4361 0.5771 0.6505 0.5620

As it can also be seen in Table 3.12, Network 6 has higher average F1-binary, AUC, and F1-macro scores compared to Network 5. Along with this, Network 7 has an average F1-binary, AUC, and F1-micro and -macro score of 0.4055, 0.7389, 0.8443, and 0.6225, respectively. Compared to Networks 5 and 6, that is a higher AUC and F1-micro which shows similar results when comparing Network 4 to Networks 1 and 3. These results show the utility of the proposed approach to help mitigate the negative impact that the imbalance of AU occurrence patterns have on detection evaluation metrics. We are also interested in what would happen if we were given an unseen pattern to detect. To facilitate this, we trained Network 3 on the top 66 patterns, and then tested on all other available patterns. As can be seen in Table 3.14, there is a decrease in overall scores, however, the proposed approach was still able to detect the unseen patterns with a relatively low decrease in performance metrics. For example, when detecting unseen patterns, the F1- binary score was 0.4361. However, when training and testing only on the top 66 patterns, with Network 3, the F1-binary score was 0.4380 which is a decrease of 0.0019 in F1-binary score. It has been shown that cross-domain AU detection is a difficult problem [36], which could explain, in part, the decrease in accuracy. Testing unseen patterns can be seen as an out of distribution, which is similar to what is seen in cross-domain AU detection.

50 Table 3.15: Variance of AU F1-binary scores.

Network Experiment 3 Experiment 4 Network 1 0.0988 0.0553 Network 3 0.0908 - Network 4 0.0824 0.0529

3.4.3 All AU Occurrence Patterns (Experiment 4)

In this experiment, instead of the top 66 patterns, we used all patterns from BP4D. We trained Networks 1, 4, 5, and 7 with the same CNN weights as we did in Section 3.4.1. For this experiment, we did not train Networks 3 and 6, as there were not enough instances of all the patterns. As can be seen in Table 3.13, the average F1 score, AUC, and F1-micro of Network 4 are higher than Network 1. For Network 7, the average F1 score, AUC, F1-micro and macro are higher than Network 5. These results further validate that the proposed approach can be integrated into current networks [35, 40] to help improve state-of-the-art results. Similar to experiment 3, as also can be seen in Table 3.15, the method of pre-training the CNN with the patterns has decreased the variance of the F1 scores.

3.5 Conclusion

We showed that AU class imbalance and patterns are correlated, and the same patterns exist across multiple emotions, which can impact how AU detection methods learn. We validated this through multi- and single-AU experiments, showing performance decreases when networks are unable to learn the patterns. An interesting result from these experiments is the overall comparable decrease in performance of Network 2, with single AU detection, compared to Network 1 (Fig. 3.6). This has led us to a new hypothesis that specific network architectures are less important when AU patterns are detected compared to single AUs. Considering this, we conducted evaluations on multiple network types for single AU and pattern detection.

51 To utilize the impact of AU patterns on detection, we proposed a new method that uses the AU patterns to directly train deep neural networks. We have shown that this method, compared to individual action units, can improve accuracy, as well as lower the overall variance among F1-score of action units. In other words, the proposed approach helps to mitigate the negative impact that AU class imbalance has on detection results, as can be seen in Figs. 3.1, 3.2, and 3.3. This approach can be integrated into any deep neural network architecture helping to further improve state of the art results.

52 Chapter 4: Multimodal Fusion of Physiological Signals and Facial Action Units for Pain Recognition2

4.1 Introduction

Assessing pain can be difficult and subject self-report is the most widely used form of assessment. When it is reported in a clinical setting, the measures are subjective and no tem- poral information is available [68]. Automatic methods to recognize pain have the potential to improve quality of life and provide an objective means for assessment. Considering this, in recent years there has been encouraging works, in this direction, that include analysis of facial expressions, physiological signals, and kinematics and muscle movement. To facilitate advances in automatically recognizing pain, Aung et al. [6] developed the EmoPain dataset which contains multi-view face videos, audio, 3D motion capture, and electromyographic signals from back muscles. The dataset contains 22 patients (7 male/15 female) with chronic back pain and 28 healthy control subjects (14 male/female). They re- leased baseline results on both facial expressions and muscle movements. Using this dataset, Wang et al. [121] proposed the deep learning architecture, BodyAttentionNet, to recognize protected behavior [112]. This architecture learns temporal information including which body parts are better suited to this task. They showed improved recognition accuracies, as well as a model with 6 to 20 times fewer parameters compared to state of the art. Olubbade et al. [85] proposed a method for automatically recognizing pain levels using kinematics and muscle activity. They trained a random forest and support vector machine, with data from

2This chapter was published in ©2021 IEEE Face and Gesture 2020 [47]. Reprinted with permission included in Appendix C.

53 the EmoPain dataset, showing accurate results for recognizing low and high pain, as well as healthy control levels. Fabiano et al. [39] proposed a method for fusing physiological signals for emotion recogni- tion. They developed a weighted fusion approach which showed that using the fused signals can accurately detect pain. They were able to detect pain in the BP4D+ dataset [139] 98.48% of the time. Olugbade et al. [86] investigated how affective factors in chronic pain interfere with daily functions. They found that movement data can recognize distinct distress and pain levels. Lucey et al. [68] proposed an active appearance model-based approach to detect pain in videos, from the the UNBC-McMaster shoulder pain dataset [69], using facial action units (FACS) [34]. Their approach is encouraging, showing that action units can be used to detect pain, motivating the use of this modality here. Motivated by these works, we propose a method for recognizing pain with the fusion of physiological data (e.g. heart rate, respiration, blood pressure, and electrodermal activity) and facial action units [34]. Using the BP4D+ dataset [139], we fuse action units from the most expressive part of the face, along with physiological signals that have been synced to have a one-to-one correspondence with each facial image that contains action units. We conduct experiments to validate the utility of the proposed fusion method, showing the fused signals result in a positive impact to the accuracy of detecting pain. We further analyze the correlations between the modalities giving insight into future applications of pain recognition. The contributions from this work are 3-fold and can be summarized as follows:

1. A method for recognizing pain through the fusion of physiological signals and action units is proposed.

2. Insight into the correlations between physiological signals and action units is shown.

3. Cross-gender (female vs. male) experiments are conducted for pain recognition, as well as correlations between modalities is shown.

54 Table 4.1: Pain recognition results using physiological signals, action units, and the fusion of both. M=Male; F=Female.

Modalities All Subjects Male Female Trained (M) / Tested Trained (F) / Tested (F) (M) Metric Acc F1 Score Acc F1 Score Acc F1 Score Acc F1 Score Acc F1 Score Physiological 77.70% 0.3 75.25% 0.285 76.98% 0.269 69.14% 0.35 75.43% 0.219 Action Units 89.02% 0.734 88.00% 0.668 90.73% 0.778 88.27% 0.725 87.50% 0.753 Fused 89.20% 0.75 88.58% 0.689 91.35% 0.787 88.27% 0.725 86.31% 0.75

55 4.2 Fusion and Experimental Design

To recognize pain, we propose to fuse physiological signals and action units (AUs) from the most facially expressive segments of sequences of tasks meant to elicit emotion. The most expressive segment is defined as the frames where Facial Action Units (AUs) [34] have been manually annotated by experts. To facilitate our experimental design, we use the BP4D+ multimodal emotion corpus [139].

4.2.1 Multimodal Dataset (BP4D+)

BP4D+ [139] is a multimodal dataset that include thermal and RGB images, 2D and 3D facial landmarks, manually coded action units (33 total), 4D facial models, and 8 physiolog- ical signals (diastolic blood pressure, mean blood pressure, EDA, systolic blood pressure, raw blood pressure, pulse rate, respiration rate, and respiration volts). There are 140 subjects (58 male and 82 female) with an age range of 18 − 66. Action units are manually coded on the most expressive frames of each sequence (task). Emotions are elicited through 10 tasks that the subjects perform. In this work, as we are focusing on pain, our experiments focus on the cold compressor task (i.e. the subjects place their hand in a bucket of ice water for an extended time). We use the physiological signals, and AUs for 139 of the subjects in our experimental design. Subject F082 has some missing features, therefore this subject is removed which follows the experimental design from Fabiano et al. [39]. Approximately 20 seconds of the most expressive frames are coded, and four tasks, that is, the target emotions of happy, embarrassment, fear and pain are coded for action units. Considering this, we use these four emotions for our experiments.

4.2.2 Syncing Physiological Signals with AUs

The frame rate of the video recordings in BP4D+ are 25 frames per second (fps) where as the sampling rate of the physiological signals is approximately 1000 frames per second. Due to this difference in sample rate between the AU frames and physiological signals, syn-

56 chronization is needed to perform fusion of the modalities. To do this, we first calculate the number of frames in the sequence and then down sample the physiological signals, to the same number, using the one step bootstrapping technique [101]. Using this technique, on physiological signals, we reduce the number of samples while retaining the important information in the signals [39]. This then gives us a one-to-one correspondence (i.e. synced) between each video frame and physiological signal. It is important to note that the video sequences are not the same length (i.e. different number of frames across subjects). Con- sidering this, we do a final re-sample, of the most expressive segments, to 5000 frames. We use the physiological signals and AUs from these 5000 synced frames to facilitate our fusion approach.

4.2.3 Fusion of Physiological Signals and AUs

Given the 5000 frames of synced physiological signals and AUs, from the most expressive segments, we concatenate them to form a new feature vector. For each frame, we concatenate 33 AUs and 8 physiological signals, giving the feature vector,

f = [AU1, ... , AU33, Phys1, ... , Phys8] (4.1)

where AUi are the 33 AUs and Physj are the 8 physiological signals as detailed in Section 4.2.1. As we want to incorporate temporal information into our pain detection approach, we then concatenate f from each of the 5000 synced frames giving us our final feature vector, F = [f1, ... , f5000]. This final concatenation results in a feature vector of size 205, 000 (41 × 5000), which is then used to recognize pain. The proposed fusion results in one feature vector for each sequence giving us a total of 556 feature vectors across the 4 tasks (139 subjects ×4 tasks).

57 4.2.4 Pain Recognition Experimental Design

As detailed in Section 4.2.1, BP4D+ has sequences from four tasks that have been man- ually AU coded (happy, embarrassment, fear and pain). As we are interested in recognizing pain, we treat the pain sequences as our positive classes (i.e. pain), and data from the other three sequences as our negative class (i.e. no pain). This experimental design is consistent with other works that have investigated pain recognition on BP4D+ [39]. Using this ex- perimental design, we train a random forest classifier [14], using the feature vectors, F , to recognize pain vs. no pain. The random forest consisted of 275 trees, and we validated our approach using subject-independent 10 fold cross-validation on all subjects (139). Along with training and testing on all subjects, we also performed cross- and gender-specific ex- periments. For each of these experiments, we evaluated each independent modality, that is, physiological signals and action units, as well as the proposed fusion approach (Section 4.2.3).

4.3 Results

4.3.1 Pain Recognition

For our pain recognition experiments, we report overall accuracy, as well as the F1-score for each experiment. For our subject independent experiments on all tested subjects, we achieved an accuracy of 89.2% and F1-Score of 0.75 when fusing the physiological signals and action units. Although physiological signals performed reasonably well as a single modality, with 77.7% accuracy, their F1-Score is low with 0.3. Using this modality, the majority of instances were classified as no pain which is the class containing more samples. It is interesting to note that action units alone achieved an accuracy of 89.02% and F1-Score of 0.734. These results agree with the literature that AUs can be a powerful representation for automatically recognizing pain [68].

58 (a) Respiration (b) Blood Pressure

(c) Seq 1 (d) Seq 2

(e) Most Expressive

Figure 4.1: Visual comparison between reparation and blood pressure, and facial expressions (i.e. action units) occurring at same time. During Seq 1, the subject has placed their hand in the bucket of ice water. Seq 2 is approximately 20 seconds after placing hand in bucket, and the most expressive sequence occurs after 40 seconds of hand being in bucket.

59 For our gender-based experiments, similar results are obtained. When training and test- ing on male subjects or female subjects the accuracy of the physiological signals is approxi- mately 76%, while the F1-Score is approximately 0.28. Again the majority of classifications were labeled as no pain. In both cases (male or female) action units performed well with an accuracy of approximately 89% and F1-Score of 0.668 and 0.778 for male and female exper- iments, respectively. Along with testing and training on male or female, we also conducted cross-gender experiments (e.g. trained on female and tested on male). In these experiments, again the results are similar where physiological signals have a low F1-Score of approximately 0.28. The interesting result is that the fusion did not help in this scenario. The fusion for training on male and testing on female resulted in the same accuracy of 88.27% and F1-Score of 0.725. When training on female and testing on male, the fusion caused a slight decrease in accuracy of 1.19% compared to action units alone. This can partially be explained by the differences in correlations between male and female subjects when analyzing the most expressive parts of the sequences (See Figs. 4.2b and 4.2c). See Table 4.1 for all experimental results.

4.3.2 Comparison to State of the Art

To the best of our knowledge, there is no direct comparison to this work using the se- lected modalities from BP4D+. Fabiano et al. [39] investigated pain recognition using fused physiological signals from the BP4D+ dataset, however, their approach was not subject inde- pendent where as we used a subject-independent experimental design. As this is the closest approach to ours, we have detailed their results in comparison to ours. Using their proposed approach and a feed-forward neural network, they achieved pain recognition accuracies of 98.48%, however, they also evaluated their method using support vector machine (92.64%), random forest (90.27%), and na¨ıve Bayes. (89.77%). Using a random forest trained on physiological signals only, with a subject-independent approach, we achieved an accuracy of 77.7%. When fusing the physiological signals with AUs, the proposed approach achieves

60 an accuracy of 89.2% which is comparable to the results from Fabiano et al. when using a random forest.

4.4 Discussion

From our analysis of physiological signals and AUs, with respect to pain, we make two observations.

1. The peak for physiological signals and facial expressions occur at different times. (see Fig. 4.1)

2. Physiological signals are highly correlated during the most expressive parts of the sequence, however, they are not correlated when the entire signal is analyzed (Fig. 4.2).

In Fig 4.1 we have shown two physiological signals, respiration (Fig. 4.1a) and blood pressure (Fig. 4.1b), and the facial expressions of three segments of the physiological signal. The instant where we see the most variance in physiological signals (Seq 1), there is no facial motor response (Fig 4.1c) and when the facial expression occurs (Fig 4.1e), the physiological signals are leveled and there is little variance in the signals. To further analyse this effect of variance of signals at different times during the task, we calculated the correlation [11] between the signals during the most expressive sequence, as well as the entire signal (Fig. 4.2). As seen in Fig. 4.2d, the physiological signals during the entire sequence are not well correlated. But when we sample the physiological signals for the most expressive part of the sequence, the physiological signals become highly correlated, as seen in Fig. 4.2a. This increase in correlation can be seen across gender as well. One possible explanation for the low accuracy of pain recognition, using physiological signals, can be this change in variance and correlation of physiological signals during the most expressive part of the video. It has been shown that physiological signals with higher variance tend to perform better at recognizing pain [39].

61 (a) Physiological with avg AU occurrence (b) Physiological with avg AU occurrence (c) Physiological with avg AU occurrence (Most Expressive Sequence) (Most Expressive Sequence) - Male (Most Expressive Sequence) - Female

(e) Complete Physiological Signal (f) Complete Physiological Signal (d) Complete Physiological Signal - Male - Female

Figure 4.2: Correlation matrices of physiological signals and action units. Top row, of matrices, shows the most expressive parts of the sequence. Bottom row, of matrices, shows the entire physiological signal, without AUs.

62 We have seen that exposure to ice cold water is seen to cause physiological changes in the autonomic nervous system – metabolic rate, circulatory system and respiratory system (Fig. 4.1). This analysis is consistent with results from the medical literature. Circulatory dynamics show effects of different temperatures on the cardiovascular system (Fig. 4.1b). Immersion in ice cold water stimulates thermoreceptors, activates different regulatory sys- tems (sympathetic nervous system and endocrine function), different effector mechanisms and increases heart rate, systolic blood pressure and diastolic blood pressure[110]. Increase in heart rate is seen to continue or show maintenance until the exposure to ice cold water is discontinued [80]. 1-mm immersion of hand or foot may show increase in systolic and diastolic blood pressure by 10 - 20 mm/Hg [67]. Immersion in ice cold water has significant effects on the autonomic nervous system. Breathing slows down to retain carbon dioxide and result in acidosis when exposed to low temperatures. This occurs due to mild alter- ations in the brain stem neuronal systems [31]. Voluntary movements which include motor actions and emotional expression are dependent on the sensory stimulus which occurs due to the peripheral connection with the brain [7]. Objects emotionally felt are the result of action-reaction of the cerebral cortex and diencephalon [7]. It is seen that when subjects are exposed to ice cold water, a spike in physiological changes occurs first, that is increase in heart rate, blood pressure and decrease in respiratory rate which is followed by motor response, that is facial expressions. This is due to the reflex arc mechanism which explains the synapse between sensory and motor neurons. When exposed to external stimuli, sensory signals are sent to the central nervous system through the nerve pathway which causes changes at the physiological levels and relays back the signals to the motor neurons and the motor actions occur [13]. Considering this, these results suggest the need for physiological signals and the coded AUs for the entire sequences, for improved pain recognition applications.

63 4.5 Conclusion

We have proposed a multimodal approach to detect pain, using physiological signals fused with action units. We have conducted experiments, on the BP4D+ dataset, using sin- gle modalities, as well as the fusion of both. Experiments also include same- and cross-gender validation. Using the most facially expressive segments from sequences shows that physio- logical data results in a lower F1-Score, while action units give significantly improved results, and the fusion of both further improve accuracies when evaluations include all subjects or same gender. Along with our experimental design, we have also given an in-depth analy- sis of the correlation between the physiological signals and the action units. This analysis shows that there is a high correlation between these modalities during the most expressive segments of the sequences, however, when the entire sequence is analyzed the correlation de- creases. Due to the temporal nature of emotion (Fig. 4.1) and the differences in correlations between most expressive sequences and others, we suggest larger and more varied datasets that include physiological signals, as well as facial images that have action units coded for all sequences.

64 Chapter 5: Recognizing Context Using Facial Expression Dynamics from Action Unit Patterns

5.1 Introduction

Affective Computing is an interdisciplinary (e.g. computer science, psychology, and cog- nitive science) research area that investigates and analyzes human emotion, which has been extensively studied in psychology. Keltner [52] investigated the distinct display of embar- rassment compared to the related emotion amusement, that are often confused. He found that both the morphology and dynamic patterns of these 2 emotions were distinct and were differentially related to subject self-reports. Ambadar et al. [4] investigated the timing and morphology of smiles that are perceived as different emotions such as amused and nervous. Their results suggest that smiles that are perceived as similar have different amplitudes, du- rations, onsets/offsets, and different AUs. For example, smiles that were perceived as amused were longer, had more amplitude, and more often contained AU6. This work motivates our current work as it suggests intensity and temporal dynamics are important for smiles. We hypothesize that the same information will result in improved recognition of context from facial expression dynamics. The work from Cowen et al. [30] also gives us motivation for our hypothesis. They discuss how emotion is evoked under specific circumstances (i.e. context) and perceived under distinct facial expressions. They also argue that to fully understand what facial expressions can tell us we need to map continuous face, body, and voice sig- nals (i.e. temporal information) onto multifaceted experiences using statistical and machine learning methods. An interesting study from Cordaro et al. [27] identified how emotional displays (e.g. expression) vary across cultures. Across 22 emotions and 5 cultures, they found that culture and context both impacted the expression of emotion beyond the six ba-

65 Figure 5.1: Overview of proposed approach to model facial dynamics from AU patterns. Images are extracted from an input video and per-frame action units are then up-sampled for AU occurrence and interpolated for intensity. The AU patterns are then transformed to the image-space, which are then used to train separate convolutional neural networks to recognize context. NOTE: separate networks are used to compare the efficacy of AU occurrence patterns vs. AU intensity patterns. sic emotions that are widely studied in the literature. Barrett et al.[9] focused on the 6 most popular emotions in the literature: anger, fear, disgust, sadness, happiness, and surprise. Their results suggest that while expressions sometimes correspond to specific emotions (e.g. frown when sad), culture and context matter. Along with this, facial movements such as a scowl don’t always correspond to an emotional state. One of the recommendations from their work is the need to research context-sensitive ways in which people move their facial muscles for emotion. Our proposed work addresses this limitation through an investigation into recognizing context from the temporal dynamics of facial expressions. Motivated by these works in psychology, affective computing has seen considerable growth in the past two decades. It has also been shown that emotion is an important part of human intelligence [94]. It is a fundamental feature of cognition, and emotional outcomes are central to thoughts, actions, and perception [115]. Considering this, the field has broad, world-wide impact with applications in areas such as medicine [16, 117], education [32, 138], security

66 [51, 144], entertainment [18, 81], and better customer shopping experiences [88, 17]. To facilitate these applications, a broad spectrum of modalities are used including but not limited to 2D [58, 122] and 3D [3, 96] face images, thermal images [92, 19], physiological signals [39, 143], gestures [77, 99], and audio [133, 49]. Using 2D images, FACS action units (AUs) [34] are often detected, which are commonly used to quantify facial movement. In Chapter 3, we showed that AU patterns impact detection and proposed an approach to train neural networks on AU patterns to improve recognition of AUs. Considering this, we hypothesize that AU patterns can also be used to recognize context used to elicit an emotional response. The main contributions of this work can be summarized as follows:

1. A new approach to modeling the temporal dynamics of facial expression, from AU patterns, is proposed. We show the proposed approach is able to accurately recognize the context, used to elicit an emotional response, from three state-of-the-art datasets.

2. Experiments are conducted on both AU occurrence and intensity patterns for recogniz- ing context. Our experiments suggest intensity patterns are better suited to recognize context compared to occurrence patterns. Along with intensity, our results also suggest the need for a larger number of AUs for more accurate recognition of context.

3. To the best of our knowledge, this is the first work to propose the use of AU patterns for recognizing context, providing a baseline for the community.

5.2 Related Works

As we propose to investigate facial expression dynamics using action unit patterns, we first detail recent works that have detected FACS [34] action units (AUs). Next, we detail related works that investigate the use of action units for expression recognition.

67 5.2.1 Action Unit Detection

In recent years, there has been encouraging work for detecting action units. Niu et al. [83] use person-specific face information to detect AUs. More specifically, they take advantage of the relationship of person-specific face regions to learn local information to help with the inconsistency of local face regions. This approach reduces the influence of person- specific facial information resulting in more discriminative features. They show encouraging results across 12 AUs compared to the state of the art on the BP4D [137] and DISFA [73] databases. Shao et al. [105] proposed an end-to-end deep learning framework that make use of channel-wise and spatial attention learning. These are used to extract local features for AU detection, that are then refined using pixel-level relation learning. Tu et al. [116] proposed IdenNet, that accounts for the same AUs varying among subjects. The proposed approach first performs face clustering, where subject-specific identifying features are extracted. Then, an AU detection network normalizes these features for detection. Their results suggest that this approach mitigates bias that individual subjects can cause in AU detection. Liu et al. [66] proposed using a 3D Morphable Model [12] that has parameters conditioned on target AU labels. This allows them to synthesize new facial images with the desired AU labels, that can be used to augment deep learning-based approaches to AU intensity detection. They considered a Conditional Generative Adverserial Network [78] and a Conditional Adverserial Autoencoder [140] to synthesize the target AU-labelled data. Their approach also showed encouraging results on the BP4D database. DGAN [123] was proposed as an approach to semi-supervised and weakly-supervised AU detection. For the semi-supervised scenario, it makes use of partial AU labels along with full expression labels. In the weakly-supervised scenario only the full expression labels are used. This approach allows DGAN to take advantage of the relationship between multiple AUs, and expressions and AUs. Along with a classifier the approach also uses an image generator and a discriminator, where the classifier and generator make face-AU-expression tuples. These tuples include the AU and expression relationships that are then used to

68 reconstruct the input face and AU labels. D-PattNet [87] was proposed to handle 3 major challenges in patch-based AU detection: (1) head rotation; (2) co-occurrence of AUs; and (3) spatiotemporal dynamics. To overcome these three challenges, D-PattNet encodes local patches early in the network, then it fuses patch-based information later in the network using an attention mechanism, and finally it uses a 3D-CNN for spatiotemporal dynamics. Another interesting challenge in AU detection is cross-domain detection (e.g. wild set- tings). Considering this, Ertugrul et al. [36] conducted an extensive literature review on cross-domain AU detection and experiments to address some of the limitations found in their review. They investigated four public databases (EB+, which is an extension of BP4D+ [139], GFT [42], DISFA [73], and UNBC Shoulder Pain [69]). Their results suggest that both shallow and deep methods perform similarly, however, on average deep methods per- formed slightly better. They also found that AU specificity decreased in a cross-domain scenario, and suggest that more attention should be paid to the variation of characteristics across domains.

5.2.2 Expression Recognition and Action Units

Along with detecting AUs, there has been encouraging work in using them for detect- ing expression (e.g. pain recognition). Sikander et al. [108] proposed an approach to 3D action unit recognition to recognize driver fatigue. They used a photometric stereo test to reconstruct 3D AUs, where faces are first detected in images, then ROIs around the eyes and mouth are extracted. Then Uncalibrated Photometric Stereo [95] is used to create the 3D reconstructions of the ROIs. Normal vectors are then extracted to create a quiver map, that was then used to train deep neural networks for recognizing pain. Their results suggest that AUs 1, 15, and 41 can reliably recognize early fatigue. To automatically detect AUs and pain in facial videos, Lucey et al. [68] proposed an active appearance model-based system (AAM) [26]. First, the patient’s face is tracked using an AAM, then features are extracted that include the similarity normalized shape and canonical-normalized appearance. These

69 features are then used to train an SVM to detect AUs and pain. Their results suggest that head pose and context matter when recognizing the AUs, as well as pain. Xu et al. [130] in- vestigated the relationship between pain measurements and frame-level AUs. They proposed a multitask learning model to predict self-report visual analog scale (VAS) from AUs. They used a multitask learning network to predict pain scores, that are then linearly combined with an ensemble to predict VAS. They show that their proposed approach outperforms human-level performance for the same task. Along with recognizing specific expressions such as pain, AUs have also been used to recognize a broader number of facial expressions. One of the early works to do this is from Lien et al. [63]. They proposed recognizing upper face expressions (i.e. combinations of AUs) using Hidden Markov Models (HMM). They investigated facial feature points, optical flow, and high gradient component detection to extract features for recognition using an HMM. They detailed encouraging results showing the potential for detecting facial expressions based on action units. More recently, Liu et al. [64] proposed a Siamese Action Unit Attention Network (SAANet) that includes an AU attention mechanism which extracts spatial context from expression specific regions (AUs). More specifically, this mechanism allows the network to select the important features from AU-specific regions of the face such as eyebrows, mouth, nose, and eyes. Their results suggest that focusing on AU-specific regions of the face can increase overall accuracy across multiple expressions including, but not limited to, anger, disgust, fear, and happiness.

5.3 Modeling Temporal Dynamics with Action Unit Patterns

5.3.1 Motivation and Background

Action units, which code facial muscle movement, have both an occurrence and intensity value. The intensity of the AU is the extent of the facial muscle movement and occurrence is the presence of the facial muscle movement. Action units can be used to recognize facial expressions (e.g. pain) [69], however, the same AU pattern can occur in different emotional

70 (a) Higher AU Intensity. (b) Lower AU Intensity.

Figure 5.2: Comparison of high/low AU intensity, with same occurrence patterns, from different sequences (DISFA [73]). context and can be of different intensities. As can be seen in Fig. 5.2, the image on the left shows a higher intensity expression, and the image on the right shows a lower intensity expression. Although these are different contexts, AU4 and AU9 are active (1), while AU1, AU2, AU5, AU6, AU12, AU15, AU17, AU20, AU25, and AU26 are inactive (0) in both. Conversely, for the image on the left, AU4 and AU9 have intensities of 3 and 2 respectively, while the same AUs have an intensity of 1 in the right image. This shows that intensity patterns tend to be different, across emotions, compared to occurrence patterns. In Chapter 2 we showed that emotional responses do not occur instantaneously but occur over time. This implies that AUs do not occur instantaneously at peak intensity, however, they do have an on-set, peak, and off-set. Considering this, we hypothesize that AU intensity patterns will result in more accurate recognition of context, compared to occurrence patterns alone. To test this hypothesis, we have proposed a new temporal AU coding technique (Section 5.3.2). In Section 5.3.3 we visually demonstrate that the patterns are different across different context, and in Section 5.6 we show that this system of coding temporal dynamics can recognize context from AU patterns.

71 5.3.2 Modeling Temporal Dynamics

We propose a new method to encode AU patterns to capture temporal dynamics in the

N×H×W form of 2D images. Given a set of context (e.g. videos in DISFA) as Ii ∈ R , we can use the proposed approach to recognize individual context (e.g. video meant to elicit a happy emotional response in DISFA). Here i is an individual context, N is the total number of frames captured during each context, and H and W are the width and height of a given image frame, respectively. For each individual context (i), we extract m action units from

N×m each frame resulting in the feature vector Ai ∈ R . It is important to note that N varies across context and subject (i.e. different contexts are shorter or longer than others). Given

Ai , we then propose to transform this feature vector into a 2D image. Representing Ai in the form of an AU pattern image instead of a feature vector allows us to take advantage of spatial information between the AUs (i.e. multiple AU patterns occur over each frame), that convolutional neural networks are well-suited for [55]. The number of action units, m, compared to the number of frames per context, N, is relatively low. Considering this, we employ a series of sampling and interpolation techniques to represent this sparse matrix of action unit values that can intuitively transform into an

AU pattern image. To do this, the AU patterns Ai of size N × m are up-sampled by N × km

times. Each column vector aj ∈ Ai is repeated k times as shown in Equation 5.1. The parameter k, is directly related to the width of the images that will be created (e.g. a larger k will result in a larger width for each pattern image). Post up-sampling, our pipeline (Fig. 5.1) proceeds independently for AU occurrence and intensity as we detail in Sections 5.3.2.1 and 5.3.2.2.

  Ai = a0 ··· a0, a1 ··· a1, ··· , am ··· am (5.1) | {z } | {z } | {z } k times k times k times

72 (a) AU Intensity (b) Interpolated AU Intensity

Figure 5.3: (a) Original AU intensity distribution; (b) interpolated AU distribution. Taken from subject in DISFA [73].

5.3.2.1 AU Occurrence Pattern Modeling

Our pipeline captures not only the information about the activated AUs, but also the degree to which each AU is activated. This is analogous to the study of valence and arousal in emotion recognition[115]. The AU pattern A in the range [0, 5] is converted into AU occurrence using the piece-wise operation in Equation 5.2.

  1, Ai,j > 0 Ac = (5.2)  0, otherwise

The AU occurrence pattern of N×km is then up-sampled to a fixed constant T . The resultant

T ×km AU occurrence pattern is denoted by Ac ∈ Z . Finally, the obtained AU occurrence (Ac) patterns are transposed to model the temporal dynamics of the action units (Fig. 5.4a).

5.3.2.2 AU Intensity Pattern Modeling

For intensity, the normalized AU pattern of N ×km is resampled using linear interpolation to a fixed constant T . Fig. 5.3 shows the plot of interpolation on a sample subject and context. In this case, AU intensity values in the range [0, 5] were interpolated with T = 1000.

73 (a) Up-sampled AU Occurrence (b) Interpolated AU intensity

Figure 5.4: (a) Up-sampled image from AU occurrence patterns; (b) interpolated image from AU intensity patterns. Both (a) and (b) are same subject and context from DISFA [73].

It should be noted that the same sampling constant has been chosen for both AU occurrence

T ×km and intensity. The resultant AU intensity pattern is denoted by Ao ∈ Z . Like the AU occurrence patterns, the obtained AU intensity patterns (Ao) are transposed to model the temporal dynamics of the action units (Fig. 5.4b).

5.3.3 Justification for AU Pattern Images

The proposed approach models the temporal dynamics of facial expressions from AU patterns. The interpolation of the intensity patterns results in a smoother curve for the AU intensity values (Fig. 5.3). The interpolated AU intensities are stacked along the Y-axis, where the AUs are in order from the bottom up. For example, as shown in Fig. 5.4, the AUs are in the following bottom-up order: AU1 (bottom of image), AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17, AU20, AU25, and AU26 (top of image). This is done as AU activation patterns are different for different facial expressions [34] and as discussed in Chapter 3, the activation of different AUs are correlated with each other. By stacking the temporal AUs we preserve the temporal information of the facial muscle movement and the correlation between the AUs over time. AU intensity patterns for different contexts for a subject, in DISFA, are shown in Fig. 5.5. It can be seen from the figure that temporal changes in AU intensity are different for different contextual responses. We can also see that AUs with the highest intensity in a contextual response correspond to the AUs which are linked with that particular emotional

74 (a) Happy Video 1

(b) Happy 2

(c) Surprise 1

(d) Fear

(e) Disgust Video 1

(f) Disgust Video 2

(g) Sad Video 1

(h) Sad 2

(i) Surprise 2

Figure 5.5: Interpolated Intensity Action Unit Signals

75 response. For example, between the patterns of Happy Video 1 (Fig. 5.5a) and Disgust Video 1 (Fig. 5.5e) we can see that temporally different AUs get activated at different intensities. In Happy Video 1 we can see that AU6 and AU12 have the highest intensity. From Figs. 5.5e and 5.5f, we can see that even though the expected elicited emotional response for both stimuli is the same and the same AUs are activated in both, the temporal AU pattern is different for them, that is, both have a different on-set, peak, and off-set for the intensity. This implies that different contextual stimuli will produce different temporal expressions and to study facial expression, context is important [4]. In Section 5.5, we detail how these differences in temporal AU patterns can be used to recognize context.

5.4 Datasets

5.4.1 DISFA

DISFA [73] contains both the occurrence and intensity of spontaneous facial action units. There are 27 subjects (12 women and 15 men), with an age range of 18-50 years. Ethnicities include Euro-American, Asian, Hispanic, and African-American. Approximately 78% of the subjects are Euro-American, ˜11% are Asian, less than 4% are African-American (all male subjects), and less than 8% are Hispanic (all male subjects). To elicit facial expressions, each subject watched a four-minute video that consisted of YouTube clips. Images were captured at 20 frames per second with a resolution of 1024 × 768. In total there were 9 different videos meant to elicit an emotional response (e.g. Happy Video 1 and Happy Video 2). See Table 5.1 for details on the sequence number, video topic, and target emotion for each video (context). For each of the videos, a single FACS coder was used to annotate AUs for each frame from [0,5], where 0 represents the AU is not present and 5 represents max intensity. In total 12 AUs were annotated for each frame: AU1, AU2, AU4, AU5, AU6, AU9, AU12, AU15, AU17, AU20, AU25, and AU26. In many AU detection works [60, 59, 142, 28, 103], only 8 AUs are detected from this database: AU1, AU2, AU4, AU6, AU9, AU12, AU25, and AU26. In the proposed work, we make use of all 12 annotated AUs to detect context.

76 Table 5.1: Video Descriptions of DISFA.

Seg. Video Target Emotion 1 Husky Dog Talking - ” I love you ” Happy 2 four year old girl in America Got Happy Talent 3 Speaking on the phone at the office Surprise 4 Computer Scare Prank Fear 5 Man vs. Wild - Eating Giant Larva Disgust 6 Large Snake Swallows a Pig Disgust 7 Pictures of malnourished African Sad children 8 2009 L’Aquila earthquake Sad 9 Crocodile snaps man’s hand Surprise

Table 5.2: Task Descriptions of BP4D.

Task Activity Target Emotion T1 Talk to experimenter and listen to Happy joke (Interview) T2 Watch and listen to a recorded Sad documentary and discuss their reactions T3 Experience sudden, unexpected burst Surprise of sound T4 Play a game in which they improvise a Embarrassment silly song T5 Anticipate and experience physical Fear or threat Nervous T6 Submerge their hand in ice water for Physical Pain as long as possible. T7 Experience harsh insults from the Anger or upset experimenter T8 Experience an unpleasant smell Disgust

77 5.4.2 BP4D

BP4D [137] is a multimodal (e.g. 2D face images and 3D models) database that contains spontaneous AU occurrences and intensities. There are 41 subjects (13 women and 18 men), with an age range of 18-29 years. Ethnicities include Asian, African-American, Hispanic, and Euro-American. Approximately 47% of the subjects are Euro-American, ˜20% are Asian, ˜17% are African-American, and less than 5% are Hispanic. To elicit facial expressions, each subject had to perform 8 tasks, which we refer to as context. See Table 5.2 for details on the task numbers, activities, and target emotion for each of the 8 tasks. For each of the tasks, FACS coding was done for 20-second segments (˜500 frames) that have the highest intensity of expression. Two independent FACS coders manually annotated each of these frames, for AU occurrence, with 27 AUs: AU1, AU2, AU4-AU7, AU9-AU20, AU22-24, AU27, AU28, AU30, AU30, AU32, AU38, and AU39. For our experimental design, we use the 12 AUs that are commonly found in the literature (each AU is present at least 200 times) [25, 59, 113, 141, 87]: AU1, AU2, AU4, AU6, AU7, AU10, AU12, AU14, AU15, AU17, AU23, and AU24. Along with the annotated occurrence, the FACS coders also manually labeled the intensity of 5 of these AUs: AU6, AU10, AU12, AU14, and AU17 from [0,5].

5.4.3 BP4D+

Similar to BP4D, BP4D+ [139] is a multimodal (e.g. physiological, thermal) database with spontaneous AU occurrences and intensities. There are 140 subjects (82 females and 58 males), with an age range of 18-66. Ethnicities include Black, White, East-Asian and Middle-East-Asian, Hispanic, and Native American. Approximately 46% of the subjects are White, ˜30% are Asian, ˜7% are Black, ˜11% are Hispanic, and ˜7% identified as Other. A total of 10 tasks were used to elicit emotional responses, which we refer to as context in our experimental design. See Table 2.1 for details on the task numbers, activities, and targeted emotion for each of the tasks. Five FACS coders manually annotated 34 AUs of the most expressive 20 seconds of four tasks (happiness, embarrassment, fear, and pain): AU1, AU2,

78 AU4-AU7, AU9-AU20, AU22-AU24, AU27-AU39. For our experimental design, we again use the same 12 AUs as BP4D (Section 5.4.2). Similar to BP4D, the FACS coders also manually labeled the intensity of AU6, AU10, AU12, AU14, and AU17.

5.5 Experimental Design

We conducted a range of experiments to show how action unit patterns can be used to recognize context. Considering convolutional neural networks (CNN) are well-suited for images [55], we have trained a CNN to recognize context from the proposed AU pattern images (Section 5.3.2). To the best of our knowledge, this is the first work to propose using AU patterns for recognizing context. Therefore for comparison purposes we also used the raw action unit intensity values ([0,5]) to train a feed-forward fully-connected neural network. The details of both networks are given in Sections 5.5.1.1 and 5.5.1.2, respectively. For our experimental design, we also have three different classes that we are recognizing: (1) context, (2) target emotion from context, and (3) target emotion. For DISFA (1), (2), and (3) are used, however, for BP4D and BP4D+ only (1) is used as they do not have multiple stimuli to elicit the same emotional response. We define each of these classes as the following:

1. Context. The stimuli used to elicit the emotional response is recognized (e.g. video or task the subject was exposed to). If there are two or more stimuli used to elicit the same emotion we detect them as separate classes (e.g. Happy Video 1 and Happy Video 2).

2. Target Emotion from Context. The stimuli is first detected as separate classes and then combined into 1 class that signifies the same class meant to elicit the same response. For example, given Disgust Video 1 and Disgust Video 2, we detect them as separate classes and then combine them into a single class called Disgust by combining the results (i.e. score level fusion).

79 3. Target Emotion. Given multiple stimuli meant to elicit the same emotional response we recognize them as one class. For example, given Happy Video 1 and Happy Video 2 we train the neural networks with all samples from both as Happy.

5.5.1 Neural Network Architectures

For our experimental design we used two kinds of neural network architectures (1) con- volutional neural network, and (2) feed- forward fully-connected neural network.

5.5.1.1 Convolutional Neural Network

Figure 5.6: Convolutional neural network architecture. The AU pattern images (both inten- sity and occurrence) are used to train the network to recognize context.

For context recognition, we trained a 7-layer convolutional neural network with AU pat- tern images as shown in Fig. 5.6. The first two layers are convolutional layers; the first layer has filter size of 64, the second layer has filter size of 128, and both layers have a kernel size of (5, 5), stride of 2 and RELU activation. The third layer is a batch normalization layer and fourth layer is max pooling with a kernel size of (2, 2). The fifth layer is a dropout layer with a drop out ratio of 0.3, sixth layer is a dense layer of 1000 neurons with RELU activation.

80 The final layer is the output layer with number of classes output neurons, with SoftMax activation. The loss function used was categorical crossentropy, the optimizer was Adadelta, with a learning rate of 0.001 and batch size of 5, and training was done for 150 epochs.

5.5.1.2 Feed-forward Fully-connected Network

To compare the results of recognizing context by training a CNN with AU pattern images, we trained a 3-layer feed-forward full-connected network. We used the same network as in Chapter 4, the first layer is a dense layer of 480 neurons, the second layer is a dense layer of 240 neurons, and both layers use the Relu activation function. To keep this network similar to the CNN that was trained on images, we used the same training parameters - optimizer used was Adadelta, loss function was categorical crossentropy, learning rate of 0.001, batch size of 5 and training was done for 150 epochs.

5.5.2 Experiments and Metrics

As DISFA has different stimuli meant to elicit the same emotional response, and BP4D and BP4D+ do not explicitly have these stimuli, we have conducted different experiments for these datasets. The experimental design for DISFA is given below in Section 5.5.2.1 and Section 5.5.2.2 for BP4D and BP4D+. We also detail the evaluation metrics we use for all datasets, below, in Section 5.5.3.

5.5.2.1 DISFA

We have found DISFA dataset to be one of the most versatile datasets for AU analysis. As discussed in Section 5.4.1, DISFA has multiple contextual stimuli for the same emotional response. For this dataset we performed the following experiments:

1. The context meant to elicit an emotional response is recognized. In this experiment there are 9 classes: (1) Happy Video 1, (2) Happy Video 2, (3) Disgust Video 1, (4)

81 Disgust Video 2, (5) Surprise Video 1, (6) Surprise Video 2, (7) Fear Video, (8) Sad Video 1, and (9) Sad Video 2.

2. Target Emotion from context is recognized. In this experiment the 9 classes from previous experiment are first recognized and then combined (i.e score-level fusion) to form five classes: (1) Happy, (2) Disgust, (3)Surprise, (4) Fear, and (5) Sad.

3. Target Emotion is recognized. In this experiment the 9 classes are combined into the same five from the emotion from context experiment. The difference being the combination happens before training and the network are only trained on the five classes.

All of the experiments were done using interpolated AU intensity pattern images and up- sampled AU occurrence images to train the CNN. It is important to note, these experiments were done independently. In other words once CNN was trained with the intensity images, and another was trained with the occurrence images to be able to compare the utility of using intensity vs occurrence for context recognition. Also, to compare the effectiveness of the normalization and interpolation, we used up-sampled non-normalized ([0,5]) AU intensity patterns. An example of up-sample non-normalized pattern can be seen in Figure 5.7.

Figure 5.7: Non-normalized AU Pattern image. NOTE: due to the range of values being [0,5], it can be difficult to visually see the pattern in the image-space.

5.5.2.2 BP4D and BP4D+

BP4D and BP4D+ are similar datasets for action unit analysis, the main difference being BP4D has more of the contexts AU annotated than BP4D+. In BP4D all 8 contexts have been AU coded where as BP4D+ only has four AUs coded. These datasets have one context

82 corresponding to each target emotion so in this experimental design we recognize context. As detailed in Sections 5.4.2 and 5.4.3, five AUs have intensity coding in these datasets, which we use for our experimental design. Here, we recognize the context using interpolated AU intensity images (5 AUs), up- sampled AU occurrence images (5 AUs and 12 AUs), non-normalized up-sampled AU in- tensity images (5 AUs), and a combination of AU intensity and occurrence images (5 + 7 AUs). To make the combination images we used the 12 AUs that are commonly found in the literature (Section 5.4.2). For these 12, we used the five AU intensities (AU6, AU10, AU12, AU14, and AU17) and combined them with the other 7 AU occurrences (AU1, AU2, AU4, AU7, AU15, AU23, and AU24). This combination was done to further compare intensity vs. occurrence as we evaluate 12 AU occurrence patterns compared to these 12 AU patterns, with five of them being intensity. An example of this combination image can be seen in Fig. 5.8.

Figure 5.8: AU occurrence and intensity combination image

It is also important to note that BP4D and BP4D+ have similar contextual stimuli (tasks) such as holding hand in a bucket of ice for pain. Considering this, we trained the proposed CNN on the AU patterns of each dataset and tested on the other (e.g. trained on BP4D and tested on BP4D+).

5.5.3 Metrics

It has been shown that it is important to use multiple evaluation metrics when detecting action units due to class imbalance [50]. Considering this, to have a clear understanding of the AU patterns and their usefulness in detecting context, we use the following performance metrics to evaluate the proposed modeling of temporal dynamics using AU patterns.

83 1. Accuracy. Accuracy is one of the most commonly used performance metric for multi- class classification. It is the ratio between the correctly classified samples divided by all the samples.

2. F1 Macro. It is the average of individual F1 binary scores of all the classes. F1 binary Precision × Recall is calculated as F 1 = 2 × . bin Precision + Recall

3. Kappa. Kappa is the measure of reliability of a predictor. It is calculated as κ = Po −Pe 1−Pe

where Po is the observed overall agreement, and Pe is the expected mean proportion of agreement due to chance. The degree of actually attained agreement in excess of

chance Po −Pe , normalised by the maximum agreement attainable above chance 1−Pe .

To validate our approach we performed subject-independent cross validation and to avoid any bias across the folds we kept the same subjects across the folds.

5.6 Results

As the experimental design is different for DISFA, compared to BP4D and BP4D+, we split the results section based on this. As mentioned in Section 5.5.1.2, we used a feed- forward fully-connected neural network for comparisons to the proposed approach. To train this network, we used a flattened AU intensity vector. To create the vector, first we re- sampled the N × m action units to a uniform length of 4000, where m = 12 for DISFA and m = 5 for BP4D and BP4D+. We then flatten the vector as shown in Equation 5.3.

 Vect = AU1 AU2 ··· AUm AU1 AU2 ··· AUm ··· | {z } | {z } First time step Second time step (5.3)  ··· AU1 AU2 ··· AUm | {z } 4000th time step

The results for this network, for DISFA, BP4D, and BP4D+ can be seen below in Table 5.3. We refer to these results as the baseline in the rest of the work.

84 Table 5.3: Evaluation metrics for the feed-forward fully-connected neural network on DISFA, BP4D, and BP4D+.

Accuracy F1 Macro Kappa

Context 0.46090 0.4592 0.3935 DISFA Target Emotion from Context 0.5555 0.5302 0.4388 Target Emotion 0.5473 0.5152 0.4264

BP4D Context 0.3262 0.3107 0.2299

BP4D+ Context 0.45 0.4482 0.2667

As we will show that the proposed modeling of temporal dynamics from AU patterns outperforms using the raw AU intensity values, for recognizing context, across all datasets and evaluation metrics.

5.6.1 DISFA

For the DISFA dataset we recognize context, target emotion from context and target emotion by performing 9-fold subject independent cross validation.

5.6.1.1 Context

In this section we report our results for detecting the context. Table 5.4 shows the performance metrics across intensity, occurrence, and no normalization. As it can be seen, the intensity AU patterns outperform the baseline results (Table 5.3), as well as the occurrence and no normalization patterns, across all three evaluations metrics. It is also interesting to note that the occurrence patterns outperformed the raw intensity values across all three evaluation metrics as well. These results are encouraging as they suggest our proposed temporal modeling approach can more accuracy recognize context.

85 Table 5.4: Evaluation metrics for context from AU patterns using intensity, occurrence, and no normalization on the DISFA dataset.

Intensity Occurrence No Norm

Accuracy 0.6461 0.5144 0.4650

F1 Macro Score 0.6516 0.5095 0.4536

Kappa 0.6019 0.4537 0.3981

To further analyze the performance of AU patterns, we calculated the confusion matrices for the intensity patterns (Table 5.5). It can be seen that the intensity AU patterns were able to accurately recognize the majority of the contexts. For example, the intensity patterns accurately recognized the two different surprise contexts. Conversely though, the network had the most misclassifications for distinguishing between the two contexts of Sad (Sad Video 1 and Sad Video 2). This can be explained, in part, by the patterns for the two Sad contexts having a similar pattern representation with few AU activations (Fig. 5.5g). The confusion matrix of AU occurrence patterns can be seen in Table 5.6.

Table 5.5: Confusion matrix for AU intensity-based context on the DISFA dataset. HA=Happy Video, SU=Surprise Video, FE=Fear Video, DI=Disgust Video, SA=Sad Video.

HA 1 20 3 1 3 0 0 0 0 0 HA 2 3 20 3 1 0 0 0 0 0 SU 1 2 2 19 2 0 1 0 1 0 FE 2 0 1 18 0 1 2 2 1 DI 1 0 0 1 1 20 2 1 0 2 DI 2 0 1 1 0 2 15 4 3 1 SA 1 0 0 0 0 0 2 12 12 1 SA 2 0 0 0 0 0 3 12 12 0 SU 2 0 0 0 2 2 0 0 2 21

HA 1 HA 2 SU 1 FE DI 1 DI 2 SA 1 SA 2 SU 2

86 Table 5.6: Confusion matrix for AU occurrence-based context on the DISFA dataset. HA=Happy, SU=Surprise, FE=Fear, DI=DISGUST, SA=SAD.

HA 1 20 0 3 4 0 0 0 0 0 HA 2 4 17 4 1 1 0 0 0 0 SU 1 4 5 11 2 2 2 0 1 0 FE 2 3 3 13 1 1 2 1 1 DI 1 0 1 2 0 21 1 1 0 1 DI 2 3 3 1 1 3 11 1 3 1 SA 1 0 0 2 0 1 4 7 13 0 SA 2 0 0 4 0 0 1 9 11 2 SU 2 0 0 1 4 5 1 1 1 14

HA 1 HA 2 SU 1 FE DI 1 DI 2 SA 1 SA 2 SU 2

5.6.1.2 Target Emotion from Context

To get target emotion from context, we combine the context classifications from Section Pn 5.6.1.1, where the new correct classifications are defined as correct = i=1 ti , where n is the total number of contexts (e.g. n = 2 for Happy) and ti is the number of correct classifications for the given context (e.g. ti = 20 for Happy Video 1 as seen in Table 5.5). The new incorrect Pn classifications are defined as incorrect = i=1 fi , where n is the total number of contexts and fi is the number of incorrect classifications for a given context. The performance metrics for the intensity, occurrence, and no normalization patterns are given in Table 5.7. As seen in the previous section, the intensity AU patterns outperform the baseline results, as well as the occurrence and no normalization pattern images.

87 Table 5.8: Confusion matrix for AU intensity-based target emotion from context on the DISFA dataset. HA=Happy, SU=Surprise, FE=Fear, DI=DISGUST, SA=SAD.

HA 46 4 4 0 0 SU 4 40 4 3 3 FE 2 2 18 1 4 DI 1 5 1 39 8 SA 0 1 0 5 48

HA SU FE DI SA

Table 5.7: Evaluation metrics for target emotion from context from AU patterns using intensity, occurrence, and no normalization on the DISFA dataset.

Intensity Occurrence No Norm

Accuracy 0.78601 0.6419 0.5761

F1 Macro Score 0.7733 0.6259 0.5426

Kappa 0.7292 0.5464 0.4619

As can be seen in Table 5.7, when we recognize target emotion from context, the accuracy, F1-macro, and Kappa metrics all show a noticeable increase. By comparing Table 5.5 with Table 5.8, we can see the reason for the increase in the performance metrics is due to combining the context into emotions. When this is done a majority of the misclassifications go away as the majority of it was between context of the same target emotion that was meant to be elicited. For example, in Table 5.5 we can see that the network gave similar misclassifications between Sad Video 1 and Sad Video 2, but when the classification was combined to Sad, as in Table 5.8, the misclassification became correctly classified.

5.6.1.3 Target Emotion

We classified the target emotion irrespective of the context (e.g. Sad Video 1 and Sad Video 2 were trained together as Sad). The performance metrics are reported in Table 5.11 and similar to previous experiments, the AU intensity patterns outperformed the baseline, and occurrence and no normalization patterns. By comparing the performance metrics in

88 Table 5.9: Confusion matrix for AU intensity-based target emotion on the DISFA dataset. HA=Happy, SU=Surprise, FE=Fear, DI=DISGUST, SA=SAD.AU Intensity based Emotion - DISFA

HA 44 6 2 2 0 SU 7 33 3 5 6 FE 1 3 15 3 5 DI 3 4 4 38 5 SA 1 3 2 3 45

HA SU FE DI SA

Table 5.10: Confusion matrix for AU occurrence-based target emotion from context on the DISFA dataset. HA=Happy, SU=Surprise, FE=Fear, DI=DISGUST, SA=SAD.

HA 41 7 5 1 0 SU 9 26 6 10 3 FE 5 4 13 2 3 DI 7 5 1 36 5 SA 0 8 0 6 40

HA SU FE DI SA

Table 5.7 with the metrics in Table 5.11 we can see that, compared to target emotion from context, the overall performance across all metrics decreased for intensity AU patterns but largely increased for occurrence and no normalization patterns. This suggests that the AU oc- currence patterns are similar for target emotion, however, some of the AU intensity patterns are different. As the AU intensity patterns outperform occurrence and no normalization, this further validates that context is important [9]. From Table 5.9 it can be seen that all target emotions suffered some decrease in accuracy compared to target emotion from con- text (Table 5.8). AU occurrence-based confusion matrices for context, target emotion from context, and target emotion can be seen in Table 5.12, Table 5.10, Table 5.6, respectively.

Table 5.11: Evaluation metrics for target emotion from AU patterns using intensity, occur- rence, and no normalization on DISFA.

Intensity Occurrence No Norm Accuracy 0.7202 0.6749 0.6091 F1 Macro Score 0.7027 0.6547 0.5344 Kappa 0.6456 0.5878 0.4828

89 Table 5.12: Confusion matrix for AU occurrence-based taregt emotion on the DISFA dataset. HA=Happy, SU=Surprise, FE=Fear, DI=DISGUST, SA=SAD.AU Intensity based Emotion - DISFA

HA 45 5 4 0 0 SU 8 28 4 10 4 FE 3 5 13 2 4 DI 4 8 2 36 4 SA 3 3 1 5 42

HA SU FE DI SA

5.6.2 BP4D and BP4D+

As BP4D and BP4D+ have only one stimuli corresponding to each emotion, we evaluate our method based on recognition of context. BP4D and BP4D+ have four common contex- tual responses which are AU coded: Happy, Embarrassed, Fear, and Pain. BP4D has four additional contextual responses for a total of 8 contextual responses: Sad, Surprise, Anger, and Disgust. For our experiments, we performed 3-fold cross validation on both datasets.

5.6.2.1 BP4D Context Recognition

As detailed in Section 5.5.2.2, we evaluated AU patterns by looking at interpolated intensity, up-sampled occurrence, non-normalized intensity, and a combination of intensity and occurrence (Fig. 5.8). Some interesting insights can be gained from Table 5.13. It can be seen that the combination of intensity and occurrence performs the best across all the metrics, followed by the occurrence patterns with 12 AUs, then intensity (5 AUs), occurrence (5 AUs), and finally intensity without normalization (i.e. [0,5]). As it can be seen in Table 5.16, the majority of contexts were accurately identified. Some exceptions were embarrassed which was confused with anger, and fear that was confused with embarrassed. This can be explained, in part, by fear being a secondary emotion to anger [91]. These results suggest the

90 Table 5.13: Evaluation metrics for intensity (5 AUs), occurrence (5 AUs), no normalization (5 AUs), combination (5+7 AUs), and occurrence (12 AUs) on the BP4D dataset.

Accuracy F1 Macro Kappa Intensity (5 AUs) 0.4177 0.4039 0.3345 Occurrence (5 AUs) 0.3781 0.3666 0.2892 No Norm (5 AUs) 0.3567 0.334 0.2648 Combination (5 + 7 AUs) 0.4453 0.4377 0.3658 Occurrence (12 AUs) 0.4359 0.4287 0.3554

Table 5.14: Confusion matrix for AU intensity-based (5 AUs) context on the BP4D dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 16 0 1 4 6 4 5 5 SA 0 31 5 0 0 3 0 2 SU 1 6 27 1 0 2 0 4 EM 13 0 1 14 3 0 4 6 FE 6 1 0 8 8 5 6 7 PA 5 12 1 1 3 9 3 7 AN 7 3 2 5 4 4 10 6 DI 3 2 4 2 1 5 2 22

HA SA SU EM FE PA AN DI following: (1) more AUs are needed to accurately recognize context; (2) intensity is a better descriptor to recognize context compared to occurrence; and (3) to model facial expression dynamics with AU patterns, normalization can increase the evaluation metrics for context recognition. Table 5.14 shows the confusion matrix for 5 AU intensity, Table 5.15 and Table 5.17 show the confusion matrices for AU occurrences, 5 AU and 12 AU, respectively.

5.6.2.2 BP4D+ Context Recognition

To further investigate the need for more AUs and intensity over occurrence, the same experiments were conducted on BP4D+ for the four contexts that had AU intensities labeled (Happy, Embarrassed, Fear, and Pain). As it can be seen in Table 5.18, the results are similar to those from BP4D. Combination of intensity and occurrence outperform the rest for all metrics, followed by intensity (5 AUs). With BP4D+, it is interesting to note that while

91 Table 5.15: Confusion matrix for AU occurrence-based (5 AUs) context on the BP4D dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 11 0 3 7 8 3 6 3 SA 0 29 3 0 0 5 0 4 SU 1 6 25 0 0 3 1 5 EM 8 0 1 15 7 0 6 4 FE 6 2 1 9 8 7 6 2 PA 6 11 2 1 4 6 3 8 AN 2 7 2 6 5 1 14 4 DI 5 2 5 3 3 3 4 16

HA SA SU EM FE PA AN DI

Table 5.16: Confusion matrix for AU combination-based (5+7 AUs) context on the BP4D dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 19 0 2 8 0 2 6 4 SA 1 25 6 0 1 5 0 3 SU 1 3 30 0 1 3 0 3 EM 9 0 1 13 5 1 11 1 FE 8 2 1 10 13 2 3 2 PA 5 8 2 3 2 12 2 7 AN 6 4 3 9 5 0 9 5 DI 3 2 0 2 2 3 4 25

HA SA SU EM FE PA AN DI

Table 5.17: Confusion matrix for AU occurrence-based (12 AUs) context on the BP4D dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 13 0 2 11 2 3 3 7 SA 0 30 4 0 0 4 0 3 SU 2 4 29 1 0 5 0 0 EM 8 0 1 15 4 2 10 1 FE 5 1 0 13 10 2 7 3 PA 6 9 1 4 2 12 2 5 AN 5 5 3 6 7 1 11 3 DI 3 1 2 2 2 3 5 23

HA SA SU EM FE PA AN DI

92 Table 5.18: Evaluation metrics for intensity (5 AUs), occurrence (5 AUs), no normalization (5 AUs), combination (5+7 AUs), and occurrence (12 AUs) on the BP4D+ dataset.

Accuracy F1 Macro Kappa Intensity (5 AUs) 0.6018 0.6009 0.469 Occurrence (5 AUs) 0.5125 0.5086 0.3501 No Norm (5 AUs) 0.5732 0.431 0.5733 Combination (5 + 7 AUs) 0.6161 0.6162 0.4881 Occurrence (12 AUs) 0.5642 0.5608 0.419

Table 5.19: Confusion matrix for AU intensity-based (5 AUs) context on the BP4D+ dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 79 37 19 5 EM 33 79 21 7 FE 22 21 72 25 PA 4 9 20 107

HA EM FE PA the occurrence patterns with 12 AUs outperformed those with 5 AUs, the non-normalized intensity patterns outperformed both of them. This requires further investigation, however, one possible explanation is BP4D+ has four contexts with intensity annotations, while BP4D has 8. It is possible we would see a decrease in accuracy for the non-normalized patterns due to this. We also calculated the confusion matrices for the intensity, occurrence (5 and 12 AUs), and combination patterns. As can be seen in Table 5.21, the results are similar to BP4D. Fear is largely confused with Embarrassed, however, here Embarrassed is largely confused with Happy (anger is not evaluated due to not being manually annotated). Similar observations of misclassification between Happy and Embarrassed can be seen in both the intensity patterns (Table 5.19) and the occurrence patterns with five AUs (Table 5.20).

5.6.2.3 BP4D vs. BP4D+ Context Recognition

As BP4D and BP4D+ have similar contexts, we are able to train on one and test on the other. As it can be seen in Table 5.23, intensity patterns again outperform the occurrence

93 Table 5.20: Confusion matrix for AU occurrence-based (5 AUs) context on the BP4D+ dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 57 52 27 4 EM 39 76 20 5 FE 30 25 50 35 PA 3 9 23 103

HA EM FE PA

Table 5.21: Confusion matrix for AU combination-based (5+7 AUs) context on the BP4D+ dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 85 30 18 7 EM 34 84 15 7 FE 18 27 74 21 PA 7 13 18 102

HA EM FE PA

Table 5.22: Confusion matrix for AU occurrence-based (12 AUs) context on the BP4D+ dataset. HA=Happy, SA=Sad, SU=Surprise, EM=Embarrassed, FE=Fear, PA=Pain, AN=ANGER, and DI=Disgust.

HA 80 33 15 12 EM 36 75 23 6 FE 18 35 58 29 PA 13 7 17 103

HA EM FE PA

94 Table 5.23: Evaluation metrics for experiments on BP4D vs. BP4D+. NOTE: 5 intensity and occurrence action units are used in these experiments.

Train BP4D / Test BP4D+ Train BP4D+ / Test BP4D Intensity Occurrence Intensity Occurrence Accuracy 0.4982 0.4875 0.4756 0.4329 F1 Macro 0.4762 0.4648 0.4831 0.4344 Kappa 0.3309 0.3167 0.3008 0.2439 patterns both when training on BP4D and testing on BP4D+ and vice-versa. When we compare the training of the two datasets we see that training on BP4D and testing on BP4D+ performs better than training on BP4D+ and testing on BP4D. This can be explained, in part, by some differences in AUs across the datasets. For example, BP4D has an average AU intensity of 1.19 and BP4D+ has an average intensity of 0.91. As can be seen in Table 5.24, similar observations as Section 5.6.2.1 can be made. More specifically that the Happy and Embarrassed contexts are misclassified. It is important to note that we are not claiming domain transfer [36] with this experiment. While there are some differences between BP4D and BP4D+, they are minor. To note, the AU coders were different but came from the same lab. The threshold for occurrence coding was A or higher in BP4D but B or higher in BP4D+. The scale for intensity coding was the same (A through E), but the difference in threshold for occurrence means that level 1 intensity is likely to be underrepresented in BP4D+ (Table 3.1).

Table 5.24: AU Intensity based Context - Cross Dataset

Trained BP4D Tested BP4D+ Trained BP4D+ Tested BP4D HA 88 25 14 13 HA 20 7 12 2 EM 54 47 22 17 EM 11 20 9 1 FE 58 13 32 37 FE 11 10 16 4 PA 10 2 16 112 PA 8 4 7 22

HA EM FE PA HA EM FE PA

95 Table 5.25: AU Occurrence based Context - Cross Dataset

Trained BP4D Tested BP4D+ Trained BP4D+ Tested BP4D HA 75 24 27 14 HA 13 4 22 2 EM 48 52 12 28 EM 10 15 16 0 FE 49 12 30 49 FE 5 7 28 1 PA 13 0 11 116 PA 4 3 19 15

HA EM FE PA HA EM FE PA

5.6.3 Network Attention Maps

To understand the relation of the AUs and context we visualized attention maps from the output layer using Grad-CAM [102]. We averaged the attention maps across all instances of the context. Fig. 5.9 shows the average attention maps for all videos in DISFA. It is interesting to note that even in the average attention maps we can see that for Happy Video 1 the network has a high attention for presence of AU 6 and AU 12 whereas in Sad Video 1 the network’s attention is on the absence of AUs rather than the presence of them. Please see Appendix B for examples of attention maps for each video for individual subjects.

96 (a) Happy Video 1 (b) Happy Video 2 (c) Fear Video

(d) Surprise Video 1 (e) Surprise Video 2 (f) Disgust Video 1

(g) Disgust Video 2 (h) Sad Video 1 (i) Sad Video 2

Figure 5.9: Average network attention maps (intensity) for all subjects in DISFA.

97 (a) Happy Video 1 (b) Happy Video 2 (c) Fear Video

(d) Surprise Video 1 (e) Surprise Video 2 (f) Disgust Video 1

(g) Disgust Video 2 (h) Sad Video 1 (i) Sad Video 2

Figure 5.10: Average network attention maps (occurrence) for all subjects in DISFA.

98 5.6.4 Automatic AU Detection

We have shown that our method of coding AUs to detect context works for manually annotated AUs. To further evaluate our approach, we also used two open frameworks for detecting AUs as well; openFace [5] and AFAR [37]. To compare with manually annotated AUs, we only used the frames for which AUs were manually annotated. Openface [5] gives intensity of 17 action units and occurrences of 18 action units. AFAR[37] gives only the probability of occurrence, therefore we took 0.5 as the cut-off for the AU to be active (occur). Table 5.26 shows the comparison between manually annotated and automatically de- tected action units. For DISFA context detection and DISFA emotion, intensity detection from openFace has the best accuracy. This could be explained, in part, by the increase in the number of AUs that were annotated (12 manual vs. 17 automatic). For other cases, the manually annotated AUs work better. While the manually labeled AUs did better overall across the three datasets, these results suggest that automatically annotated AU work com- parable to manually annotated AUs. This is encouraging for potential use in wild settings where manually annotated AUs are not available.

99 Table 5.26: Manually vs. automatically annotated AUs.

Open Face [5] AFAR[37] Manual Int Occ Occ Int Occ Comb

DISFA Context 71.19% 57.20% 42.80% 64.61% 51.44% -

DISFA Emotion from Context 76.54% 69.96% 53.90% 78.60% 64.19% -

DISFA Emotion 78.60% 70.37% 52.65% 72.02% 67.49% -

BP4D 40.78% 39.95% 37.12% 41.77% 37.81% 44.53%

BP4D+ 60.32% 58.71% 54.46% 60.18% 51.25% 61.16%

100 5.7 Discussion

While there has been significant work in detecting action units, there has been consider- ably less work on how to apply the detected actions units. Considering this, we proposed to model facial expression dynamics using action unit patterns. The proposed approach is able to model the temporal aspect of facial movement (e.g. expression); more specifically, these patterns are able to recognize context that was meant to elicit an emotional response. We have shown that the proposed temporal modeling can accurately recognize context, as well as the emotion that the context was meant to elicit. Through our experimental design, we have also shown that action unit intensity patterns are a better descriptor compared to ac- tion unit occurrence patterns when recognizing the context and elicited emotion. Currently, most work has investigated action unit occurrences, with less work being done on intensity. Considering this, our suggestion is for more work to focus on detecting the intensity of action units compared to only the occurrence. This shift in focus would allow for further insight into facial expressions and how they change over time including across age, gender, culture, and context. Our results also detail the importance of context when recognizing emotion, which is supported by recent works in psychology [9]. Broadly, this work has computationally investigated two theories of emotion: (1) Basic emotions have an invariant set of facial configurations [34]; and (2) the counter-argument to (1), where emotions should be looked at from a holistic view, where factors such as context, and subject variability need to be accounted for [8]. We argue that the proposed work falls somewhere in the middle of these two conflicting theories and shows validity in each. In this work, we have shown that static action units can be temporally encoded to recognize both context and the targeted emotional response from said context. Our results suggest that while there are different facial configurations (e.g. expressions) for emotions, when context and temporal information are considered, these configurations are a useful tool for recognizing emotional responses.

101 Although we are encouraged by these results, there are some limitations to our work. Our investigation only used the most widely used and studied action units [73, 137, 139], and for BP4D and BP4D+, we had five AU intensities encoded. The impact of other actions units on recognizing context and the elicited emotional response are not clear. Our results suggest that having a larger corpus of annotated action unit intensities would greatly benefit the field. We have also not taken other emotion-sensitive factors into consideration including, but not limited to, age, gender, and culture. We would like to investigate the changes in facial expression dynamics over these factors (e.g. how can we model changes in facial expression dynamics across age), as it has been shown that these are important factors when analyzing emotion [9, 84, 70]. Along with this, the datasets evaluated are similar in nature, especially BP4D and BP4D+. Considering this, we need to evaluate more varied datasets across both manual and automatically annotated AUs.

102 Chapter 6: Conclusion

6.1 Findings

This dissertation has resulted in the following findings:

6.1.1 Self-reported emotions

We have shown that although males and females have similar self-reported emotions for same stimulus, the intensity at which they report the emotion differs. Further, we have shown that our proposed 3D CNN is able to detect self reported emotions with an 83% accuracy and we saw similar accuracy in gender dependent experiments. We have also shown that there is little variance in the accuracy of detecting the self reported emotions, showing the proposed approach works well across a range of subjects.

6.1.2 Impact of Action Unit Occurrence Distribution on Detection

We have shown a strong correlation between the action unit data imbalance and the evaluated F1 scores. We discovered that the same AU patterns occur in multiple tasks, which suggests this negatively impacts detection accuracies. We have validated it by using multi-AU and single-AU experiments. We showed that AU patterns can be used to train deep neural networks and boost the detection accuracy, as well as decrease the negative impact of AU occurrence imbalance.

6.1.3 Fusion of Physiological Signals and Facial Action Units for Pain Recognition

We have shown a multimodal way of detecting contextual pain from other contextual emotions. We found strong correlations between the physiological signals and action units

103 during the most expressive parts of a video sequence. We also verified that there is de- lay between the peak of physiological signals and facial movements when the participants immersed their hand in ice-cold water. We achieved an accuracy of 89.20% and F1 score of 0.75 for detecting pain through our proposed approach showing the feasibility of using multimodal data for pain recognition.

6.1.4 Recognizing Context Using Facial Expression Dynamics from Action Unit Patterns

Facial action units show the facial muscle movements in single images. In this disser- tation, we showed that facial action units have a temporal aspect, that is, their intensity and occurrence change over time. We proposed a novel way of representing this temporal information in the form of images, and demonstrated that these images can be used to detect context with high accuracy. In our analysis, we found that using facial action unit intensity gives better accuracy to detect context than just facial action unit occurrence. Further, we have used open facial action unit detection systems to evaluate our approach. It showed that even though there are imperfections in the open methods, our approach is able to detect context with similar accuracy to that of manually labelled action units. We have also shown that a greater number of action units along with intensity can positively impact recognition accuracies. This finding can facilitate future research directions into using action units.

6.2 Applications

This dissertation has a wide range of applications in fields, including but not limited to, medicine, security, and entertainment. More details on these fields are given below.

1. Health Care is one of the most important applications of our research. Wearable devices with physiological sensors are becoming common and more people are able to afford them. Currently, such devices can detect incidents such as a fall, however, our research has shown that physiological signals can be used for more applications. With further research we may be able to identify people in depression, as well as early detection of

104 incidents such as strokes or heart attacks. As wearable physiological sensors become more ubiquitous, these early detections will be able to provide early assistance and has the potential to save lives.

2. Biometrics are becoming ever more important in our daily life, as we increasingly use software such as face recognition [21]). This is used for many applications such as, boarding a plane, digital payments, login verification, and more. As our dependence on biometrics grows there is an ever growing need to keep our information secure (e.g. continuous authentication). This dissertation has shown that there are strong correlations between intense facial movements and physiological signals, this correlation can be used for continuous authentication applications.

3. Industry is ever increasing their efforts into and leveraging affect for a more targeted and personalized approach. There is a strong correlation between emotions and deci- sion making [57], therefore, continuously understanding affect is important for industry. Multimodal continuous affect recognition will prove useful for industry and entertain- ment [29].

6.3 Limitations

While the results from this dissertation are encouraging, there are some limitations.

1. The detection of contextual emotions may not be able to generalize well to other contexts of the same emotions. For example: in this dissertation, the context of pain was a person immersing their hand into a bucket of ice cold water. The emotional response to this might be different than that of a person being poked with a needle (e.g. infant heel lancing [134]). We suggest more investigations into context across age, gender, and ethnicity are needed.

2. As there is a large imbalance of action units in many state-of-the-art datasets, this poses restrictions on what patterns can be used. When we pre-trained our networks on

105 action unit patterns, to have enough pattern classes of minority action units, we had to take the top 66 patterns for pre-training. There are patterns that we were unable to investigate due to this. A larger corpus of action units, with a specific focus on patterns could mitigate this limitation.

3. We found that self-reported emotions are subjective and can vary from person to person. Due to this, while the results are encouraging, they may not generalize to a larger dataset of subjects. More experiments to evaluate the generalization of our proposed approach is needed. In our analysis we have not considered any initial state of the person. That is, we do not look into any previous experiences of the person, or any predisposition the person may have. This could have a significant impact on self-reported emotions.

4. To manage processing of videos for detection of self-reported emotions, we trained our 3D CNN by skipping frames. As we skipped the frames uniformly, we might miss out on key frames which might be useful for detecting self-reported emotions. This was done due to hardware constraints, as the proposed network is expensive to train. More efficient models may be needed to fully realize the potential of temporal information for self-reported emotions.

5. In our experiments of recognizing contextual emotion using facial action units, we showed that having more action units with higher intensities perform better. But a major limitation of this method is that in the publicly available datasets, very few action units are annotated and even fewer have their intensity annotated. Also, action unit annotation is subjective and dependent on expert annotators. More publicly available AU annotated datasets are needed to facilitate research in this direction.

106 6.4 Future Work

In this dissertation, even though we have shown the importance of context and their analysis, this is the beginning of important work that needs to be done in this regard. We have shown that context can be detected using physiological signals as well as facial movements, over time. Recently, datasets are being collected [106] [90] [136] that are self annotated with continuous affect (valance and arousal) while performing various tasks. These datasets provide us with a unique opportunity to use our understanding of physiological signals and context to detect continuous affect. Our preliminary results have shown that we can use physiological signals to continuously detect context. We have also seen that EMG signals perform better and are more correlated with context than physiological signals. Our analysis into the relationship between context and different modalities, such as, EMG, physiological signals, and facial movements, will help with narrowing the modalities that are most useful and guide future dataset collections. Analyzing temporal information is useful, but at the same time it is difficult. In this dissertation we have shown a novel way to combine multiple temporal data, that is, different action units, into a single image that is a representation of all the action units over time. We found that this temporal representation is useful in preserving the co-dependence of the action unit patterns and in detecting contextual affect. We have also shown that even some physiological signals are highly correlated. Considering this, we will develop a coding technique similar to our action unit method to code physiological signals to temporal images for analysis.

107 References

[1] Muhammad Abdul-Mageed and Lyle Ungar. Emonet: Fine-grained emotion detection with gated recurrent neural networks. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 718–728, 2017.

[2] Amelia Aldao. The future of emotion regulation research: Capturing context. Per- spectives on Psychological Science, 8(2):155–172, 2013.

[3] Gilderlane Ribeiro Alexandre, Jos´eMarques Soares, and George Andr´ePereira Th´e. Systematic review of 3d facial expression recognition methods. Pattern Recognition, 100:107108, 2020.

[4] Zara Ambadar, Jeffrey F Cohn, and Lawrence Ian Reed. All smiles are not created equal: Morphology and timing of smiles perceived as amused, polite, and embar- rassed/nervous. Journal of nonverbal behavior, 33(1):17–34, 2009.

[5] Brandon Amos, Bartosz Ludwiczuk, and Mahadev Satyanarayanan. Openface: A general-purpose face recognition library with mobile applications. Technical report, CMU-CS-16-118, CMU School of Computer Science, 2016.

108 [6] Min SH Aung, Sebastian Kaltwang, Bernardino Romera-Paredes, Brais Martinez, Aneesha Singh, Matteo Cella, Michel Valstar, Hongying Meng, Andrew Kemp, Moshen Shafizadeh, and A. Elkins. The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE transactions on affective computing, 7(4):435–451, 2015.

[7] Philip Bard. The neuro-humoral basis of emotional reactions. Handbook of General Experimental Psychology, 1934.

[8] Lisa Feldman Barrett. Was darwin wrong about emotional expressions? Current Directions in Psychological Science, 20(6):400–406, 2011.

[9] Lisa Feldman Barrett, Ralph Adolphs, Stacy Marsella, Aleix M Martinez, and Seth D Pollak. Emotional expressions reconsidered: Challenges to inferring emotion from human facial movements. Psychological science in the public interest, 20(1):1–68, 2019.

[10] Lisa Feldman Barrett, Batja Mesquita, and Maria Gendron. Context in emotion per- ception. Current Directions in Psychological Science, 20(5):286–290, 2011.

[11] Jacob Benesty, Jingdong Chen, Yiteng Huang, and Israel Cohen. Pearson correlation coefficient. In Noise reduction in speech processing, pages 1–4. Springer, 2009.

[12] Volker Blanz and Thomas Vetter. A morphable model for the synthesis of 3d faces. In Proceedings of the 26th annual conference on Computer graphics and interactive techniques, pages 187–194, 1999.

[13] Eric Bredo. Evolution, psychology, and john dewey’s critique of the reflex arc concept. The Elementary School Journal, 98(5):447–466, 1998.

[14] Leo Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

109 [15] Joy Buolamwini and Timnit Gebru. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency, pages 77–91. PMLR, 2018.

[16] Hanshu Cai, Xiangzi Zhang, Yanhao Zhang, Ziyang Wang, and Bin Hu. A Case- Based Reasoning Model for Depression Based on Three-Electrode EEG Data. IEEE Transactions on Affective Computing, 11(3):383–392, July 2020.

[17] Silvia Ceccacci, Andrea Generosi, Luca Giraldi, and Maura Mengoni. Tool to make shopping experience responsive to customer emotions. International Journal of Au- tomation Technology, 12(3):319–326, 2018.

[18] Martin Certick`y,Michalˇ Certick`y,Peterˇ Sinˇc´ak,Gergely Magyar, J´anVaˇsˇc´ak,and Filippo Cavallo. Psychophysiological indicators for modeling user experience in inter- active digital entertainment. Sensors, 19(5):989, 2019.

[19] Chaitanya, S Sarath, Malavika, Prasanna, and Karthik. Human Emotions Recognition from Thermal Images using Yolo Algorithm. In 2020 International Conference on Communication and Signal Processing (ICCSP), pages 1139–1142. IEEE, July 2020.

[20] Feng-Ju Chang, Anh Tuan Tran, Tal Hassner, Iacopo Masi, Ram Nevatia, and Gerard Medioni. Expnet: Landmark-free, deep, 3d facial expressions. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 122–129. IEEE, 2018.

[21] Jie Chang, Zhonghao Lan, Changmao Cheng, and Yichen Wei. Data uncertainty learning in face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5710–5719, 2020.

[22] Zachary Chase Lipton, Charles Elkan, and Balakrishnan Narayanaswamy. Threshold- ing classifiers to maximize f1 score. arXiv, pages arXiv–1402, 2014.

110 [23] Junkai Chen, Zenghai Chen, Zheru Chi, and Hong Fu. Facial expression recognition in video with multiple feature fusion. IEEE Transactions on Affective Computing, 9(1):38–50, 2016.

[24] Shiyang Cheng, Irene Kotsia, Maja Pantic, and Stefanos Zafeiriou. 4dfab: A large scale 4d database for facial expression analysis and biometric applications. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5117–5126, 2018.

[25] Wen-Sheng Chu, Fernando De la Torre, and Jeffrey F Cohn. Learning spatial and temporal cues for multi-label facial action unit detection. In 2017 12th IEEE Interna- tional Conference on Automatic Face & Gesture Recognition (FG 2017), pages 25–32. IEEE, 2017.

[26] Timothy F Cootes, Gareth J Edwards, and Christopher J Taylor. Active appearance models. In European conference on computer vision, pages 484–498. Springer, 1998.

[27] Daniel T Cordaro, Rui Sun, Dacher Keltner, Shanmukh Kamble, Niranjan Huddar, and Galen McNeil. Universals and cultural variations in 22 emotional expressions across five cultures. Emotion, 18(1):75, 2018.

[28] Ciprian Corneanu, Meysam Madadi, and Sergio Escalera. Deep structure inference network for facial action unit recognition. In Proceedings of the European Conference on Computer Vision (ECCV), pages 298–313, 2018.

[29] Sarah Cosentino, Estelle IS Randria, Jia-Yeu Lin, Thomas Pellegrini, Salvatore Sessa, and Atsuo Takanishi. Group emotion recognition strategies for entertainment robots. In 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 813–818. IEEE, 2018.

111 [30] Alan Cowen, Disa Sauter, Jessica L Tracy, and Dacher Keltner. Mapping the pas- sions: Toward a high-dimensional taxonomy of emotional experience and expression. Psychological Science in the Public Interest, 20(1):69–90, 2019.

[31] Avijit Datta and Michael Tipton. Respiratory responses to cold water immersion: neural pathways, interactions, and clinical consequences awake and asleep. Journal of Applied Physiology, 100(6):2057–2064, 2006. PMID: 16714416.

[32] Abhinav Dhall, Amanjot Kaur, Roland Goecke, and Tom Gedeon. Emotiw 2018: Audio-video, student engagement and group-level affect prediction. In Proceedings of the 20th ACM International Conference on Multimodal Interaction, ICMI ’18, page 653–656, New York, NY, USA, 2018. Association for Computing Machinery.

[33] Lixin Duan, Ivor W. Tsang, Dong Xu, and Tat-Seng Chua. Domain adaptation from multiple sources via auxiliary classifiers. In Proceedings of the 26th Annual Interna- tional Conference on Machine Learning, ICML ’09, page 289–296, New York, NY, USA, 2009. Association for Computing Machinery.

[34] Paul Ekman and Erika L. Rosenberg, editors. What the face reveals: basic and applied studies of spontaneous expression using the facial action coding system (FACS). Series in affective science. Oxford University Press, Oxford ; New York, 2nd ed edition, 2005.

[35] Itir Onal Ertugrul, Jeffrey F Cohn, L´aszl´oA Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. Cross-domain au detection: Domains, learning approaches, and measures. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019.

[36] Itir Onal Ertugrul, Jeffrey F Cohn, L´aszl´oA Jeni, Zheng Zhang, Lijun Yin, and Qiang Ji. Crossing domains for au coding: Perspectives, approaches, and measures. IEEE transactions on biometrics, behavior, and identity science, 2(2):158–171, 2020.

112 [37] Itir Onal Ertugrul, L´aszl´oA Jeni, Wanqiao Ding, and Jeffrey F Cohn. Afar: A deep learning based tool for automated facial affect recognition. In 2019 14th IEEE Inter- national Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–1. IEEE, 2019.

[38] Diego Fabiano and Shaun Canavan. Deformable synthesis model for emotion recog- nition. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–5, 2019.

[39] Diego Fabiano and Shaun Canavan. Emotion recognition using fused physiological signals. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 42–48. IEEE, 2019.

[40] Alberto Fung and Daniel McDuff. A scalable approach for facial action unit classifier training using noisy data for pre-training. arXiv preprint arXiv:1911.05946, 2019.

[41] Maria Gendron, Debi Roberson, Jacoba Marietta van der Vyver, and Lisa Feldman Barrett. Perceptions of emotion from facial expressions are not culturally universal: evidence from a remote culture. Emotion, 14(2):251, 2014.

[42] Jeffrey M Girard, Wen-Sheng Chu, Laszlo A Jeni, and Jeffrey F Cohn. Sayette Group Formation Task (GFT) Spontaneous Facial Expression Database. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 581–588, Washington, DC, DC, USA, May 2017. IEEE.

[43] Jeffrey M Girard, Jeffrey F Cohn, L´aszl´oA Jeni, Simon Lucey, and Fernando De la Torre. How much training data for facial action unit detection? In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 1, pages 1–8. IEEE, 2015.

113 [44] Jeffrey M Girard, Gayatri Shandar, Zhun Liu, Jeffrey F Cohn, Lijun Yin, and Louis- Philippe Morency. Reconsidering the Duchenne smile: Indicator of positive emotion or artifact of smile intensity? In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 594–599. IEEE, 2019.

[45] Boqing Gong, Yuan Shi, Fei Sha, and Kristen Grauman. Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2066–2073. IEEE, 2012.

[46] Judith A. Hall and David Matsumoto. Gender differences in judgments of multiple emotions from facial expressions. Emotion (Washington, D.C.), 4(2):201–206, June 2004.

[47] Saurabh Hinduja, Shaun Canavan, and Gurmeet Kaur. Multimodal Fusion of Physi- ological Signals and Facial Action Units for Pain Recognition. In 2020 15th IEEE In- ternational Conference on Automatic Face and Gesture Recognition (FG 2020), pages 577–581. IEEE, November 2020.

[48] Saurabh Hinduja, Shaun Canavan, and Lijun Yin. Recognizing Perceived Emotions from Facial Expressions. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 236–240. IEEE, November 2020.

[49] Rahatul Jannat, Iyonna Tynes, Lott La Lime, Juan Adorno, and Shaun Canavan. Ubiquitous emotion recognition using audio and video data. In Proceedings of the 2018 ACM International Joint Conference and 2018 International Symposium on Pervasive and Ubiquitous Computing and Wearable Computers, pages 956–959, 2018.

[50] L´aszl´oA Jeni, Jeffrey F Cohn, and Fernando De La Torre. Facing imbalanced data– recommendations for the use of performance metrics. In 2013 Humaine association conference on affective computing and intelligent interaction, pages 245–251. IEEE, 2013.

114 [51] Sarah Jordan, Laure Brimbal, D Brian Wallace, Saul M Kassin, Maria Hartwig, and Chris NH Street. A test of the micro-expressions training tool: Does it improve lie detection? Journal of Investigative Psychology and Offender Profiling, 16(3):222–235, 2019.

[52] Dacher Keltner. Signs of appeasement: Evidence for the distinct displays of embarrass- ment, amusement, and shame. Journal of personality and , 68(3):441, 1995.

[53] Davis E King. Dlib-ml: A machine learning toolkit. Journal of Machine Learning Research, 10(Jul):1755–1758, 2009.

[54] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

[55] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.

[56] Alison Ledgerwood. Evaluations in their social context: Distance regulates consistency and context dependence. Social and Personality Psychology Compass, 8(8):436–447, 2014.

[57] Jennifer S Lerner, Ye Li, Piercarlo Valdesolo, and Karim S Kassam. Emotion and decision making. Annual review of psychology, 66, 2015.

[58] Shan Li and Weihong Deng. Deep Facial Expression Recognition: A Survey. IEEE Transactions on Affective Computing, pages 1–1, 2020.

[59] Wei Li, Farnaz Abtahi, and Zhigang Zhu. Action unit detection with region adapta- tion, multi-labeling learning and optimal temporal fusing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1841–1850, 2017.

115 [60] Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. Eac-net: A region-based deep enhancing and cropping approach for facial action unit detection. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 103–110. IEEE, 2017.

[61] Wei Li, Farnaz Abtahi, Zhigang Zhu, and Lijun Yin. Eac-net: Deep nets with en- hancing and cropping for facial action unit detection. IEEE transactions on pattern analysis and machine intelligence, 40(11):2583–2596, 2018.

[62] Ya Li, Jianhua Tao, Bj¨ornSchuller, Shiguang Shan, Dongmei Jiang, and Jia Jia. Mec 2017: Multimodal emotion recognition challenge. In 2018 First Asian Conference on Affective Computing and Intelligent Interaction (ACII Asia), pages 1–5. IEEE, 2018.

[63] James J Lien, Takeo Kanade, Jeffrey F Cohn, and Ching-Chung Li. Automated facial expression recognition based on facs action units. In Proceedings Third IEEE Interna- tional Conference on Automatic Face and Gesture Recognition, pages 390–395. IEEE, 1998.

[64] Daizong Liu, Xi Ouyang, Shuangjie Xu, Pan Zhou, Kun He, and Shiping Wen. Saanet: Siamese action-units attention network for improving dynamic facial expression recog- nition. Neurocomputing, 413:145–157, 2020.

[65] Peng Liu, Zheng Zhang, Huiyuan Yang, and Lijun Yin. Multi-modality empowered network for facial action unit detection. In 2019 IEEE Winter Conference on Appli- cations of Computer Vision (WACV), pages 2175–2184. IEEE, 2019.

[66] Zhilei Liu, Guoxian Song, Jianfei Cai, Tat-Jen Cham, and Juyong Zhang. Conditional adversarial synthesis of 3d facial action units. Neurocomputing, 355:200–208, 2019.

[67] William Lovallo. The cold pressor test and autonomic function: A review and integra- tion. Psychophysiology, 12(3):268–282, 1975.

116 [68] Patrick Lucey, Jeffrey F Cohn, Iain Matthews, Simon Lucey, Sridha Sridharan, Jessica Howlett, and Kenneth M Prkachin. Automatically detecting pain in video through facial action units. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(3):664–674, 2010.

[69] Patrick Lucey, Jeffrey F Cohn, Kenneth M Prkachin, Patricia E Solomon, and Iain Matthews. Painful data: The unbc-mcmaster shoulder pain expression archive database. In Face and Gesture 2011, pages 57–64. IEEE, 2011.

[70] Kaixin Ma, Xinyu Wang, Xinru Yang, Mingtong Zhang, Jeffrey M Girard, and Louis- Philippe Morency. Elderreact: A multimodal dataset for recognizing emotional re- sponse in aging adults. In 2019 International Conference on Multimodal Interaction, pages 349–357, 2019.

[71] Aleix M Martinez. Context may reveal how you feel. Proceedings of the National Academy of Sciences, 116(15):7169–7171, 2019.

[72] Christine Mattley. The temporality of emotion: Constructing past emotions. Symbolic Interaction, 25(3):363–378, 2002.

[73] Mohammad Mavadati, Mohammad H. Mahoor, Kevin Bartlett, Philip Trinh, and Jef- frey F. Cohn. DISFA: A Spontaneous Facial Action Intensity Database. IEEE Trans- actions on Affective Computing, 4(2):151–160, April 2013.

[74] Mohammad Mavadati, Peyten Sanger, and Mohammad H. Mahoor. Extended DISFA Dataset: Investigating Posed and Spontaneous Facial Expressions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1452–1459. IEEE, June 2016.

[75] Batja Mesquita, Lisa Feldman Barrett, and Eliot R Smith. The mind in context. Guilford Press, 2010.

117 [76] Batja Mesquita and Michael Boiger. Emotions in context: A sociodynamic model of emotions. Emotion Review, 6(4):298–302, 2014.

[77] Dale Metcalfe, Karen McKenzie, Kristofor McCarty, and Thomas V Pollet. Emotion recognition from body movement and gesture in children with autism spectrum disorder is improved by situational cues. Research in Developmental Disabilities, 86:1–10, 2019.

[78] Mehdi Mirza and Simon Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.

[79] Trisha Mittal, Pooja Guhan, Uttaran Bhattacharya, Rohan Chandra, Aniket Bera, and Dinesh Manocha. Emoticon: Context-aware multimodal emotion recognition using frege’s principle. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14234–14243, 2020.

[80] Laurent Mourot, Malika Bouhaddi, and Jacques Regnard. Effects of the cold pressor test on cardiac autonomic control in normal subjects. Physiological research, 58(1), 2009.

[81] Niklas M¨uller,Bettina Eska, Richard Sch¨affer,Sarah Theres V¨olkel, Michael Braun, Gesa Wiegand, and Florian Alt. Arch’n’smile: A jump’n’run game using facial expres- sion recognition control for entertaining children during car journeys. In Proceedings of the 17th International Conference on Mobile and Ubiquitous Multimedia, pages 335– 339, 2018.

[82] Thu Nguyen, Khang Tran, and Hung Nguyen. Towards thermal region of interest for human emotion estimation. In 2018 10th International Conference on Knowledge and Systems Engineering (KSE), pages 152–157. IEEE, 2018.

[83] Xuesong Niu, Hu Han, Songfan Yang, Yan Huang, and Shiguang Shan. Local relation- ship learning with person-specific shape regularization for facial action unit detection. In Computer Vision and Pattern Recognition, pages 11917–11926, 2019.

118 [84] Behnaz Nojavanasghari, Tadas Baltruˇsaitis,Charles E Hughes, and Louis-Philippe Morency. Emoreact: a multimodal approach and dataset for recognizing emotional responses in children. In International Conference on Multimodal Interaction, pages 137–144, 2016.

[85] Temitayo A Olugbade, Nadia Bianchi-Berthouze, Nicolai Marquardt, and Amanda C Williams. Pain level recognition using kinematics and muscle activity for physical re- habilitation in chronic pain. In 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), pages 243–249. IEEE, 2015.

[86] Temitayo A Olugbade, Aneesha Singh, Nadia Bianchi-Berthouze, Nicolai Marquardt, Min SH Aung, and Amanda C De C Williams. How can affect be detected and rep- resented in technological support for physical rehabilitation? ACM Transactions on Computer-Human Interaction (TOCHI), 26(1):1–29, 2019.

[87] Itir Onal Ertugrul, Le Yang, Laszlo A Jeni, and Jeffrey F Cohn. D-pattnet: Dynamic patch-attentive deep network for action unit detection. Frontiers in computer science, 1:11, 2019.

[88] Eleonora Pantano. Non-verbal evaluation of retail service encounters through con- sumers’ facial expressions. Computers in Human Behavior, 111:106448, October 2020.

[89] Maja Pantic and Ioannis Patras. Dynamics of facial expression: Recognition of facial actions and their temporal segments from face profile image sequences. IEEE Transac- tions on Systems, Man, and Cybernetics, Part B (Cybernetics), 36(2):433–449, 2006.

[90] Cheul Young Park, Narae Cha, Soowon Kang, Auk Kim, Ahsan Habib Khandoker, Leontios Hadjileontiadis, Alice Oh, Yong Jeong, and Uichin Lee. K-emocon, a multi- modal sensor dataset for continuous emotion recognition in naturalistic conversations. Scientific Data, 7(1):1–16, 2020.

119 [91] W Gerrod Parrott. Emotions in social psychology. key readings in social psychology, 2000.

[92] Ravindra Patil, Kiran Chaudhari, S. N. Kakarwal, R. R. Deshmukh, and D. V. Kur- mude. Analysis of Facial Expression Recognition of Visible, Thermal and Fused Imagi- nary in Indoor and Outdoor Environment. In Internet of Things, Smart Computing and Technology: A Roadmap Ahead, Studies in Systems, Decision and Control. Springer International Publishing.

[93] Rosalind W. Picard. Affective computing. The MIT Press, first paperback edition edition, 2000. OCLC: 247967780.

[94] Rosalind W. Picard, Elias Vyzas, and Jennifer Healey. Toward machine emotional intelligence: Analysis of affective physiological state. IEEE transactions on pattern analysis and machine intelligence, 23(10):1175–1191, 2001.

[95] Yvain Qu´eau,Fran¸coisLauze, and Jean-Denis Durou. Solving uncalibrated photo- metric stereo using total variation. Journal of Mathematical Imaging and Vision, 52(1):87–107, 2015.

[96] R. Ramya, K. Mala, and S. Selva Nidhyananthan. 3D Facial Expression Recogni- tion Using Multi-channel Deep Learning Framework. Circuits, Systems, and Signal Processing, 39(2):789–804, February 2020.

[97] Michael David Resnik. The context principle in frege’s philosophy. Philosophy and Phenomenological Research, 27(3):356–365, 1967.

[98] Andr´esRomero, Juan Le´on,and Pablo Arbel´aez. Multi-view dynamic facial action unit detection. Image and Vision Computing, page 103723, 2018.

[99] Tomasz Sapi´nski,Dorota Kami´nska, Adam Pelikant, and Gholamreza Anbarjafari. Emotion recognition from skeletal movements. Entropy, 21(7):646, 2019.

120 [100] Sebastian Schindler, Ria Vormbrock, and Johanna Kissler. Emotion in context: how sender predictability and identity affect processing of words as imminent personality feedback. Frontiers in psychology, 10:94, 2019.

[101] William R Schucany and Suojin Wang. One-step bootstrapping for smooth itera- tive procedures. Journal of the Royal Statistical Society: Series B (Methodological), 53(3):587–596, 1991.

[102] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE international conference on computer vision, pages 618–626, 2017.

[103] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. Deep adaptive attention for joint facial action unit detection and face alignment. In Proceedings of the European Conference on Computer Vision (ECCV), pages 705–720, 2018.

[104] Zhiwen Shao, Zhilei Liu, Jianfei Cai, and Lizhuang Ma. Jaa-net: Joint facial action unit detection and face alignment via adaptive attention. arXiv preprint arXiv:2003.08834, 2020.

[105] Zhiwen Shao, Zhilei Liu, Jianfei Cai, Yunsheng Wu, and Lizhuang Ma. Facial Ac- tion Unit Detection Using Attention and Relation Learning. IEEE Transactions on Affective Computing, pages 1–1, 2019.

[106] Karan Sharma, Claudio Castellini, Egon L van den Broek, Alin Albu-Schaeffer, and Friedhelm Schwenker. A dataset of continuous affect annotations and physiological signals for emotion analysis. Scientific data, 6(1):1–13, 2019.

121 [107] Erika H Siegel, Molly K Sands, Wim Van den Noortgate, Paul Condon, Yale Chang, Jennifer Dy, Karen S Quigley, and Lisa Feldman Barrett. Emotion fingerprints or emotion populations? a meta-analytic investigation of autonomic features of emotion categories. Psychological bulletin, 144(4):343, 2018.

[108] Gulbadan Sikander and Shahzad Anwar. A Novel Machine Vision-Based 3D Facial Action Unit Identification for Fatigue Detection. IEEE Transactions on Intelligent Transportation Systems, pages 1–11, 2020.

[109] Yale Song, Daniel McDuff, Deepak Vasisht, and Ashish Kapoor. Exploiting sparsity and co-occurrence structure for action unit recognition. In 2015 11th IEEE inter- national conference and workshops on automatic face and gesture recognition (FG), volume 1, pages 1–8. IEEE, 2015.

[110] P. Sr´amek,M.ˇ Simeˇckov´a,L.ˇ Jansk´y,J. Savl´ıkov´a,andˇ S. Vyb´ıral. Human physiological responses to immersion into water of different temperatures. European Journal of Applied Physiology, 81(5):436–442, Feb 2000.

[111] Ramprakash Srinivasan and Aleix M Martinez. Cross-Cultural and Cultural-Specific Production and Perception of Facial Expressions of Emotion in the Wild. IEEE Trans- actions on Affective Computing, 2018.

[112] Michael JL Sullivan, Pascal Thibault, Andr´eSavard, Richard Catchlove, John Kozey, and William D Stanish. The influence of communication goals and physical demands on different dimensions of pain behavior. Pain, 125(3):270–277, 2006.

[113] Qian Sun, Rita Chattopadhyay, Sethuraman Panchanathan, and Jieping Ye. A two- stage weighting framework for multi-source domain adaptation. In Advances in neural information processing systems, pages 505–513, 2011.

122 [114] Leslie A Thornton and David R Novak. Storying the temporal nature of emotion work among volunteers: Bearing witness to the lived traumas of others. Health communica- tion, 25(5):437–448, 2010.

[115] Rebecca M Todd, Vladimir Miskovic, Junichi Chikazoe, and Adam K Anderson. Emo- tional objectivity: Neural representations of emotions and their interaction with cog- nition. Annual Review of Psychology, 71:25–48, 2020.

[116] Cheng-Hao Tu, Chih-Yuan Yang, and Jane Yung-jen Hsu. Idennet: Identity-aware facial action unit detection. In 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pages 1–8. IEEE, 2019.

[117] Md Taufeeq Uddin and Shaun Canavan. Synthesizing physiological and motion data for stress and meditation detection. In Affective Computing and Intelligent Interaction Workshops and Demos, pages 244–247, 2019.

[118] Michel F Valstar, Timur Almaev, Jeffrey M Girard, Gary McKeown, Marc Mehu, Lijun Yin, Maja Pantic, and Jeffrey F Cohn. Fera 2015-second facial expression recognition and analysis challenge. In 2015 11th IEEE International Conference and Workshops on Automatic Face and Gesture Recognition (FG), volume 6, pages 1–8. IEEE, 2015.

[119] Michel F Valstar, Bihan Jiang, Marc Mehu, Maja Pantic, and Klaus Scherer. The first facial expression recognition and analysis challenge. In Face and Gesture 2011, pages 921–926. IEEE, 2011.

[120] Michel F Valstar, Enrique S´anchez-Lozano, Jeffrey F Cohn, L´aszl´oA Jeni, Jeffrey M Girard, Zheng Zhang, Lijun Yin, and Maja Pantic. Fera 2017-addressing head pose in the third facial expression recognition and analysis challenge. In 2017 12th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2017), pages 839–847. IEEE, 2017.

123 [121] Chongyang Wang, Min Peng, Temitayo A Olugbade, Nicholas D Lane, Amanda C De C Williams, and Nadia Bianchi-Berthouze. Learning bodily and temporal attention in protective movement behavior detection. arXiv preprint arXiv:1904.10824, 2019.

[122] Kai Wang, Xiaojiang Peng, Jianfei Yang, Debin Meng, and Yu Qiao. Region attention networks for pose and occlusion robust facial expression recognition. IEEE Transac- tions on Image Processing, 29:4057–4069, 2020.

[123] Shangfei Wang, Heyan Ding, and Guozhu Peng. Dual Learning for Facial Action Unit Detection Under Nonfull Annotation. IEEE Transactions on Cybernetics, pages 1–13, 2020.

[124] Shangfei Wang, Bowen Pan, Huaping Chen, and Qiang Ji. Thermal augmented ex- pression recognition. IEEE transactions on cybernetics, 48(7):2203–2214, 2018.

[125] Christian E. Waugh, Elaine Z. Shing, and Brad M. Avery. Temporal dynamics of emotional processing in the brain. Emotion Review, 7(4):323–329, 2015.

[126] Xiaofan Wei, Huibin Li, Jian Sun, and Liming Chen. Unsupervised domain adaptation with regularized optimal transport for multimodal 2d+ 3d facial expression recognition. In 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018), pages 31–37. IEEE, 2018.

[127] Yunchao Wei, Wei Xia, Min Lin, Junshi Huang, Bingbing Ni, Jian Dong, Yao Zhao, and Shuicheng Yan. Hcp: A flexible cnn framework for multi-label image classifica- tion. IEEE transactions on pattern analysis and machine intelligence, 38(9):1901–1907, 2015.

[128] Philipp Werner, Sebastian Handrich, and Ayoub Al-Hamadi. Facial action unit inten- sity estimation and feature relevance visualization with random regression forests. In 2017 Seventh International Conference on Affective Computing and Intelligent Inter- action (ACII), pages 401–406. IEEE, 2017.

124 [129] J.S Winston, J O’Doherty, and R.J Dolan. Common and distinct neural responses dur- ing direct and incidental processing of multiple facial emotions. NeuroImage, 20(1):84– 97, September 2003.

[130] Xiaojing Xu and Virginia R de Sa. Exploring multidimensional measurements for pain evaluation using facial action units. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020)(FG), pages 559–565, 2020.

[131] Huiyuan Yang, Umur Ciftci, and Lijun Yin. Facial expression recognition by de- expression residue learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2168–2177, 2018.

[132] Le Yang, Itir Onal Ertugrul, Jeffrey F. Cohn, Zakia Hammal, Dongmei Jiang, and Hichem Sahli. FACS3D-Net: 3D Convolution based Spatiotemporal Representation for Action Unit Detection. pages 538–544, September 2019.

[133] Seunghyun Yoon, Seokhyun Byun, and Kyomin Jung. Multimodal speech emotion recognition using audio and text. In 2018 IEEE Spoken Language Technology Workshop (SLT), pages 112–118. IEEE, 2018.

[134] Ghada Zamzmi, Rahul Paul, Dmitry Goldgof, Rangachar Kasturi, and Yu Sun. Pain assessment from facial expression: Neonatal convolutional neural network (n-cnn). In 2019 International Joint Conference on Neural Networks (IJCNN), pages 1–7. IEEE, 2019.

[135] Jiabei Zeng, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F Cohn, and Zhang Xiong. Confidence preserving machine for facial action unit detection. In Proceedings of the IEEE international conference on computer vision, pages 3622–3630, 2015.

125 [136] Tianyi Zhang, Abdallah El Ali, Chen Wang, Alan Hanjalic, and Pablo Cesar. Rcea: Real-time, continuous emotion annotation for collecting precise mobile video ground truth labels. In Proceedings of the 2020 CHI Conference on Human Factors in Com- puting Systems, pages 1–15, 2020.

[137] Xing Zhang, Lijun Yin, Jeffrey F Cohn, Shaun Canavan, Michael Reale, Andy Horowitz, Peng Liu, and Jeffrey M Girard. Bp4d-spontaneous: a high-resolution spontaneous 3d dynamic facial expression database. Image and Vision Computing, 32(10):692–706, 2014.

[138] Zhaoli Zhang, Zhenhua Li, Hai Liu, Taihe Cao, and Sannyuya Liu. Data-driven online learning engagement detection via facial expression and mouse behavior recognition technology. Journal of Educational Computing Research, 58(1):63–86, 2020.

[139] Zheng Zhang, Jeffrey M. Girard, Yue Wu, Xing Zhang, Peng Liu, Umur Ciftci, Shaun Canavan, Michael Reale, Andrew Horowitz, Huiyuan Yang, Jeffrey F. Cohn, Qiang Ji, and Lijun Yin. Multimodal Spontaneous Emotion Corpus for Human Behavior Analysis. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3438–3446. IEEE, June 2016.

[140] Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In Computer Vision and Pattern Recognition, pages 5810– 5818, 2017.

[141] Kaili Zhao, Wen-Sheng Chu, Fernando De la Torre, Jeffrey F Cohn, and Honggang Zhang. Joint patch and multi-label learning for facial action unit detection. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2207–2216, 2015.

126 [142] Kaili Zhao, Wen-Sheng Chu, and Honggang Zhang. Deep region and multi-label learn- ing for facial action unit detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3391–3399, 2016.

[143] Sicheng Zhao, Amir Gholaminejad, Guiguang Ding, Yue Gao, Jungong Han, and Kurt Keutzer. Personalized emotion recognition by personality-aware high-order learning of physiological signals. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 15(1s):1–18, 2019.

[144] Yiru Zhao, Wanfeng Ge, Wenxin Li, Run Wang, Lei Zhao, and Jiang Ming. Capturing the persistence of facial expression features for deepfake video detection. In Inter- national Conference on Information and Communications Security, pages 630–645. Springer, 2019.

[145] James Zou and Londa Schiebinger. AI can be sexist and racist — it’s time to make it fair, July 2018.

127 Appendix A: Dictionary

Some of the commonly used terms in the dissertation

• Context: It is the stimuli which the subject is introduced to. For example, shown a video, told a joke.

• Expressed Emotion: It is the stereotypical facial expressions which are related to the emotion. For example, a smile is considered to be happy.

• Self-Report Emotion: It the emotion which is reported the subject after a certain task.

• Target Emotion: The emotion the stimuli (context) is meant to elicit.

128 Appendix B: Attention Maps

(a) Examples of Happy Video 1

(b) Examples of Happy Video 2

Figure B.1: Examples of Happy Video Attention Maps Using Intensity

129 (a) Examples of Surprise Video 1

(b) Examples of Surprise Video 2

Figure B.2: Examples of Surprise Video Attention Maps Using Intensity

130 (a) Examples of Sad Video 1

(b) Examples of Sad Video 2

Figure B.3: Examples of Sad Video Attention Maps Using Intensity

131 (a) Examples of Disgust Video 1

(b) Examples of Disgust Video 2

Figure B.4: Examples of Disgust Video Attention Maps Using Intensity

132 (a) Examples of Fear Video

Figure B.5: Examples of Fear Video Attention Maps Using Intensity

133 (a) Examples of Happy Video 1

(b) Examples of Happy Video 2

Figure B.6: Examples of Happy Video Attention Maps Using Occurrence

134 (a) Examples of Surprise Video 1

(b) Examples of Surprise Video 2

Figure B.7: Examples of Surprise Video Attention Maps Using Occurrence

135 (a) Examples of Sad Video 1

(b) Examples of Sad Video 2

Figure B.8: Examples of Sad Video Attention Maps Using Occurrence

136 (a) Examples of Disgust Video 1

(b) Examples of Disgust Video 2

Figure B.9: Examples of Disgust Video Attention Maps Using Occurrence

137 (a) Examples of Fear Video

Figure B.10: Examples of Fear Video Attention Maps Using Occurrence

138 Appendix C: Copyright Permissions

The permission below is for the reproduction of material in Chapter 2.

Figure C.1: Copyright for Chapter 2

139 The permission below is for the reproduction of material in Chapter 4.

Figure C.2: Copyright for Chapter 4

140 About the Author

Saurabh Hinduja obtained a Bachelor’s degree in Electronics and Communication Engi- neering from Birla Institute of Technology, India in 2009, a Master’s degree in Business Administration (MBA) in Human Resources from Symbiosis Center for Management and Human Resource Development, India in 2013 and a Master’s degree in Computer Science from University of South Florida, Tampa in 2016. During his Doctoral program he has published over 8 papers in conferences and journals. His main areas of interest are Affective computing, computer vision, neural networks. His current research is in field of contextual emotions.