FACULDADEDE ENGENHARIADA UNIVERSIDADEDO PORTO

Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies

João Carlos Miranda de Almeida

Mestrado Integrado em Engenharia Informática e Computação

Supervisor: Paula Viana Co-Supervisor: Inês Nunes Teixeira Co-Supervisor: Luís Vilaça

September 21, 2020

Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies

João Carlos Miranda de Almeida

Mestrado Integrado em Engenharia Informática e Computação

September 21, 2020

Abstract

The bond between spectators and films takes place in the emotional dimension of a film. The emotional dimension is characterized by the filmmakers’ decisions on writing, imagery and sound, but it is through acting that emotions are directly transmitted to the audience. Understanding how this bond is created can give us essential information about how humans interact with this increasingly digital medium and how this information can be integrated into large film platforms. Our work represents another step towards identifying emotions in cinema, particularly from camera close-ups of actors that are often used to evoke intense emotions in the audience. During the last few decades, the research community has made promising progress in developing facial expression recognition methods but without much emphasis on the complex nature of film, with variations in lighting and pose, a problem discussed in detail in this work. We start by focusing on the understating from social sciences of the state-of-the-art models for emotion classification, discussing its strengths and weaknesses. Secondly, we introduce the Facial Emotion Recognition (FER) systems and automatic emotion analysis in movies, analyzing unimodal and multimodal strategies. We present a comparison between the relevant databases and computer vision techniques used in facial emotion recognition and we highlight some issues caused by the heterogeneity of the databases, since there is no universal model of emotions. Built upon this understanding, we designed and implemented a framework for testing the fea- sibility of an end-to-end solution that uses facial expressions to determine the emotional charge of a film. The framework has two phases: firstly, the selection and in-depth analysis of the relevant databases to serve as a proof-of-concept of the application; secondly, the selection of the model to satisfy our needs, by conducting a benchmark of the most promising convolu- tional neural networks when trained with a facial dataset. Lastly, we discuss and evaluate the results of the several experiments made throughout the framework. We learn that current FER approaches insist on using a wide range of emotions that hinders the robustness of models and we propose a new way to look at emotions computationally by creating possible clusters of emotions that do not diminish the collected information based on the evidence obtained. Finally, we develop a new database of facial masks discussing some promising paths that may lead to the implementation of the aforesaid system.

Keywords: Facial Expression Recognition, Multimodal Sentiment Analysis, Emotion Analysis, Movie Databases, , Deep Learning, Computer Vision

i ii Resumo

A ligação estabelecida entre espetadores e filmes ocorre na dimensão emocional de um filme. A dimensão emocional é caracterizada pela decisão dos cineastas na escrita, imagem e som, mas é através da atuação dos atores que as emoções são diretamente transmitidas para a plateia. Perceber como esta ligação é criada pode dar-nos informações essenciais acerca de como os humanos inter- agem com este meio digital e como esta informação poderá ser integrada em grandes plataformas de filmes. O nosso trabalho é mais um passo nas técnicas de identificação de emoções no cinema, par- ticularmente através de close-ups aos atores muitas vezes utilizados para evocar emoções intensas na audiência. Nas últimas décadas, a comunidade científica tem feito progressos promissores no desenvolvimento de métodos para o reconhecimento de expressões faciais. No entanto, não é dado grande ênfase à complexa natureza dos filmes, como variações na iluminação das cenas ou pose dos atores, algo que é explorado mais em detalhe neste trabalho. Começamos por uma explicação extensiva dos modelos mais modernos de classificação de emoções, no contexto das ciências sociais. Em segundo lugar, apresentamos os sistemas de Recon- hecimento Facial de Emoções (FER, do inglês Facial Emotion Recognition) e análise automática de emoções em filmes, explorando estratégias unimodais e multimodais. Apresentamos uma com- paração entre as base de dados relevantes e as técnicas de visão computacional usadas no recon- hecimento de emoções faciais e destacamos alguns problemas causados pela heterogeneidade das bases de dados, uma vez que não existe um modelo universal de emoções. Com base neste conhecimento, projetámos e implementámos uma framework para testar a viabilidade de uma solução end-to-end que utilize as expressões faciais para determinar a carga emocional de um filme. A framework é composta por duas fases: em primeiro lugar, a seleção e análise aprofundada das bases de dados relevantes para servir como uma prova de conceito da apli- cação; e em segundo lugar, a seleção do modelo deep learning mais apropriado para alcançar os objetivos propostos, conduzindo um benchmark de redes neurais convolucionais mais promissoras quando treinadas com uma base de dados contendo faces. No final, discutimos e avaliamos os resultados das diversas experiências feitas ao longo da framework. Aprendemos que as actuais abordagens de FER insistem na utilização de uma vasta gama de emoções que dificulta a robustez dos modelos e propomos uma nova forma de olhar para as emoções computacionalmente, criando possíveis grupos de emoções que não diminuem a informação recolhida com base na evidência obtida. Finalmente, desenvolvemos uma nova base de dados de máscaras faciais, discutindo alguns caminhos promissores que podem levar à implementação do referido sistema.

Keywords: Reconhecimento de Expressões Faciais, Análise Multimodal de Sentimento, Análise de Emoção, Bases de dados cinematográficas, Aprendizagem Computacional, Inteligência Artifi- cial, Visão por Computador

iii iv Acknowledgements

First and foremost, I would like to thank my Supervisors Professor Paula Viana and Inês Nunes Teixeira for giving me the opportunity to participate both in an internship and in a dissertation about two of the subjects I am more passionate about. Luís Vilaça, thank you for bringing me the calmness and the necessary clear view on the problems and how to tackle them. To all three supervisors, a big thank you for your support, guidance and feedback throughout this project and for making me grow as an academic and as a person. Secondly, I am eternally grateful to my parents, brother, girlfriend and friends for endlessly supporting me during this coursework and for always being my Home regardless of how far I fly. Finally, I would like to thank all persons who I met trough the years for making me who I am today and for showing me that love and tolerance are the only values that should really matter.

Obrigado.

João Almeida

v vi “If we opened people up, we’d find landscapes.”

Agnès Varda

vii viii Contents

1 Introduction1 1.1 Context ...... 1 1.2 Motivation ...... 2 1.3 Goals ...... 4 1.4 Document Structure ...... 4

2 The Human Emotion5 2.1 Discrete emotion model ...... 5 2.2 Dimensional emotion model ...... 6 2.3 Facial Action Coding System (FACS) ...... 7 2.4 Mappings between Emotion Representation Models ...... 7 2.5 Discussion ...... 11

3 Facial Expression Recognition 13 3.1 Historical Overview ...... 13 3.2 Deep Learning current approach ...... 14 3.2.1 Data Preprocessing ...... 14 3.2.2 Convolutional Neural Networks ...... 18 3.3 Model Evaluation and Validation ...... 26 3.4 Datasets ...... 28 3.5 Open-source and commercial solutions ...... 34

4 Automatic Emotion Analysis in Movies 41 4.1 Unimodal Approach ...... 41 4.1.1 Emotion in text ...... 41 4.1.2 Emotion in sound ...... 42 4.1.3 Emotion in image and video ...... 42 4.2 Multimodal approach ...... 43 4.2.1 Feature fusion techniques ...... 44 4.3 Commercial Solutions ...... 45

5 Methodology, Results and Evaluation 47 5.1 Problem Definition and Methodology ...... 47 5.2 Implementation details ...... 48 5.3 Approach ...... 50 5.3.1 Datasets Exploration ...... 50 5.3.2 Facial Detection in Movies ...... 53 5.3.3 CNNs Baseline and Benchmark ...... 54

ix x CONTENTS

5.3.4 Deep Learning Modeling and Optimization ...... 60 5.3.5 FER2013 dataset balancing ...... 65 5.3.6 Reducing the dimensionality of the problem ...... 66 5.3.7 Combining Facial Landmark masks with the original face image as input 72

6 Conclusions 73 6.1 Contributions ...... 74 6.2 Future work ...... 75

A FER Datasets Source 77

B Movies Present in AFEW and SFEW Datasets 79

C API response examples from commercial solutions 81 C.1 Amazon Rekognition ...... 81 C.2 Vision API ...... 87

D Multimodal Movie Datasets and Related Studies 89

References 93 List of Figures

1.1 L’arrivée d’un train en gare de La Ciotat (1985) ...... 2

2.1 Circumplex Model of Emotion ...... 7 2.2 Upper Face Action Units (AUs) from the Facial Action Coding System ...... 8 2.3 Spatial distribution of NAWL word classifications in the valence-arousal affective spac...... 10 2.4 Distribution of emotions in the valence-arousal affective map ...... 11

3.1 Facial Expression Recognition stages outline ...... 14 3.2 Histogram of Oriented Gradients of Barack Obama’s face ...... 16 3.3 Examples of facial landmarks with their respective faces ...... 16 3.4 Image geometric transformation examples ...... 17 3.5 Example of feed-forward neural networks ...... 18 3.6 Biological neuron and a possible mathematical model ...... 19 3.7 Flow of information in a artificial neuron ...... 20 3.8 Most common activation functions ...... 21 3.9 Example of a 3x3 convolution operation with the horizontal Sobel filter . . . . . 22 3.10 Average-pooling and max-pooling operation examples ...... 23 3.11 Convolutional Neural Network example ...... 23 3.12 An example of a residual block ...... 25 3.13 Evolution of the diverse CNN architectures milestones ...... 25 3.14 Transfer learning process ...... 26 3.15 Possible queries accepted by the EmotionNet database ...... 30 3.16 Examples of the FER2013 database ...... 31 3.17 Class distribution per dataset ...... 33 3.18 Face-api.js emotion recognition examples [37]...... 34 3.19 Deepface facial attribute analysis examples [24]...... 35 3.20 An example of the Azure Face API response ...... 36 3.21 An example of the Algorithmia API response ...... 36 3.22 Amazon Rekognition face detection and analysis example ...... 37 3.23 Google Vision API attributes example ...... 38 3.24 Face++ Emotion Recognition API response example ...... 39

4.1 The girl in red coat from Schindler’s List ...... 43 4.2 Video Indexer functionalities overview ...... 45 4.3 Video Indexer insights from the 1999 movie The Matrix ...... 46

5.1 FER2013 content overview ...... 51 5.2 FER2013 class distribution ...... 51

xi xii LIST OF FIGURES

5.3 FER2013 image examples ...... 52 5.4 SFEW face aligned samples ...... 53 5.5 Dlib face bounding box ...... 54 5.6 Baseline Network Architecture diagram ...... 55 5.7 Baseline network accuracy, loss and confusion matrix ...... 56 5.8 VGG16 accuracy, loss and confusion matrix on FER2013 ...... 57 5.9 InceptionV3 accuracy, loss and confusion matrix on FER2013 ...... 57 5.10 Xception accuracy, loss and confusion matrix on FER2013 ...... 58 5.11 MobileNetV2 accuracy, loss and confusion matrix on FER2013 ...... 58 5.12 ResnetV2 accuracy, loss and confusion matrix on FER2013 ...... 59 5.13 DenseNet accuracy, loss and confusion matrix on FER2013 ...... 59 5.14 Final model ...... 61 5.15 Accuracy and loss learning curves and confusion matrix of the model when train- ing with FER2013 ...... 63 5.16 Confusion matrix of testing with SFEW dataset ...... 64 5.17 Learning curve of the model when training with class weights ...... 65 5.18 Learning curve of the model when training with the top-4 performing emotions . 67 5.19 Validation confusion matrix of FER2013 top-4 emotions ...... 68 5.20 Confusion matrix of testing on SFEW ...... 69 5.21 Learning curve of the model when training with the possible clusters of emotion . 69 5.22 Angry, Happy and Neutral validation confusion matrix on FER0213 ...... 70 5.23 Angry, Happy and Neutral testing confusion matrix on testing SFEW ...... 71 5.24 Facial mask example from FER2013 ...... 72 5.25 FER2013 images not containing faces ...... 72 List of Tables

2.1 Basic and compound emotions with their corresponding AUs ...... 8

3.1 Confusion Matrix ...... 27 3.2 Principal databases used in FER systems ...... 32 3.3 FER approaches and results on widely evaluated datasets ...... 33 3.4 Comparison between the different open-source and commercial solutions . . . . 40

5.1 Versions of the software used in this work ...... 49 5.2 FER2013 number of samples per class ...... 51 5.3 SFEW aligned face samples per class ...... 52 5.4 Facial Recognition in movies experiment ...... 53 5.5 Benchmark of CNN architectures ...... 60 5.6 Final configuration of the network ...... 62 5.7 Precision, Recall and F1-score of FER2013 validation set ...... 63 5.8 Precision, Recall and F1-score of SFEW test set ...... 64 5.9 Training class weights ...... 65 5.10 Precision, Recall and F1-score of balanced FER2013 validation set ...... 66 5.11 Precision, Recall and F1-score of the top-4 performing emotions during training on FER2013 ...... 67 5.12 Precision, Recall and F1-score of the top-4 performing emotions during testing on SFEW...... 68 5.13 Precision, Recall and F1-score with possible clustered emotions in validation phase 70 5.14 Precision, Recall and F1-score with clustered emotions in SFEW ...... 71

A.1 FER datasets sources ...... 77

B.1 Movie sources for the SFEW and AFEW databases ...... 80

D.1 Multimodal movie datasets ...... 90 D.2 Some multimodal relevant studies ...... 91

xiii xiv LIST OF TABLES Abbreviations

AI Artificial Intelligence AU (Facial Action Coding System) Action Unit API Application Programming Interface BLSTM Bidirectional Long Short Term Memory CNN Convolutional Neural Networks FACS Facial Action Coding System FER Facial Expression Recognition GAN Generative Adversarial Network LSTM Long Short Term Memory ML Machine Learning NLP Natural Language Processing OCC Ortony, Clore and Collins’s (Model of Emotion) PAD Pleasure-Arousal-Dominance PA Pleasure-Arousal PCA Principal Component Analysis SaaS Software as a Service SoA State-of-the-art SGD Stochastic Gradient Descent SDK Software development kit SVM Support Vector Machine RNN Recurrent Neural Network

xv

Chapter 1

Introduction

1.1 Context ...... 1 1.2 Motivation ...... 2 1.3 Goals ...... 4 1.4 Document Structure ...... 4

The context, motivation, main objectives and document structure of this dissertation are intro- duced in this chapter.

1.1 Context

In the early showings of "L’arrivée d’un train en gare de La Ciotat", as shown in Figure 1.1, by Louis and Auguste Lumière, credited as among the first inventors of the technology for cinema as a mass medium, there are reports that the audience panicked, thinking that the silent and grainy black-and-white train pictured in the movie was going to drive right into them. Although there is no sufficient evidence that audience panic rush ever occurred [88], the "Cinema’s Founding Myth" still serves its purpose of showing the power of film, in particular by eliciting strong sentiments over the audiences. The tale began to surface when people tried do describe the emotional power inherent to the convincing three-dimensional effect of the emerging medium. Since then, films have been used not only as entertainment, but also as a means of communica- tion, to provoke feelings in their audience and make them relate with the story being told. Despite the human expression of emotions when watching movies being nowadays more contended, it is impossible to dissociate the emotional dimension from this kind of activity. In particular, there are studies that suggest that it is possible to extract emotion from the audience’s facial expres- sions [34], or even with an analysis of it’s physiological signals [132]. Furthermore, the way people watch films has changed significantly in recent years. With the advent of the internet, online streaming services and movie databases, video platforms have

1 2 Introduction

Figure 1.1: L’arrivée d’un train en gare de La Ciotat (1985)

been developing increasingly more complex video indexing and summarization tools, as well as personalized content delivery. There are problems that still persist in current solutions, with much of the difficulties lying specifically in the state-of-the-art cinematographic databases, since there is a large quantity of undocumented media content and an underlying difficulty in annotating them. Consequently, without information on contents, establishing relationships between them is a complex task, thus making dataset access and retrieval inefficient, overlooking the information potential that the content itself is able to provide. The work developed is framed in the scope of activity B.1 (Plataforma Tecnológica de apoio ao Plano Nacional do Cinema) of the CHIC project (Cooperative Holistic View on Internet and Content — POCI-01-0247-FEDER- 024498). One of the contributions of this project will be to allow a simple and intuitive consultation of the contents stored in the vast database of the Cine- mateca Portuguesa, which is a public institution dedicated to the dissemination and preservation of cinematic art. For this purpose it is intended to develop new forms of navigation, research and interaction with the contents that consider aspects typically not contained in the description meta- data. As there is evidence that 7% of the human communication is verbal, 38% is vocal and 55% is visual [99], in this project we try to enrich the platform with the information provided by the emotions portrayed by the actors in movies.

1.2 Motivation

Affective Computing is an interdisciplinary field that studies and develops systems that are able to recognize, interpret, process and simulate human affection. Understanding the emotional state of people is a subject that has attracted and fascinated researchers from different branches, combining the findings in the communities of computer science, psychology and cognitive science. One of the challenges of this field is to try to answer the highly subjective question "What emotion does this particular content convey?", which is studied in detail in the subfield of Emotion 1.2 Motivation 3 and Sentiment Analysis. Several studies have been trying to answer this question by analyzing different modalities of content.

For a long time, text-based sentiment analysis has been the reference in this area, with the use of Natural Language Processing (NLP) and text analysis computational techniques for the extrac- tion of the sentiment that a text conveys. These techniques are commonly used in the analysis of texts from social networks or online reviews found on e-commerce platforms, due to the proven value that is added to companies and organizations [111].

However, in recent years, due to the advances observed in the Artificial Intelligence and Com- puter Vision fields, other media modalities have been considered. In the speech recognition field, relevant correlations have been found between statistical measures of sound and the mood of the speaker. Additionally, approaches try to recognize different facial expressions from video sequences, as it is suggested by some authors that they might have a close relationship with emo- tions.

The advantage of analyzing videos over text is the behaviour contextualization from the indi- viduals being studied: it is possible to combine visual and sound clues to better identify the true affective state of the videos. Additionally, when analyzing movies, other stylistic characteristics can be used to improve the accuracy of emotion recognition, such as shot length, number of shots, dominant colours or colour histograms. Even though these developments are not film-oriented, combined with the state-of-the-art machine learning algorithms and the ever growing computa- tional power, they could be very valuable in the context of this work, providing crucial cues on how emotion is detected efficiently across several fields.

Therefore, what we will investigate in this work is the applicability of FER solutions in the field of films. We intend to gather a solid understanding of how emotions are addressed in the social and human sciences and discuss possible adaptations of the emotional theories to computational models. Additionally, we shall explore the various datasets available and try to understand their behaviour when confronted with the difficult conditions of a film. Furthermore, we will look for the most suitable computer models for this task and evaluate their reliability. Currently, there is a vast amount of solutions available but still no consensus on what is the best solution when applied to films, an uncertainty that we will embrace. In particular, we will seek to understand the results of multi-class classification models, its limitations, and what adjustments can be made to achieve the desired goals.

The benefits of this study could serve a vast amount of applications: personalized content distribution and better recommendation systems that are able to suggest new content based on the characteristics that a user generally connects with may arise from finding new emotion-based relationships between media content. Additionally, the conclusions of this study may also be used to enhance information retrieval from large movie platforms, and emotion aware content clustering may greatly improve the organization of such systems. 4 Introduction

1.3 Goals

This dissertation has the following goals:

• Facial expression analysis — evaluate the contribution of facial expression analysis in the context of the affective dimension of a movie.

• Database evaluation — evaluate the best databases that could be applied in facial expres- sion recognition tasks in the movie domain.

• Machine Learning evaluation — evaluate the most appropriate machine/deep learning techniques to achieve the best possible result.

The ultimate objective of this work is to create a framework that evaluates the feasibility of using faces to determine the emotional charge of a movie, discussing the limitations and promising paths of such system.

1.4 Document Structure

This document is divided into the following six chapters:

• Chapter 1— Introduction This chapter introduces the work, its context, motivation and goals.

• Chapter 2— The Human Emotion This chapter will provide sufficient knowledge about the models that were developed to classify human emotion. It also shows the limitations of current research, given the non universal nature of the subject.

• Chapter 3— Facial Expression Recognition In this chapter the Facial Expression Recognition is introduced along with an in-depth re- view of the current deep learning approach.

• Chapter 4— Automatic Emotion Analysis in Movies This chapter provides an overview of the current approaches for the unimodal or multimodal automatic emotion analsysis in movies.

• Chapter 5— Methodology, Results and Evaluation This chapter includes a detailed definition of the problem and the developed framework with a description of the technologies used to address it. In the second part of this chapter, the experimentation done throughout this work is presented, accompanied by the evaluation and discussion of the results that were achieved.

• Chapter 6— Conclusions Finally, the last chapter provides a synthesis of the main ideas presented in this work and conclusions are drawn, pointing out promising paths that can be made as future work. Chapter 2

The Human Emotion

2.1 Discrete emotion model ...... 5 2.2 Dimensional emotion model ...... 6 2.3 Facial Action Coding System (FACS) ...... 7 2.4 Mappings between Emotion Representation Models ...... 7 2.5 Discussion ...... 11

This chapter describes the background and related work on social sciences related to the Hu- man emotion. It starts by proving an in-depth description of how emotion classification models have evolved, from a historical and psychological point of view. In the end, mappings between these models are discussed.

2.1 Discrete emotion model

Understanding the emotional state of people has been a complex topic over the years. The studies date back to ancient Greeks, where Cicero first described emotions as a set of four basic states: metus (fear), aegritudo (pain), libido (lust), and laetitia (pleasure) [47]. Years later, Darwin ar- gued that emotions evolved via natural selection, hence being independent of the individuals culture [13]. In this context, the first scientific models for emotion were developed, starting by dividing them into a limited and discrete set of basic emotions — the discrete model of emotions. Paul Ekman proposed in 1970 that facial expressions are universal and provide sufficient in- formation to predict emotions. His studies suggest that our emotions evolved through natural se- lection in humans into a limited and discrete set of basic emotions: anger, disgust, fear, happiness, sadness, and surprise [34]. Each emotion is independent of the others in its behavioural, psycho- logical and physiological manifestations, and each of them is born from the activation of unique areas in the central nervous system. The criterion used was the assumption that each primary emotion has a distinct facial expression that is recognized even between different cultures [108].

5 6 The Human Emotion

Other studies tried to expand the set of emotions to non-basic ones, such as fatigue, anxiety, satisfaction, confusion, or frustration [62, 144]. The Ortony, Clore and Collins’s Model of Emotion (OCC Model) [107] is popular amongst systems that incorporate emotions in artificial characters. It is a hierarchy model that classifies 22 emotion types that might emerge as a consequence of events (e.g., happiness and pity), actions of agents (e.g., shame and admiration), or aspects of objects (e.g., love and hate). Despite the six basic emotion model being the dominant theory of emotion in psychiatric and neuroscience research, recent studies have pointed out some limitations. Certain facial expressions are associated with more than one emotion, which suggests that the initial proposed taxonomy is not adequate [113]. Other studies suggest that there is no correlation between the basic emotions and the automatic activation of facial muscles [14]. Other claims suggest that this model is culture- specific and not universal [60]. These drawbacks caused the emergence of additional methods that intend to be more exhaustive and universally accepted regarding emotion classification.

2.2 Dimensional emotion model

Some studies have assessed people’s difficulty in evaluating and describing their own emotions, which points out that emotions are not discrete and isolated entities, but rather ambiguous and overlapping experiences [120]. This line of thought reinforced a dimensional model of emotions, which describes them as a continuum of highly interrelated and often ambiguous states. The model that gathered the most consensus among researchers — the Circumplex Model of Emotion — argues that there are two fundamental dimensions: valence, which represents the hedonic aspect of emotion (that is, how pleasurable it is for the human being), and an enthusiasm or tension dimension, which represents the energy level [119]. Each emotion is represented using coordinates in a multi-dimensional state. Figure 2.1 represents a two-dimensional visualization of the model. Valence spans from negative (e.g., depressed) to positive (e.g., content), whereas arousal ranges from inactive (e.g., tired) to active (e.g., excited). Some predefined points represent a categorical definition of emotion to facilitate its interpretation. Another bi-dimensional model was proposed by Whissell, which defines an emotion by the pair of values , that evaluates the emotion as positive or negative in the activation dimension, and active or passive in the evaluation dimension [146]. This is the particular case of polarity, which is very popular in NLP studies. Other approaches proposed tri-dimensional models. The most famous one, the pleasure- arousal-dominance (PAD), adds a third dimension to the arousal-valence scale called dominance, representing the controlling and dominant nature of the emotion on the individual. For instance, anger is considered a dominant/in control emotion, while fear is considered submissive/domi- nated [98]. It remains unclear the utility of the third dimension, as several studies revealed that valence and arousal axes are sufficient to model emotions, particularly when handling with emo- tions induced by videos [93]. More recently, Fontaine concluded that using a four-dimension 2.3 Facial Action Coding System (FACS) 7

Figure 2.1: Circumplex Model of Emotion [87]

model best describes the variety of emotions (valence, arousal, dominance and predictability), but the optimal number of dimensions depends on the purpose of the study [59]. The advantages of a dimensional model compared with a discrete model are the accuracy in describing emotions, by not being limited to a closed set of words, and a better description of emotion variations over time, since the variation in emotion is not realistically discrete, from one universal emotion to another, but rather continuous.

2.3 Facial Action Coding System (FACS)

The Facial Action Coding System (FACS) [33] is an anatomically-based system used to de- all visually discernible movement of face muscles. FACS is able to objectively measure the frequency and intensity of facial expressions using Action Units (AUs), i.e., the smallest distin- guishable unit of measurable facial movement, such as brow lowerer, eyes blink or jaw drap. The system has a total of 46 action units with some of the AUs having a 5-point ordinal scale used to measure the degree of muscle contraction. Figure 2.2 represents the action units of the upper face region. FACS is strictly descriptive and does not include an emotion correspondence. The same au- thors of the system proposed a Emotional Facial Action Coding System (EMFACS) [39] based on the six-basic discrete emotion model described in section 2.1 that makes a connection between emotions and facial expressions. Recent studies have proposed a new classification system based on simple and compound emotions [8], as shown in Table 2.1.

2.4 Mappings between Emotion Representation Models

Some studies suggest that there are neurological variations within the processes of emotion catego- rization and assessment of emotion dimensions [65]. Understanding what can bring these models 8 The Human Emotion

Figure 2.2: Upper Face Action Units (AUs) from the Facial Action Coding System. The AUs marked with * are graded according to muscle contraction intensity [145]

Table 2.1: Basic and compound emotions with their corresponding AUs [8]

Category AUs Category AUs Happy 12,25 Sadly disgusted 4,10 Sad 4,15 Fearfully angry 4,20,25 Fearful 1,4,20,25 Fearfully surprise 1,2,5,20,25 Angry 4,7,24 Fearfully disgusted 1,4,10,20,25 Surprised 1,2,25,26 Angrily disgusted 1,2,5,10 Disgusted 9,10,17 Disgusted surprised 1,2,5,10 Happily sad 4,6,12,25 Happily fearfully 1,2,12,25,26 Happily surprised 1,2,12,25 Angrily disgusted 4,10,17 Happily disgusted 1,4,15,25 Awed 1,2,5,25 Sadly fearful 1,4,15,25 Appalled 4,9,10 Sadly angry 4,7,15 Hatred 4,7,10 Sadly surprised 1,4,25,26

together is just as relevant as understanding what separates them, if not more so. Motivated by the dispersion of classification methods across emotional databases, some studies have investigated a potential mapping between discrete/categorical and dimensional theories. A first linear mapping between the emotions anger, disgust, fear, happiness and sadness with a dimensional space was proposed in 2011 [128]. This linear mapping is based on the matrix of coefficients shown in Equation 2.1.

PAD[Anger,Disgust,Fear,Happiness,Sadness] = −0.51 −0.40 −0.64 0.40 −0.40 (2.1)    0.59 0.20 0.60 0.20 −0.20 0.25 0.10 −0.43 0.15 −0.50 2.4 Mappings between Emotion Representation Models 9

Considering that the matrix was obtained following the quantitative relationship between three-dimensional mood Pleasure-Arousal-Dominance (PAD) space and OCC emotions [41], it might be considered a theoretical model instead of an evidence-based one, since no data source was used to come up with the mapping. In 2018, a new study proposed a method and evaluation metrics to assess the mapping accu- racy, and elaborated a new mapping between the six basic emotions and the PAD model [79]. The new mapping in the three-dimensional PAD space is shown in Equation 2.2 and a new mapping in the two-dimensional Pleasure-Arousal (PA) space is shown in Equation 2.3.

PAD[Happy,Sad,Angry,Scared,Disgusted,Surprised,1] = 0.46 −0.30 −0.29 −0.19 −0.14 0.24 0.52 (2.2)   0.07 −0.11 0.19 0.14 −0.08 0.15 0.53 0.19 −0.18 −0.02 −0.10 −0.02 0.08 0.50

PA[Happy,Sad,Angry,Scared,Disgusted,Surprised,1] = " # 0.54 −0.14 −0.21 −0.06 −0.16 0.00 0.46 (2.3) 0.50 0.06 0.37 0.36 0.12 0.00 −0.01

These linear representations were obtained by cross-referencing the information of lexicons annotated in both models. The PAD mapping was obtained by pairing Affective Norms for En- glish Words (ANEW) [9] and Synesketch [74] lexicons. The compound set is composed of En- glish words annotated in the three-dimensional PAD model (originally in the Synesketch lexicon) and in the Ekman’s six-basic emotions and a neutral state (originally in the ANEW lexicon). The PA mapping was derived using two sentiment annotations of the same lexicon, Nencki Affective Word List (NAWL) [116, 147]. The NAWL dataset contains 2902 Polish words, annotated in the valence-arousal space and in five basic emotions (happy, sad, angry, scared and disgusted). During the construction of the NAWL dataset, each subject provided a vector of five individual ratings, one for each emotion. Later, the Euclidean distance to six “extreme” points was calculated, representing pure basic emotions: (7,1,1,1,1) for happiness, (1,7,1,1,1) for anger, (1,1,7,1,1) for sadness, (1,1,1,7,1) for fear and (1,1,1,1,7) for disgust and (1,1,1,1,1) for the neutral state. The average distance was calculated to estimate how close the word is to each of the extreme points. To classify a word to one of the six classes, a threshold should be established, which defines the maximum distance a word must be from an emotion in order to be classified as such. Additionally, a world should only be in a unique category region, i.e., if it falls in an intersection area between two categories it remains unclassified. An Interactive Analysis of the NAWL Database 1 is available on the Internet, where various combinations of parameters and the consequent results can be tested.

1https://exp.lobi.nencki.gov.pl/nawl-analysis 10 The Human Emotion

Using the Euclidean distance based classification method with the thresholds 2.5 for happiness, 5.5 for the remaining basic emotion classes and 2.5 for the neutral class, the researchers of the NAWL lexicon were able to classify 739 out of 2902 available words. Figure 2.3 illustrates the spatial distribution of these classifications in the valence-arousal affective space. A significant finding was the apparent formation of emotion clusters: happiness is a high-valence medium- arousal emotion, neutral is low is both dimensions and the remaining emotions seem to overlap, particularly anger and sadness.

Figure 2.3: Spatial distribution of NAWL word classifications in the valence-arousal affective space [147]

Further interesting insights have been emerging from the music research field. In 2010 a study was conducted in order to perceive a possible correlation of these theories using musical stimuli. Firstly, a validation experiment was carried out involving twelve music experts who had studied a musical instrument for at least ten years. Each member of the study was given five different movie soundtracks and asked to classify it in the target emotions. Half of the panel focused on discrete emotions (happiness, sadness, fear, anger, surprise and tenderness) and the other half on the dimensional emotions (high-low valence, tension arousal and energy arousal). This phase was important to validate the choice of the emotional frameworks since the selected stimuli needed to represent emotion concepts. Secondly, the experiment was conducted involving 116 university students, aged between 18 and 42 years. Similar to the validation experiment, the participants were organized in two different blocks. The members of the first block were asked to rate the musical excerpt in a 1 to 9 intensity scale for each discrete emotion, and the participants of the second block were asked to do the same in a (negative) 1-9 (positive) scale for each emotion 2.5 Discussion 11 dimension. A more detailed description about the participants, stimuli, apparatus and procedure can be found in the corresponding study [32]. Some results of the study are shown in Figure 2.4. Despite having different scales, a clustering process similar to the previous study appears to have happened, particularly in the happiness emotion and in the overlap of fear and anger. However, the same did not happen with sadness, appearing to be in a different region of the valence-arousal affective map.

Figure 2.4: Distribution of emotions in the valence-arousal affective map. Left: Variance of the five discrete target emotions — capital letters represent well-defined areas (high intensity and low deviation) whereas the small letters represent less clearly defined areas (lower intensity and higher variation). Right: Mean rating per discrete emotion represented as marker type in the valence- arousal space [32]

2.5 Discussion

Humans have a complex nature and the study of the social and psychological motivations behind their emotions could not be different. At the moment, there is no truly universal model of emo- tion. The models covered in this section should not be seen as a competition for a single universal truth, but rather as different angles from different academic fields (biology, psychology, psychi- atry and social sciences) that with different experimental methods try to analyze and describe a convoluted subject. The discrete emotion model, being categorical, has the advantage of the population’s broad familiarity with the concepts, which may be useful in film rating or recommen- dation systems. The dimensional model, on the other hand, may have a greater granularity in the evolution of emotion throughout a film, which may be particularly interesting in systems that need a more comprehensive information about emotion at a given time. However, trying to create a computational model of non-universal theory is an arduous task that becomes more evident with the dispersion of models used in the various databases, as it will be exposed in the following sections. Even so, what might initially seem like a limitation could become a valuable asset if we consider a possible hybrid model involving classification and 12 The Human Emotion linear regression tasks, as indicated by the results of the mappings between emotion representation models. Chapter 3

Facial Expression Recognition

3.1 Historical Overview ...... 13 3.2 Deep Learning current approach ...... 14 3.3 Model Evaluation and Validation ...... 26 3.4 Datasets ...... 28 3.5 Open-source and commercial solutions ...... 34

Like many other computer vision subjects, the face-related studies have changed from en- gineering features by hand to the use of deep learning, which surpassed SoA approaches at the time. This section is introductory to Facial Expression Recognition (FER). It starts with a brief overview of the historical context of the subject. Then, it explores the ongoing deep learning approach, from the data pre-processing stage to the construction and optimization of a CNN model.

3.1 Historical Overview

Facial Expression Recognition (FER) systems use biometric markers to detect emotion in human faces. Despite the complexity of emotion classification models, the discrete model is the most popular perspective in facial expression recognition algorithms, given its pioneering investigations and the intuitive definition of emotions. FER systems can be branched into two principal categories:

• Static-based methods — the feature representation takes into account only spatial infor- mation from the current single image.

• Dynamic-based methods — temporal relation between contiguous frames in a facial ex- pression sequence is taken into consideration.

13 14 Facial Expression Recognition

Figure 3.1: Facial Expression Recognition stages outline [56]

Figure 3.1 illustrates the principal stages of a static-based FER system. In the image pre- processing step, several transformations to the image are performed, such as noise reduction (us- ing a Gaussian filter, for instance), normalisation and histogram equalisation. Face related image transformations are further explored in Section 3.2.1. Initially, most traditional methods were based on shallow learning for feature extraction — unlike deep learning, features are drawn by hand based on heuristics related to the target problem, like Local Binary Patterns (LBP) [127], Non-Negative Matrix Factorization [161] and Sparse Learning [162]. However, the increasing computational power and the emergence of higher quality databases made the deep learning mod- els stood out from SoA techniques at that time, being the current focus of FER researchers. A typical Deep Learning-based FER pipeline is discussed in the forthcoming section.

3.2 Deep Learning current approach

Since 2013, international competitions (like FER2013 [45] and EmotiW [31]) have changed the paradigm in relation to the recognition of facial expressions by providing a significant increase in training data. These competitions introduced more unconstrained datasets which led to the tran- sition of the study from controlled simulations in the laboratory to the real-life more unpredictable environment. Current approaches are made using deep learning techniques, due to the increased pro- cessing power of the chips as well as the availability of more detailed datasets, with results that far exceed traditional methods [75, 50]. The general pipeline of deep facial expression recognition systems includes a pre-processing phase of the input image, which incorporates face alignment, data augmentation and face normalization techniques. Then, the processed data is passed to a neural network, commonly Convolutional Neural Network (CNN), Deep Belief Network (DBN), Recurrent neural network (RNN), Generative Adversarial Network (GAN) or Deep Autoencoder (DAE). In the end, the features extracted by the neural network are passed to a classifier that will label the image. The following sections delve into the details for each of these procedures, with special emphasis on CNNs since they are the algorithm of choice for image classification challenges [82].

3.2.1 Data Preprocessing

The data quality and quantity have a great impact on the results of deep learning-based related tasks. In deep learning algorithms the feature extraction is done throughout the inner layers of the model, becoming crucial to have an initial data processing phase in order to bring our 3.2 Deep Learning current approach 15 data closer to the goals we intend to achieve. It is unrealistic to expect that a given dataset will be perfect so it is fundamental to evaluate the quality of it. The assessment includes checking for missing or duplicate images, examining inconsistent images regarding its label, or analyzing class balance, i.e., the distribution of images per class in a given dataset. The general methods used to preliminary process image data containing faces are explained in the following sections.

Data quality assessment

Essential for determining the quality of a given dataset. Includes checking for missing or duplicate images, examining inconsistent images regarding its label or analyzing class imbalance, i.e., the distribution of images per class in a given dataset.

Face detection

Face detection is used to detect inconsistent images that do not contain faces. Dlib 1 is a toolkit for making real-world machine learning and data analysis applications written in C++ and allows face detection in two distinct ways:

• Histogram of Oriented Gradients (HOG) combined with a linear classifier [23]: HOG is a feature descriptor often used to extract features from image data, focusing on the shape of the object. By counting the occurrences of gradient orientation in localized portions of an image, it allows a trained linear classifier (such as SVM) to use this information and classify if an image contains a face or not. Figure 3.2 contains an example of this gradient applied to a face.

• Convolutional Neural Network: it uses a pre-trained CNN model to find faces in an image. The model was trained using images from ImageNet [27], AFLW [96], Pascal VOC [36], VGG-Face [110], WIDER [153], and Face Scrub [105]. It is more accurate than the HOG based model, but at the cost of a much more computational power to run.

Face alignment

Although face detection is the only necessary method to train a deep neural network to learn meaningful features, there are natural variations in pictures that are irrelevant to facial expressions, such as background or head poses. A technique to overcome these variations is to perform face alignment using the localization of facial landmarks, proven to enhance the performance of FER systems [104]. The detection of facial landmarks is a subset of the shape prediction problem. Given an input image of a face, the shape predictor tries to localize key points of interest throughout the face’s shape, namely the mouth, the right and left eyebrows, the right and left eyes, the nose and the jaw. The different position of the facial landmarks and contour indicate facial deformations due to head

1http://dlib.net/ 16 Facial Expression Recognition

Figure 3.2: Histogram of Oriented Gradients of Barack Obama’s face

movements and facial expressions. This information is later used to establish a feature vector of the human face that can be used to get a normalized rotation, translation, and scale representation of the face for all images in the database. There are several open-source solutions to estimate facial landmarks. Dlib uses an ensem- ble of regression trees 2 trained on the iBUG 300-W dataset [121] to determine the positions of the facial landmarks directly from the pixel intensities [64]. Another strategy proposes a deep neural network to estimate facial landmarks in the two-dimensional and three-dimensional spaces [12], with a Pytorch implementation available on Github 3. Figure 3.3 contains a visual representation of 68 two-dimensional facial landmarks on their respective faces.

Figure 3.3: Examples of facial landmarks with their respective faces [12]

2An ensemble of regression trees is a predictive model composed of a weighted combination of multiple regression trees. 3https://github.com/1adrianb/face-alignment 3.2 Deep Learning current approach 17

Data augmentation

Data augmentation is a method used to enlarge the size of training or testing data, by applying several transformations to real face samples or simulated virtual face samples. This technique is particularly useful in unbalanced databases since we can generate new images to balance the number of samples per class in a given dataset, reducing overfitting. Image data augmentation is usually only applied in the training dataset, and not in the validation or test datasets. In the context of facial expressions, data augmentation techniques can be divided into three groups [143]:

• Generic transformations: includes geometric and photometric transformations to the ge- ometry or colour of the image. The most frequently used operations include rotation, skew, shifting, scaling, noise, shear, horizontal or vertical flips, contrast and colour jittering. Ex- amples of geometric transformations can be found on Figure 3.4.

• Component transformations: includes face specific transformations, such as changing the persons’ hairstyle, makeup or accessories.

• Attribute transformations: includes face specific transformations, such as changing the persons’ pose, expression or age.

Figure 3.4: Image geometric transformation examples [143]

Component and attribute transformations are generally done using Generative Adversarial Networks (GAN) that are able to learn disentangled representations of the face and modify its characteristics [143]. Besides the manipulation of human faces, there are also studies that point out that the construction of virtual faces based on the facial structure can be advantageous to overcome the problem of small sample size [81]. The selection of the specific data augmentation techniques utilized for the training dataset must be cautiously chosen and within the context of the dataset and the domain knowledge of the problem. For instance, a vertical flip of a face may not be particularly advantageous since it is quite unlikely that the model will ever see a picture of an upside down face. 18 Facial Expression Recognition

Face normalization

Face normalization techniques seek to normalize the illumination and pose in all the samples of a database. Related studies have shown that illumination normalization combined with histogram equalization can improve FER results [82]. On the other hand, normalizing pose and yielding for frontal facial views report promising performances, generally using facial landmarks [49] or GAN-based deep models [154] for frontal view synthesis, as discussed above.

3.2.2 Convolutional Neural Networks

A Convolutional Neural Network is a class of deep, feed-forward artificial neural network that has been proven to be very effective in areas such as image recognition and classification, particularly in FER systems. To understand these models, we firstly need to understand how a deep neural network works.

Deep feed-forward neural network

Deep feed-forward neural network (FNN), or multilayer perceptron (MLP) is the quintessential structure that support deep learning models. The idea was inspired by the structure of the human brain, with the so-called perceptrons trying to recreate the firing of neurons [117]. In [44], the following definition of a FNN can be found:

The goal of a feed-forward network is to approximate some function f*. For example, for a classifier, y = f*(x) maps an input x to a category y. A feed-forward network defines a mapping y = f(x;θ) and learns the value of the parameters θ that results in the best function approximation.

The early neural networks were composed by three single layers: the input layer, the hidden layer, and the output layer. Each layer of this network has a connection to the previous layer in one-way only, so that nodes can’t form a cycle. The information in a feed-forward network only moves in one direction – from the input layer, through the hidden layers, to the output layer. Figure 3.5 illustrates a simple and a deep feed-forward neural networks.

Figure 3.5: Example of feed-forward neural networks. Left: One-layer feed-forward neural net- work. Right: Deep feed-forward neural network 3.2 Deep Learning current approach 19

The Universal Approximation Theorem states that "a perceptron with one hidden layer of finite width can arbitrarily accurately approximate any continuous function" [54]. However, a single-layer perceptron is a linear classifier and therefore is not able to distinguish data that is not linearly separable. Stacking up hidden layers — that with a combination of non-linear activation function creates a multi-layer perceptron — increases the depth of the network which requires fewer parameters to train and allows more complex functions to be approximated. The goal of the training phase is to minimise some loss function. To do this, we apply the Backpropagation algorithm. "The backpropagation algorithm allows the information from the loss to then flow backward through the network in order to compute the gradient" [44]. This negative gradient is the direction of steepest descent, resulting in a decrease in the loss function. The size of the step taken by the algorithm is controlled by a parameter called the learning rate. Usually, the algorithm is applied to the entire dataset of data. When applied to a random batch of data, the algorithm is called Stochastic Gradient Descent (SGD). In SGD, only a single sample is used to perform each iteration. The sample is randomly shuffled and selected for performing the iteration. Small batches of data will lead to a more noisy gradient and can help escape local minima. On the other hand, large batches of samples better represent the true direction of steepest descent, but at the cost of a more memory intensive implementation. After each iteration, the algorithm will recursively update the parameters until convergence or the predefined maximum number of iterations is achieved.

Activation Function

The activation function is a mathematical function that determines if there is enough informative input at a node to fire a signal to the next layer. This motivation was inspired in the biological foundations of a neuron, illustrated in Figure 3.6.

Figure 3.6: Biological neuron and a possible mathematical model [135]

As stated previously, every neuron is connected to all the neurons in the previous layer. Each connection has an associated weight, and in each iteration a neuron adds all the incoming inputs multiplied by its corresponding connection weight plus an extra optional bias. The sum of these inputs will be the passed to the activation function, as illustrated in Figure 3.7. The most common activation functions used in computer vision and classification problems are: 20 Facial Expression Recognition

Figure 3.7: Flow of information in a artificial neuron [148]

• Sigmoid function — is the standard logistic function and translates the input from [-∞,+∞] to [0,1], often used in binary classification problems. Being an exponential function, it is computationally expensive and can lead to a vanishing gradient during training, so should not be used in hidden layers. The vanishing gradient problem is a consequence of the deriva- tive of sigmoid becoming very small in the saturating regions of the function and therefore the updates to the weights of the network almost vanish, slowing down learning.

• Softmax function — is a generalization of the logistic function to multiple dimensions, being useful in multi-class classification problems. It is normally used as the last activa- tion function of a neural network, normalizing its output to a probability distribution over predicted output classes.

• Rectified Linear Unit (ReLU) function — is easy to compute, converges faster compared to Sigmoid, can be used in hidden layers and does not have the problem of vanishing gradi- ent. However, it could lead to the "dying ReLu" problem, since being zero for all negative values, once a neuron gets negative it is unlikely for it to recover. Leaky ReLU is a variation of this function and attempts to solve this problem by giving a small negative slope instead of zero for negative values, but most architectures covered in this paper still prefer to use the ReLU function.

A non-exhaustive list of activation functions can be found in Figure 3.8. Currently, image and video media files are high resolution. Considering a colored RGB picture with a 250x250 resolution, a simple feed-forward neural network will have 250x250x3 = 187,500 input features. Each of these numbers is given a value from 0 to 255 which describes the pixel intensity at that point. If the hidden layer has 1,000 neurons, it means that the network will have 187,500x1,000 parameters to compute, which is a huge computation overhead for the network. Moreover, simple networks are not translation invariant — if the same object is geometrically translated in a given picture, the network should not be able to identify it. Convolutional Neural 3.2 Deep Learning current approach 21

Figure 3.8: Most common activation functions [61]

Networks overcome these problems, being extensively used in diverse computer vision applica- tions. The basic building blocks of a CNN are the convolutional base , composed by convolutional layers and pooling layers, responsible to generate features from an image; and the classifier, usually composed by fully connected layers, responsible for classifying the image.

Convolutional layers

After the input layer, the convolution layer is the first layer of a CNN. The mathematical operation of convolution is applied to the input image, where a kernel (or convolution filter) slides over the whole image originating an activation map (or feature map). This mechanism performs an elemen- twise dot product between the kernel and the previous layer’s corresponding neuron. Figure 3.9 illustrates a single convolution with a 3x3 filter. A hyper-parameter is a value defined a priori to control the network learning process. In CNNs, the hyper-parameters that have more impact on the performance of the network are:

• Kernel size — establishes the dimensions of the sliding window over the input. This hyper- parameter has a great impact on the image classification task: smaller filter sizes are able to extract useful information from local features, but larger kernel sizes could be relevant to extract larger less detailed features.

• Padding — necessary if the kernel exceeds the size of the activation map to prevent loosing information on corners of the image or shrinking outputs. The most commonly used padding technique is the highly efficient zero-padding, which adds a layer of zeros around the input, conserving the input’s spatial size between convolutions.

• Stride — defines how many pixels the filter should shift over time. Normally, a stride of one is used, which means that the filter slides over the input pixel by pixel.

As multiple convolutions are employed to the same input, features maps are stacked to create the convolutional layer. Normally, the convolutional layers closer to the input layer learn basic 22 Facial Expression Recognition

Figure 3.9: Example of a 3x3 convolution operation with the horizontal Sobel filter. This filter is able to estimate the presence of a light-dark transition zones, normally associated with the edges of an object [22]

feature such as edges and corners, the middle layers are able to learn filters that detect parts of objects (for faces, they might learn the representation of an eye or a nose) and the last layers have a high-level representation of the object, being able to distinguish it in different shapes and positions. Considering that the convolution operation is a linear combination, a non-linear layer is normally added after each convolution layer. Initially, Sigmoid functions were used, but recent developments showed that using ReLU decreases training time [122].

Pooling layers

Pooling is a non-linear downsampling operation that reduces the dimensionality of a feature map by applying an operation window of an arbitrary size. Figure 3.10 represents a 4x4 image with a 2x2 sub-sampling window that divides the image in four non-overlapping matrices with size 2x2. In the case of average-pooling, the average value of the four values is the output. In the case of max-pooling, the maximum value of each matrix is calculated. There are other non-linear functions to implement pooling not covered in this paper since max-pooling is the most common. A Pooling layer is commonly placed between Convolutional layers (with the corresponding ReLU layer) and serves two principal purposes. The first is to reduce the number of parameters and the overall computation load of the network, and the second is to avoid overfitting. The pooling operation will also help CNNs to achieve translation invariance [42].

Fully connected layers

Fully connected layers are often used as the final layers of a CNN. The neurons in these layers have full connections to all the activations in the previous layer. Frequently, they are preceded by a 3.2 Deep Learning current approach 23

Figure 3.10: Average-pooling and max-pooling operation examples

Flatten layer to transform three-dimensional data into a one-dimensional vector accepted by these layers. Combined with the Softmax function, the output of this layer (and the overall network) is a N-dimensional vector, where N is the number of classes, and each value of the vector is the probability of the given picture belonging to each class. A classical CNN will have a combination of these layers, as pictured in Figure 3.11. With the recent advances in the field, new and more efficient models have appeared with an improved performance. Some of these models are discussed in the following section.

Figure 3.11: Convolutional Neural Network example [28]

Modern Architectures

Convolutional Neural Networks have been around since the 1990s, with the emergence of one of the very first CNN, LeNet, in 1998 [80]. Between the late 1990s and early 2010s, as more data and computing power became available, CNNs began to have increasingly interesting results. The year of 2010 marks the first ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [118], being since then the de facto benchmark tool for evaluating object detection and image classification models at large scale. ImageNet [27] is a large image database containing over 14 million hand-annotated images indicating what objects are present. Some influential architectures and their corresponding innovations are listed below. 24 Facial Expression Recognition

• AlexNet [76] (2012) — AlexNet was the first CNN to achieve a remarkable result in the competition, achieving a top-5 test error of 15.4%, whereas the second best result had at the time a top-5 test error rate of 26.4%. The network is composed of five convolutional layers, max-pooling layers, three full-connected layers and dropout layers. Dropout is a regularization technique that ignores (drops out) randomly chosen neurons (along with their connections) from the network during training. With dropout, a network can not fully rely on one feature, but will rather have to learn robust features that are useful, preventing overfitting. The original work of AlexNet also included data augmentation techniques during the training phase and the use of the ReLU non-linear function.

• VGG [129] (2014) — The major improvement over AlexNet was the reduction of the con- volution kernel size to 3x3. The VGG network, with nineteen convolution layers with a fixed filter size of 3x3 and stride and padding of one, interspersed with 2x2 max-pooling layers with stride two, achieved a 7.3% top-5 error rate in ILSVRC 2014.

• GoogLeNet/Inception [136] (2014) — The simplicity of the VGG network contrasts with the complexity of the GoogLeNet network, the winner of ILSVRC 2014 with a top-5 error rate of 6.7%. This network launched a new concept of CNN, proposing parallel modules called the Inception modules. These modules allow the network to perform several con- volution and pooling operations with different sizes in parallel and concatenating the result at the end. Another difference from AlexNet/VGG is the use Global Average Pooling layer instead of fully connected layers to reduce the spatial dimensions of a three-dimensional tensor to a one-dimensional tensor. Despite its complexity, with over a hundred layers, this network has around 12x less parameters than AlexNet, being computationally efficient.

• ResNet [50] (2015) — The winner of ILSVRC 2015 was a Residual Network with top-5 error rate of 3.6%. With the network depth constant increase new problems have begun to emerge, such as the vanishing gradient problem and the degradation problem, defined as "loss of meaningful information on the feed-forward loop as accuracy gets saturated and then degrades rapidly" [50]. To overcome these problems, residual blocks have been proposed. In a network with residual blocks, each layer feeds into the next layer and directly into the 2 or 3 most distant layers, skipping connections. Figure 3.12 illustrates a single residual block. This architecture allows both the input and the loss to propagate much further through the network, making training remarkably more efficient. A combination of different architectures also have begun to emerge, with the example of Xception that merges the concepts of Inception and ResNet families into a new architecture.

• Efficient designs (2019) — the most recent trends in the field focus on highly parameter ef- ficient networks, with the advent of Neural Architecture Search (NAS) and MobileNet fam- ilies. MobileNet architectures thrive to have the less number of parameters possible, allow- ing the execution of the networks on mobile devices, as the name suggests. MobileNet-V2 [123], for instance, has a top-5 error rate of 7,5% with only 6 million parameters. Another 3.2 Deep Learning current approach 25

Figure 3.12: An example of a residual block

efficient design is NASNet [164] that has a defined search space of CNN common building blocks and tries to build the best child network using reinforcement learning. This network achieved a top-5 error rate of 3.8% with 88.9 million parameters. The most recent devel- opments in the field focus in achieving best accuracy possible with the smallest number of parameters [53].

Figure 3.13 illustrates the chronological path of the most relevant CNNs in terms of its accu- racy, familiarity and size.

Figure 3.13: Evolution of the diverse architectures milestones for the image recognition task achieved using the Imagenet 2012 database. The size of the circle represent the number of pa- rameters of the networks and the colours represent the familiarity between them [53]

Transfer Learning and model fine-tuning

Transfer Learning is a machine learning technique whereby a model is trained and developed for a specific task and then is re-used on a similar task. Concretely, the weights and parameters of a network which has already gone through a training process with a large dataset (a pre-trained 26 Facial Expression Recognition network) are used as initialization for a new model being trained on a different dataset from the same domain, in a process called fine-tuning. Transfer learning and fine-tuning generally improve the generalization of a network (provided that are sufficient samples) and often speeds up training yielding to a better performance than training the network from scratch [155]. The transfer learning process is represented in Figure 3.14.

Figure 3.14: Transfer learning process [58]

There are different transfer learning strategies that could improve the performance of a dataset, depending on the domain, the task at hand, or the availability of data. The most common strategy is to use the convolutional base of the pre-trained network as a feature extractor. Then, the last layers of the network are removed and replaced by the classifier of the problem at hand. Finally, the new network is trained with convolutional base frozen, so that the weights of the pre-trained network are not replaced with the new learning process. The pre-trained weights can provide excellent initial values and can still be adjusted by the training to better fit the problem. Therefore, training the entire model or training some specific layers and leave the others frozen are other alternatives to accomplish transfer learning. To find the optimal model there are other parameters of a network that could also be fine-tuned. Adding batch normalization can accelerate the learning of a neural network, and weight regu- larization can prevent overfitting in unbalanced datasets. Automatically adjusting the learning rate upon plateauing can also improve the performance of a model. Finally, opting for an efficient optimization algorithm, such as Adam or RMSProp, can also lead to better results.

3.3 Model Evaluation and Validation

In the case of a discrete model, the results can be organized in a Confusion Matrix, shown in Table 3.1, that compares the predicted values with the truth values. True positives are data points classified as positive by the model that are positive (meaning that they are correct), and false negatives are data points the model identifies as negative that are positive (meaning that they are incorrect). 3.3 Model Evaluation and Validation 27

Table 3.1: Confusion Matrix

Predicted Positive Negative Actual Positive True Positive (TP) False Negative (FN) Negative False Positive (FP) True Negative (TN)

The relevant evaluation measures are Accuracy, Precision, Recall, and F-Score.

Accuracy (Equation 3.1) refers to the fraction between the number of correct predictions and the total number of predictions. It is used to estimate the number of correct predictions that the model was able to identify.

TP + TN Accuracy = (3.1) total

Precision (Equation 3.2) refers to the fraction between the number of correct predictions and the total number of positive predictions. It is used to estimate the proportion of correct estimations between the positive ones, and therefore estimate the impact of false positives in the predictions.

TP Precision = (3.2) TP + FP

Recall (Equation 3.3) refers to the fraction between the number of correct predictions and the total number of all predictions that are relevant to the model (i.e., all predictions that should have been identified as positive). It is used to estimate cost of false positives in the model under consideration.

TP Recall = (3.3) TP + FN

F1 score (Equation 3.4) is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall) and worst at 0. It represents how precise the classifier is (how many correct predictions), as well as how robust it is (by not missing a significant number of instances).

2 · Precision · recall F1 = (3.4) Precision + Recall

In the case of a dimensional model, the relevant measures are Mean-Squared Error (MSE) and Person’s R. Mean-Squared Error (MSE) (Equation 3.5) measures the average squared difference be- tween the estimated values and the actual value. It is a measure of quality of an estimator, being 28 Facial Expression Recognition always negative and the better results are closer to zero.

n 1 2 MSE = ∑ et (3.5) n t=1

Persons’s Correlation Coefficient, or Person’s R (Equation 3.6), measures the linear cor- relation between two variables, X and Y. It has a value between [+1, -1], where 1 represents a perfect positive relationship, -1 a perfect negative relationship, and 0 indicates the absence of a relationship between the variables.

cov(X,Y) ρ = (3.6) σxσy

where cov is the covariance,

σx is the standard deviation of X,

σy is the standard deviation of Y. k-Fold Cross-Validation Technique

Cross-validation is a model validation technique to evaluate predictive models by partitioning the original dataset into a training set and a test set to evaluate it in production. In the case of k-Fold Cross-Validation, the original dataset is shuffled and randomly dis- tributed among k equally sized sub-datasets. In this set of sub-datasets, per each iteration (having k iterations or folds), one sub-dataset is chosen to be the test sub-dataset, and the remaining are the training sub-datasets. The results of this technique can be averaged to produce a single esti- mation. All observations are used for both training and validation, and each observation is used in validation exactly once.

3.4 Datasets

This section refers to the relevant databases used throughout the major variety of studies regarding facial expression recognition. The sources of the datasets discussed in this section are available in AppendixA. AFEW [30] — Acted Facial Expressions In The Wild (AFEW) is a dynamic temporal facial expressions dataset consisting of 1809 video excerpts extracted from movies. It is labelled with the six basic emotions, angry, disgust, fear, happy, sad, surprise and neutral. A recommendation system uses the movie subtitles to suggest a certain video clip to a human labeler, in which the labeler then annotates the perceived emotion and the information regarding the actors present in the clip, such as their name, head pose and age. AFEW-VA [73] — AFEW-VA is an extension of the AFEW dataset, in which 600 videos are selected and annotated with highly accurate per-frame levels of valence and arousal, along with per-frame annotations of 68 facial landmarks. 3.4 Datasets 29

AffectNet [3] — AffectNet contains more than 1 million facial colored images collected from the Internet by querying three major search engines using 1250 keywords related to emotions in six different languages. The entire database was annotated in the dimensional model (valence- arousal) and about half of the database was manually annotated in both the categorical (with the labels neutral, happy, sad, surprise, fear, disgust, anger, contempt, none, uncertain and non-face) and dimensional models. Aff-Wild2 [72] — The extended Aff-Wild database contains 558 videos annotated in continu- ous emotions such as valence and arousal, different action units and the six basic expressions plus neutral. The database is already divided in training, validation and test sets. The test set is not publicly available as the database is currently being used in a Competition and a Worskhop for the IEEE International Conference on Automatic Face & Gesture Recognition (FG 2020). AM-FED+ [97] — The Extended Dataset of Naturalistic and Spontaneous Facial Expressions Collected in Everyday Settings (AM-FED+) consists of 1,044 facial videos recorded in real world conditions. All the videos have automatically detected facial landmark locations for every frame and 545 of the videos have been manually FACS coded. A self-report of liking and familiarity responses from the viewers are also provided. CK+ [89] — The Extended Cohn-Kanade (CK+) is the most widely adopted laboratory- controlled dataset. The database is composed by 593 FACS coded videos, of which 327 are labelled with the six basic expression labels (anger, disgust, fear, happiness, sadness and sur- prise) and contempt. CK+ does not provided specific training, validation and test sets. The most common data-selection methods for static-based methods is to select the first (neutral) and last frames (peak expression) of each video. EmotioNet [8] — The EmotioNet database includes 950,000 images collected from the In- ternet annotated with AU, AU intensity, basic and compound emotion category, and WordNet concepts. The emotion category is composed of happy, sad, fearful, angry, surprised, disgusted, happily sad, happily surprised, happily disgusted, sadly fearful, sadly angry, sadly surprised, sadly disgusted, fearfully angry, fearfully surprised, fearfully disgusted, angrily surprised, disgustedly surprised, happily fearful, angrily disgusted, awed, appalled and hatred. The computer vision algorithm used to automatically annotate the emotion categories and AUs is described in [8]. Fig- ure 3.15 illustrates some query examples that can be done to the database. FER2013 [45] — FER2013 was introduced in the ICML 2013 Challenges in Representation Learning and consists of 48x48 pixel grayscale images of faces. The images were collected using the Google image search API and resized so that the face is more or less centered and occupies about the same amount of space in each image. The database is composed by 28,709 training images, 3,589 validation images and 3,589 test images with seven emotion labels: anger, disgust, fear, happiness, sadness, surprise and neutral. Figure 3.16 shows some example images from the dataset, illustrating variabilities in illumination, age, pose, expression intensity, and occlusions that occur under realistic conditions. JAFFE [92] — The Japanese Female Facial Expression is one of the first facial expression databases. It is composed by 10 Japanese female participants that posed for seven facial expression 30 Facial Expression Recognition

Figure 3.15: Possible queries accepted by the EmotionNet database [8]

(six basic emotions and neutral). The database is composed by 253 grayscale images with a resolution of 256x256. KDEF [91] — The Karolinska Directed Emotional Faces (KDEF) is a set of 4900 pictures with six basic different facial expression (happy, angry, afraid, disgusted, sad, surprised) and neutral. The set of pictures contains 70 individuals (35 males and 35 females), viewed from 5 different angles. MMI [109, 141] — MMI Facial Expression is a laboratory-controlled dataset and has over 2900 videos of 75 subjects. Each video was annotated for the presence of AUs and the six basic expressions plus neutral. It contains recordings of the full temporal pattern of a facial expressions, from neutral to peak expression and back to neutral. OULU-CASIA [160] — Oulu-CASIA facial expression database contains 2,880 videos cate- gorized in six basic expressions: happiness, sadness, surprise, anger, fear and disgust. The videos were recorded in a laboratory environment, using two different cameras (near-infrared and visible light) under three different illumination conditions (normal, weak and dark conditions). The first frame of each video is neutral and the last frame is when the peak expression occurs. RAF-DB [85, 83] — Real-world Affective Faces Database (RAF-DB) is a large-scale facial expression database with 29,672 facial images downloaded from the Internet. The dataset has a crowdsourcing-based annotation with the six basic emotions, neutral, and twelve compound emotions. For each image, information about accurate landmark locations, 37 automatic landmark locations, bounding box, race, age range and gender attributes is also available. SFEW [29] — Static Facial Expressions in the Wild (SFEW) contains frames selected from AFEW, described above. The dataset is labelled with the six basic expressions (angry, disgust, fear, happy, sad, surprise and the neutral class) and is divided in 958 training samples, 372 testing 3.4 Datasets 31

Figure 3.16: Examples of the FER2013 database

samples and 436 validation samples. The authors also made available a pre-processed version of the dataset with the faces aligned in the image. The SFEW was built following a Strictly Person Independent (SPI) protocol, thus the train and test datasets strictly contain different subjects.

There are several properties common to FER datasets, namely the shooting environment and the elicitation method.

The shooting environment is closely related to the data quality and thus the performance of deep learning based FER systems. Laboratory controlled shooting environments provide high quality image data where normally the illumination, background and head poses are strictly controlled. However, these datasets are time consuming to build and therefore have the limitation of having a low number of samples. In-the-wild settings, on the other hand, are easier to collect but are more challenging to achieve high-performance in deep learning models.

Elicitation method refers to the way the person pictured in an image portrayed the supposed emotion. Posed expression datasets, in which facial behavior is deliberately performed, are often exaggerated, increasing the differences between classes and making the images easier to classify. Spontaneous expression datasets are collected under the guarantee that only contain natural re- sponses to emotion inductions, better reflecting the real word. Datasets that were collected from the Web or Movies, for instance, normally include both posed and spontaneous facial behavior.

Table 3.2 provides an overview over the FER databases described in this section.

An important aspect to consider regarding datasets is its class balance. Imbalanced data typi- cally refers to the problem in classification problems where the classes are not represented equally. This problem can mislead the interpretation of a model’s results, since an unbalanced dataset can achieve high accuracy values while still having problems in generalization. Figure 3.17 illustrates the class distribution in some of the discussed datasets. 32 Facial Expression Recognition

Table 3.2: Principal databases used in FER systems (C. = Colour, Res. = Resolution, G = Grayscale, RGB = RGB-colored, P = Posed (expression), S = Spontaneous (expression), BE = Basic Emotion, AU = Action Unit, CE = Compound Emotion, FL = Facial Landmark)

Database Samples (C. Res.) Subjects Source Annotation 1,809 videos AFEW [29] 330 Movie | P & S 6 BEs + Neutral (RGB N/A) 600 videos AFEW-VA [73] 240 Movie | P & S Valence, Arousal, FLs (RGB N/A) 450,000 images AffectNet [3] N/A Web | P & S 6 BEs + Neutral (RGB 425x25) 558 videos Aff-Wild2 [72] 458 Web | S Valence, Arousal (RGB 1454x890) 1,044 videos AM-FED+ [97] 1,044 Web | S 11 AUs, FLs, Liking (RGB 320x240) 593 videos CK+ [89] 123 Lab | P 7 BEs + contempt, AUs, FLs (RGB 640x480) 950,000 images EmotioNet [8] N/A Web | P & S 12 AUs, 23 BE and CEs (RGB N/A) 35,887 images FER2013 [45] N/A Web | P & S 6 BEs + Neutral (G 48x48) 4,900 images KDEF [91] 70 Lab | P 6 BEs + Neutral (RGB 562x762) 213 images JAFFE [92] 10 Lab | P 6 BEs + Neutral (G 256x256) 2,900 videos MMI [109, 141] 75 Lab | P 6 BEs + Neutral, AUs (RGB 720x576) 2,880 videos OULU-CASIA [160] 80 Lab | P 6 BEs (RGB 320x240) 26,672 images 6 BEs + Neutral, RAF-DB [85, 83] N/A Web | P & S (RGB N/A) 42 FLs and 12 CEs 1,766 images SFEW [29] 95 Movie | P & S 6 BEs + Neutral (RGB N/A)

Discussion

Datasets are the fundamental piece in any machine learning application. In this section we re- viewed the principal datasets used in FER systems, and we have come across some serious limita- tions. At the moment, to the best of our knowledge, AFEW [30] (and its extensions to SFEW [29] and AFEW-VA [73]) is the only facial expression dataset in the movie domain, which poses a considerable obstacle given its very limited size. An alternative would be joining datasets from other domains, but there is evidence that increasing the number of databases in training result in disappointing increases in cross-domain performance [21]. Additionally, there is a huge variability of annotations between datasets, which makes the generalizability across domains difficult. 3.4 Datasets 33

Figure 3.17: Class distribution per dataset [84]

Table 3.3: FER approaches and results on widely evaluated datasets

Study (Year) Approach Acc (%) (No. classes) CK+ [89] [100] (2019) FAN 99.69 (7) [159] (2016) CNN 98.9 (6) [10] (2017) CNN 98.62 (6) FER2013[45] [114] (2016) CNN-VGG 72.7 (7) [10] (2017) CNN 72.1 (7) [114] (2016) CNN-Inception 71.6 (7) JAFFE [92] [68] (2019) SVM 97.10 (7) [48] (2019) 2channel-CNN 95.8 (7) [102] (2019) ATN 92.8 (7) SFEW [29] [156] (2015) CNN 55.96 (7) [66] (2015) CNN 53.9 (7) [86] (2019) CNN (ACNN) 51.72 (7) OULU-CASIA [160] [152] (2018) GAN + CNN 88.92 (6) [86] (2019) CNN (ACNN) 58.18 (6)

The discrete model of emotion predominates the FER datasets presumably given its antiquity 34 Facial Expression Recognition and the contribute of the FACS model that correlates facial landmarks, action units and discrete emotions. Table 3.3 refers to the SoA approaches and results on the most widely evaluated categor- ical datasets. CK+ [89] and JAFFE [92] achieve over 90% high accuracy rates in SoA approaches which is justified since they are datasets with laboratory-controlled ideal conditions. However, datasets with subjects who perform spontaneous expressions under in-the-wild scenarios, such as FER2013 [45] and particularly SFEW [29] since it belongs to the film domain, have less satisfac- tory results. There is a disparate performance due to the heterogeneity of the datasets because even within the discrete model, there are several emotions or combinations of them to be considered. Regarding methodology, as shown in Table 3.3, CNN-based approaches are behind SoA current results and can be applied to FER tasks achieving consistent performance. In conclusion, the current datasets in the movie domain are still not large enough to train the neural networks that led to the most promising results in object recognition tasks. Physiological variations (such as age, sex, cultural context or levels of expressiveness) and technical inconsis- tencies (such as people’s pose or lighting) are other challenges currently being addressed [82]. We try to overcome this problem by joining the benefits of a large-scale facial expression dataset with the available dataset from the movie domain, as proposed in Chapter5.

3.5 Open-source and commercial solutions

Face-api.js

Face-api.js 4 is an open-source JavaScript API which implements several CNNs to solve face detection, face recognition, face landmark detection, face expression recognition, age estimation and gender recognition. The API is a node.js module built on top of tensorflow.js core optimized for the web and mobile devices. Figure 3.18 illustrates some emotion recognition examples from Face-api.js.

Figure 3.18: Face-api.js emotion recognition examples [37]

4https://github.com/justadudewhohacks/face-api.js/ 3.5 Open-source and commercial solutions 35

Deepface

Deepface 5 is a open-source face recognition and facial attribute analysis (age, gender, emotion and race) framework for Python. The library is mainly based on Keras and TensorFlow and imple- ments state-of-the-art models such as VGG-Face, Google FaceNet, OpenFace, Deep- Face, DeepID and Dlib. Figure 3.19 illustrates some examples of facial attribute analysis from Deepface.

Figure 3.19: Deepface facial attribute analysis examples [24]

Microsoft Azure Face

Microsoft Azure Face 6 is part of Microsoft Cognitive Services and provides cloud-based algo- rithms that detect, recognize and analyze human faces in images. By providing a well-documented API, this service is able to detect faces and extract their facial landmarks, to perceive if two faces belong to the same person with a certain level of confidence, and to detect perceived facial expres- sions such as anger, contempt, disgust, fear, happiness, neutral, sadness, and surprise, as seen in Figure 3.20. At the moment, the Face API provides 30,000 free transactions per month. The Western Euro- pean standard plan starts at 0.844 e per 1,000 transactions in the range of 0-1 million transactions per month and goes down to 0.338 e per 1,000 transactions if the number of transactions exceeds 100 million per month [5].

5https://github.com/serengil/deepface 6https://azure.microsoft.com/en-us/services/cognitive-services/face/ 36 Facial Expression Recognition

Figure 3.20: An example of the Azure Face API response [5]

Algorithmia

Algorithmia 7 claims to be the largest marketplace for machine learning algorithms in the world, allowing the creation of easy-to-deploy cloud-based AI applications through the use of community- based machine learning models. In the field of emotion recognition, Algorithmia uses Convolu- tional Neural Networks and Mapped Binary Patterns to recognize facial expressions in pictures containing faces [1]. Figure 3.21 illustrates an example of an Algorithmia’s API response.

Figure 3.21: An example of the Algorithmia API response [1]

At the moment, the service stars with a free trial with up to $300 of free credits for testing, and then follows a pay-as-you-go pricing plan, with models running with 1 credit per second (10,000 credits/$1) [2].

7https://algorithmia.com/ 3.5 Open-source and commercial solutions 37

Amazon Rekognition

Rekognition 8 is a cloud-based SaaS computer vision platform by Amazon that can identify ob- jects, people, text, scenes, and activities in any image or video. In the context of emotion recognition, Rekognition can detect and identify faces and get at- tributes such as gender, age range, eyes open, glasses, facial hair and perceived emotion. The perceived emotion could be disgusted, happy, surprised, angry, confused, calm or sad. In video files, it can measure how these face attributes change over time, such as constructing a timeline of the emotions expressed by an actor. Figure 3.22 illustrate a face detection example and Section C.1 of AppendixC shows a complete API response from Amazon Rekognition.

Figure 3.22: Amazon Rekognition face detection and analysis example [115]

At the moment, as part of the AWS free tier, it is possible to start with Amazon Rekognition at no cost. The free tier lasts 12 months and allows the analysis of 5,000 images per month and store 1,000 pieces of face metadata per month, as well as the analysis of 1,000 minutes of video per month. The pricing plan starts at $1.16 per 1 000 images or $0.135 per minute of video analysis and comes down with greater use of the service [115].

Google Cloud Vision AI

Cloud Vision AI 9 is a cloud-based SaaS computer vision platform by Google to derive insights from images using pre-trained models to detect objects, understand printed and handwritten text, and perform face detection along with associated key facial attributes such as emotional state, facial landmarks or if the person is wearing headwear. The set of emotions consists of joy, sorrow, anger and surprise. Specific individual Facial Recognition is not supported. Figure 3.23 illustrates

8https://aws.amazon.com/rekognition/ 9https://cloud.google.com/vision 38 Facial Expression Recognition some attributes of Google API response. A more complete example can be found in Section C.2 of AppendixC.

Figure 3.23: Google Vision API attributes example [149]

At the moment, the service is free for the first 1,000 image analysis in a given month. Exceed- ing this limit, the price is $1.50 or $0.60 for every 1,000 image analysis per month, depending on the volume of API usage [142].

Face++

Face++ 10 is a platform offering computer vision solutions that enable applications to understand faces better. It allows to add deep-learning technologies into existing solutions, with simple APIs and SDKs. In the context of emotion recognition, the service is able to detect various emotions in pic- tures of faces with a confidence score. The emotions supported are happiness, neutral, surprise, sadness, disgust, anger and fear. Face++ is also able to determine face attributes including age, gender, smile intensity, head pose or eyes/mouth status and estimate eye gaze direction in images. Figure 3.24 illustrates an example of the Face++ Emotion Recognition API response. At the moment, all APIs can be used for free with a pay-as-you-go service upgrade according to business volume, with a price of $0.0005 per call [125].

Discussion

The big technological companies are aware to the needs of the market and provide a myriad of artificial intelligence solutions, including FER. However, none of the suggested solutions is intended to be applied specifically to the movie domain, with the exception of Microsoft Video Indexer, which is presented in Section 4.3, that

10https://www.faceplusplus.com/ 3.5 Open-source and commercial solutions 39

Figure 3.24: Face++ Emotion Recognition API response example

incorporates several features of movies (including facial expressions) in order to conduct an emo- tional analysis of film. Additionally, the documentation of the SaaS solutions generally does not provide much detail about which methods and databases are used, circumscribing the use of the services in simple but limited APIs. Open-source solutions, on the other hand, are more transparent about what methods and databases are used and provide the benefit of being free and unlimited to use. Table 3.4 offers a comparison between the different open-source and commercial solutions covered in this section. 40 Facial Expression Recognition

Table 3.4: Comparison between the different open-source and commercial solutions. The prices shown apply to the entry level paid plan available after the depletion of the free resources

Emotion classification Other facial attributes Price Face bounding box, face recognition, face Confidence score for similarity, face landmark detection, Face-api.js Happy, Disgusted, Sad, Free age estimation, ethnicity Fearful, Angry and Surprised and gender recognition Confidence score for Face bounding box, face recognition, Deepface Angry, Disgust, Sad, Neutral face landmark detection, age estimation Free Fear, Angry and Surprised and gender recognition Confidence score for Anger, Contempt, Disgust, Face bounding box, facial landmarks, Microsoft Azure Face 0.844 e per 1,000 transactions Fear, Happiness, Neutral, face verification Sadness and Surprise Confidence score for Algorithmia Happy, Neutral, Disgust, Sad, Face bounding box $1 for 10,000 service credits Fear, Angry and Surprise Face bounding box, age range, beard Confidence score for eyeglasses, sunglasses, smile and Amazon Rekognition Disgusted, Happy, Surprised, moustache detection, mouth and $1.16 per 1,000 images Angry, Confused, Calm and Sad eyes status, gender, facial landmarks, head pose and image quality Likelihood (unknown, very unlikely, Face bounding box, facial landmarks, Google Cloud Vision AI unlikely, possible, likely, very likely) underexposed, blurred image $1.50 for every 1,000 images of Joy, Sorrow, Anger and Surprise and headwear detection Confidence score for Face bounding box, age, gender, Anger, Neutral, Disgust, smile intensity, head pose, eye status, Face++ $0.0005 per API call Fear, Happiness, Neutral, beauty, eye gaze, mouth status, skin status, Sadness and Surprise ethnicity, face image quality and blurriness Chapter 4

Automatic Emotion Analysis in Movies

4.1 Unimodal Approach ...... 41 4.2 Multimodal approach ...... 43 4.3 Commercial Solutions ...... 45

In this section the principal methods for automatic emotion analysis in movies are reviewed. In the end, the commercial solutions available for this task are discussed.

4.1 Unimodal Approach

There are several studies that either use unimodal or multimodal approaches with interesting re- sults. With an unimodal approach, only one feature is considered to evaluate emotion, whether it is in the form of text, audio or visual.

4.1.1 Emotion in text

Retrieving the emotional aspect of text has been a popular topic in business-oriented fields such as e-commerce, marketing and management. The emergence of social networks like Facebook or Twitter made it possible to easily aggregate large amounts of opinions from the widest variety of topics, spanning from product reviews, hotel reviews and entertainment reviews, such as TV series, music, books or movies. In this domain normally the emotion is considered as a sentiment, which can be positive, negative or neutral. Sentiment analysis uses Natural Language Processing (NLP) and text analysis techniques to extract sentiments from text-based content. The approach can be at a document- level (analyzing the overall text), sentence-level (analyzing the sentiment for each sentence) or aspect-level, which associate specific sentiments with different aspects of a product or service. The manipulation of the data is done using techniques such as Bag-of-Words and TF-IDF, which

41 42 Automatic Emotion Analysis in Movies retrieves the content under analysis in a vector form. The vectors are then used in algorithms such as Support Vector Machine (SVM), Naive Bayes or Markov chains. The main challenges in the Sentiment Analysis field include domain dependence bias, NLP overhead, negation handling and spam/fake content detection [57]. Regarding movies, a pre-processed collection of movie scripts were used to generate a se- quence of sentiment values, that feed a neural network to learn the general sentiment value of the script, showing high similarities between the target and the predicted value [67]. In another ap- proach, plot summaries were used to train a classification task of the movie genre, with a precision of 67.75% [35].

4.1.2 Emotion in sound

Several studies show high correlation between sound characteristics and the emotional state of the speaker [4, 26]. Retrieving emotion from sound is possible analyzing the differences that occur in acoustic features when a speech contains the same subject but in a different emotional state. For index- ing speech into a frequency domain, several physical features are applied to sound, like spectrum irregularity, speech signals filtering and processing, enhancement and manipulation of specific fre- quency regions, individual phonemes, and others [77]. The Mel-Frequency Cepstral Coefficients (MFCC), that decomposes the audio into a short-time spectral shape with relevant information about the voice and sound effects, is widely used given to its good performance. The most fre- quent classification algorithms used in emotion recognition from audio are k-Nearest Neighbor (k-NN), Support Vector Machines (SVM), Artifical Neural Networks (ANN) and Hidden Markov Model (HMM) [90]. Regarding movies, several studies pointed out that emotion recognition from speech is possible both in the dimensional model of emotion [25, 43] and in the discrete model of emotion [17, 70].

4.1.3 Emotion in image and video

In addition to Facial Emotion Recognition, covered in Chapter2, the stylistic characteristics of the movies can be used to convey communication and prompt different feelings in the audience [158]. These characteristics can be divided into two groups:

• Temporal features — represents the dynamics of the video, such as the shot length, average shot length, number of shots, scenes and number of scenes as well as motion (that can be of subjects in a scene or camera movement).

• Spatial features — represent the static information in one single frame, such as color vari- ance, dominant color, color layout, color structure, color energy, lightning and saturation.

The application of this feature set in combination with Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) models have shown promising results in film classifica- tion [165]. Other studies proposed that video recommendation systems benefit relying on visual 4.2 Multimodal approach 43 low-level information rather than relying only in high-level information about the movies, such as genre, cast, director and similar meta-data information [25]. Colour has the ability to affect us either consciously or unconsciously and filmmakers carefully compose each frame making color decisions that directly affect our watching experience. The most paradigmatic example of the use of color in cinema and its impact on the audience is the infamous scene from the movie Schindler’s List where a red coat is used to distinguish a little girl in a scene depicting the liquidation of the Kraków ghetto in an otherwise monochrome movie. The girl in red coat scene is pictured in Figure 4.1.

Figure 4.1: The girl in red coat from Schindler’s List

However, there has been little research in this field and colour may be poorly considered given to how ubiquitous it is. Considering the previous example, the color red was used, a color that is commonly associated with love or passion scenes from movies. Nevertheless, color is indissociable from the emotional experience of a film and is an important feature to be considered in systems of emotional film analysis.

4.2 Multimodal approach

Recently, a number of approaches to multimodal sentiment analysis with interesting results have been proposed. This approach often combines video with audio low-level features, sometimes incorporating the script or captions from other databases. The early fusion of audio-visual features has been tested with and without dimensionality re- duction using Extreme Learning Machines (ELM), Random Forest and Support Vector Regression (SVR) for arousal and valence prediction tasks [139]. Moreover, other studies use Convolutional Neural Networks (CNNs) and compare the performance between multimodal early fusion and late 44 Automatic Emotion Analysis in Movies fusion, as well as the impact of transfer learning in these models [106]. Other innovative ap- proaches determine the emotional arcs that occur in a movie using audio-visual features, which refers to the evolution of the emotional charged moments (emotional peaks and valleys), relevant to the movie engagement analysis [19]. More recent approaches combine textual with audio-visual information. In a discrete emotion dataset, different levels of feature combinations were tested (unimodal, bimodal and trimodal) with CNNs, Long Short Term Memory (LSTM) and Bidirectional LSTM (BLSTM). The archi- tecture of LSTMs and BLSTM increase the amount of contextual information, considering long distance dependencies in sequences, relevant to textual and video analysis [151]. Considering the characters in movies as well as the interaction between them on both conversational level (movie script) and visual level (facial expressions), can also give important clues in emotion recognition of a scene [78, 134].

4.2.1 Feature fusion techniques

• Feature-Level Fusion — Combines the characteristics extracted from each input channel in a ‘joint vector’ before any classification operations are performed [126]. Recent studies claim that one of the challenges of this approach is the combination of highly heterogeneous data, being the synchronization of various inputs a non-trivial task [95].

• Decision-Level Fusion — The unimodal results are combined at the end of the process by choosing suitable metrics, such as expert rules and other simple operators.

The most consensual fusion method is the decision-level fusion, mainly because the method- ology is independent of the characteristics being studied [157]. Multimodal models normally deal with high-dimensional data, thereby it is important to per- form dimensionality reduction. The main advantages of using such techniques are avoiding the curse of dimensionality, reducing the amount of time and memory required by data mining al- gorithms and allowing data to be more easily visualized. Additionally, it may help to eliminate irrelevant features or reduce noise. Principal Component Analysis (PCA) transforms data from a higher dimensional space into a new coordinate system with a smaller dimension. Its goal is to preserve variance as much as possible in the new coordinate system [40]. The test accuracy of unimodal models and feature early-fusion models improves with the use of PCA. Combining PCA with CNNs helps removing noise from the data [150]. 4.3 Commercial Solutions 45

4.3 Commercial Solutions

Microsoft Video Indexer

The Azure Media Services Video Indexer (VI) 1 is a powerful Software as a Service (SaaS) solu- tion provided by Microsoft that consolidates audio and video artificial intelligence technologies in one integrated online service, capable of extracting several deep insights about multimedia files. It uses personalized machine learning models based on multiple channels (audio, visual, voice) and offers an accessible web platform as well as an API to retrieve deep insights about video content. Figure 4.2 provides a quick overview over the principal functionalities of the service. At the moment, Video Indexer provides up to 10 hours of free indexing for website users and up to 40 hours of free indexing for API users. For a larger scale indexing, there is the option of a paid unlimited account, with a price of 0.127 e per minute for video analysis and 0.034 e per minute for audio analysis in the Western Europe region [101].

Figure 4.2: Video Indexer functionalities overview

The most noteworthy available features are face detection, celebrity identification, scene/shot segmentation, multi-language speech identification with audio transcription and emotion detection based on speech (what’s being said) and voice tonality (how it’s being said). Emotions can be joy, sadness, anger or fear. Video Indexer is also able to perform multi-channel sentiment analysis, identifying positive, negative, and neutral sentiments from speech and visual text. An example of the insights given by this service is illustrated in Figure 4.3.

1https://azure.microsoft.com/en-us/services/media-services/video-indexer/ 46 Automatic Emotion Analysis in Movies

Figure 4.3: Video Indexer insights from the 1999 movie The Matrix Chapter 5

Methodology, Results and Evaluation

5.1 Problem Definition and Methodology ...... 47 5.2 Implementation details ...... 48 5.3 Approach ...... 50

After discussing the state-of-the-art approaches and limitations to FER and emotion recogni- tion from movies, the methodology of this dissertation is outlined in this chapter. Firstly, a clear formulation of the main current problem is made. Then, as an introduction to the followed ap- proach, the software used is described. The last section thoughtfully covers the raised hypotheses, experiments and results that stem from this work.

5.1 Problem Definition and Methodology

Now that we have a good understanding of the context and the main approaches found in the literature and their corresponding limitations, we can focus on the problem of this dissertation and how we will handle it. Based on the evidence discussed in Section 3.4, it becomes clear that there is no sufficient large-scale movie datasets with face derived emotion annotations. As a direct consequence, there are not many studies that validate the use of FER deep learning models specifically for the movie domain. Therefore, the problem we investigate is twofold and can be defined as follows:

Applying the current FER datasets to the movie domain lead to meaningful results? What is the best deep learning approach to achieve those results? If not, what can be done to foster the development of a tool that is based on the affective nature of a film?

This work represents another step into the conception of a system that could automatically retrieve affective information about movies, by drawing a first draft of how this system would

47 48 Methodology, Results and Evaluation be accomplished if it depended solely on facial expressions. In order to do so, we will exper- iment with state-of-the-art methods for each of the sub-tasks in our work: data analysis, data pre-processing, deep learning modelling and evaluation. The research methodology that we will conduct to answer our problem is based on a sequence of experiments in which a hypothesis is raised, tested and the results are analyzed. From the results of one experiment another hypothesis is suggested, in a process that ends with a full com- prehension of the problem and a number of ideas on how to overcome it. In order to validate the proposed hypotheses, we will use machine learning validation techniques. We will start by doing a exploratory analysis of the datasets used to find any susceptibilities to data bias. To validate our models, we will use the most relevant metrics for classification problems and we will closely monitor the learning curve of our models to inspect for possible under or overffiting problems. Our approach is further extended in Section 5.3.

5.2 Implementation details

In this section the different technologies that were used during the development of the present work are introduced. Table 5.1 specifies the software used and respective versions. Google Colaboratory, or Colab 1, was used as the principal programming interface since it is a cloud-based solution especially suited for machine learning and data analysis applications. The system allows the creation and execution of Python powered Jupyter Notebooks 2 hosted by Google in Google Drive. The free educational version of Colab allows the development and execu- tion of code in an Intel(R) Xeon(R) CPU @ 2.20GHz, with 12GB of RAM and GPU (NVIDIA(R) T4 Tensor Core GPU) or TPU hardware acceleration. In fact, this is the main reason behind our choice for this solution, since it can handle GPU intensive tasks like training computer vision deep learning models. In the free version of Colab, the notebooks can run for at most 12 hours. Google is currently offering a Pro version for $9.99/month with faster GPUs, longer runtimes and more memory but it was not required for the development of this project. Google Colab already has pre-installed machine learning libraries that are extensively used throughout this work. Pandas 3 is a library used for data manipulation and analysis offering data structures and operations for manipulating numerical tables and time series. Numpy 4 library allows the creation of multi-dimensional arrays and matrices and the mathematical operations to employ on these arrays. Tensorflow 5 offers end-to-end open-source solutions for building machine learning models, and Keras 6 is a Python library that runs on top of Tensorflow designed to enable fast and flexible

1https://colab.research.google.com/ 2Jupyter Notebook is an open-source web application in which is possible to create and share documents that contain live code, equations, visualizations and narrative text. 3https://pandas.pydata.org/ 4https://numpy.org/ 5https://www.tensorflow.org/ 6https://keras.io/ 5.2 Implementation details 49

Table 5.1: Versions of the software used in this work

Library Version Tensorflow 2.3.0 OpenCV 4.1.2 Keras 2.4.3 Numpy 1.18.5 Colab 1.0.0 Matplotlib 3.2.2 Seaborn 0.10.1 Python 3.6.9 scikit-learn 0.22.2 CUDA 10.1

experimentation with deep neural networks. Scikit-learn 7 is another Python library for machine learning and statistical modelling, containing efficient tools for supervised and unsupervised learn- ing solutions, cross-validation, feature extraction and dimensionality reduction.

Regarding images and pre-processing operations, OpenCV 8provides a common infrastructure for computer vision applications, containing more than 2,500 optimized algorithms, including a comprehensive set of both classic and state-of-the-art computer vision and machine learning algorithms. Dlib 9 is a general purpose cross-platform software library that implements a variety of machine learning algorithms and supporting functionality like threading and networking.

For data visualization, Matplotlib 10 allows the creation of static, animated, and interactive vi- sualizations in Python. Seaborn 11 is a data visualization library based on Matplotlib that provides a high-level interface for drawing attractive and informative statistical graphics.

Face_alignment 12 library implements a facial landmark detection using the world’s most ac- curate face alignment network, capable of detecting points in both two-dimensional and three- dimensional coordinates. It is the only used framework that does not come pre-installed on Colab.

The input/output operations along with storage were handled by google.colab function that stores and retrieves information typically from Google Drive. During development we noticed that using Google Drive slowed the operations of the neural network, so we started to upload the files directly to the memory of the cloud computer, with the limitation that the files were only available during the notebook runtime (12h).

7https://scikit-learn.org/ 8https://opencv.org/ 9https://dlib.net/ 10https://matplotlib.org/ 11https://seaborn.pydata.org/ 12https://github.com/1adrianb/face-alignment 50 Methodology, Results and Evaluation

5.3 Approach

This chapter focuses on the approach adopted in this dissertation. Each subsection represents an experiment made in order to validate a specific hypothesis. Our approach starts with the selection and analysis of the datasets to be used. This is the primordial phase in any machine learning pipeline and is of crucial importance since all results will depend on this choice. After this decision, we used the selected datasets to find out which is the best method of facial recognition in the movie domain, an important pre-processing step in FER applications. Then, we proceeded to test and benchmark several CNN architectures in order to understand which would best fit the problem. The results consists in the metrics accuracy and loss learning curves in the training phase, and in the confusion matrices in the validation phase. The next natural step was the deep learning modeling and optimization, in which we run several experiments with the chosen datasets. Following the findings reported in Section 2.4 we try to maximize the obtained results, on one hand, by proposing clustered emotions, and on the other hand, by combining facial landmarks masks with the original face image as input to the network.

5.3.1 Datasets Exploration

None of the datasets introduced in Section 3.4 perfectly fits this project since there is no large-scale FER database in the film domain. Thus, what we propose is a cross-database study involving two in-the-wild settings that can unite the benefits of a large database with the benefits of a film-based database. The chosen large-scale facial expression dataset was FER2013 [45]. The dataset was created using Google image search API with 184 different keywords related to emotions, like blissful or enraged, collecting 1,000 images for each search query. Then, the images were cropped in the face region and a face-alignment post-processing phase was conducted. Finally, the images were grouped into the corresponding groups of emotions. The dataset consists in a single .csv file containing the columns emotion, pixels and Usage. Each image, represented in the 48x48 vector in pixels is labeled with a encoded emotion, in which 0 represents Angry, 1 represents Disgust, 2 represents Fear, 3 represents Happy, 4 represents Sad, 5 represents Surprise, and 6 represents Neutral. The dataset is already divided in Training (28,709 samples), PublicTest (3,589 samples) and PrivateTest (3,589 samples), which serve as training and validation datasets, respectively. Figure 5.1 illustrates the content of the .csv file. The number of samples per class of the dataset is present in Table 5.2 and the class distribution is illustrated in Figure 5.2. The imbalance of the dataset is fairly evident, especially between Dis- gust (with 547 samples) and Happy (with 8989 samples) classes. This imbalance can be explained as it is relatively easy to classify a smile as happiness, while perceiving anger, fear or sadness is a more complicated task for the annotator. This problem can be solved by applying data augmen- tation techniques or developing a cost-sensitive loss function during training. The details on how we addressed this problem can be found in the Section 5.3.4. 5.3 Approach 51

Figure 5.1: FER2013 content overview

Figure 5.2: FER2013 class distribution

Table 5.2: FER2013 number of samples per class

Emotion Number of samples (%) 0 — Angry 4953 (13.8) 1 — Disgust 547 (1.5) 2 — Fear 5121 (14.3) 3 — Happy 8989 (25.0) 4 — Sad 6077 (16.9) 5 — Surprise 4002 (11.1) 6 — Neutral 6198 (17.3) 35887 (100)

By combining Numpy numerical capabilities with Matplotlib visualization tools it is possible to recreate and visualize the dataset images inside the notebook, as illustrated in Figure 5.3. 52 Methodology, Results and Evaluation

Figure 5.3: FER2013 image examples

SFEW [29] was another dataset selected for this dissertation since the images were collected directly through film frames. Furthermore, the labels of SFEW are consistent with FER2013 dataset, making the aforementioned cross-database study possible. The movies from which the frames were collected are present in AppendixB. The original version of the dataset only contained movie stills, while the second version of the dataset comes with already pre-processed aligned faces, LPQ and PHOG 13 features. Table 5.3 represents the distribution of the images in the dataset. SFEW was built following a Strictly Person Independent (SPI) protocol, meaning that the train and test datasets do not contain images of the same person. Figure 5.4 illustrates some aligned faces examples from the dataset.

Table 5.3: SFEW aligned face samples per class. The test set contains 372 unlabeled images

Train Validation Test Angry 178 77 Disgust 49 23 Fear 78 46 Happy 184 72 372 Sad 161 73 Surprise 94 56 Neutral 144 84 888 431

13Local Phase Quantization (LPQ) and Pyramid Histogram of Oriented Gradients (PHOG) are feature descriptors used for image feature extraction 5.3 Approach 53

Figure 5.4: SFEW face aligned samples. The left image is labeled as Angry, the middle image is labed as Happy and the right image is labeled as Sad

5.3.2 Facial Detection in Movies

When building a deep learning based FER solution, we need to guarantee that faces and only faces are supplied to the neuronal network for training and testing. Thus, the following hypothesis naturally emerged:

Do applying the SoA facial recognition methods lead to meaningful results in the movie domain?

To try this hypothesis, we set up a simple experiment. We implemented dlib’s HOG and CNN face detectors, explained in Section 3.2.1, that went through all 938 original frames contained in SFEW training folder, with the guarantee that there is always at least one person per frame. The results of this experiment are shown in the Table 5.4.

Table 5.4: Facial Recognition in movies experiment

Emotion Total HOG (%) CNN (%) Angry 178 108 (60.7) 151 (71.7) Disgust 66 41 (62.1) 51 (77.3) Fear 98 58 (59.2) 81 (82.7) Happy 198 154 (77.8) 193 (97.5) Neutral 150 98 (65.3) 130 (86.7) Sad 172 105 (61.1) 143 (83.1) Surprise 96 58 (60.4) 79 (82.3) 938 622 (66.3) 828 (88.3)

Upon the detection of a face, dlib’s face detector creates a rect object, a bounding box rect- angle containing the (x, y)-coordinates of the detection. Using OpenCV it is possible to draw the bounding box over the input image and visualize the result, as shown in Figure 5.5. This information can then be used in a posterior face cropping and alignment phase. 54 Methodology, Results and Evaluation

Figure 5.5: Dlib face bounding box

By analyzing the results of the experiment, we learned that current methods of facial detection in images taken from movie frames are somewhat limited. Nevertheless, we could conclude that using deep leaning methods translates into better detection, since we were able to detect about 22% more faces in the same dataset than using the traditional shallow learning methods. Using GPU hardware acceleration, the deep learning method ran on average for 0.18 seconds per image, while the HOG method ran for 0.25 seconds per image. All in all, the CNN method proved to be efficient and more accurate.

5.3.3 CNNs Baseline and Benchmark

Based on the successful CNN approaches reported in Section 3.4, we chose the CNN base archi- tecture as the backbone of our work. However, it was not certain which architecture better fits in FER tasks. We built a simple convolutional network to have a first sense of its performance using the FER2013 dataset. Not satisfied with the initial results, we decided to go further and apply knowledge transfer through pre-trained networks until better results were achieved. The model with the best performance was the one adopted and optimized, which will be addressed in the Section 5.3.4.

Baseline Network Architecture

To set a baseline for our work, we firstly built a simple Convolutional Neural Network consisting of four convolutional blocks of two layers each, followed by a pooling layer. The classifier consists of two dense layers with a dropout layer in between. Figure 5.6 illustrates the diagram of the proposed network. The goal of this network is to have a first understanding about the behavior of this type of architectures when trained with FER2013, our reference dataset. 5.3 Approach 55

Figure 5.6: Baseline Network Architecture diagram 56 Methodology, Results and Evaluation

The baseline test was performed as follows. Firstly, the pixel values in FER2013 were normal- ized and stored as an 48x48 one-channel input vector. Secondly, using Keras utilities, the seven- class vector of labels was converted to a binary class matrix using keras.utils.to_categorical. Both images and labels were then divided in training, test and validation sets according to the ini- tial division from the .csv file. The model was built using the Sequential Keras backbone, that allows grouping a linear stack of layers into a Keras model. It provides training and inference features. We trained the baseline network for up to 25 epochs, optimizing the cross-entropy loss using Adam optimizer. The initial learning rate and batch size are fixed at 0.1 and 128, respectively. The learning rate is reduced by a factor of 0.1 if the validation accuracy does not improve for 3 epochs. The training phase was executed in 465.949 seconds and the learning curve, loss curve and confusion matrix are illustrated in Figure 5.7.

Figure 5.7: Baseline network accuracy, loss and confusion matrix

Although the results are mildly satisfactory, they are still far from the SoA results. To im- prove the performance, we decided to employ transfer learning techniques and use CNN models that were initialised with pre-trained weights from ImageNet [27], a general dataset with over 14 million images. The selected models were MobileNetV2 [123], Xception [18], VGG16 [129], ResnetV2 [51], InceptionV3 [137] and DenseNet [55]. These models have been selected based on their solid performance in other image challenges, with the premise that possibly they can also be applied in FER tasks. Keras provides an API to automatically deploy pre-trained networks, by specifying the in- put_shape, pre-trained weights and if we want to include the original network classifier. The networks were trained with the same hyperparameters defined for the baseline and the same pro- cess is applied for every model. The overall architecture and further details are provided in the published articles about the networks.

VGG16

VGG16 [129] was released in 2015 and was one of the first architectures that used small kernel sizes and increased the depth of the network with 16 layers, which led to a reduction of the number of parameters. 5.3 Approach 57

The training results illustrated in Figure 5.8 are the accuracy and loss learning curves and the confusion matrix of the validation set. VGG16 had an execution time of 1364.90 seconds.

Figure 5.8: VGG16 accuracy, loss and confusion matrix on FER2013

InceptionV3

InceptionV3 [137] was submitted in 2015 with several improvements over GoogLeNet/Incep- tionV1. The most notable improvement was the introduction of convolution factorization with the aim of reducing the number of connections and parameters without decreasing the network performance. The training results illustrated in Figure 5.10 are the accuracy and loss learning curves and the confusion matrix of the validation set. InceptionV3 had an execution time of 1127.68 seconds.

Figure 5.9: InceptionV3 accuracy, loss and confusion matrix on FER2013

Xception

Xception [18] stands for Extreme version of the Inception network and was presented in 2017. The differences to the previous network are the order of the depthwise separable convolutions and the absence of intermediate ReLU non-linearity layer after the first operation. The training results illustrated in Figure 5.10 are the accuracy and loss learning curves and the confusion matrix of the validation set. Xception had an execution time of 245.85 seconds. 58 Methodology, Results and Evaluation

Figure 5.10: Xception accuracy, loss and confusion matrix on FER2013

MobileNetV2

MobileNetV2 [123] is a neural network architecture designed in 2018 by Google and optimized for mobile devices or any devices with low computational power. The key features of this architecture are depthwise separable convolutions and inverted residual structures, which are explained in depth in the published study.

The training results are illustrated in Figure 5.11. MobileNetV2 had an execution time of 85.32 seconds.

Figure 5.11: MobileNetV2 accuracy, loss and confusion matrix on FER2013

ResnetV2

ResnetV2 [51] was introduced in 2016 and differentiates itself from the first version of Resnet, discussed in Section 3.2.2, particularly by doing the batch normalization and ReLU activation operations before the two-dimensional convolution in the residual block.

The training results illustrated in Figure 5.12 are the accuracy and loss learning curves and the confusion matrix of the validation set. ResnetV2 had an execution time of 245.85 seconds. 5.3 Approach 59

Figure 5.12: ResnetV2 accuracy, loss and confusion matrix on FER2013

DenseNet

DenseNet [55] was proposed in 2016 and each layer obtains additional inputs from all preceding layers and passes on its own feature-maps to all subsequent layers. The training results are illustrated in Figure 5.13. DenseNet had an execution time of 1131.98 seconds.

Figure 5.13: DenseNet accuracy, loss and confusion matrix on FER2013

Discussion

The first experiment served the purpose of obtaining a general picture of how the networks with best results in object classification tasks would behave if retrained with a face dataset such as FER2013. As such, we decided to implement the vanilla architectures without further fine-tuning in order to get results in a timely fashion. The choice of the network was based on two criteria: runtime and accuracy achieved. The results are summarized in Table 5.5. Although these metrics are not enough to evaluate the complete robustness of the models, they were those that allowed us to foresee a possible promising network in the best interest of our work. In fact, none of the vanilla models achieved SoA results, given the little investment made in their refinement. However, it allowed us to realize that models based on residual blocks, as is the case with ResnetV2, can achieve an excellent initial result but at the cost of a substantial runtime. The VGG-based network achieved a good accuracy but with the second worst runtime of the benchmark. MobileNetV2 was the network with the best runtime result, which we were anticipating since it was designed and optimized to run on smartphones. However, the accuracy achieved is only slightly higher than the defined baseline. 60 Methodology, Results and Evaluation

Table 5.5: Benchmark of CNN architectures

Execution time (s) macro-avg Accuracy (%) Baseline 465.95 52.29 MobileNetV2 85.32 52.57 Xception 245.85 64.71 VGG16 1364.90 61.43 ResnetV2 4024.74 62.57 InceptionV3 1127.68 53.14 DenseNet 1131.98 58.86

The results of the Xception network caught our attention. It performed well in execution time, having the second fastest training time of our tests, and the best accuracy result. However, a careful analysis of the training curve made us realize that this result was very likely due to overfitting. Nevertheless, taking into account its promising results, we decided to establish this network as our main network and maximize its robustness by trying to reduce the overfitting problem. The implementation of the Xception network in the movie domain and the efforts made to minimize the overfitting problem will be addressed in the following section.

5.3.4 Deep Learning Modeling and Optimization

After the selection of the CNN backbone we proceeded to the construction of the model, followed by its optimization and ended with the testing of our FER solution in the movie domain. Initially, the model was built using the Sequential linear stack of layers from Keras. After the input layer, the images are pre-processed in two Lambda layers: the first layer converts the grayscale image to RGB, by tripling the image channels, and the second layer resizes the input images to 71x71. The pre-processed images are passed into the Xception model whose output is reshaped into an one- dimensional vector in a Flatten layer. The classifier is composed by two Dense layers interleaved with a Dropout layer with a rate of 0.3. The designed model is illustrated in Figure 5.14. 5.3 Approach 61

Figure 5.14: Final model

Model — hyper-parameters and data handling

In order to reduce the model overfitting, a phase of hyperparameters search and testing was con- ducted. Every parameter was tested with different values in order to find the optimal configuration. The Adam optimizer [69] used in the training process was chosen accordingly to the evidence sup- porting its superior performance over SGD and RMSProp and for being the optimizer of choice in the original development of the Xception network. The final parameters and configurations are shown in Table 5.6. 62 Methodology, Results and Evaluation

Table 5.6: Final configuration of the network

Configuration — All images from both datasets were rescaled (each pixel value was divided by 255, normalizing the values to the range [0,1]). — SFEW and FER2013 images are loaded as one-channel grayscale images. Image preprocessing — The colour channels in both SFEW and FER2013 datasets are tripled with the Tensorflow utility tf.image.grayscale_to_rgb, as Xception only accepts RGB images. — All images are resized to 71x71. Small data augmentation was applied in the training phase: — Horizontal Flip: randomly flip inputs horizontally. Data augmentation — Random rotation: randomly rotates inputs up to 10 degrees. — Width and height shift: randomly shifts width and height up to 10% of its original size. Training batch size 128 Training duration 20 epochs Adam, with the default initial values (learning rate of 0.001 and Optimizer the default beta 1 and 2 parameters provided in the original study [69]. After 3 epochs without improvement on the validation set accuracy Early stoping the training phase is terminated. After 1 epoch without improvement on the validation set accuracy Learning rate reduction the learning rate is reduced by a factor of 0.1. Weights initialization Imagenet dataset Loss Categorical cross-entropy The Xception block is set as treinable, meaning that its weights Transfer Learning can be updated during training. 5.3 Approach 63

Training Phase — FER2013

Since the SFEW dataset has few samples, the approach we followed was to train the network in a large in-the-wild database of facial expressions and understand if a network trained in these conditions would have the ability to generalize to the film domain by testing it with SFEW. The database of facial expressions chosen was FER2013 given its size and the fact that all faces are aligned in the images. The accuracy and loss learning curves and the confusion matrix of the FER2013 validation set are illustrated in Figure 5.15.

Figure 5.15: Accuracy and loss learning curves and confusion matrix of the model when training with FER2013

The classification report containing precision, recall, F1-score for each class is shown in Ta- ble 5.7.

Table 5.7: Precision, Recall and F1-score of FER2013 validation set

precision (%) recall (%) f1-score (%) support Angry 60 62 61 467 Disgust 56 55 56 56 Fear 58 45 51 496 Happy 87 87 87 895 Sad 60 54 57 653 Surprise 79 80 80 416 Neutral 56 71 63 607

accuracy 68 3590 macro avg 65 65 65 3590 weighted avg 68 68 68 3590

Testing phase — SFEW

To understand if the developed model is robust enough to adapt to a new context, we tested it with SFEW dataset since it contains faces of actors directly extracted from film frames. The results of 64 Methodology, Results and Evaluation using the model retrained with FER2013 tested on SFEW are shown in Figure 5.16.

Figure 5.16: Confusion matrix of testing with SFEW dataset

The classification report of the SFEW test set is shown in Table 5.8.

Table 5.8: Precision, Recall and F1-score of SFEW test set

precision (%) recall (%) f1-score (%) support Angry 53 40 46 178 Disgust 29 10 14 52 Fear 36 31 33 78 Happy 63 82 71 184 Sad 6 1 1 161 Surprise 13 14 14 94 Neutral 22 49 31 144

accuracy 38 891 macro avg 32 32 30 891 weighted avg 35 38 34 891

Discussion

This first experiment managed to overcome the limitations of Xception obtained when benchmark- ing CNNs. Through a phase of search and testing of hyperparameters, we were able to fine-tune the network and achieve an overall accuracy of FER2013 of 68% which is within state-of-the-art values. Having achieved the first objective, we proceeded to the real testing of the network by submitting it to images taken from films and the results were not satisfactory, reaching an overall 5.3 Approach 65 accuracy of only 38%. Intuitively, we tried to understand if the dataset images could be too chal- lenging. As such, and given the reduced size of the dataset, we did a manual search for defective or extremely poorly lit images and lowered the dataset from 891 to 830 images, a reduction of 6.8%. We performed the same tests but there were no noticeable improvements on the results. Another test that was discarded was training and testing with SFEW, since it does not have enough samples to achieve meaningful results.

Having noticed a slight overfitting of the network still present in the final part of the training phase, the next step was to try to improve the obtained results by addressing a problem detected in the exploratory phase of the data, the imbalance of FER2013, covered in the next section.

5.3.5 FER2013 dataset balancing

There are two common techniques for dealing with imbalanced data: oversampling the minority classes or training a model with class weights. The problem was addressed using the latter ap- proach, which causes the model to "pay more attention" to examples from an under-represented class. The weights can be calculated by dividing the number of samples per class by the number of classes multiplied by the total number of training samples and are presented in Table 5.9.

Table 5.9: Training class weights

Anger Disgust Fear Happy Sad Surprise Neutral Class 1.026 9.407 1.001 0.568 0.849 1.293 0.826 weight

The results of the training phase with class weights are shown in Figure 5.17 and Table 5.10 refer to the classification report on the validation set of FER2013.

Figure 5.17: Learning curve of the model when training with class weights 66 Methodology, Results and Evaluation

Table 5.10: Precision, Recall and F1-score of balanced FER2013 validation set

precision (%) recall (%) f1-score (%) support Angry 57 59 58 467 Disgust 0 0 0 56 Fear 55 38 45 496 Happy 83 90 86 895 Sad 56 57 57 653 Surprise 85 72 78 416 Neutral 55 70 62 607

accuracy 66 3590 macro avg 56 55 55 3590 weighted avg 65 66 65 3590

From the learning curves it can be observed that the overfitting was indeed slightly reduced, but this reduction did not lead to better accuracy results. When tested with SFEW dataset, the results were similar to those already reported, so this improvement did not have a substantial impact on the results obtained.

5.3.6 Reducing the dimensionality of the problem

Taking into account the weak results obtained, new hypotheses were explored with the aim of achieving more promising results. We started by trying to find a heuristic that could map the categorical emotions in the valence- arousal space and thus allowing us to use more complete databases, but the direct application of the findings discussed in Section 2.4 did not prove very promising. However, the gathered evidence indicates that there is an overlap of emotions in the affective space, so that they can be reduced without a significant loss of information — at least, from a computational point of view. Thus, we propose a reduction in the dimensionality of the problem by reducing the number of emotions to be considered in affective analyses. We demonstrate the effectiveness of this approach firstly by selecting the top-4 performing emotions in the previous experiments, and secondly by selecting the clusters of emotions more clearly demarcated in the studies previously addressed.

Selecting the top-4 performing emotions

The emotions that stood out in the previous tests were Happy, Surprise, Neutral and Angry achiev- ing a accuracy score of 87%, 80%, 71% and 62%, respectively. When training the network solely with these emotions, the network was able to achieve a global accuracy of 83%, as shown in Table 5.11. The network learning curves are illustrated in Figure 5.18. 5.3 Approach 67

Figure 5.18: Learning curve of the model when training with the top-4 performing emotions

Table 5.11: Precision, Recall and F1-score of the top-4 performing emotions during training on FER2013

precision (%) recall (%) f1-score (%) support Angry 77 74 75 467 Happy 90 88 89 895 Surprise 89 87 88 415 Neutral 75 81 78 607

accuracy 83 2384 macro avg 83 82 83 2384 weighted avg 83 83 83 2384

Analyzing each emotion, we can conclude that by decreasing the size of the problem the network was able to improve its performance. The confusion matrix is illustrated in Figure 5.19. 68 Methodology, Results and Evaluation

Figure 5.19: Validation confusion matrix of FER2013 top-4 emotions

When applied to SFEW, the network also demonstrated some improvements with the reduction of dimensionality, going from 38% to 47% in the global accuracy of the model. The results per class are available in Table 5.12. The confusion matrix of the SFEW test is pictured in Figure 5.20.

Table 5.12: Precision, Recall and F1-score of the top-4 performing emotions during testing on SFEW

precision (%) recall (%) f1-score (%) support Angry 49 61 54 148 Happy 96 57 71 185 Surprise 29 46 36 129 Neutral 0 0 0 79

accuracy 47 541 macro avg 44 41 40 541 weighted avg 53 47 83 48 5.3 Approach 69

Figure 5.20: Confusion matrix of testing on SFEW

Selecting possible clustered emotions

Based on the evidence collected in Section 2.4, there are three clearly demarcated emotional clus- ters: Happy, Neutral and a third one composed by the Angry, Sad, Fear and Disgust. This apparent clustering becomes a very important factor since it allows us to group data without losing essen- tial information, at least from a computational point of view. Therefore, another test involving these three clusters was done. Angry was chosen as the representative of the third cluster given its promising results in the tests developed so far. The accuracy and loss learning rates are illustrated in Figure 5.21.

Figure 5.21: Learning curve of the model when training with the possible clusters of emotion

By concentrating only on these three emotions the network has achieved a good result of 85% in the global accuracy. The classification report on the metrics of Precision, Recall and F1-score per class are available in Table 5.13. 70 Methodology, Results and Evaluation

Table 5.13: Precision, Recall and F1-score with possible clustered emotions in validation phase

precision (%) recall(%) f1-score (%) support Angry 81 78 80 467 Happy 92 90 91 895

Neutral 77 82 79 607

accuracy 85 1969 macro avg 83 83 83 1969 weighted avg 85 85 85 1969

Analyzing the confusion matrix illustrated in Figure 5.22, we can notice that Happy achieves a good accuracy score of 90%, while Neutral and Angry reach an accuracy score of 82% and 78%, respectively.

Figure 5.22: Angry, Happy and Neutral validation confusion matrix on FER0213

Testing the three emotional network with the SFEW dataset achieved a best score of 64%. The classification report is shown in Table 5.14. 5.3 Approach 71

Table 5.14: Precision, Recall and F1-score with clustered emotions in SFEW

precision (%) recall(%) f1-score (%) support Angry 51 90 65 148 Happy 93 52 67 185

Neutral 70 52 60 129

accuracy 64 462 macro avg 71 65 64 462 weighted avg 73 64 64 462

Unlike the validation set of FER2013, the emotion with the best performance in SFEW was Angry reaching an accuracy value of 90%. The confusion matrix is illustrated in Figure 5.23.

Figure 5.23: Angry, Happy and Neutral testing confusion matrix on testing SFEW

Discussion

Without much of a surprise, decreasing the number of classes in a classification task led to better overall results. However, there is a trade-off between the decrease in classes and the information that can be lost, so it is necessary to be very cautious when making decisions of this nature. We cannot pervert the psychological nature of emotions. From a human sciences point of view, it does not make sense to group somewhat disparate emotions like angry and disgust. However, the evidence points out that it may be an alternative to overcome the current problems in the classifi- cation of emotions through audiovisual media, at least as long as there are no more comprehensive datasets. 72 Methodology, Results and Evaluation

The best results of the combination of training with FER2013 and test with SFEW were ob- tained when the dimensional reduction took place, so this may be a suitable solution for emotional analysis systems. The present research is nonetheless affected by faults and limitations since the selected emotions are the most represented in the datasets, which may not happen in all cases.

5.3.7 Combining Facial Landmark masks with the original face image as input

One last hypothesis to improve the results of the model was a double input approach, with the original image and a facial mask formed by the facial landmarks of the face possibly being fed into the network. Figure 5.24 illustrates an example of a facial mask from a FER2013 image.

Figure 5.24: Facial mask example from FER2013

For this hypothesis, we developed the FER2013-FM dataset, a FER2013-based dataset of facial masks. The development of the dataset was made with the open-source software face- alignment [12], a library previously described in Section 5.2. The library uses SoA two-dimensional and three-dimensional Face Alignment Networks (FAN) to detect facial landmark and create dig- ital facial recreations of the face. When applying the 2D-FAN method to the FER2013 dataset, it did not recognize any faces in 1192 of the 35 887 images (3.32% of the total number of images in the dataset). Inspecting the targeted images it was concluded that most of them were random images that did not contain faces, revealing another dataset fragility. Figure 5.25 illustrates some examples of FER2013 images that do not contain faces. However, due to time limitations it was not possible to adjust the network to learn from the fusion of these images, so the viability of this solution is still to be determined.

Figure 5.25: FER2013 images not containing faces Chapter 6

Conclusions

6.1 Contributions ...... 74 6.2 Future work ...... 75

The work that was carried out during this dissertation focused on one main problem: the au- tomatic computation of video-induced emotions using the actor’s facial expressions. This chapter marks the end of this dissertation, initially by summarizing our main contributions and results, and finally by proposing some future work ideas that we believe can be expanded from this work. We started this work looking for the knowledge relative to the human emotion conceptualiza- tion and classification from the inter-related areas of psychology, psychiatry and social sciences. In Chapter2, we identified the main models and theories of representing emotions, discrete and dimensional, along with their respective advantages and limitations. Then, we proceed with the exploration of a theoretical modeling approach from facial expressions to emotions, and discussed a possible approximation between these two very distinctive theories. The contextualization from human and social sciences allowed us to foresee that the lack of unanimity in the classification of emotions would naturally have repercussions both in the databases and in the classification models, being one of the major bottlenecks of affective analysis work. Taking into account the objective of obtaining emotions through the facial expressions of film actors, in Chapter3 we investigated the state-of-the-art approaches to the Facial Emotion Recog- nition (FER) task. After a brief historical overview, we learned that the current approaches use deep learning algorithms, namely Convolutional Neural Networks, which we delve into in detail. Then, we made a detailed survey of the databases currently available for this purpose, since qual- ity datasets are of extreme importance in data-driven tasks. From this survey we drew several conclusions. Firstly, there is a lack of training data both in terms of quantity and quality: there is no publicly available dataset that is large by current deep learning standards. Even within the available databases, there are several inconsistencies in the annotation (using different models of emotion, or even within the same theory of emotion) and image collection processes (illumination variation, occlusions, head-pose variation) that hinder progress in the FER field. Additionally, the

73 74 Conclusions notion of ground truth applied to this context needs to be taken with a grain of salt, since classi- fying emotion is intrinsically biased to the degree that it reflects the perception of the emotional experience that the annotator is feeling at the moment. Finally, the existing datasets neglect to ensure diversity of age, gender or ethnicity which is paramount since there is empirical proof that different societies interpret facial expressions in a different way [60]. At the end of the chapter we discussed some commercial and open-source FER solutions already available. Besides facial expressions, there are other film characteristics that can be used to estimate their emotional charge. Chapter4 shed some light to this different approach that uses single or a combination of movie characteristics that have some kind of interrelation to human emotions. In fact, the fusion of several modalities such as sound, visual aspects and textual information is becoming a promising research direction and is further extended in Section 6.2. Finally, we have developed a comprehensive framework to test and validate the current state- of-the-art FER approaches applied to the movie domain in Chapter5. We proposed a baseline architecture and developed various models with transfer learning techniques. After an initial benchmark, we fine-tuned the chosen model to better fit the most relevant datasets to the sub- ject, training with FER2013 and testing with the movie related dataset SFEW. During this phase we learned that the datasets have several flaws and limitations, as the class imbalance and even some blank images that do not contain faces, and we worked on solutions to solve these particular problems. Additionally, we studied the hypothesis of a possible heuristic that could approximate the discrete and dimensional models and based on parallel studies we proposed a dimensional reduction of the problem taking into account the apparent clustering of categorical emotions in the valence-arousal space with promising results. Another hypothesis raised was the use of facial landmarks along with the original image as input to the networks as suggested by some studies, however, due to time constraints, it was not possible to improve the network in order to achieve satisfactory results. Even so, we have developed a dataset from FER2013 with facial masks that can be used in future work to prove the effectiveness of this method.

6.1 Contributions

In conclusion, we can emphasize the following contributions:

• Literature review — Conducted a thorough literature review to understand the current state of the art of studies regarding emotion, facial emotion recognition, automatic emotion analysis in movies and how these subjects intersect.

• Comprehensive experimentation of deep learning FER approaches in movie domain — Designed and implemented several deep learning experiments regarding FER approaches in the movie domain, specifically by using FER2013 and SFEW datasets.

• Clustering of emotions proposal — Proposed a new way of looking at emotions com- putationally, given the evidence of the formation of possible clusters of emotions in the valence-arousal emotion space. 6.2 Future work 75

• FER2013-FM dataset development — Developed a parallel dataset from FER2013 con- taining the facial masks of the respective FER2013 faces.

6.2 Future work

Affective computing in movies is a challenging task and needs to be addressed in a close col- laboration of psychology, sociology and computer vision sciences. This section discusses some promising avenues that can be taken from this work.

High-quality dataset

The lack of a highly reliable publicly available data is the major issue that we have encountered. By combining FER to the movie domain the problem is exacerbated since traditionally FER datasets are annotated with the discrete model of emotion and movie datasets are usually annotated using the dimensional model of emotion. In this way, it becomes essential to create a database that is of high quality and tries to solve the various issues reported throughout this dissertation: large scale, with consistent annotations, balanced and with attention for illumination and head-pose variation, occlusions, and highly diverse with different cultural and social backgrounds. Since emotion is a continuous process, developing a video dataset to be used in neural networks capable of learning temporal relations can also be advantageous.

Discrete-dimensional heuristic

The development of a heuristic that allows a reliable mapping between the two main models of emotion can contribute to overcome the current dataset compatibility problems and further im- prove cross-domain generalizability.

Deep learning other approaches

Deep learning is rapidly expanding and there are new techniques that can be applied to this prob- lem. Transductive learning techniques, for instance, could be useful in adapting the same task to different domains. Different types of deep neural networks may also be considered. Siamese architectures [11], Generative Adversarial Networks [46] and Neural Networks Ensembles [138] are recent approaches that have shown interesting results in the literature.

Multimodal approach

As addressed in Chapther4, an approach that combines multiple audiovisual modalities can pro- vide additional information and further improve the reliability of the models. In addition to basic features that could be extracted from films such as color, sound and text; advanced features such as motion and the relationships between movie characters may also bring interesting information regarding the emotions that a film conveys. Finally, the use of biological signals such as skin 76 Conclusions conductance, heart rate or the electrical activity of the brain can bring new information about the emotional response of individuals in a truly universal and objective way. During the litera- ture review phase, a survey of multimodal datasets and relevant studies was compiled and can be consulted in AppendixD. Appendix A

FER Datasets Source

Table A.1: FER datasets sources

Dataset Data Source AFEW [29] https://cs.anu.edu.au/few/AFEW.html AFEW-VA [73] https://ibug.doc.ic.ac.uk/resources/afew-va-database/ AffectNet [3] http://mohammadmahoor.com/affectnet/ Aff-Wild2 [72] https://ibug.doc.ic.ac.uk/resources/aff-wild2/ AM-FED+ [97] [email protected] CK+ [89] http://www.pitt.edu/ emotion/ck-spread.htm EmotioNet [8] http://cbcsl.ece.ohio-state.edu/EmotionNetChallenge/index.html https://www.kaggle.com/c/challenges-in-repres FER2013 [45] entation-learning-facial-expression-recognition-challenge/data KDEF [91] http://www.emotionlab.se/resources/kdef JAFFE [92] http://www.kasrl.org/jaffe.html MMI [109, 141] http://mmifacedb.eu/ OULU-CASIA [160] http://www.cse.oulu.fi/CMV/Downloads/Oulu-CASI RAF-DB [85, 83] http://www.whdeng.cn/RAF/model1.html SFEW [29] https://cs.anu.edu.au/few/AFEW.html

77 78 FER Datasets Source Appendix B

Movies Present in AFEW and SFEW Datasets

79 80 Movies Present in AFEW and SFEW Datasets

Table B.1: Movie sources for the SFEW and AFEW databases

Movie 21 About a Boy American History X Aviator Black Swan Did You Hear About The Morgans? Dumb and Dumber When Harry met Sally Four weddings and a funeral Frost/Nixon Harry Potter and The Philosopher Stone Harry Potter and The Chamber of Secrets Harry Potter and The Goblet of Fire Harry Potter and The Half Blood Prince Harry Potter and The Order of Phoenix Harry Potter and The Prisoners of Azkaban Informant It’s Complicated I Think I Love My Wife Kings Speech Little Manhattan Notting Hill One Flew Over Cuckoo’s Nest Pretty In Pink Pretty Woman Remember Me Run Away Bride Saw 3D Serendipity Social Network Terminal Term of Endearment The Hangover The Devil Wears Prada Town Valentine Day Unstoppable You’ve got mail Appendix C

API response examples from commercial solutions

C.1 Amazon Rekognition

{ "FaceDetails": [ { "AgeRange": { "High": 43, "Low": 26 }, "Beard": { "Confidence": 97.48941802978516, "Value": true }, "BoundingBox": { "Height": 0.6968063116073608, "Left": 0.26937249302864075, "Top": 0.11424895375967026, "Width": 0.42325547337532043 }, "Confidence": 99.99995422363281, "Emotions": [ { "Confidence": 0.042965151369571686, "Type": "DISGUSTED" }, {

81 82 API response examples from commercial solutions

"Confidence": 0.002022328320890665, "Type": "HAPPY" }, { "Confidence": 0.4482877850532532, "Type": "SURPRISED" }, { "Confidence": 0.007082826923578978, "Type": "ANGRY" }, { "Confidence": 0, "Type": "CONFUSED" }, { "Confidence": 99.47616577148438, "Type": "CALM" }, { "Confidence": 0.017732391133904457, "Type": "SAD" } ], "Eyeglasses": { "Confidence": 99.42405700683594, "Value": false }, "EyesOpen": { "Confidence": 99.99604797363281, "Value": true }, "Gender": { "Confidence": 99.722412109375, "Value": "Male" }, "Landmarks": [ { "Type": "eyeLeft", "X": 0.38549351692199707, C.1 Amazon Rekognition 83

"Y": 0.3959200084209442 }, { "Type": "eyeRight", "X": 0.5773905515670776, "Y": 0.394561767578125 }, { "Type": "mouthLeft", "X": 0.40410104393959045, "Y": 0.6479480862617493 }, { "Type": "mouthRight", "X": 0.5623446702957153, "Y": 0.647117555141449 }, { "Type": "nose", "X": 0.47763553261756897, "Y": 0.5337067246437073 }, { "Type": "leftEyeBrowLeft", "X": 0.3114689588546753, "Y": 0.3376390337944031 }, { "Type": "leftEyeBrowRight", "X": 0.4224424660205841, "Y": 0.3232649564743042 }, { "Type": "leftEyeBrowUp", "X": 0.36654090881347656, "Y": 0.3104579746723175 }, { "Type": "rightEyeBrowLeft", "X": 0.5353175401687622, 84 API response examples from commercial solutions

"Y": 0.3223199248313904 }, { "Type": "rightEyeBrowRight", "X": 0.6546239852905273, "Y": 0.3348073363304138 }, { "Type": "rightEyeBrowUp", "X": 0.5936762094497681, "Y": 0.3080498278141022 }, { "Type": "leftEyeLeft", "X": 0.3524211347103119, "Y": 0.3936865031719208 }, { "Type": "leftEyeRight", "X": 0.4229775369167328, "Y": 0.3973258435726166 }, { "Type": "leftEyeUp", "X": 0.38467878103256226, "Y": 0.3836822807788849 }, { "Type": "leftEyeDown", "X": 0.38629674911499023, "Y": 0.40618783235549927 }, { "Type": "rightEyeLeft", "X": 0.5374732613563538, "Y": 0.39637991786003113 }, { "Type": "rightEyeRight", "X": 0.609208345413208, C.1 Amazon Rekognition 85

"Y": 0.391626238822937 }, { "Type": "rightEyeUp", "X": 0.5750962495803833, "Y": 0.3821527063846588 }, { "Type": "rightEyeDown", "X": 0.5740782618522644, "Y": 0.40471214056015015 }, { "Type": "noseLeft", "X": 0.4441811740398407, "Y": 0.5608476400375366 }, { "Type": "noseRight", "X": 0.5155643820762634, "Y": 0.5569332242012024 }, { "Type": "mouthUp", "X": 0.47968366742134094, "Y": 0.6176465749740601 }, { "Type": "mouthDown", "X": 0.4807897210121155, "Y": 0.690782368183136 }, { "Type": "leftPupil", "X": 0.38549351692199707, "Y": 0.3959200084209442 }, { "Type": "rightPupil", "X": 0.5773905515670776, 86 API response examples from commercial solutions

"Y": 0.394561767578125 }, { "Type": "upperJawlineLeft", "X": 0.27245330810546875, "Y": 0.3902156949043274 }, { "Type": "midJawlineLeft", "X": 0.31561678647994995, "Y": 0.6596118807792664 }, { "Type": "chinBottom", "X": 0.48385748267173767, "Y": 0.8160444498062134 }, { "Type": "midJawlineRight", "X": 0.6625112891197205, "Y": 0.656606137752533 }, { "Type": "upperJawlineRight", "X": 0.7042999863624573, "Y": 0.3863988518714905 } ], "MouthOpen": { "Confidence": 99.83820343017578, "Value": false }, "Mustache": { "Confidence": 72.20288848876953, "Value": false }, "Pose": { "Pitch": -4.970901966094971, "Roll": -1.4911699295043945, "Yaw": -10.983647346496582 C.2 Google Vision API 87

}, "Quality": { "Brightness": 73.81391906738281, "Sharpness": 86.86019134521484 }, "Smile": { "Confidence": 99.93638610839844, "Value": false }, "Sunglasses": { "Confidence": 99.81478881835938, "Value": false } } ] }

C.2 Google Vision API

"faceAnnotations": [ { "boundingPoly": { "vertices": [ { "x": 669, "y": 324 }, ... ] }, "fdBoundingPoly": { ... }, "landmarks": [ { "type": "LEFT_EYE", "position": { "x": 692.05646, "y": 372.95868, "z": -0.00025268539 88 API response examples from commercial solutions

} }, ... ], "rollAngle": 0.21619819, "panAngle": -23.027969, "tiltAngle": -1.5531756, "detectionConfidence": 0.72354823, "landmarkingConfidence": 0.20047489, "joyLikelihood": "POSSIBLE", "sorrowLikelihood": "VERY_UNLIKELY", "angerLikelihood": "VERY_UNLIKELY", "surpriseLikelihood": "VERY_UNLIKELY", "underExposedLikelihood": "VERY_UNLIKELY", "blurredLikelihood": "VERY_UNLIKELY", "headwearLikelihood": "VERY_LIKELY" }, "landmarkAnnotations": [ { "mid": "//0c7zy", "description": "Petra", "score": 0.5403372, "boundingPoly": { "vertices": [ { "x": 153, "y": 64 }, ] }, "locations": [ { "latLng": { "latitude": 30.323975, "longitude": 35.449361 } } ] } ] Appendix D

Multimodal Movie Datasets and Related Studies

During literature review, a survey of multimodal datasets was compiled and is shown in Table D.1. Some relevant studies associated with these databases were also collected and are present in Ta- ble D.2. Multimodal datasets may possess physiological data where EEG, ECG, GSR and PPG are the acronyms for Electroencephalography, Electrocardiogram, Galvanic Skin Response and Pho- toplethysmogram, respectively. Regarding the related studies table, PSD, HVR, MFCC stand for Power Spectral Density, Heart Rate Variability and Mel-frequency cepstral coefficients, respec- tively. Regarding classifiers, SVM, SVR, kNN, RBF and HMM mean Support Vector Machine, Support Vector Regression, k-Nearest Neighbors, Radial Basis Function and Hidden Markov Model, respectively.

89 90 Multimodal Movie Datasets and Related Studies NO YES — 30 seconds ofand pre-trial post-trial baseline data available. NO YES — low level features YES — single-trial classification of valence and arousal, single-trial classification of personality traits, mood and social context using EEG, GSR and ECG. YES — 61 seconds ofbaseline pre-trial data available. NO 364 participants rated on laboratory each film clip with 24 classification criteria. 30 participants being monitored and doing self-reports on the emotion felt. 113 participants self-reported the emotion felt and the physiological levels of 32 participants were monitored. Discrete: Crowdsourcing pair-wise video comparison protocol (Crowdflower) Continuous: 10 French participants with different educational backgrounds Self-assessment and external- -assessment of affect, between valence and arousal elicited by short and long videos on individuals and groups and between personality, mood social context, and affect dimensions. Signals from 23 participants were recorded along with the participants’ self-assessment of their affective state after each stimuli. A majority voting scheme was applied to annotate each utterance from the votes of five Amazon Mechanical Turk workers. Anger, Sadness, Fear, Disgust, Amusement, Tenderness and Neutral Valence/Arousal (scale [1...9]) Valence/Arousal/ Dominance Valence/Arousal (scale [1...9]), six basic emotions and neutral state. Valence/Arousal/Liking (scale [1...9]) for both experiments; control, familiarity, like/dislike, and selection of basic emotion for the first experiment. Valence/Arousal/ Dominance (scale [1...5]) Anger, Disgust, Sadness, Joy, Neutral, Surprise and Fear and Positive/Negative YES (participants) YES (of participants) NO YES (EEG, ECG, GSR) YES (ECG) NO NO NO YES (EEG, ECG, GSR) YES (EEG, ECG) 36430 NO113 NO (32 physiological monitoring) Discrete - 50 Countinuous - 10 40 23 6 NO NO Table D.1: Multimodal movie datasets Only raw data available Only raw data available Raw and pre- -processed data available Raw and pre- -processed data available Raw and pre- -processed data available Only raw data available Raw and pre- -processed data available 70 movie clips [60 to 420 seconds long] 20 movie clips [35 to 117 seconds long] 52 non-auditory film clips [40 seconds long] Discrete: 9800 [9 to 12 seconds long] movie clips of 160 movies Continuous: 30 different movies 16 movie clips [51 to 150 seconds long] 18 movie clips [67 to 394 seconds long] 13 709 utterances with an average duration of 3.59 seconds (video, audio and text) 50 film experts were asked tospecific remember film scenes that elicitedbasic the and six neutral emotions. Foremotion, each the top 10 films were30 selected. participants were monitored withvideo 6 cameras, a head-worn microphone an eye gaze tracker, ECG, EEG, respiration amplitude and skin temperature while watching fragments of movies and pictures. Arousal, Valence and Dominance from 52 non-auditory film clips. Diverse database divided into discrete and continuous sub-datasets containing emotion annotations about films. Multi-modal database consisting of EEG and ECG signals recorded during affect elicitation by means of audio-visual stimuli. 1400 dialogues and 13000 utterances from Friends TV series. Research on affect, personality traits and mood by means of-physiological neuro- signals, with the use of short and long videosexperiments, in one two with individual viewers and one with groups of viewers. (2010) [ 131 ] (2016) (2010) [ 124 ] (2017) [ 63 ] Dataset Description N Data No. Subjects Bio analysis FER analysis Labeling Method Baseline (2017) [ 103 ] (2012) [ 16 ] (2018) [ 112 ] FILMSTIM MAHNOB-HCI EMDB LIRIS-ACCDE [ 7 ] AMIGOS DREAMER MELD Multimodal Movie Datasets and Related Studies 91 Accuracy Accuracy Accuracy F1-Score Accuracy MSE MSE 62.32% ∼ 0.023 Arousal ∼ 152 Arousal 67.7% Arousal, ∼ 61.35% Arousal, d ∼ 59.2% Arousal, ∼ ∼ 0.281 RMSE 0.035 Valence; ∼ ∼ 0.188 0.031 Valence; 0.021 (different combinations) Arousal, different fusion ∼ 0.245 Valence; 0.113 ifferent fusion of modalities using EEG and Faces fusion SVR: 0.034 61.84% Valence and 62.17 73% Valence; 68.5% Arousal, 0.023 Arousal (different combinations) ∼ 76.1% Valence; 46.2% 57.6% Valence; 54.8 ∼ ∼ ∼ ∼ using different fusions of modalities using different fusions of modalities 58.16% Valence; 60.23 ∼ 0.178 62.49 53.1 0.022 45.5% CNN: 0.027 55.72 SVM 74.69% Accuracy HMM 0.54 for arousal, 0.23 for valence Correlation SVM, Bayesian 63.9% Accuracy SVM w/ RBF Kernel 92.7% Precision SVR, Neural Networks, Polynominal Regression AMIGOS DREAMER LIRIS-ACCDE MANHOB-HCI OTHER DATASETS Various SVM VariousVarious CNN, SVM Random Forest and SVR Facial expression Various low-level Various low-level Various low-level MFCC, shot length) audio-visual features audio-visual features audio-visual features recognition and lexical Table D.2: Some multimodal relevant studies (color, energy, lighting, saturation, Various low-level audio-visual features motion) EEG, ECG, GSR, faces of characters Audio, Color, Aesthetic, Video Features (scene cuts, Respiration, Skin, Temperature Low level visual and audio features, [ 94 ] Audio, Video [ 38 ] ECG Various kNN 59.2% Valence; 58.7% Arousal Accuracy [ 131 ] [ 130 ] Audio, Video [ 133 ] Video, Audio, Text [ 71 ] EEG, faces Various Decision classifiers fusion [ 6 ] [ 103 ] EEG, ECG, GSR Various SVM [ 63 ] EEG, ECG PSD, HVR SVM [ 15 ] Video, Audio [ 52 ] Audio, Video [ 140 ] [ 20 ] Video, Audio Various Clustering 89.4% Precision [ 163 ] EEG, Video Stimulus Various SVM Study Used Modalities Extracted Features Classifier Evaluation Metric Chu et al. Zhu et al. Wang et al. Baveye et al. Canini et al. Stamos et al. Timar et al. Koelstra et al. Miranda et al. Alasaarela et al. Soleymani et al. Soleymani et al. Srivastava et al. Malandrakis et al. 92 Multimodal Movie Datasets and Related Studies References

[1] Algorithmia. Emotion recognition in the wild via convolutional neural networks and mapped binary patterns, 2020 (accessed August 7, 2020). https://algorithmia. com/algorithms/deeplearning/EmotionRecognitionCNNMBP.

[2] Algorithmia. Pricing, 2020 (accessed August 7, 2020). https://algorithmia.com/ pricing.

[3] Behzad Hasani Ali Mollahosseini and Mohammad H. Mahoor. Affectnet: A new database for facial expression, valence, and arousal computation in the wild. IEEE Transactions on Affective Computing, 2017.

[4] N. Amir and S. Ron. Towards an automatic classification of emotions in speech. In ICSLP, 1998.

[5] Microsoft Azure. Pricing - face api, 2020 (accessed August 7, 2020). https: //azure.microsoft.com/en-us/pricing/details/cognitive-services/ face-api/.

[6] Y. Baveye, E. Dellandréa, C. Chamaret, and L. Chen. Deep learning vs. kernel methods: Performance for emotion prediction in videos. In 2015 International Conference on Affec- tive Computing and Intelligent Interaction (ACII), pages 77–83, 2015.

[7] Yoann Baveye, Emmanuel Dellandréa, Christel Chamaret, and Liming Chen. Liris-accede: A video database for affective content analysis. Affective Computing, IEEE Transactions on, 6:43–55, 01 2015.

[8] C. F. Benitez-Quiroz, R. Srinivasan, and A. M. Martinez. Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5562–5570, 2016.

[9] M. Bradley and P. Lang. Affective norms for english words (anew): Instruction manual and affective ratings. 1999.

[10] Ran Breuer and Ron Kimmel. A deep learning perspective on the origin of facial expres- sions, 2017.

[11] Jane Bromley, James Bentz, Leon Bottou, Isabelle Guyon, Yann Lecun, Cliff Moore, Ed- uard Sackinger, and Rookpak Shah. Signature verification using a "siamese" time delay neural network. International Journal of Pattern Recognition and Artificial Intelligence, 7:25, 08 1993.

93 94 REFERENCES

[12] Adrian Bulat and Georgios Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In International Conference on Computer Vision, 2017.

[13] Darwin C. The expression of the emotions in man and animals. John Murray, London, 1872.

[14] John Cacioppo, Gary Berntson, Jeff Larsen, Kirsten Poehlmann, and Tiffany Ito. The Psy- chophysiology of Emotion, pages 173–191. 01 2000.

[15] L. Canini, S. Benini, and R. Leonardi. Affective recommendation of movies based on selected connotative features. IEEE Transactions on Circuits and Systems for Video Tech- nology, 23(4):636–647, 2013.

[16] Sandra Carvalho, Jorge Leite, Santiago Galdo-Alvarez, and Oscar Goncalves. The emo- tional movie database (emdb): A self-report and psychophysiological study. Applied psy- chophysiology and biofeedback, 37, 07 2012.

[17] Vladimir Chernykh, Grigoriy Sterling, and Pavel Prihodko. Emotion recognition from speech with recurrent neural networks. 01 2017.

[18] François Chollet. Xception: Deep learning with depthwise separable convolutions, 2017.

[19] E. Chu and D. Roy. Audio-visual sentiment analysis for learning emotional arcs in movies. In 2017 IEEE International Conference on Data Mining (ICDM), pages 829–834, Nov 2017.

[20] Eric Chu and Deb Roy. Audio-visual sentiment analysis for learning emotional arcs in movies, 2017.

[21] Jeffrey F. Cohn, Itir Onal Ertugrul, Wen-Sheng Chu, Jeffrey M. Girard, Laszlo A. Jeni, and Zakia Hammal. Affective facial computing: Generalizability across domains. In Multi- modal Behavior Analysis in the Wild, pages 407 – 441. June 2019.

[22] Daphne Cornelisse. An intuitive guide to convolutional neural networks, Jun 12, 2019 (accessed September 2, 2020). https://www.freecodecamp.org/news/ an-intuitive-guide-to-convolutional-neural-networks-260c2de0a050/.

[23] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume 1, pages 886–893 vol. 1, 2005.

[24] Deepface. Deepface github page, September 22, 2020 (accessed September 22, 2020). https://github.com/serengil/deepface.

[25] Yashar Deldjoo, Mehdi Elahi, Paolo Cremonesi, Franca Garzotto, Pietro Piazzolla, and Massimo Quadrana. Content-based video recommendation system based on stylistic visual features. Journal on Data Semantics, 5(2):99–113, Jun 2016.

[26] F. Dellaert, T. Polzin, and A. Waibel. Recognizing emotion in speech. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP ’96, volume 3, pages 1970–1973 vol.3, Oct 1996. REFERENCES 95

[27] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09, 2009.

[28] Arden Dertat. Applied deep learning - part 4: Convolutional neural networks, Novem- ber 8, 2017 (accessed September 5, 2020). https://towardsdatascience.com/ applied-deep-learning-part-4-convolutional-neural-networks-584bc134c1e2.

[29] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Acted facial expressions in the wild database, 2011.

[30] A. Dhall, R. Goecke, S. Lucey, and T. Gedeon. Collecting large, richly annotated facial- expression databases from movies. IEEE MultiMedia, 19(3):34–41, 2012.

[31] Abhinav Dhall, Roland Goecke, Jyoti Joshi, Jesse Hoey, and Tom Gedeon. Emotiw 2016: Video and group-level emotion recognition challenges. In Proceedings of the 18th ACM International Conference on Multimodal Interaction, ICMI ’16, page 427–432, New York, NY, USA, 2016. Association for Computing Machinery.

[32] Tuomas Eerola and Jonna Vuoskoski. A comparison of the discrete and dimensional models of emotion in music. Psychology of Music, 01 2011.

[33] P. Ekman and W. Friesen. Facial action coding system: a technique for the measurement of facial movement. 1978.

[34] Keltner D. Ekman P. Universal facial expressions of emotion. CalifMental Health Res Digest 8(4), pages 151–158, 1970.

[35] A. M. Ertugrul and P. Karagoz. Movie genre classification from plot summaries using bidirectional lstm. In 2018 IEEE 12th International Conference on Semantic Computing (ICSC), pages 248–251, Jan 2018.

[36] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2):303– 338, June 2010.

[37] Face-api.js. Face-api.js github page, April 22, 2020 (accessed September 7, 2020). https: //github.com/justadudewhohacks/face-api.js/.

[38] H. Ferdinando, T. Seppänen, and E. Alasaarela. Comparing features from ecg pattern and hrv analysis for emotion recognition system. In 2016 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology (CIBCB), pages 1–6, 2016.

[39] W. Friesen and P. Ekman. Emfacs-7: Emotional facial action coding system. 1983.

[40] Karl Pearson F.R.S. Liii. on lines and planes of closest fit to systems of points in space. The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science, 2(11):559–572, 1901.

[41] Patrick Gebhard. Alma: a layered model of affect. pages 29–36, 01 2005.

[42] Aurelien Geron. Hands-on machine learning with Scikit-Learn and TensorFlow : concepts, tools, and techniques to build intelligent systems. O’Reilly Media, Sebastopol, CA, 2017.

[43] Theodoros Giannakopoulos, Aggelos Pikrakis, and S. Theodoridis. A dimensional ap- proach to emotion recognition of speech from movies. pages 65–68, 04 2009. 96 REFERENCES

[44] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, Cambridge, MA, USA, 2016. http://www.deeplearningbook.org.

[45] Ian J. Goodfellow, Dumitru Erhan, Pierre Luc Carrier, Aaron Courville, Mehdi Mirza, Ben Hamner, Will Cukierski, Yichuan Tang, David Thaler, Dong-Hyun Lee, and et al. Chal- lenges in representation learning: A report on three machine learning contests, 2013.

[46] Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sher- jil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks, 2014.

[47] Margaret R. Graver. Cicero on the Emotions: Tusculan Disputations 3 and 4. University of Chicago Press, 2002.

[48] D. Hamester, P. Barros, and S. Wermter. Face expression recognition with a 2-channel convolutional neural network. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–8, 2015.

[49] Tal Hassner, Shai Harel, Eran Paz, and Roee Enbar. Effective face frontalization in uncon- strained images, 2014.

[50] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015.

[51] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks, 2016.

[52] Hee Lin Wang and Loong-Fah Cheong. Affective understanding in film. IEEE Transactions on Circuits and Systems for Video Technology, 16(6):689–704, 2006.

[53] Thorsten Hoeser and Claudia Kuenzer. Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends. Remote Sensing, 12, 05 2020.

[54] Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2(5):359 – 366, 1989.

[55] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks, 2018.

[56] Yunxin Huang, Fei Chen, Shaohe Lv, and Xiaodong Wang. Facial expression recognition: A survey. Symmetry, 11:1189, 09 2019.

[57] Doaa Mohey El-Din Mohamed Hussein. A survey on sentiment analysis challenges. Jour- nal of King Saud University - Engineering Sciences, 30(4):330 – 338, 2018.

[58] integrate.ai. Transfer learning explained, Aug 29, 2018 (accessed Septem- ber 5, 2020). https://medium.com/the-official-integrate-ai-blog/ transfer-learning-explained-7d275c1e34e2.

[59] E. B. Roesch J. R. J. Fontaine, K. R. Scherer and P. C. Ellsworth. The world of emotions is not two-dimensional. Psychological science, 18:1050–1057, 2007.

[60] Rachael E. Jack, Oliver G. B. Garrod, Hui Yu, Roberto Caldara, and Philippe G. Schyns. Facial expressions of emotion are not culturally universal. Proceedings of the National Academy of Sciences, 109(19):7241–7244, 2012. REFERENCES 97

[61] Pawan Jain. Complete guide of activation functions, Jun 12, 2019 (ac- cessed September 2, 2020). https://towardsdatascience.com/ complete-guide-of-activation-functions-34076e95d044. [62] Prinz JJ. Gut reactions: a perceptual theory of emotion. Oxford University Press, Ox- ford/New York, 2004. [63] S. Katsigiannis and N. Ramzan. Dreamer: A database for emotion recognition through eeg and ecg signals from wireless low-cost off-the-shelf devices. IEEE Journal of Biomedical and Health Informatics, 22(1):98–107, 2018. [64] V. Kazemi and J. Sullivan. One millisecond face alignment with an ensemble of regression trees. 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 1867– 1874, 2014. [65] Stéphanie Khalfa, Mathieu Roy, Pierre Rainville, Simone Dalla Bella, and Isabelle Peretz. Role of tempo entrainment in psychophysiological differentiation of happy and sad music? International journal of psychophysiology : official journal of the International Organiza- tion of Psychophysiology, 68:17–26, 05 2008. [66] Bo-Kyeong Kim, Hwaran Lee, Jihyeon Roh, and Soo-Young Lee. Hierarchical commit- tee of deep cnns with exponentially-weighted decision fusion for static facial expression recognition. In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, ICMI ’15, page 427–434, New York, NY, USA, 2015. Association for Com- puting Machinery. [67] D. Kim, S. Lee, and Y. Cheong. Predicting emotion in movie scripts using deep learning. In 2018 IEEE International Conference on Big Data and Smart Computing (BigComp), pages 530–532, Jan 2018. [68] D. H. Kim, W. J. Baddar, J. Jang, and Y. M. Ro. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Transactions on Affective Computing, 10(2):223–236, 2019. [69] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization, 2017. [70] Y. Ko, I. Hong, H. Shin, and Y. Kim. Construction of a database of emotional speech using emotion sounds from movies and dramas. In 2017 International Conference on Information and Communications (ICIC), pages 266–267, June 2017. [71] Sander Koelstra and Ioannis Patras. Fusion of facial expressions and eeg for implicit affec- tive tagging. Image and Vision Computing, 31:164–174, 02 2013. [72] Dimitrios Kollias and Stefanos Zafeiriou. Aff-wild2: Extending the aff-wild database for affect recognition, 2018. [73] Jean Kossaifi, Georgios Tzimiropoulos, Sinisa Todorovic, and Maja Pantic. Afew-va database for valence and arousal estimation in-the-wild. Image and Vision Computing, 65:23 – 36, 2017. Multimodal Sentiment Analysis and Mining in the Wild Image and Vision Computing. [74] U. Krcadinac, P. Pasquier, J. Jovanovic, and V. Devedzic. Synesketch: An open source library for sentence-based emotion recognition. IEEE Transactions on Affective Computing, 4(3):312–325, 2013. 98 REFERENCES

[75] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25, 01 2012.

[76] Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton. Imagenet classification with deep convolutional neural networks. Neural Information Processing Systems, 25, 01 2012.

[77] Davor Kukolja, Siniša Popovic,´ Marko Horvat, Bernard Kovac,ˇ and Krešimir Cosi´ c.´ Com- parative analysis of emotion estimation methods based on physiological measurements for real-time applications. International Journal of Human-Computer Studies, 72(10):717 – 727, 2014.

[78] Vincent Labatut and Xavier Bost. Extraction and analysis of fictional character networks: A survey. ACM Computing Surveys, 52:89, 09 2019.

[79] Agnieszka Landowska. Towards new mappings between emotion representation models. Applied Sciences, 8:274, 02 2018.

[80] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to docu- ment recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.

[81] Lingjun Li, Yali Peng, Guoyong Qiu, Zengguo Sun, and Shigang Liu. A survey of virtual sample generation technology for face recognition. Artificial Intelligence Review, 50, 01 2017.

[82] Shan Li and Weihong Deng. Deep facial expression recognition: A survey, 2018.

[83] Shan Li and Weihong Deng. Reliable crowdsourcing and deep locality-preserving learning for unconstrained facial expression recognition. IEEE Transactions on Image Processing, 28(1):356–370, 2019.

[84] Shan Li and Weihong Deng. A deeper look at facial expression dataset bias. IEEE Trans- actions on Affective Computing, page 1–1, 2020.

[85] Shan Li, Weihong Deng, and JunPing Du. Reliable crowdsourcing and deep locality- preserving learning for expression recognition in the wild. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2584–2593. IEEE, 2017.

[86] Yong Li, Jiabei Zeng, Shiguang Shan, and Xilin Chen. Occlusion aware facial expression recognition using cnn with attention mechanism. IEEE Transactions on Image Processing, PP:1–1, 12 2018.

[87] Zhe Liu, Anbang Xu, Yufan Guo, Jalal Mahmud, Haibin Liu, and Rama Akkiraju. Seemo: A computational approach to see emotions. pages 1–12, 04 2018.

[88] Martin Loiperdinger and Bernd Elzer. Lumiere’s arrival of the train: Cinema’s founding myth. The Moving Image, vol. 4 no. 1, pages 89–118, 2004.

[89] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The extended cohn-kanade dataset (ck+): A complete dataset for action unit and emotion-specified ex- pression. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition - Workshops, pages 94–101, 2010.

[90] Sergej Lugovic, Ivan Dunder,¯ and Marko Horvat. Techniques and applications of emotion recognition in speech. 05 2016. REFERENCES 99

[91] D. Lundqvist, A. Flykt, and A. Öhman. The karolinska directed emotional faces - kdef, 1998.

[92] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba. Coding facial expressions with gabor wavelets. In Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition, pages 200–205, 1998.

[93] E. W. Cook M. K. Greenwald and P. J. Lang. Affective judgment and psychophysiolog- ical response: Dimensional covariation in the evaluation of pictorial stimuli. Journal of psychophysiology, 3, no. 1:51–64, 1989.

[94] N. Malandrakis, A. Potamianos, G. Evangelopoulos, and A. Zlatintsi. A supervised ap- proach to movie emotion tracking. In 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2376–2379, 2011.

[95] Muharram Mansoorizadeh and Nasrolah Charkari. Multimodal information fusion applica- tion to human emotion recognition from face and speech. Multimedia Tools Appl., 49:277– 297, 08 2010.

[96] Peter M. Roth Martin Koestinger, Paul Wohlhart and Horst Bischof. Annotated Facial Land- marks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In Proc. First IEEE International Workshop on Benchmarking Facial Image Analysis Tech- nologies, 2011.

[97] D. McDuff, M. Amr, and R. Kaliouby. Kaliouby, am-fed+: An extended dataset of natural- istic and spontaneous facial expressions collected in everyday settings. In IEEE Transac- tions on Affective Computing, (1), 1-1, 2018.

[98] A. Mehrabian. Pleasure-arousal-dominance: a general framework for describing and mea- suring individual differences in temperament. Curr Psychol, 14(4):261–292, 1996.

[99] Ferris S.R. Mehrabian, A. Inference of attitudes from nonverbal communication in two channel. Journal of Consulting Psychology, 31(3), 248:1050–1057, 1967.

[100] Debin Meng, Xiaojiang Peng, Kai Wang, and Yu Qiao. Frame attention networks for facial expression recognition in videos, 2019.

[101] Microsoft. Pricing - media services | microsoft azure, 2020 (accessed Au- gust 7, 2020). https://azure.microsoft.com/en-us/pricing/details/ media-services/#analytics.

[102] Shervin Minaee and Amirali Abdolrashidi. Deep-emotion: Facial expression recognition using attentional convolutional network, 2019.

[103] Juan Abdon Miranda-Correa, Mojtaba Khomami Abadi, Nicu Sebe, and Ioannis Patras. Amigos: A dataset for affect, personality and mood research on individuals and groups, 2017.

[104] A. Mollahosseini, D. Chan, and M. H. Mahoor. Going deeper in facial expression recog- nition using deep neural networks. In 2016 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 1–10, 2016.

[105] H. Ng and S. Winkler. A data-driven approach to cleaning large face datasets. In 2014 IEEE International Conference on Image Processing (ICIP), pages 343–347, 2014. 100 REFERENCES

[106] Juan Ortega, Patrick Cardinal, and Alessandro Koerich. Emotion recognition using fusion of audio and video features. pages 3847–3852, 10 2019.

[107] Andrew Ortony, Gerald Clore, and Allan Collins. The Cognitive Structure of Emotion, volume 18. 01 1988.

[108] Ekman P. An argument for basic emotions. Cognition and Emotion, pages 169–200, 1992.

[109] M. Pantic, M. Valstar, R. Rademaker, and L. Maat. Web-based database for facial expres- sion analysis. In 2005 IEEE International Conference on Multimedia and Expo, pages 5 pp.–, 2005.

[110] Omkar M. Parkhi, Andrea Vedaldi, and Andrew Zisserman. Deep face recognition. In British Machine Vision Conference, 2015.

[111] Ling Peng, Geng Cui, Mengzhou Zhuang, and Chunyu Li. What do seller manipulations of online product reviews mean to consumers? 2014.

[112] Soujanya Poria, Devamanyu Hazarika, Navonil Majumder, Gautam Naik, Erik Cambria, and Rada Mihalcea. Meld: A multimodal multi-party dataset for emotion recognition in conversations, 2019.

[113] Jonathan Posner, James A. Russell, and Bradley S. Petersona. The circumplex model of affect: An integrative approach to affective neuroscience, cognitive development, and psy- chopathology. Dev Psychopathol, pages 715–734, 2005.

[114] Christopher Pramerdorfer and Martin Kampel. Facial expression recognition using convo- lutional neural networks: State of the art, 2016.

[115] Amazon Reckognition. Pricing, 2020 (accessed August 7, 2020). https://aws. amazon.com/rekognition/pricing.

[116] Monika Riegel, Małgorzata Wierzba, Marek Wypych, Łukasz Zurawski,˙ Katarzyna Jed- noróg, Anna Grabowska, and Artur Marchewka. Nencki affective word list (nawl): the cultural adaptation of the berlin affective word list–reloaded (bawl-r) for polish. Behavior Research Methods, 01 2015.

[117] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza- tion in the brain. Psychological Review, pages 65–386, 1958.

[118] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhi- heng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, Alexander C. Berg, and Li Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 115(3):211–252, 2015.

[119] James Russell. A circumplex model of affect. Journal of Personality and Social Psychology, 39:1161–1178, 12 1980.

[120] C. Saarni. Development of emotional competence. New York: Guilford Press, 1999.

[121] Christos Sagonas, Epameinondas Antonakos, Georgios Tzimiropoulos, Stefanos Zafeiriou, and Maja Pantic. 300 faces in-the-wild challenge: database and results. Image and Vision Computing, 47, 01 2016. REFERENCES 101

[122] Snehanshu Saha, Nithin Nagaraj, Archana Mathur, and Rahul Yedida. Evolution of novel activation functions in neural network training with applications to classification of exo- planets, 2019.

[123] Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks, 2018.

[124] Alexandre Schaefer, Frédéric Nils, Xavier Sanchez, and Pierre Philippot. Assessing the effectiveness of a large database of emotion-eliciting films: A new tool for emotion re- searchers. Cognition and Emotion, 24(7):1153–1172, 2010.

[125] Face++ Cognitive Services. Pricing details, 2020 (accessed August 7, 2020). https: //www.faceplusplus.com/v2/pricing-details/#api_1.

[126] Caifeng Shan, Shaogang Gong, and Peter Mcowan. Beyond facial expressions: Learning human emotion from body gestures. 01 2007.

[127] Caifeng Shan, Shaogang Gong, and Peter W. McOwan. Facial expression recognition based on local binary patterns: A comprehensive study. Image and Vision Computing, 27(6):803 – 816, 2009.

[128] Zhiguo Shi, Junming Wei, Zhiliang Wang, Jun Tu, and Qiao Zhang. Affective transfer computing model based on attenuation emotion mechanism. Journal on Multimodal User Interfaces, 5, 03 2011.

[129] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition, 2014.

[130] M. Soleymani, J. J. M. Kierkels, G. Chanel, and T. Pun. A bayesian framework for video affective representation. In 2009 3rd International Conference on Affective Computing and Intelligent Interaction and Workshops, pages 1–7, 2009.

[131] M. Soleymani, J. Lichtenauer, T. Pun, and M. Pantic. A multimodal database for affect recognition and implicit tagging. IEEE Transactions on Affective Computing, 3(1):42–55, 2012.

[132] T. Song, W. Zheng, C. Lu, Y. Zong, X. Zhang, and Z. Cui. Mped: A multi-modal physio- logical emotion database for discrete emotion recognition. IEEE Access, 7:12177–12191, 2019.

[133] R. Srivastava, S. Yan, T. Sim, and S. Roy. Recognizing emotions of characters in movies. In 2012 IEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), pages 993–996, 2012.

[134] Ruchir Srivastava, Shuicheng Yan, Terence Sim, and Sujoy Roy. Recognizing emotions of characters in movies. pages 993–996, 03 2012.

[135] CS231n Stanford. Convolutional neural networks for visual recognition, 2020 (accessed September 1, 2020). https://cs231n.github.io/neural-networks-1/.

[136] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions, 2014. 102 REFERENCES

[137] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wo- jna. Rethinking the inception architecture for computer vision, 2015.

[138] Sean Tao. Deep neural network ensembles, 2019.

[139] Yasemin Timar, Nihan Karslioglu, Heysem Kaya, and Albert Salah. Feature selection and multimodal fusion for estimating emotions evoked by movie clips. pages 405–412, 06 2018.

[140] Yasemin Timar, Nihan Karslioglu, Heysem Kaya, and Albert Salah. Feature selection and multimodal fusion for estimating emotions evoked by movie clips. pages 405–412, 06 2018.

[141] M. Valstar and M. Pantic. Induced disgust , happiness and surprise : an addition to the mmi facial expression database. 2010.

[142] Google Vision. Pricing, 2020 (accessed August 7, 2020). https://cloud.google. com/vision/pricing.

[143] Xiang Wang, Kai Wang, and Shiguo Lian. A survey on face data augmentation for the training of deep neural networks. Neural Computing and Applications, Mar 2020.

[144] Parrott WG. Emotions in social psychology: essential readings. Psychology Press, Philadel- phia, 2001.

[145] what-when how. Facial expression recognition (face recog- nition techniques) part 1, 2012 (accessed September 5, 2020). http://what-when-how.com/face-recognition/ facial-expression-recognition-face-recognition-techniques-part-1/.

[146] CYNTHIA M. WHISSELL. Chapter 5 - the dictionary of affect in language. In Robert Plutchik and Henry Kellerman, editors, The Measurement of Emotions, pages 113–131. Academic Press, 1989.

[147] Małgorzata Wierzba, Monika Riegel, Marek Wypych, Katarzyna Jednoróg, Paweł Turnau, Anna Grabowska, and Artur Marchewka. Basic emotions in the nencki affective word list (nawl be): New method of classifying emotional stimuli. PLOS ONE, 10(7):1–16, 07 2015.

[148] Wikipedia. Artificial neural networks/activation functions, 2018 (accessed September 1, 2020). https://en.wikibooks.org/wiki/Artificial_Neural_Networks/ Activation_Functions#Activation_Functions.

[149] Parker Wilhelm. Try google’s emotion-detecting image api for your- self, February 18, 2016 (accessed September 7, 2020). https: //www.techradar.com/news/internet/cloud-services/ you-can-now-try-google-s-emotion-detecting-image-api-for-yourself-1315249.

[150] Jennifer Williams, Ramona Comanescu, Oana Radu, and Leimin Tian. DNN multimodal fusion techniques for predicting video sentiment. In Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML), pages 64–72, Melbourne, Australia, July 2018. Association for Computational Linguistics.

[151] Jennifer Williams, Steven Kleinegesse, Ramona Comanescu, and Oana Radu. Recognizing emotions in video using multimodal dnn feature fusion. pages 11–19, 01 2018. REFERENCES 103

[152] H. Yang, Z. Zhang, and L. Yin. Identity-adaptive facial expression recognition through expression regeneration using conditional generative adversarial networks. In 2018 13th IEEE International Conference on Automatic Face Gesture Recognition (FG 2018), pages 294–301, 2018.

[153] Shuo Yang, Ping Luo, Chen Change Loy, and Xiaoou Tang. Wider face: A face detection benchmark. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.

[154] Xi Yin, Xiang Yu, Kihyuk Sohn, Xiaoming Liu, and Manmohan Chandraker. Towards large-pose face frontalization in the wild, 2017.

[155] Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep neural networks?, 2014.

[156] Zhiding Yu and Cha Zhang. Image based static facial expression recognition with multiple deep network learning. pages 435–442, 11 2015.

[157] Z. Zeng, J. Tu, M. Liu, T. S. Huang, B. Pianfetti, D. Roth, and S. Levinson. Audio-visual affect recognition. IEEE Transactions on Multimedia, 9(2):424–428, Feb 2007.

[158] Herbert Zettl. Essentials of Applied Media Aesthetics, pages 11–38. Springer US, Boston, MA, 2002.

[159] Zhanpeng Zhang, Ping Luo, Chen Change Loy, and Xiaoou Tang. From facial expression recognition to interpersonal relation prediction, 2016.

[160] Guoying Zhao, Xiaohua Huang, Matti Taini, Stan Z. Li, and Matti Pietikäinen. Facial expression recognition from near-infrared videos. Image and Vision Computing, 29(9):607 – 619, 2011.

[161] R. Zhi, M. Flierl, Q. Ruan, and W. B. Kleijn. Graph-preserving sparse nonnegative ma- trix factorization with application to facial expression recognition. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 41(1):38–52, Feb 2011.

[162] L. Zhong, Q. Liu, P. Yang, B. Liu, J. Huang, and D. N. Metaxas. Learning active facial patches for expression analysis. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pages 2562–2569, June 2012.

[163] Y. Zhu, S. Wang, and Q. Ji. Emotion recognition from users’ eeg signals with the help of stimulus videos. In 2014 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6, 2014.

[164] Barret Zoph, Vijay Vasudevan, Jonathon Shlens, and Quoc V. Le. Learning transferable architectures for scalable image recognition, 2017.

[165] Federico Álvarez, Faustino Sánchez, Gustavo Hernández-Peñaloza, David Jiménez, José Manuel Menéndez, and Guillermo Cisneros. On the influence of low-level visual features in film classification. PLOS ONE, 14(2):1–29, 02 2019.