Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies
Total Page:16
File Type:pdf, Size:1020Kb
FACULDADE DE ENGENHARIA DA UNIVERSIDADE DO PORTO Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies João Carlos Miranda de Almeida Mestrado Integrado em Engenharia Informática e Computação Supervisor: Paula Viana Co-Supervisor: Inês Nunes Teixeira Co-Supervisor: Luís Vilaça September 21, 2020 Automatic Emotion Identification: Analysis and Detection of Facial Expressions in Movies João Carlos Miranda de Almeida Mestrado Integrado em Engenharia Informática e Computação September 21, 2020 Abstract The bond between spectators and films takes place in the emotional dimension of a film. The emotional dimension is characterized by the filmmakers’ decisions on writing, imagery and sound, but it is through acting that emotions are directly transmitted to the audience. Understanding how this bond is created can give us essential information about how humans interact with this increasingly digital medium and how this information can be integrated into large film platforms. Our work represents another step towards identifying emotions in cinema, particularly from camera close-ups of actors that are often used to evoke intense emotions in the audience. During the last few decades, the research community has made promising progress in developing facial expression recognition methods but without much emphasis on the complex nature of film, with variations in lighting and pose, a problem discussed in detail in this work. We start by focusing on the understating from social sciences of the state-of-the-art models for emotion classification, discussing its strengths and weaknesses. Secondly, we introduce the Facial Emotion Recognition (FER) systems and automatic emotion analysis in movies, analyzing unimodal and multimodal strategies. We present a comparison between the relevant databases and computer vision techniques used in facial emotion recognition and we highlight some issues caused by the heterogeneity of the databases, since there is no universal model of emotions. Built upon this understanding, we designed and implemented a framework for testing the fea- sibility of an end-to-end solution that uses facial expressions to determine the emotional charge of a film. The framework has two phases: firstly, the selection and in-depth analysis of the relevant databases to serve as a proof-of-concept of the application; secondly, the selection of the deep learning model to satisfy our needs, by conducting a benchmark of the most promising convolu- tional neural networks when trained with a facial dataset. Lastly, we discuss and evaluate the results of the several experiments made throughout the framework. We learn that current FER approaches insist on using a wide range of emotions that hinders the robustness of models and we propose a new way to look at emotions computationally by creating possible clusters of emotions that do not diminish the collected information based on the evidence obtained. Finally, we develop a new database of facial masks discussing some promising paths that may lead to the implementation of the aforesaid system. Keywords: Facial Expression Recognition, Multimodal Sentiment Analysis, Emotion Analysis, Movie Databases, Machine Learning, Deep Learning, Computer Vision i ii Resumo A ligação estabelecida entre espetadores e filmes ocorre na dimensão emocional de um filme. A dimensão emocional é caracterizada pela decisão dos cineastas na escrita, imagem e som, mas é através da atuação dos atores que as emoções são diretamente transmitidas para a plateia. Perceber como esta ligação é criada pode dar-nos informações essenciais acerca de como os humanos inter- agem com este meio digital e como esta informação poderá ser integrada em grandes plataformas de filmes. O nosso trabalho é mais um passo nas técnicas de identificação de emoções no cinema, par- ticularmente através de close-ups aos atores muitas vezes utilizados para evocar emoções intensas na audiência. Nas últimas décadas, a comunidade científica tem feito progressos promissores no desenvolvimento de métodos para o reconhecimento de expressões faciais. No entanto, não é dado grande ênfase à complexa natureza dos filmes, como variações na iluminação das cenas ou pose dos atores, algo que é explorado mais em detalhe neste trabalho. Começamos por uma explicação extensiva dos modelos mais modernos de classificação de emoções, no contexto das ciências sociais. Em segundo lugar, apresentamos os sistemas de Recon- hecimento Facial de Emoções (FER, do inglês Facial Emotion Recognition) e análise automática de emoções em filmes, explorando estratégias unimodais e multimodais. Apresentamos uma com- paração entre as base de dados relevantes e as técnicas de visão computacional usadas no recon- hecimento de emoções faciais e destacamos alguns problemas causados pela heterogeneidade das bases de dados, uma vez que não existe um modelo universal de emoções. Com base neste conhecimento, projetámos e implementámos uma framework para testar a viabilidade de uma solução end-to-end que utilize as expressões faciais para determinar a carga emocional de um filme. A framework é composta por duas fases: em primeiro lugar, a seleção e análise aprofundada das bases de dados relevantes para servir como uma prova de conceito da apli- cação; e em segundo lugar, a seleção do modelo deep learning mais apropriado para alcançar os objetivos propostos, conduzindo um benchmark de redes neurais convolucionais mais promissoras quando treinadas com uma base de dados contendo faces. No final, discutimos e avaliamos os resultados das diversas experiências feitas ao longo da framework. Aprendemos que as actuais abordagens de FER insistem na utilização de uma vasta gama de emoções que dificulta a robustez dos modelos e propomos uma nova forma de olhar para as emoções computacionalmente, criando possíveis grupos de emoções que não diminuem a informação recolhida com base na evidência obtida. Finalmente, desenvolvemos uma nova base de dados de máscaras faciais, discutindo alguns caminhos promissores que podem levar à implementação do referido sistema. Keywords: Reconhecimento de Expressões Faciais, Análise Multimodal de Sentimento, Análise de Emoção, Bases de dados cinematográficas, Aprendizagem Computacional, Inteligência Artifi- cial, Visão por Computador iii iv Acknowledgements First and foremost, I would like to thank my Supervisors Professor Paula Viana and Inês Nunes Teixeira for giving me the opportunity to participate both in an internship and in a dissertation about two of the subjects I am more passionate about. Luís Vilaça, thank you for bringing me the calmness and the necessary clear view on the problems and how to tackle them. To all three supervisors, a big thank you for your support, guidance and feedback throughout this project and for making me grow as an academic and as a person. Secondly, I am eternally grateful to my parents, brother, girlfriend and friends for endlessly supporting me during this coursework and for always being my Home regardless of how far I fly. Finally, I would like to thank all persons who I met trough the years for making me who I am today and for showing me that love and tolerance are the only values that should really matter. Obrigado. João Almeida v vi “If we opened people up, we’d find landscapes.” Agnès Varda vii viii Contents 1 Introduction1 1.1 Context . .1 1.2 Motivation . .2 1.3 Goals . .4 1.4 Document Structure . .4 2 The Human Emotion5 2.1 Discrete emotion model . .5 2.2 Dimensional emotion model . .6 2.3 Facial Action Coding System (FACS) . .7 2.4 Mappings between Emotion Representation Models . .7 2.5 Discussion . 11 3 Facial Expression Recognition 13 3.1 Historical Overview . 13 3.2 Deep Learning current approach . 14 3.2.1 Data Preprocessing . 14 3.2.2 Convolutional Neural Networks . 18 3.3 Model Evaluation and Validation . 26 3.4 Datasets . 28 3.5 Open-source and commercial solutions . 34 4 Automatic Emotion Analysis in Movies 41 4.1 Unimodal Approach . 41 4.1.1 Emotion in text . 41 4.1.2 Emotion in sound . 42 4.1.3 Emotion in image and video . 42 4.2 Multimodal approach . 43 4.2.1 Feature fusion techniques . 44 4.3 Commercial Solutions . 45 5 Methodology, Results and Evaluation 47 5.1 Problem Definition and Methodology . 47 5.2 Implementation details . 48 5.3 Approach . 50 5.3.1 Datasets Exploration . 50 5.3.2 Facial Detection in Movies . 53 5.3.3 CNNs Baseline and Benchmark . 54 ix x CONTENTS 5.3.4 Deep Learning Modeling and Optimization . 60 5.3.5 FER2013 dataset balancing . 65 5.3.6 Reducing the dimensionality of the problem . 66 5.3.7 Combining Facial Landmark masks with the original face image as input 72 6 Conclusions 73 6.1 Contributions . 74 6.2 Future work . 75 A FER Datasets Source 77 B Movies Present in AFEW and SFEW Datasets 79 C API response examples from commercial solutions 81 C.1 Amazon Rekognition . 81 C.2 Google Vision API . 87 D Multimodal Movie Datasets and Related Studies 89 References 93 List of Figures 1.1 L’arrivée d’un train en gare de La Ciotat (1985) . .2 2.1 Circumplex Model of Emotion . .7 2.2 Upper Face Action Units (AUs) from the Facial Action Coding System . .8 2.3 Spatial distribution of NAWL word classifications in the valence-arousal affective spac......................................... 10 2.4 Distribution of emotions in the valence-arousal affective map . 11 3.1 Facial Expression Recognition stages outline . 14 3.2 Histogram of Oriented Gradients of Barack Obama’s face . 16 3.3 Examples of facial landmarks with their respective faces . 16 3.4 Image geometric transformation examples . 17 3.5 Example of feed-forward neural networks . 18 3.6 Biological neuron and a possible mathematical model . 19 3.7 Flow of information in a artificial neuron . 20 3.8 Most common activation functions . 21 3.9 Example of a 3x3 convolution operation with the horizontal Sobel filter . 22 3.10 Average-pooling and max-pooling operation examples . 23 3.11 Convolutional Neural Network example .